Robert
Grzesiak
oracle dba,
LexisNexis Sp. z
o.o.
Temat: problem ze stabilnoscia Oracle 11.2.0.2 RAC na Solaris 10
WitamMam zainstalowane Oracle 11.2.0.2 RAC system na OS Solaria 10.
Co pare dni mam problem ze stabilnością systemu - 1 wezel rebootuje 2gi albo na odwrót, tylko że nie mogę namierzyć dlaczego tak się dzieje...
Mam 2 wezly, nazwane Mars i Venus. Przykładowo dzisiaj wezel Mars zrestartowal 2gi wezel Venus (via IPMI), tylko dlaczego... W alercie db jest tylko zapis ze mial timeout'y wiec pewnie nie przeszedl sciezki komunikacji, ale nic konkretnie...
restart nastapil ok 02:30…
Db alert log na Mars:
Thread 2 advanced to log sequence 5461 (LGWR switch)
Current log# 10 seq# 5461 mem# 0: +DATA/lnred/onlinelog/group_10.308.743814917
Thu Jul 07 01:02:23 2011
Archived Log entry 3047 added for thread 2 sequence 5460 ID 0xa9ba4d33 dest 1:
Thu Jul 07 02:31:28 2011
minact-scn master exiting with err:12751
Thu Jul 07 02:31:43 2011
IPC Send timeout detected. Sender: ospid 1940 [oracle@Mars (MMON)]
Receiver: inst 1 binc 36502 ospid 1949
IPC Send timeout to 1.0 inc 32 for msg type 8 from opid 26
IPC Send timeout: Terminating pid 26 osid 1940
Thu Jul 07 02:31:46 2011
IPC Send timeout detected. Sender: ospid 18305 [oracle@Mars (M000)]
Receiver: inst 1 binc 36502 ospid 1949
IPC Send timeout to 1.0 inc 32 for msg type 8 from opid 75
IPC Send timeout: Terminating pid 75 osid 18305
Thu Jul 07 02:32:07 2011
IPC Send timeout detected. Sender: ospid 1908 [oracle@Mars (LMD0)]
Receiver: inst 1 binc 36502 ospid 1949
Thu Jul 07 02:32:10 2011
Restarting dead background process MMON
Thu Jul 07 02:32:11 2011
MMON started with pid=26, OS id=19687
Thu Jul 07 02:32:15 2011
IPC Send timeout detected. Sender: ospid 1928 [oracle@Mars (LGWR)]
Receiver: inst 1 binc 36528 ospid 1951
IPC Send timeout to 1.1 inc 32 for msg type 73 from opid 20
Thu Jul 07 02:32:19 2011
Communications reconfiguration: instance_number 1
Thu Jul 07 02:32:24 2011
IPC Send timeout to 1.0 inc 32 for msg type 65521 from opid 12
Thu Jul 07 02:32:29 2011
IPC Send timeout detected. Sender: ospid 1928 [oracle@Mars (LGWR)]
Receiver: inst 1 binc 36528 ospid 1955
IPC Send timeout to 1.2 inc 32 for msg type 73 from opid 20
Thu Jul 07 02:33:18 2011
Evicting instance 1 from cluster
Waiting for instances to leave: 1
Thu Jul 07 02:33:38 2011
Remote instance kill is issued with system inc 34
Remote instance kill map (size 1) : 1
LMON received an instance eviction notification from instance 2
The instance eviction reason is 0x20000000
The instance eviction map is 1
Thu Jul 07 02:33:48 2011
Waiting for instances to leave: 1
Thu Jul 07 02:34:18 2011
Waiting for instances to leave: 1
Thu Jul 07 02:34:40 2011
Reconfiguration started (old inc 32, new inc 36)
List of instances:
2 (myinst: 2)
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Thu Jul 07 02:34:40 2011
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Thu Jul 07 02:34:40 2011
LMS 1: 1 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Thu Jul 07 02:34:40 2011
Instance recovery: looking for dead threads
Beginning instance recovery of 1 threads
Started redo scan
Completed redo scan
read 3662 KB redo, 135 data blocks need recovery
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Started redo application at
Thread 1: logseq 10409, block 19647
Recovery of Online Redo Log: Thread 1 Group 5 Seq 10409 Reading mem 0
Mem# 0: +DATA/lnred/onlinelog/group_5.303.743814897
Completed redo application of 0.02MB
Completed instance recovery at
Thread 1: logseq 10409, block 26971, scn 2612073761
62 data blocks read, 121 data blocks written, 3662 redo k-bytes read
Thread 1 advanced to log sequence 10410 (thread recovery)
Redo thread 1 internally disabled at seq 10410 (SMON)
Thu Jul 07 02:34:41 2011
Archived Log entry 3048 added for thread 1 sequence 10409 ID 0xa9ba4d33 dest 1:
Thu Jul 07 02:34:42 2011
ALTER SYSTEM SET service_names='lnwyd_service','lnred_service','backup' SCOPE=MEMORY SID='lnred2';
Thu Jul 07 02:34:42 2011
ARC3: Archiving disabled thread 1 sequence 10410
Archived Log entry 3049 added for thread 1 sequence 10410 ID 0xa9ba4d33 dest 1:
Thu Jul 07 02:34:53 2011
minact-scn: Inst 2 is now the master inc#:36 mmon proc-id:19687 status:0x7
minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.9bb067c4 gcalc-scn:0x0000.9bb06ebf
Thu Jul 07 02:35:51 2011
SKGXP: ospid 1908: network interface with IP address 169.254.220.65 no longer running (check cable)
Thu Jul 07 02:36:52 2011
SKGXP: ospid 1908: network interface with IP addressTs is now running
Thu Jul 07 02:37:28 2011
IPC Send timeout detected. Sender: ospid 1886 [oracle@Mars (PMON)]
Receiver: inst 1 binc 36502 ospid 1949
IPC Send timeout to 1.0 inc 32 for msg type 8 from opid 2
Thu Jul 07 02:41:17 2011
db_recovery_file_dest_size of 512000 MB is 80.48% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Thu Jul 07 02:42:28 2011
Reconfiguration started (old inc 36, new inc 38)
List of instances:
1 2 (myinst: 2)
Global Resource Directory frozen
Communication channels reestablished
Thu Jul 07 02:42:28 2011
* domain 0 valid = 1 according to instance 1
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Thu Jul 07 02:42:28 2011
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Thu Jul 07 02:42:28 2011
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Thu Jul 07 02:42:29 2011
minact-scn: Master returning as live inst:1 has inc# mismatch instinc:0 cur:38 errcnt:0
Reconfiguration complete
Thu Jul 07 02:56:18 2011
db_recovery_file_dest_size of 512000 MB is 80.48% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Thu Jul 07 04:09:37 2011
Thread 2 advanced to log sequence 5462 (LGWR switch)
Current log# 8 seq# 5462 mem# 0: +DATA/lnred/onlinelog/group_8.306.743814913
Thu Jul 07 04:09:39 2011
Db alert na Venus:
ed Jul 06 22:09:06 2011
Deleted Oracle managed file +FRA/lnred/archivelog/2011_07_06/thread_1_seq_10394.415.755808529
Deleted Oracle managed file +FRA/lnred/archivelog/2011_07_06/thread_2_seq_5418.421.755808553
Archived Log entry 3044 added for thread 1 sequence 10408 ID 0xa9ba4d33 dest 1:
Thu Jul 07 02:42:16 2011
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Private Interface 'aggr2:1' configured from GPnP for use as a private interconnect.
[name='aggr2:1', type=1, ip=169.254.16.251, mac=00-1b-21-49-f4-34, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62]
Public Interface 'aggr1' configured from GPnP for use as a public interface.
[name='aggr1', type=1, ip=10.122.10.150, mac=00-1b-21-49-f4-30, net=10.122.10.0/23, mask=255.255.254.0, use=public/1]
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as USE_DB_RECOVERY_FILE_DEST
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Release 11.2.0.2.0 - 64bit Production
With the Real Application Clusters option.
Using parameter settings in server-side pfile /md1/app/oracle/product/11.2.0/db1/dbs/initlnred1.ora
System parameters with non-default values:
processes = 300
sessions = 480
timed_statistics = TRUE
spfile = "+DATA/lnred/spfilelnred.ora"
nls_language = "POLISH"
nls_territory = "POLAND"
nls_length_semantics = "CHAR"
resource_manager_plan = ""
sga_target = 8G
control_files = "+DATA/lnred/controlfile/current.256.741561271"
db_block_size = 8192
compatible = "11.2.0.0.0"
cluster_database = TRUE
db_create_file_dest = "+DATA"
db_recovery_file_dest = "+FRA"
db_recovery_file_dest_size= 500G
thread = 1
undo_tablespace = "UNDOTBS1"
instance_number = 1
remote_login_passwordfile= "EXCLUSIVE"
db_domain = ""
dispatchers = "(PROTOCOL=TCP) (SERVICE=lnredXDB)"
remote_listener = "lnscan.adln.local:1521"
audit_file_dest = "/md1/app/oracle/admin/lnred/adump"
optimizer_features_enable= "11.1.0.7"
audit_trail = "DB"
db_name = "lnred"
open_cursors = 300
pga_aggregate_target = 8G
aq_tm_processes = 1
diagnostic_dest = "/md1/app/oracle"
Cluster communication is configured to use the following interface(s) for this instance
169.254.16.251
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
Thu Jul 07 02:42:25 2011
PMON started with pid=2, OS id=1914
OS messages na Venus:
Jul 7 02:34:38 Venus unix: [ID 836849 kern.notice]
Jul 7 02:34:38 Venus ^Mpanic[cpu6]/thread=ffffffff9bec7e60:
Jul 7 02:34:38 Venus genunix: [ID 156897 kern.notice] forced crash dump initiated at user request
Jul 7 02:34:38 Venus unix: [ID 100000 kern.notice]
Jul 7 02:34:38 Venus genunix: [ID 655072 kern.notice] fffffe80017ceeb0 genunix:kadmin+517 ()
Jul 7 02:34:38 Venus genunix: [ID 655072 kern.notice] fffffe80017cef00 genunix:uadmin+c7 ()
Jul 7 02:34:38 Venus genunix: [ID 655072 kern.notice] fffffe80017cef10 unix:brand_sys_syscall+21d ()
Jul 7 02:34:38 Venus unix: [ID 100000 kern.notice]
Jul 7 02:34:38 Venus genunix: [ID 672855 kern.notice] syncing file systems...
Jul 7 02:34:39 Venus genunix: [ID 433738 kern.notice] [6]
Jul 7 02:34:39 Venus genunix: [ID 733762 kern.notice] 220
Jul 7 02:34:40 Venus genunix: [ID 433738 kern.notice] [6]
Jul 7 02:34:40 Venus genunix: [ID 733762 kern.notice] 205
Jul 7 02:34:42 Venus genunix: [ID 433738 kern.notice] [6]
nie za wiele info, ale na Venus grid log grid\log\venus\crsd\crsd.log znalazlem:
2011-07-07 02:02:10.052: [ CRSPE][47] {1:3804:25345} Processing PE command id=25249. Description: [Stat Resource : 3703910]
2011-07-07 02:02:10.054: [UiServer][48] {1:3804:25345} Done for ctx=32ca350
Here are same rubbishes like (maybe it has same problem with comunication): [NULL][NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL]
[ clsdmt][3]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=venusDBG_CRSD))
2011-07-07 02:41:48.808: [ clsdmt][3]PID for the Process [1619], connkey 1
a na Mars grid log grid\log\mars\crsd\crsd.log :
2011-07-07 02:19:42.393: [ AGFW][42] {0:8:6} Agfw Proxy Server received the message: RESOURCE_STATUS[Proxy] ID 20481:58
2011-07-07 02:19:42.393: [ AGFW][42] {0:8:6} Agfw Proxy Server replying to the message: RESOURCE_STATUS[Proxy] ID 20481:58
2011-07-07 02:20:48.482: [ CRSCOMM][39][FFAIL] Ipc: Couldnt clscreceive message, no message: 11
2011-07-07 02:20:48.482: [ CRSCOMM][39] Ipc: Client disconnected.
2011-07-07 02:20:48.482: [ CRSCOMM][39][FFAIL] IpcL: Listener got clsc error 11 for memNum. 8
2011-07-07 02:20:48.482: [ CRSCOMM][39] IpcL: connection to member 8 has been removed
2011-07-07 02:20:48.483: [CLSFRAME][39] Removing IPC Member:{Relative|Node:0|Process:8|Type:3}
2011-07-07 02:20:48.483: [CLSFRAME][39] Disconnected from AGENT process: {Relative|Node:0|Process:8|Type:3}
2011-07-07 02:20:48.483: [ AGFW][42] {2:44971:2690} Agfw Proxy Server received process disconnected notification, count=1
2011-07-07 02:20:48.483: [ AGFW][42] {2:44971:2690} /md1/app/11.2.0/grid/bin/oraagent_oracle disconnected.
2011-07-07 02:20:48.483: [ AGFW][42] {2:44971:2690} Agent /md1/app/11.2.0/grid/bin/oraagent_oracle[18546] stopped!
2011-07-07 02:20:48.483: [ CRSCOMM][42] {2:44971:2690} IpcL: removeConnection: Member 8 does not exist.
2011-07-07 02:20:48.483: [ AGFW][42] {2:44971:2690} Created alert : (:CRSAGF00129:) : The agent: /md1/app/11.2.0/grid/bin/oraagent has been disabled because it exceeded the maximum failures (5) over a period of 600 seconds.
w tym czasie Venus OS zostal zrestartowany (via IPMI z Mars)
2011-07-07 02:34:39.963: [ CRSSE][32] Forwarding Node Leave to PE for: venus
2011-07-07 02:34:39.965: [ CRSCCL][34]Reconfig event received by clssgsgrpstat
2011-07-07 02:34:39.965: [ CRSCCL][34]cclGetMemberData called
2011-07-07 02:34:39.979: [ CRSCCL][34]Member (2, 307107, mars:11.2.0.2.0) @ found.
2011-07-07 02:34:39.999: [CLSFRAME][34] CCL MEMBER LEFT:1:1:CRSD:venus
2011-07-07 02:34:39.999: [CLSFRAME][34] Disconnected from CRSD:venus process:
i co dalej... macie jakis pomysl ?
pozdrawiam
Robert