Monday, January 27, 2014

find which asm disk maps to which linux partition

There is varies option  to check which disk is part of Which disk Group .

- # ls -ls /dev/mpath/DATA3*

0 brw-rw—- 1 grid asmadmin 8, 17 Oct 20 12:46 /dev/oracleasm/disks/DATA3
0 brw-rw—- 1 grid asmadmin 8, 18 Oct 20 12:46 /dev/oracleasm/disks/DATA3

2 columns after group (asmadmin) refer to major/minor

e.g. for ASMDISK01 major = 8, minor = 17

# cat /proc/partitions
major minor #blocks name
8 0 83886080 sda
8 1 104391 sda1
8 2 5116702 sda2
8 3 5116702 sda3
8 4 1 sda4
8 5 73545538 sda5
8 16 93487104 sdf
8 17 4008186 sde1 --- Here you GO

Hence, partition /dev/sde1 is mapped to ASMDISK01.


2- /sbin/blkid |grep asm -- Very usefully --

/dev/mapper/ASM_EMC_XX_292Dp1: LABEL="VOTE" TYPE="oracleasm"
..
/dev/sdn1: LABEL="ARCH1" TYPE="oracleasm"
/dev/sdh1: LABEL="DATA6" TYPE="oracleasm"
/dev/sde1: LABEL="DATA3" TYPE="oracleasm"
/dev/sdd1: LABEL="DATA2" TYPE="oracleasm"
/dev/sda1: LABEL="DATA1" TYPE="oracleasm"
/dev/sdb1: LABEL="DATA7" TYPE="oracleasm"
/dev/sdy1: LABEL="DATA2" TYPE="oracleasm"
/dev/sdx1: LABEL="DATA11" TYPE="oracleasm"
/dev/sdw1: LABEL="DATA7" TYPE="oracleasm"
/dev/sdg1: LABEL="DATA4" TYPE="oracleasm"

[+ASM2] /dev/mpath > cat /proc/partitions |grep sdb1
   8       17  141419456 sdb1
[+ASM2] /dev/mpath >

Master node detail in Rac

- CRSd process on the Master node is responsible to initiate the OCR backup as per the backup policy

- Master node  is also responsible to sync OCR cache across the nodes
- CRSd process oth the master node reads from and writes to OCR on disk
- In case of node eviction, The cluster is divided into two sub-clusters. The sub-cluster containing fewer no. of nodes is evicetd. But, in case both the sub-clusters have same no. of nodes, the sub-cluster having the master node survives whereas the other sub-cluster is evicted.

Oracle ClusterWare master’s  information can be found


    - by scanning ocssd logs from various nodes
    - by scanning  crsd logs from various nodes.
    - by identifying the node which  takes the backup of the OCR.
    -the node that store OCR backups is the master node. 

    - With the help of V$GES_RESOURCE view .

If master node gets evicted/rebooted, another node becomes the master.



Wednesday, January 8, 2014

ORA-27301: OS failure message: No space left on device

Got this just now :) on one of  11 node cluster instance,

SQL> startup
ORA-27154: post/wait create failed
ORA-27300: OS system dependent operation:semget failed with status: 28
ORA-27301: OS failure message: No space left on device
ORA-27302: failure occurred at: sskgpcreates
SQL> exit

 We have 50+ database running on this cluster .after searching detail  on this error found one usefully oracle doc , make value as per  document .

This is caused by Kernel parameter setting. Since we were running many databases on a server, we should check our semaphore.


In our case it's

$ /sbin/sysctl -a | grep sem
error: permission denied on key 'kernel.cad_pid'
kernel.sem = 250 524288 256 2048

reset value as per above document.

run sysctl and check.
[root@test ~]# sysctl kernel.sem
kernel.sem = 275        524288  256     2048
[root@test ~]# sysctl -p
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 2
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
fs.aio-max-nr = 1048576
fs.file-max = 6815744
kernel.shmall = 32988817
kernel.shmmax = 67561097216
kernel.shmmni = 4096
net.core.rmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.core.wmem_max = 1048576
net.ipv4.ip_local_port_range = 1024 65500
kernel.msgmax = 8192
kernel.msgmni = 2878
fs.file-max = 6815744
net.ipv4.ip_local_port_range = 9000 65500
kernel.shmmni = 4096
kernel.msgmnb = 65536
kernel.sysrq = 1
net.core.rmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.ipv4.tcp_syncookies = 1
fs.aio-max-nr = 3145728
kernel.shmall = 1073741824
kernel.shmmax = 4398046511104
net.core.wmem_max = 1048576
kernel.sem = 275 524288 256 2048

Tried to start db again,
SQL> startup
ORACLE instance started.

Total System Global Area 1670221824 bytes
Fixed Size                  2227072 bytes
Variable Size            1241515136 bytes
Database Buffers          402653184 bytes
Redo Buffers               23826432 bytes
Database mounted.
Database opened.

wow all set :)

ORA-17503: ksfdopn:2 Failed to open file

Today received one Critical Incident from  Oem 12C . The alerts message was Message=Health checker runs found new failures in 12 at time/line number: Wed Dec 18 23:49:17 2013/122332.

first i quick check crs status  . crsctl status cluster -all  this command  shows all cluster resource are online, since alerts was coming from only 1 node . i thought  asm alerts will give me some clue on this  . yes it give me ,below was key line in ASM alerts ,

Errors in file /XX/grid/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_XXX.trc:
ORA-17503: ksfdopn:2 Failed to open file +DATA/XXX/spfileXXX.ora
ORA-15173: entry 'XXX' does not exist in directory '/'

XXX is indicate here db name. 
after checking more detail found,  The error comes when db is droped and still it's register into OCR. after removing db from OCR byi srvcet remove .. Incident clear,

 

Tuesday, January 7, 2014

Archive Logs Are Not Copied To Standby solve issue from scratch

I debug this same issue at least once a week since May 2009 :), below step help to solve issue from scratch
We usually find below erro message on alert logfile.

————————————————————
Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE,
and that the SYS password is same in the password files.
returning error ORA-16191
————————————————————

Check:

That you are using password files
That the name of the password file is $ORACLE_HOME/dbs/orapw${ORACLE_SID}
That the password of sys is the same on all the servers involved in the archive copy
If one of the databases is 11g, make sure you set sec_case_sensitive_logon to false
Also check that you created the password file with ignore_case=y
I think this covers my usual debug process.


Never forget  11g has  case sensitive passwords.

Saturday, January 4, 2014

ORA-39095: Dump file space has been exhausted: Unable to allocate 8192 bytes

Today while running one Export job , got below issue . 

Processing object type SCHEMA_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
Processing object type SCHEMA_EXPORT/TABLE/POST_INSTANCE/PROCACT_INSTANCE
Processing object type SCHEMA_EXPORT/TABLE/POST_INSTANCE/PROCDEPOBJ
Processing object type SCHEMA_EXPORT/POST_SCHEMA/PROCOBJ
Processing object type SCHEMA_EXPORT/POST_SCHEMA/PROCACT_SCHEMA
. . exported "XXX"."XML_DOCUMENT"           3.716 GB 1501732 rows
ORA-39095: Dump file space has been exhausted: Unable to allocate 8192 bytes
Job "SYSTEM"."SYS_EXPORT_SCHEMA_01" stopped due to fatal error at 20:22:02
Job SYS_EXPORT_SCHEMA_01 has been reopened at Friday, 03 January, 2014 20:27
Job "SYSTEM"."SYS_EXPORT_SCHEMA_01" stopped due to fatal error at 20:27:11

System  and all TBS and MP has sufficient space .  after doing some test came to know issue was -

Cause:
If the dumpfiles are less than the parallel processes, then the dumpfile may be locked by one process and the other process waits for the dumpfile to write. And sometimes the process may not release the lock even if the process finished. So the entire task will not end as the other processes are waiting for the lock.

Solution:
Parallelism should be less than or equal to the number of the dumpfiles. Or avoid the usage of parallel option.
To avoid above confussion i use dumpfile=dumpfile_name_%U.dmp parallel=number of CPU process/2,3.4 ..and my job went fine which any issue .