Saturday, July 19, 2014

Oracle Database Corruption



Master Note for Handling Oracle Database Corruption Issues (Doc ID 1088018.1). This will help you in case if you get future issues, please take a look at it.

How to perform ORACLE_HOME/binaries backup

Oracle binaries backup is needed when we are doing any OS  label upgrade like
OEL upgrade .

1. (Suggested but not mandatory) Shutdown databases, listeners, or any other processes related to the ORACLE_HOME which you are taking backup.

2. cd to the directory where ORACLE_HOME is located for example:

cd /u01/app/oracle/product/11.2

3. backup the ORACLE_HOME for example:

tar -pcvf /u01/app/oracle/backup/oracle_home_bkup.tar db1

In the above example, ORACLE_HOME is /u01/app/oracle/product/11.2/db1 and backup the directory is /u01/app/oracle/backup/

Below is an example of restoring the ORACLE_HOME:

1. (This step would be mandatory in the case of restore) Shutdown databases, listeners, or any other processes related to the ORACLE_HOME which you are restoring.

2. Go to the directory where ORACLE_HOME is located for example:

cd /u01/app/oracle/product/11.2

3. Rename or move the current ORACLE_HOME for example:

mv db1 db1_bkup

3. Restore the ORACLE_HOME for example:

tar -pxvf /u01/app/oracle/backup/oracle_home_bkup.tar

Make sure that is enough enough free disk space before doing the backup.

Monday, January 27, 2014

find which asm disk maps to which linux partition

There is varies option  to check which disk is part of Which disk Group .

- # ls -ls /dev/mpath/DATA3*

0 brw-rw—- 1 grid asmadmin 8, 17 Oct 20 12:46 /dev/oracleasm/disks/DATA3
0 brw-rw—- 1 grid asmadmin 8, 18 Oct 20 12:46 /dev/oracleasm/disks/DATA3

2 columns after group (asmadmin) refer to major/minor

e.g. for ASMDISK01 major = 8, minor = 17

# cat /proc/partitions
major minor #blocks name
8 0 83886080 sda
8 1 104391 sda1
8 2 5116702 sda2
8 3 5116702 sda3
8 4 1 sda4
8 5 73545538 sda5
8 16 93487104 sdf
8 17 4008186 sde1 --- Here you GO

Hence, partition /dev/sde1 is mapped to ASMDISK01.


2- /sbin/blkid |grep asm -- Very usefully --

/dev/mapper/ASM_EMC_XX_292Dp1: LABEL="VOTE" TYPE="oracleasm"
..
/dev/sdn1: LABEL="ARCH1" TYPE="oracleasm"
/dev/sdh1: LABEL="DATA6" TYPE="oracleasm"
/dev/sde1: LABEL="DATA3" TYPE="oracleasm"
/dev/sdd1: LABEL="DATA2" TYPE="oracleasm"
/dev/sda1: LABEL="DATA1" TYPE="oracleasm"
/dev/sdb1: LABEL="DATA7" TYPE="oracleasm"
/dev/sdy1: LABEL="DATA2" TYPE="oracleasm"
/dev/sdx1: LABEL="DATA11" TYPE="oracleasm"
/dev/sdw1: LABEL="DATA7" TYPE="oracleasm"
/dev/sdg1: LABEL="DATA4" TYPE="oracleasm"

[+ASM2] /dev/mpath > cat /proc/partitions |grep sdb1
   8       17  141419456 sdb1
[+ASM2] /dev/mpath >

Master node detail in Rac

- CRSd process on the Master node is responsible to initiate the OCR backup as per the backup policy

- Master node  is also responsible to sync OCR cache across the nodes
- CRSd process oth the master node reads from and writes to OCR on disk
- In case of node eviction, The cluster is divided into two sub-clusters. The sub-cluster containing fewer no. of nodes is evicetd. But, in case both the sub-clusters have same no. of nodes, the sub-cluster having the master node survives whereas the other sub-cluster is evicted.

Oracle ClusterWare master’s  information can be found


    - by scanning ocssd logs from various nodes
    - by scanning  crsd logs from various nodes.
    - by identifying the node which  takes the backup of the OCR.
    -the node that store OCR backups is the master node. 

    - With the help of V$GES_RESOURCE view .

If master node gets evicted/rebooted, another node becomes the master.



Wednesday, January 8, 2014

ORA-27301: OS failure message: No space left on device

Got this just now :) on one of  11 node cluster instance,

SQL> startup
ORA-27154: post/wait create failed
ORA-27300: OS system dependent operation:semget failed with status: 28
ORA-27301: OS failure message: No space left on device
ORA-27302: failure occurred at: sskgpcreates
SQL> exit

 We have 50+ database running on this cluster .after searching detail  on this error found one usefully oracle doc , make value as per  document .

This is caused by Kernel parameter setting. Since we were running many databases on a server, we should check our semaphore.


In our case it's

$ /sbin/sysctl -a | grep sem
error: permission denied on key 'kernel.cad_pid'
kernel.sem = 250 524288 256 2048

reset value as per above document.

run sysctl and check.
[root@test ~]# sysctl kernel.sem
kernel.sem = 275        524288  256     2048
[root@test ~]# sysctl -p
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 2
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
fs.aio-max-nr = 1048576
fs.file-max = 6815744
kernel.shmall = 32988817
kernel.shmmax = 67561097216
kernel.shmmni = 4096
net.core.rmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.core.wmem_max = 1048576
net.ipv4.ip_local_port_range = 1024 65500
kernel.msgmax = 8192
kernel.msgmni = 2878
fs.file-max = 6815744
net.ipv4.ip_local_port_range = 9000 65500
kernel.shmmni = 4096
kernel.msgmnb = 65536
kernel.sysrq = 1
net.core.rmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.ipv4.tcp_syncookies = 1
fs.aio-max-nr = 3145728
kernel.shmall = 1073741824
kernel.shmmax = 4398046511104
net.core.wmem_max = 1048576
kernel.sem = 275 524288 256 2048

Tried to start db again,
SQL> startup
ORACLE instance started.

Total System Global Area 1670221824 bytes
Fixed Size                  2227072 bytes
Variable Size            1241515136 bytes
Database Buffers          402653184 bytes
Redo Buffers               23826432 bytes
Database mounted.
Database opened.

wow all set :)

ORA-17503: ksfdopn:2 Failed to open file

Today received one Critical Incident from  Oem 12C . The alerts message was Message=Health checker runs found new failures in 12 at time/line number: Wed Dec 18 23:49:17 2013/122332.

first i quick check crs status  . crsctl status cluster -all  this command  shows all cluster resource are online, since alerts was coming from only 1 node . i thought  asm alerts will give me some clue on this  . yes it give me ,below was key line in ASM alerts ,

Errors in file /XX/grid/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_XXX.trc:
ORA-17503: ksfdopn:2 Failed to open file +DATA/XXX/spfileXXX.ora
ORA-15173: entry 'XXX' does not exist in directory '/'

XXX is indicate here db name. 
after checking more detail found,  The error comes when db is droped and still it's register into OCR. after removing db from OCR byi srvcet remove .. Incident clear,

 

Tuesday, January 7, 2014

Archive Logs Are Not Copied To Standby solve issue from scratch

I debug this same issue at least once a week since May 2009 :), below step help to solve issue from scratch
We usually find below erro message on alert logfile.

————————————————————
Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE,
and that the SYS password is same in the password files.
returning error ORA-16191
————————————————————

Check:

That you are using password files
That the name of the password file is $ORACLE_HOME/dbs/orapw${ORACLE_SID}
That the password of sys is the same on all the servers involved in the archive copy
If one of the databases is 11g, make sure you set sec_case_sensitive_logon to false
Also check that you created the password file with ignore_case=y
I think this covers my usual debug process.


Never forget  11g has  case sensitive passwords.