Introduction: All these days there are certain common words like " cloud, virtualization, consolidation " very frequently used in IT industry. There are so many organization still not convinced to run the Application and Database workloads on a virtualized platforms. Does really it's not a good idea to run databases and applications on virtualized platform?? I believe the answer for this question is both "YES" and "NO" . If the virtualized environments are poorly configured and deployed then the answer is YES , because business may face bad system performance, business outage's and time to find the root cause for the problem and fixing it. If the virtualized environment is properly deployed considering all hardware consideration then the answer will be NO and such environments are far better in terms of availability and better hardware utilization. This article doesn't cover step by step method for configuring and Installing a Real Application cluster on Oracle Solaris Sparc based virtualization LDOM, rather it covers the best practices you should follow for configuring shared disk devices inside the LDOMS that will be used for RAC Installation. The below diagram shows the typical deployment of sparc based virtualization. There are two servers and Oracle RAC is Installed on LDOMS using each physical server. The main purpose of writing this article is to help individuals who are planning to use LDOMS for their RAC deployments and what things we should consider while configuring the Shared storage devices on LDOM servers. If the shared devices are not configured properly then we may encounter node eviction issues here and then, so we must be very careful while configuring the shared devices. In this article I will demonstrate one issue which was encountered by at least three customers. Environments Details: Two Physical Severs - Oracle Sparc T5-4 RAC deployed on two ldom from 2 physical servers each Oracle ZFS Storage was used for Shared Storage S.No. Servers Description 1 controlhost01 Controller host domain - server1 2 racnode1 Guest ldom server on server1 3 controlhost02 Controller host domain - server2 4 racnode1 Guest ldom server on server2 Oracle Grid Infrastructure 11.2.0.3 was running without issues, but for some maintenance one of the server rebooted and since after node reboot, that node started evicting from the cluster. Observations: Log message from the operating system : Jan 14 03:52:10 racnode1 last message repeated 1 time Jan 14 04:26:32 racnode1 CLSD: [ID 770310 daemon.notice] The clock on host racnode1 has been updated by the Cluster Time Synchronization Service to be synchronous with th e mean cluster time. Jan 14 04:45:22 racnode1 vdc: [ID 795329 kern.notice] NOTICE: vdisk@1 disk access failed Jan 14 04:49:33 racnode1 last message repeated 5 times Jan 14 04:50:23 racnode1 vdc: [ID 795329 kern.notice] NOTICE: vdisk@1 disk access failed Jan 14 04:51:14 racnode1 last message repeated 1 time Jan 14 05:00:11 racnode1 CLSD: [ID 770310 daemon.notice] The clock on host racnode1 has been updated by the Cluster Time Synchronization Service to be synchronous with th e mean cluster time. Jan 14 05:33:34 racnode1 last message repeated 1 time Jan 14 05:45:25 racnode1 vdc: [ID 795329 kern.notice] NOTICE: vdisk@1 disk access failed Jan 14 05:48:54 racnode1 last message repeated 4 times Jan 14 05:49:44 racnode1 vdc: [ID 795329 kern.notice] NOTICE: vdisk@1 disk access failed Jan 14 05:52:22 racnode1 last message repeated 3 times Jan 14 06:09:21 racnode1 CLSD: [ID 770310 daemon.notice] The clock on host racnode1 has been updated by the Cluster Time Synchronization Service to be synchronous with th e mean cluster time. Log message from the GI logs : NOTE: cache mounting group 3/0xF98788E3 (OCR) succeeded NOTE: cache ending mount (success) of group OCR number=3 incarn=0xf98788e3 GMON querying group 1 at 10 for pid 18, osid 8795 Thu Jan 28 02:15:18 2016 NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 1 SUCCESS: diskgroup ARCH was mounted GMON querying group 2 at 11 for pid 18, osid 8795 NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 2 SUCCESS: diskgroup DATA was mounted GMON querying group 3 at 12 for pid 18, osid 8795 NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 3 SUCCESS: diskgroup OCR was mounted SUCCESS: ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */ SQL> ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent *//* {0:0:2} */ SUCCESS: ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent *//* {0:0:2} */ Thu Jan 28 02:15:19 2016 WARNING: failed to online diskgroup resource ora.ARCH.dg (unable to communicate with CRSD/OHASD) WARNING: failed to online diskgroup resource ora.DATA.dg (unable to communicate with CRSD/OHASD) WARNING: failed to online diskgroup resource ora.OCR.dg (unable to communicate with CRSD/OHASD) Thu Jan 28 02:15:36 2016 NOTE: Attempting voting file refresh on diskgroup OCR NOTE: Voting file relocation is required in diskgroup OCR NOTE: Attempting voting file relocation on diskgroup OCR [/u01/grid/bin/oraagent.bin(9694)]CRS-5818:Aborted command 'check' for resource 'ora.ARCH.dg'. Details at (:CRSAGF00113:) {1:57521:2} in /u01/grid/log/racnode1/agent/crsd/oraagent_grid/oraagent_grid.log. 2016-01-28 02:18:42.213 [/u01/grid/bin/oraagent.bin(9694)]CRS-5818:Aborted command 'check' for resource 'ora.DATA.dg'. Details at (:CRSAGF00113:) {1:57521:2} in /u01/grid/log/racnode1/agent/crsd/oraagent_grid/oraagent_grid.log. 2016-01-28 02:18:42.213 [/u01/grid/bin/oraagent.bin(9694)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER_SCAN3.lsnr'. Details at (:CRSAGF00113:) {1:57521:2} in /u01/grid/log/racnode1/agent/crsd/oraagent_grid/oraagent_grid.log. 2016-01-28 02:18:42.410 [/u01/grid/bin/oraagent.bin(9694)]CRS-5016:Process "/u01/grid/opmn/bin/onsctli" spawned by agent "/u01/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/grid/log/racnode1/agent/crsd/oraagent_grid/oraagent_grid.log" 2016-01-28 02:18:50.897 [/u01/grid/bin/oraagent.bin(9694)]CRS-5818:Aborted command 'check' for resource 'ora.asm'. Details at (:CRSAGF00113:) {1:57521:2} in /u01/grid/log/racnode1/agent/crsd/oraagent_grid/oraagent_grid.log. Cause: On further Investigation we found the logical devices names used by ASM on both LDOM Cluster nodes are not same and this is the reason one of the Instance is always getting evicted. RACNODE1 root@controlhost01:~# ldm list -o disk racnode NAME racnode DISK NAME VOLUME TOUT ID DEVICE SERVER MPGROUP OS OS@racnode 0 disk@0 primary cdrom cdrom@racnode 1 disk@1 primary DATA DATA@RACDisk 5 disk@5 primary OCR1 OCR1@RACDisk 2 disk@2 primary OCR2 OCR2@RACDisk 3 disk@3 primary OCR3 OCR3@RACDisk 4 disk@4 primary ARCH ARCH@RACDisk 6 disk@6 primary root@controlhost01:~# RACNODE2 root@controlhost02:~# ldm list -o disk racnode NAME racnode DISK NAME VOLUME TOUT ID DEVICE SERVER MPGROUP OS OS@racnode 0 disk@0 primary OCR1 OCR1@RACDisk 2 disk@2 primary OCR2 OCR2@RACDisk 3 disk@3 primary OCR3 OCR3@RACDisk 4 disk@4 primary DATA DATA@RACDisk 5 disk@5 primary ARCH ARCH@RACDisk 1 disk@1 primary root@controlhost02:~# If we observe there is difference in the number of devices of racnode on controlhost01 and controlhost02. controlhost01 has seven devices in total and controlhost02 has six devices in total and if you observe there is difference in the logical name as well. controlhost01 ==> ARCH ARCH@RACDisk 1 disk@1 primary controlhost02 ==> ARCH ARCH@RACDisk 6 disk@6 primary On racnode1 the logical device name allocated is "1" and on racnode2 the logical device name allocated is "6". Lets see the device name allocated on each cluster node: RACNODE2: -bash-3.2# echo|format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c0d0 /virtual-devices@100/channel-devices@200/disk@0 1. c0d2 OCR1 /virtual-devices@100/channel-devices@200/disk@2 2. c0d3 OCR2 /virtual-devices@100/channel-devices@200/disk@3 3. c0d4 OCR3 /virtual-devices@100/channel-devices@200/disk@4 4. c0d5 DATA /virtual-devices@100/channel-devices@200/disk@5 5. c0d6 ARCH /virtual-devices@100/channel-devices@200/disk@6 Specify disk (enter its number): Specify disk (enter its number): -bash-3.2# RACNODE1: -bash-3.2# echo|format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c0d0 /virtual-devices@100/channel-devices@200/disk@0 1. c0d1 ARCH /virtual-devices@100/channel-devices@200/disk@1 2. c0d2 OCR1 /virtual-devices@100/channel-devices@200/disk@2 3. c0d3 OCR2 /virtual-devices@100/channel-devices@200/disk@3 4. c0d4 OCR3 /virtual-devices@100/channel-devices@200/disk@4 5. c0d5 DATA /virtual-devices@100/channel-devices@200/disk@5 Specify disk (enter its number): Specify disk (enter its number): -bash-3.2# Let's check the ASM disks from ASM instance's on both rac nodes: racnode2 SQL> select name, path from v$asm_disk; NAME PATH -------------------------------------------------- -------------------- OCR_0002 /dev/rdsk/c0d4s4 DATA_0000 /dev/rdsk/c0d5s4 OCR_0000 /dev/rdsk/c0d2s4 OCR_0001 /dev/rdsk/c0d3s4 ARCH_0000 /dev/rdsk/c0d1s4 racnode1 SQL> select name, path from v$asm_disk; NAME PATH -------------------------------------------------- -------------------- OCR_0002 /dev/rdsk/c0d4s4 DATA_0000 /dev/rdsk/c0d5s4 OCR_0000 /dev/rdsk/c0d2s4 OCR_0001 /dev/rdsk/c0d3s4 ARCH_0000 /dev/rdsk/c0d6s4 Here is the problem the ASM disks paths for ARCH Disk group is different and it creating problem for Grid Infrastructure to understand which path is a correct and valid path. So we must be very careful about the logical device names of shared disks. If we came across such situation what we should do to overcome this Issue. On racnode1 there is an additional device allocated - CDROM and it is the one which caused the problem for changing the device names. root@controlhost01:~# ldm list -o disk racnode NAME racnode DISK NAME VOLUME TOUT ID DEVICE SERVER MPGROUP OS OS@racnode 0 disk@0 primary cdrom cdrom@racnode 1 disk@1 primary DATA DATA@RACDisk 5 disk@5 primary OCR1 OCR1@RACDisk 2 disk@2 primary OCR2 OCR2@RACDisk 3 disk@3 primary OCR3 OCR3@RACDisk 4 disk@4 primary ARCH ARCH@RACDisk 6 disk@6 primary root@controlhost01:~# On racnode2 CDROM device doesn't even exists: root@controlhost02:~# ldm list -o disk racnode NAME racnode DISK NAME VOLUME TOUT ID DEVICE SERVER MPGROUP OS OS@racnode 0 disk@0 primary OCR1 OCR1@RACDisk 2 disk@2 primary OCR2 OCR2@RACDisk 3 disk@3 primary OCR3 OCR3@RACDisk 4 disk@4 primary DATA DATA@RACDisk 5 disk@5 primary ARCH ARCH@RACDisk 1 disk@1 primary root@controlhost02:~# There is no CDROM available for racnode2. The solution for this issue is to remove the incorrect logical device from the guest domain using controller domain and assign it again with correct logical name. We need to even delete the CDROM from racnode1 guest domain. Remove the CDROM: root@controlhost01:~# ldm rm-vdisk cdrom racnode1 Check the status of the devices: root@controlhost01:~# ldm list -o disk racnode NAME database DISK NAME VOLUME TOUT ID DEVICE SERVER MPGROUP OS OS@RACDisk 0 disk@0 primary DATA DATA@RACDisk 5 disk@5 primary OCR1 OCR1@RACDisk 2 disk@2 primary OCR2 OCR2@RACDisk 3 disk@3 primary OCR3 OCR3@RACDisk 4 disk@4 primary ARCH ARCH@RACDisk 6 disk@6 primary root@controlhost01:~# - The CDROM is now removed from the guest domain. Now its time to remove the ASM logical device, so we must ensure that backup is done and it can be restored. The ASM is not identifying the device once its removed and reconnected again. We must follow the complete procedure of provisioning new LUN to the ASM device. Action Plan: - Perform backup of data residing on the disk group - Drop the disk group - remove the device from the guest domain - Add the device using the correct logical name - label the device - Change the ownership and permissions - Create ASM Disk group - Restore the data on the ASM disk group I will not demonstrate the detailed steps for the above listed action plan. But I will list down the steps which is required to be performed at the Controller domain and the Guest Domain. -Remove the incorrect logical device from the Guest Domain: root@controlhost01:~# ldm rm-vdisk ARCH racnode root@controlhost01:~# ldm list -o disk racnode NAME database DISK NAME VOLUME TOUT ID DEVICE SERVER MPGROUP OS OS@RACDisk 0 disk@0 primary DATA DATA@RACDisk 5 disk@5 primary OCR1 OCR1@RACDisk 2 disk@2 primary OCR2 OCR2@RACDisk 3 disk@3 primary OCR3 OCR3@RACDisk 4 disk@4 primary root@controlhost01:~# There is no ARCH disk now after removal of logical device. - Add logical device with correct Name root@controlhost01:~# ldm add-vdisk id=1 ARCH ARCH@RACDisk racnode root@controlhost01:~# ldm list -o disk racnode NAME database DISK NAME VOLUME TOUT ID DEVICE SERVER MPGROUP OS OS@RACDisk 0 disk@0 primary DATA DATA@RACDisk 5 disk@5 primary OCR1 OCR1@RACDisk 2 disk@2 primary OCR2 OCR2@RACDisk 3 disk@3 primary OCR3 OCR3@RACDisk 4 disk@4 primary ARCH ARCH@RACDisk 1 disk@1 primary root@host01:~# The newly added device is now available with correct logical device id. Label the disk at the guest LDOM operating system , change the permission and ownership of the newly added device. After this step the device is ready to be used by ASM disk group. Conclusion: The purpose of writing this article is to help individuals who are implementing oracle RAC on Oracle Solaris LDOM's. It took three days for us and oracle support to do the root cause analysis for this problem. I strongly recommend to verify the logical device names across the all cluster nodes before installing the cluster software by performing multiple hard/soft reboots and it should also be tested even after cluster installations. If we observed there is difference in the number of devices of racnode on controlhost01 and controlhost02. controlhost01 has seven devices in total and controlhost02 has six devices in total and if you observe there is difference in the logical name as well.
↧