Quantcast
Channel: Oracle
Viewing all articles
Browse latest Browse all 1814

Wiki Page: Managing & Troubleshooting Exadata – Part 2

$
0
0
Written by Syed Jaffar Hussain Part 1 has talked about the significance of Exadata patching and explained the cell and DB node patching concepts, patching tools and demonstrated with hands-on examples to apply a patch at various layers of Exadata. This part of the series would focus on determining the cell health check verifications, collecting right information from various logs/trace/dump files for troubleshooting cell and InfiniBand issues. Additionally, you will also learn the automated file deletion policy on the Cell server. Hierarchy of the logs, traces Oracle keep track of all useful information into various log files, and dumps the critical information into trace or dump files. Reviewing these files time-to-time is strongly recommended as they would provide the glimpse and current state for Cell, database, RAC and etc. This part of the segment will take you through the hierarchy of logs on Exadata cell server, and explain the importance of the files. Every cell has /var/log/oracle file system, as shown in the below picture: You will find the following sub-directories underneath of /var/log/oracle : diag cellos crashfiles deploy Cell alert.log Like database and Oracle Cluster, each cell maintains its own alert.log file where it keep track of cell start/stop, services information and other important details. Whenever there is any issue with the Exadata services, this is the first file to be reviewed to get useful information. Location: /opt/oracle/cell/log/diag/asm/cell/{cellname}/trace Name : alert.log MS logfile Review the below log whenever you encounter issues with Management Server (MS) service: Location: /opt/oracle/cell/log/diag/asm/cell/{cellname}/trace Name : ms-odl.log Crash and Core files By default the crash core files are dumped at the following location on Exadata cell: /var/log/oracle/crashfiles In order to modify the crash core file location, you can modify the following configuration files on the cell: /etc/kdump.conf – change the path to new location. Cell patching log files For any cell patching related log files, you should review files under the following location: /var/log/oracle/cellos OS log file All OS related messages can be reviewed in the following: /var/log/messages The image below depicts which patching tool is used to patch the Exadata stack: Disk controller Firmware logs Battery capacity, feature properties can be viewed through the following command: /opt/MegaRAID/MegaCli/MegaCli64 -fwtermlog -dsply -a0 alerthistory & cell details The alerthistory is the another powerful command which giving significantly useful information about the cell. Strongly recommend to run through alerthistory on each cell from time-to-time. Another power command on the cell is to determine the health state of the cell is executing the following: To ensure the stability of the cell, verify the health status of a cell, ensure the fanstatus, powerstatus, cell status, and CellSrv/MS/RS services status is up and running. Proper Tools to verify the Exadata components health check It is essential to know the proper tools on Exadata to verify the Cell components health status. Following are a few important tools which can be used to verify the status of different components, such as, cell boot location/files, InfiniBand status etc. Imageinfo Imageinfo provides crucial information of the cell software, rolling back to previous image possibilities and the location/file for CELL boot usb partition, especially useful before/after patching on the cell servers: Verifying network topology: To verify spine/Leaf switch status, topology and errors, use the following command: /opt/oracle.SupportTools/ibdiagtools/verify-topology InfiniBand Link details Run the iblinkinfo command to review the InfiniBand Link details on the cell: CA: uso17 S 192.168.2.112,192.168.2.113 HCA-1: 0x0010e00001495101 5 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 10[ ] "SUN DCS 36P QDR uso27 10.0.9.91" ( ) 0x0010e00001495102 6 2[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 1 10[ ] "SUN DCS 36P QDR uso28 10.0.9.92" ( ) Switch: 0x0010e04071e5a0a0 SUN DCS 36P QDR uso28 10.0.9.92: 1 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 12 2[ ] "uso19 C 192.168.2.116,192.168.2.117 HCA-1" ( ) 1 2[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 4 2[ ] "uso18 C 192.168.2.114,192.168.2.115 HCA-1" ( ) 1 3[ ] ==( Down/ Polling)==> [ ] "" ( ) 1 4[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 10 2[ ] "uso20 C 192.168.2.118,192.168.2.119 HCA-1" ( ) 1 5[ ] ==( Down/ Polling)==> [ ] "" ( ) 1 6[ ] ==( Down/ Polling)==> [ ] "" ( ) 1 7[ ] ==( Down/ Polling)==> [ ] "" ( ) 1 8[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 8 2[ ] "uso26 S 192.168.2.110,192.168.2.111 HCA-1" ( ) 1 9[ ] ==( Down/ Polling)==> [ ] "" ( ) 1 10[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 6 2[ ] "uso17 S 192.168.2.112,192.168.2.113 HCA-1" ( ) 1 11[ ] ==( Down/Disabled)==> [ ] "" ( ) 1 12[ ] ==( Down/ Polling)==> [ ] "" ( ) 1 13[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 14[ ] "SUN DCS 36P QDR uso27 10.0.9.91" ( ) 1 14[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 13[ ] "SUN DCS 36P QDR uso27 10.0.9.91" ( ) 1 15[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 16[ ] "SUN DCS 36P QDR uso27 10.0.9.91" ( ) 1 16[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 15[ ] "SUN DCS 36P QDR uso27 10.0.9.91" ( ) 1 17[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 18[ ] "SUN DCS 36P QDR uso27 10.0.9.91" ( ) 1 18[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 17[ ] "SUN DCS 36P QDR uso27 10.0.9.91" ( ) Ibstatus Review the IB status, speed details using the ibstauts command: Diagnostic collection Collecting right information is always important to troubleshoot or diagnose any issue. However, when the information needed to gather from dozens of different files from different servers, like Cell and DB, it is going to be time consuming. Oracle has provided couple of utilities/tools to gather diagnostic information from logs/traces across all Cells/DB servers together at one time. You will see below the tools that can do the job: sundiag.sh The Sundiag.sh is available under /opt/oracle.SupportTools location on each cell. The tool is used to collect the information from Cell server and DB server, need to run the script as root user. root> ./sundiag.sh Oracle Exadata Database Machine - Diagnostics Collection Tool Gathering Linux information Skipping ILOM collection. Use the ilom or snapshot options, or login to ILOM over the network and run Snapshot separately if necessary. /tmp/sundiag_usdwilo11_1418NML055_2016_02_07_13_53 Generating diagnostics tarball and removing temp directory ============================================================================== Done. The report files are bzip2 compressed in /tmp/sundiag_usdwilo11_1418NML055_2016_02_07_13_53.tar.bz2 ============================================================================== The *.tzr.bz2 file contains several files, including alert.log and celldisk details etc. Automated Cell File Management Like automated Cluster file management deletion policy, there is automated cell maintenance which perform a file deletion policy based on the date. The feature has the following characteristics: Management Server (MS) service is responsible to run through a file delete policy. The retention for ADR is 7 days Older than 7days metric history files will be deleted Alert.log file will be renamed once it reaches to 10MB. MS also triggers the deletion policy when the file system utilization become high. If the /root and the /var/log/oracle directory utilization reaches to 80%, automatic deletion policy will be applied The automatic deletion policy will be applied on the /opt/oracle file when the utilization reaches to 90% Files over 5MB or one day older under the / file system, /tmp, /var/crash, /var/spool will be deleted Conclusion This part has explained the hierarchy of the logs/trace files on Cell server. What are the important tools that can be used to view the status of various Exadata components, such as Cell, InfiniBand, Disks etc. In the next Part, you will learn the best approach to Exadata migration.

Viewing all articles
Browse latest Browse all 1814

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>