Quantcast
Channel: Oracle
Viewing all articles
Browse latest Browse all 1814

Wiki Page: Collaboration of DBA and Development team

$
0
0
Abstract In IT field its very usual to see DBA team and Development/Application team not working closely while encountering performance issues. Its very common to see each team blaming on each other which directly hampers the team ability to resolve the issues. Often it is the vague communications among the teams that leads to misunderstanding and ultimately impacting customers. Healthy conversations with other teams will lead to tremendous acquisition of knowledge and skills when dealing with the issues. Substantial values can be derived from the teams by adopting agile and development approaches which can evolve as continuous integration among the different teams. Below are the few ubiquitous scenarios which helps in paradigm shift of the teams mindset to overcome the barrier existing between DBA and Development/Application teams. Case 1: Unaccounted LOB access response time in total sql elapsed time While Application team were bench-marking an Application they found that few of their sql's which were accessing LOB datatype were taking more time and in particular whenever they access LOB records which are of large size (>20MB) it use to take more than 30 seconds. DBA team was involved to study this behavior and to provide recommendations for improving response time of these sql's. Fortunately DBA found that all these sql's were completed in less than 2 seconds when they queried the details of these sql's from dynamic view V$SQL. So it was contradicting situation between DBA and Application team and thus it got escalated, as a DBA it was very pathetic situation as to investigate why V$SQL would lie the elapsed time of sql, and hence decided to simulate the issue and profile it to get more details of sql's particularly which access LOB data. A demo table was created with CLOB datatype having 10K records and then an simple "select * from DEMO" was executed which took about 54 seconds to complete sql execution along with the results. SQL> select * from DEMO; ... ..... ....... 10000 rows selected. Elapsed: 00:00:53.80 Tracing 10046 was enabled to this sql statement and then report was generated using tkprof. It was weird to see the elapsed time of the sql as ~7 seconds when the sql has in fact ran for about 54 seconds. SQL ID: 27uhu2q2xuu7r Plan Hash: 3617692013 select * from demo call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.00 0.00 0 0 0 0 Execute 1 0.00 0.00 0 0 0 0 Fetch 10001 0.12 0.19 72 10005 0 10000 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 10003 0.12 0.19 72 10005 0 10000 Misses in library cache during parse: 0 Optimizer mode: ALL_ROWS Parsing user id: DEMO Rows Row Source Operation ------- --------------------------------------------------- 10000 TABLE ACCESS FULL DEMO (cr=10005 pr=72 pw=0 time=0 us cost=21 size=19744985 card=9799) Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ SQL*Net message to client 10003 0.00 0.01 SQL*Net message from client 10003 4.99 6.94 db file sequential read 2 0.00 0.00 db file scattered read 12 0.00 0.02 ******************************************************************************** Checking the elapsed time of this sql through V$SQL did not reveal the actual elapsed time, unfortunately it was reporting just .23 seconds of elapsed time as shown below. This also confirms that it would be impossible to determine the true runtime of these queries by looking at an AWR report/table. SQL> select elapsed_time/1000000 from v$sql where sql_id='27uhu2q2xuu7r'; ELAPSED_TIME/1000000 -------------------- .239964 Upon following the tkprof report further we found that the total response time of all non-recursive statements was about 55 seconds but this was not accounted for the individual sql_id. Under wait event section we could see that "direct path read" wait event has consumed 46 seconds out of 55 seconds total time, this wait event can be strongly correlated with the LOB access as LOB were not configured for caching. OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 2 0.00 0.00 0 0 0 0 Execute 2 0.00 0.02 0 0 0 0 Fetch 10001 0.12 0.19 72 10005 0 10000 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 10005 0.12 0.21 72 10005 0 10000 Misses in library cache during parse: 0 Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ SQL*Net message to client 50003 0.00 0.08 SQL*Net message from client 50003 4.99 9.13 db file sequential read 2 0.00 0.00 direct path read 160000 0.54 46.03 db file scattered read 12 0.00 0.02 To investigate further for this unaccounting of elapsed time with the sql we looked into the raw trace file, below is the snippet ===================== PARSING IN CURSOR #1 len=16 dep=0 uid=0 oct=3 lid=0 tim=1469601320938670 hv=2245880055 ad='ccf827178' sqlid='27uhu2q2xuu7r' select * from demo END OF STMT PARSE #1:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3617692013,tim=1469601320938670 EXEC #1:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3617692013,tim=1469601320938670 WAIT #1: nam='SQL*Net message to client' ela= 4 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1469601320957461 WAIT #1: nam='SQL*Net message from client' ela= 202 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=1469601320957688 WAIT #1: nam='db file sequential read' ela= 630 file#=1 block#=27490 blocks=1 obj#=206441 tim=1469601320959545 WAIT #1: nam='db file sequential read' ela= 649 file#=1 block#=27491 blocks=1 obj#=206441 tim=1469601320960278 WAIT #1: nam='SQL*Net message to client' ela= 3 driver id=1650815232 #bytes=1 p3=0 obj#=206441 tim=1469601320960316 FETCH #1:c=0,e=0,p=2,cr=4,cu=0,mis=0,r=1,dep=0,og=1,plh=3617692013,tim=1469601320938670 WAIT #1: nam='SQL*Net message from client' ela= 78 driver id=1650815232 #bytes=1 p3=0 obj#=206441 tim=1469601320960429 WAIT #0 : nam='direct path read' ela= 1148 file number=1 first dba=6266 block cnt=1 obj#=206442 tim=1469601320961741 WAIT #0 : nam='direct path read' ela= 12 file number=1 first dba=6266 block cnt=1 obj#=206442 tim=1469601320961802 WAIT #0 : nam='direct path read' ela= 266 file number=1 first dba=6266 block cnt=1 obj#=206442 tim=1469601320962108 WAIT #0 : nam='direct path read' ela= 3 file number=1 first dba=6266 block cnt=1 obj#=206442 tim=1469601320962119 ... .... If we closely check the occurrence of the wait events we see that WAIT #0 is reported when infact the current cursor getting executed is CURSOR #1 and thus CURSOR #1 doesn't report appropriate elapsed time. But which is that cursor WAIT #0 is referring to as there is no information of CURSOR #0 and why does it appear as such in the trace file ? Cursor #0 is a pseudocursor which is an internal cursor created by Oracle for reading LOB data, since this pseudocursor is different cursor the time spent by this cursor is not included in the original cursor. The reason for having pseudocursor is due to the way LOB are accessed, when we select LOB we just get the LOB locator and by using LOB locator we need to access the LOB data through another OPI(Oracle Program Interface) call. These LOB access OPI calls are never reported(Need to set event 10051 to trace OPI calls) but from 11.2 and higher releases there is an enhancement which reports the LOB access OPI calls in the trace file as shown below. PARSING IN CURSOR #47681193136232 len=16 dep=0 uid=0 oct=3 lid=0 tim=15759353126161 hv=2245880055 ad='92770f20' sqlid='27uhu2q2xuu7r' select * from demo END OF STMT PARSE #47681193136232:c=4000,e=9126,p=1,cr=78,cu=0,mis=1,r=0,dep=0,og=1,plh=3617692013,tim=15759353126161 EXEC #47681193136232:c=0,e=17,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3617692013,tim=15759353126208 WAIT #47681193136232: nam='SQL*Net message to client' ela= 4 driver id=1650815232 #bytes=1 p3=0 obj#=93403 tim=15759353126248 WAIT #47681193136232: nam='SQL*Net message from client' ela= 277 driver id=1650815232 #bytes=1 p3=0 obj#=93403 tim=15759353126559 WAIT #47681193136232: nam='SQL*Net message to client' ela= 4 driver id=1650815232 #bytes=1 p3=0 obj#=93403 tim=15759353126658 FETCH #47681193136232:c=999,e=2026,p=0,cr=4,cu=0,mis=0,r=1,dep=0,og=1,plh=3617692013,tim=15759353128631 WAIT #47681193136232: nam='SQL*Net message from client' ela= 14506 driver id=1650815232 #bytes=1 p3=0 obj#=93403 tim=15759353143164 WAIT #0: nam='direct path read' ela= 9988 file number=1 first dba=101059 block cnt=1 obj#=93401 tim=15759353155958 WAIT #0: nam='direct path read' ela= 309 file number=1 first dba=101059 block cnt=1 obj#=93401 tim=15759353156385 WAIT #0: nam='SQL*Net message to client' ela= 3 driver id=1650815232 #bytes=1 p3=0 obj#=93401 tim=15759353156413 LOBREAD: type=PERSISTENT LOB,bytes=20,c=2000,e=13225,p=2,cr=4,cu=0,tim=15759353156432 WAIT #0: nam='SQL*Net message from client' ela= 1663 driver id=1650815232 #bytes=1 p3=0 obj#=93401 tim=15759353158125 WAIT #0: nam='direct path read' ela= 246 file number=1 first dba=101059 block cnt=1 obj#=93401 tim=15759353158510 WAIT #0: nam='direct path read' ela= 220 file number=1 first dba=101059 block cnt=1 obj#=93401 tim=15759353158805 WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=93401 tim=15759353158825 LOBREAD: type=PERSISTENT LOB,bytes=20,c=0,e=669,p=2,cr=3,cu=0,tim=15759353158839 ... .... After gaining all these information through demo we started investigating the Application sql's and found that they indeed took more than 30 seconds whenever they try to access LOB data, out of 30 seconds 27 seconds were spent on wait event "db file sequential read" as LOB were enabled for caching and thus flooding the buffer cache with huge LOB data. This infact did not helped the Application as buffer cache was not used efficiently causing all the LOB access to perform disk reads. So when they are read they will go into buffer cache (vs setting that to NO will not put the LOB into cache). But the initial read of the LOB is still needed and if done through the application read time could take 30+ seconds for the larger LOBs. Solution for this slow response time when accessing LOB data was to disable caching of LOB for "direct path read" to take place or allocate different buffer pool (Keep/recycle) for all the LOB's to avoid flooding of buffer cache. Case 2: Implicit datatype conversion One of the Application stopped working all of sudden as one of their sql was throwing error ORA-01722, and as per Application team this same sql was perfectly working fine till then. Application team were surprised as there were no code changes done to Application recently, this is when they involved DBA team to investigate further on this issue. As usual DBA team started gathering all the information they could get from Application team and came to an conclusion saying that it is impossible for the same sql to work fine earlier which is contradicting the claim of Application team, justification from DBA team was the way sql has implemented when compared to datatype declared for the referencing column in the table. To be clear datatype is declared as varchar2 but sql is not using single quotes around the value when referencing this column in the sql, thus sql is throwing error ORA-01722. Later on Application team annoyingly pursued with their investigation to prove that this same sql was really working fine earlier and all of sudden stopped working without any changes in Application side, they came back with the application log files having history of the same sql showing it was indeed working fine earlier. At this situation it was finger pointing between Application and DBA team both trying to prove themselves. At the end based on the proofs it was concluded that this sql was running fine earlier and for sure issue is related to database. As usual DBA was blamed for this sql failure and enforced to take the responsibility of the catastrophic outage. Business and all other teams were waiting for DBA team to investigate it for justifying this failure. First task of investigation would be to find if varchar2 datatype column can be referenced as number(passing value without single quotes) in the sql, Simple demonstration: SQL> create table tab (col1 varchar2(20)); Table created. SQL> begin 2 for i in 1..5 loop 3 insert into tab values (i); 4 end loop; 5 commit; 6 end; 7 / PL/SQL procedure successfully completed. SQL> select * from tab where col1=1; COL1 -------------------- 1 Surprisingly it did worked, so lets see if execution plan could give some clues to understand the behaviour. SQL> select * from table(dbms_xplan.display_cursor); PLAN_TABLE_OUTPUT ------------------------------------------------------------------------------------------------------------------------------------------------------ SQL_ID bavq1swrqakx6, child number 0 ------------------------------------- select * from tab where col1=1 Plan hash value: 1910818592 -------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 3 (100)| | |* 1 | TABLE ACCESS FULL| TAB | 1 | 42 | 3 (0)| 00:00:01 | -------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter( TO_NUMBER ("COL1")=1) It's interesting to see that Optimizer has done conversion of value to NUMBER datatype by using to_number function as per the predicate information. So far no problem as we have clean data, which means just numbers in varchar datatype column. This confirms Application team were right about their claim that their sql was running fine earlier. Now let's check what happens if we insert a record of character value. SQL> insert into tab values ('AB'); 1 row created. SQL> commit; Commit complete. SQL> select * from tab where col1=1; ERROR: ORA-01722: invalid number no rows selected Suddenly query started giving error which was working fine earlier, a single row can cause all queries to fail which were not handling datatype appropriately. SQL> select * from tab where col1='1'; COL1 -------------------- 1 As we saw the cleverness of Optimizer converting datatype using to_number when data is clean in the table but it may break when data is mixture of number and character datatype. It is always recommended to not depend upon Implicit Datatype Conversion and declare proper datatype while deploying tables. Case 3 : Wrong interpretation of data in ASH due to assumption of each record as one second. One of the application job was taking too much of time when compared to its previous daily runs, this job extract 150 Million records from database and store it in flat file on Application side to process it further. The database to which this job connects is Active Dataguard meant for read only purpose. Since this job executes on Active Dataguard the only place from where we could get the run time details of this job is from ASH and thus one of the DBA immediately copied the standby ASH content into primary database through database link for future investigation. ASH was the only source of information for our investigation and hence DBA tried to narrow down the problematic area by mining ASH. One of the most common dilemma about ASH is that each record in ASH represents one second for which DBA's are tend to write sql's by grouping for each sql_id and sql_exec_id to get the run time duration for each individual execution of the problematic sql related to application job and then find details of time spent breakdown at each wait events as shown below. SQL> SELECT count( * ), sql_id, sql_exec_id FROM GV$ACTIVE_SESSION_HISTORY WHERE sql_id = '0hnujnc0a8b9j' GROUP BY sql_id,sql_exec_id ORDER BY 1 ; COUNT(*) SQL_ID SQL_EXEC_ID ---------- ------------- ----------- 411 0hnujnc0a8b9j 50331653 432 0hnujnc0a8b9j 50331651 440 0hnujnc0a8b9j 50331652 463 0hnujnc0a8b9j 67108864 So there were 4 executions of this sql_id based on the information stored in ASH and all of the 4 execution has took almost the same amount of time. Let's take one of the sql_exec_id(67108864) for further breakdown of elapsed time on wait events. SQL> SELECT count( * ), session_state, event FROM V$ACTIVE_SESSION_HISTORY WHERE sql_id = '0hnujnc0a8b9j' AND sql_exec_id = 67108864 GROUP BY session_state,event ORDER BY 1; COUNT(*) SESSION EVENT ---------- ------- ----------------------------------- 10 WAITING cell multiblock physical read 23 WAITING direct path write temp 71 WAITING gc cr multi block request 164 WAITING direct path read temp 195 ON CPU There is nothing unusual as per above information since all of the executions has taken about 450 seconds and not matching with what Application team is claiming that it took about 4 hours to complete. This is where both DBA and Application team started arguing with each other and ultimately DBA was pressurized to look into this issue with greater detail in focus. Anyone having good understanding of ASH working will easily point the mistake we did in the above sql's for analyzing ASH data. It is true that ASH contains every seconds snaps of the active sessions and thus most of the DBA's just sum up each record representing one second as shown above. One of the important detail of ASH is that it doesn't capture any idle wait events and thus if session is idle in between of sql execution then it will not be captured by ASH. Hence it is always recommended to calculate the start and end time of the sql when using ASH as shown below. SQL> SELECT sql_exec_id, ( Cast( max_tim AS DATE ) - Cast( min_tim AS DATE ) ) * 60 * 60 * 24 as SECONDS FROM ( SELECT Min( sample_time ) min_tim, Max( sample_time ) max_tim, sql_exec_id FROM V$ACTIVE_SESSION_HISTORY WHERE sql_id = '0hnujnc0a8b9j' GROUP BY sql_exec_id ) ; SQL_EXEC_ID SECONDS ----------- ------- 50331653 411 50331651 432 50331652 440 67108864 15015 This confirms what Application were saying that job took about 4 hours to complete. But according to ASH just 463 seconds was spent doing work and all the remaining time out of 15015 was just idle. After providing this information to Application team they were able to track down the issue with their storage system which was having issues and thus delaying their loading, due this database session was just idle in between with SQL*Net idle wait events and thus ASH didn't captured it. Conclusion In all the cases explained above the issue were related to either DBA or Application team, but both teams had to revisit the issue due to misunderstanding among the teams as the problem description was vaguely communicated leading to delay in finding the root cause. That's why the best option is to aim for consensus and prepare to listen openly to the concerns of the other to avoid throwing issues over the wall. Precise communication will improve the situation among the teams resulting in faster resolution of the issues. There should be no barrier between DBA and Application team, strong collaboration among the teams will lead to comprehensive performance improvements. The database and development teams both have different perspectives, both are important and need to be coordinated. In earlier project life-cycle adopt agile, iterative database development approaches, which means continuous integration. In summary, commitment to collaborative work has high potential for faster resolution of the issues and gain greater insights into technical areas.

Viewing all articles
Browse latest Browse all 1814

Trending Articles