Analyzing a Statspack Report
If you could choose just two Oracle utilities
to find and monitor performance problems in your Oracle9i Database
system, those two utilities would be Oracle Enterprise Manager (now available
in Release 4.0) and Statspack. Which area of the Summary page you will focus
o first will depend on whether you are investigating a performance problem
on monitoring the load of changes, you should start checking the top 5 wait
events section.
When statistics and wait events can be misleading
There are certain checks which can be performed to help identify whether
a statistic or event is really of interest. When timed statistics is false,
wait events are ordered by the number of waits. This information may indicate
which events are of interest, however it may be misleading. An event may be
waited for a large number of times, however the wait time (if it were available
for comparison) may show the actual time waited is small despite the high
count, hence the event is not really of interest. If wait time is available,
a useful comparison can be made by taking the total wait time for an event,
and comparing it to the elapsed time between snapshots. For example, if the
wait event accounts for only 30 seconds out of a two hour period, there is
probably little to be gained by investigating this event. However, if the
event accounts for 30 minutes of a 45 minute period, the event may be worth
investigating. There is a warning here, too: even an event which had a wait
of 30 minutes in a 45 minute snapshot may not be indicative of a problem,
when you take into account there were 2000 users on the system, and the host
hardware was a 64 node machine.
When interpreting computed statistics (such as percentages, or per-second
rates), it is important to cross-verify the computed statistic with the actual
statistic counts. This acts as a sanity check to determine whether the derived
rates are really of interest. On initial examination, a soft-parse ratio of
50% would normally indicate a potential tuning area. However if the actual
statistic counts are small, this would not be an area of interest. For example,
if there was one hard parse and one soft parse during the Statspack interval,
the soft-parse ratio would be 50%, even though the statistic counts show this
is not an area of concern.
SNAPSHOT LEVELS
LEVEL 0 – GENERAL PERFORMANCE
This level can be used to gather general performance information about the
database.
LEVEL 5 – GENERAL PERFORMANCE + SQL STATEMENTS
This snapshot level will gather all the information from the previous levels,
plus it will collect performance data on high resource SQL statements. This
is also the default snapshot level when Statspack is installed.
LEVEL 6 – GENERAL PERFORMANCE + SQL STATEMENTS + SQL PLANS AND SQL PLAN
USAGE
This level is new in Oracle9i and it will include all the information collected
from the previous snapshot levels, plus execution path and plan usage information
as they relate to high resource SQL statements. This type of information can
prove critical when determining if the execution path or plan has changed
for high resource SQL statements. Oracle recommends using this level for when
one of the following situations has occurred:
• A plan has possibly changed after large volumes of data have been added.
• Obtaining new optimizer setting information.
LEVEL 10 – GENERAL PERFORMANCE + SQL STATEMENTS + SQL PLANS AND SQL PLAN
USAGE + PARENT
AND CHILD LATCHES
This level will include all the information collected from previous snapshot
levels, plus the addition of parent and child latch information. This level
will take even longer to complete since the parent and child latch information
are added to the duration of the previous 2 levels, which are already information
gathering intensive. First, because the information gathered is based on the
shared_pool_size and secondly the volume of information gathered based on
SQL statement information, plus the parent and child latch information. Snapshots
taken from this level will take even longer and it is Oracle’s recommendation
to only use this level when requested by Oracle technical support personnel.
LEVEL SETTING RECOMMENDATION
It is recommended to set the timed statistics to true BEFORE the first snapshot
because it will help to establish a better baseline, otherwise another baseline
will be needed AFTER it is turned on. This can be done with the Alter SYSTEM
command and/or setting it in the init.ora file.
SESSION SPECIFIC SNAPSHOT
Statspack also provides the capability to gather session specific information.
Passing the i_session_id value to the Statspack.snap procedure will enable
this option.
The following is an example of using this feature:
SQL> EXECUTE STATSPACK.SNAP(i_session_id=>20);
EXECUTING SNAPSHOTS
Executing a snapshot interactively can be as easy as accessing SQL*Plus
as the PERFSTAT user and using the SNAPSHOT.SNAP command or automating when
a snapshot is executed. The interactive method is highly beneficial for when
a problem is reported in the database and a snapshot could prove beneficial
for troubleshooting, whereas the value of an automated snapshot is realized
when a problem is reported at a later time and a comparison needs to be made
between two specific times that occurred in the past.
INTERACTIVE METHOD
Access SQL*Plus as the PERFSTAT user and execute either method 1, 2 or 3
as discussed in the above snapshot Configuration section. The simplest form
of the interactive mode is as follows:
SQL> EXECUTE STATSPACK.SNAP
AUTOMATED METHOD
The ability to automate a snapshot is another one of the great features
of the Statspack utility. Automating and scheduling when to take snapshots
allows for the collection of database performance information that would
be beneficial for troubleshooting performance problems that occurred earlier.
The following are two ways that snapshots can be automated:
• Oracle’s DBMS_JOB utility to schedule snapshots. This utility will be
discussed in greater detail.
• An operating specific job scheduler. For example on Unix, shell scripts
can be written and then scheduled through the CRON scheduler. For NT, the
AT scheduler in combination with .cmd files.
DBMS_JOB UTILITY
The DBMS_JOB utility provides a way to schedule database related tasks that
are controlled within the database. Through the DBMS_JOB utility snapshots
can be taken at a scheduled interval. When the spcpkg.sql script was executed
as part of the Statspack installation, the DBMS_JOB package was created for
the PERFSTAT user. One of the requirements to use the DBMS_JOB utility is
that the init.ora parameter job_queue_processes must be set to a value greater
than 0. The spauto.sql script is designed to setup the automation of executing
snapshots once every hour. The following line from the script is how the job
is added to the schedule:
dbms_job.submit(:jobno, 'statspack.snap;', trunc(sysdate+1/24,'HH'), - 'trunc(SYSDATE+1/24,''HH'')',
TRUE, :instno);
The benefits of using the spauto.sql script is that it:
• Displays the job number assigned
• Identifies the number of job_queue_processes set for the database
• The next time that the snapshot will occur
If you are not comparing two reports, it's still a good idea to scan
the Load Profile for any rates that seem high, irrespective of a baseline.
For example, a high hard-parse rate (say, greater than 100 per second) may
have serious implications for performance. High hard-parse rates are likely
to be accompanied by latch contention, so you would expect to see latch free
waits in the Top 5 Wait Events or high in the complete Wait Events list.
Load Profile
~~~~~~~~~~~~
Per Second Per Transaction
--------------- ---------------
Redo size:
351,530.67
7,007.37
Logical reads:
5,449.81
108.64
Block changes:
1,042.0 8
20.77
Physical reads:
37.71
0.75
Physical
writes:
134.68
2.68
User calls:
1,254.72
25.01
Parses:
4.92
0.10
Hard parses:
0.02
0.00
Sorts:
15.73
0.31
Logons:
-0.01
0.00
Executes:
473.73
9.44
Transactions:
50.17
% Blocks changed per Read: 19.12 Recursive
Call %: 4.71
Rollback per transaction %: 2.24
Rows per Sort: 20.91
Where:
. Redo size: This is the amount of redo generated during this report.
. Logical Reads: This is calculated as Consistent Gets + DB Block Gets =
Logical Reads
. Block changes: The number of blocks modified during the sample interval
. Physical Reads: The number of requests for a block that caused a physical
I/O.
. Physical Writes: The number of physical writes issued.
. User Calls: The number of queries generated
. Parses: Total of all parses: both hard and soft
. Hard Parses: Those parses requiring a completely new parse of the SQL
statement. These consume both latches and shared pool area.
. Soft Parses: Not listed but derived by subtracting the hard parses from
parses. A soft parse reuses a previous hard parse and hence consumes
far fewer resources.
. Sorts, Logons, Executes and Transactions are all self explanatory
Basically, you need to keep in mind the characteristics of your application—whether it is query-intensive or update-intensive, whether it involves lots of sorting, and so on—when you're evaluating the Instance Efficiency Percentages. Here's how each ratio is calculated, along with which related sections of the report you should look at when investigating suspicious values:
Buffer Nowait Ratio. This ratio relates to requests that a server
process makes for a specific buffer; it gives the percentage of those requests
in which the requested buffer is immediately available. All buffer types
are included in this statistic. If the ratio is low, check the Buffer Wait
Statistics section of the report for more detail on which type of block is
being contended for.
Buffer Hit Ratio. This ratio, also known as the buffer-cache hit ratio,
gives the percentage of block requests that were satisfied within the cache
without requiring physical I/O. Although historically known as one of the
most important statistics to evaluate, this ratio can sometimes be misleading.
A low buffer hit ratio does not necessarily mean the cache is too small;
it may be that potentially valid full-table scans are artificially reducing
what is otherwise a good ratio. Similarly, a high buffer hit ratio (say,
99 percent) normally indicates that the cache is adequately sized, but this
assumption may not always be valid.
For example, frequently executed SQL statements that repeatedly refer
to a small number of buffers via indexed lookups can create a misleadingly
high buffer hit ratio. When these buffers are read, they are placed at the
most recently used (MRU) end of the buffer cache; iterative access to these
buffers can artificially inflate the buffer hit ratio. This inflation makes
tuning the buffer cache a challenge. Sometimes you can identify a too-small
buffer cache by the appearance of the write complete waits event, which
indicates that hot blocks (that is, blocks that are still being modified)
are aging out of the cache while they are still needed; check the Wait Events
list for evidence of this event.
Library Hit Ratio. This ratio, also known as the library-cache hit
ratio, gives the percentage of pin requests that result in pin hits. A pin
hit occurs when the SQL or PL/SQL code to be executed is already in the
library cache and is valid to execute. A low library hit ratio could imply
that SQL is prematurely aging out of a too-small shared pool, or that non-shareable
SQL is being used. If the soft parse ratio is also low, check whether there's
a parsing issue.
Redo Nowait Ratio. This ratio indicates the amount of redo entries
generated for which there was space available in the redo log. The percentage
is calculated as follows:
100 x (1- (redo-log space requests/redo entries))
The redo-log space-request statistic is incremented when an Oracle process attempts to write a redo-log entry but there is not sufficient space remaining in the online redo log. Thus, a value close to 100 percent for the redo nowait ratio indicates minimal time spent waiting for redo logs to become available, either because the logs are not filling up very often or because the database is able to switch to a new log quickly whenever the current log fills up.
If your alert log shows that you are switching logs frequently (that
is, more than once every 15 minutes), you may be able to reduce the amount
of switching by increasing the size of the online redo logs. If the log
switches are not frequent, check the disks on which the redo logs reside
to see why the switches are not happening quickly. If these disks are not
overloaded, they may be slow, which means you could put the files on faster
disks.
In-Memory Sort Ratio. This ratio gives the percentage of sorts that
were performed in memory, rather than requiring a disk-sort segment to complete
the sort. Optimally, in an OLTP environment, this ratio should be high.
If it isn't, consult the Oracle8i Designing and Tuning for Performance
manual for information on tuning sorts.
Soft Parse Ratio. This ratio gives the percentage of parses that were
soft, as opposed to hard. A soft parse occurs when a session attempts to
execute a SQL statement and a usable version of the statement is already
in the shared pool. In other words, all data (such as the optimizer execution
plan) pertaining to the statement in the shared pool is equally applicable
to the statement currently being issued. A hard parse, on the other hand,
occurs when the current SQL statement is either not in the shared pool or
not there in a shareable form. An example of the latter case would be when
the SQL statement in the shared pool is textually identical to the current
statement but the tables referred to in the two statements resolve to physically
different tables.
Hard parsing is an expensive operation and should be kept to a minimum in an OLTP environment. The aim is to parse once, execute many times.
Ideally, the soft parse ratio should be greater than 95 percent. When
the soft parse ratio falls much below 80 percent, investigate whether you
can share SQL by using bind variables or force cursor sharing by using the
init.ora
parameter cursor_sharing
(new in Oracle8
i Release 8.1.6).
Before you jump to any conclusions about your soft parse ratio, however, be sure to compare it against the actual hard and soft parse rates shown in the Load Profile. If the rates are low (for example, 1 parse per second), parsing may not be a significant issue in your system. Another useful standard of comparison is the proportion of parse time that was not CPU-related, given by the following ratio:
(parse time CPU) / (parse time elapsed)
A low value for this ratio could mean that the non-CPU-related parse
time was spent waiting for latches, which might indicate a parsing or latching
problem. To investigate further, look at the shared-pool and library-cache
latches in the Latch sections of the report for indications of contention
on these latches.
Latch Hit Ratio. This is the ratio of the total number of latch misses
to the number of latch gets for all latches. A low value for this ratio
indicates a latching problem, whereas a high value is generally good. However,
as the data is rolled up over all latches, a high latch hit ratio can artificially
mask a low get rate on a specific latch. Cross-check this value with the
Top 5 Wait Events to see if latch free is in the list, and refer to the Latch
sections of the report.
Top 5 Wait Events
When you are trying to eliminate bottlenecks on your system, your Statspack report's Top 5 Wait Events section is the first place to look. This section of the report shows the top 5 wait events, the full list of wait events, and the background wait events. If your system's TIMED_STATISTICS initialization parameter is set to true, the events are ordered in time waited, which is preferable, since all events don't show the waits. If TIMED_STATISTICS is false, the events are ordered by the number of waits.
Listing 1
shows a large number of waits related to reading
a single block (db file sequential read) as well as waits for latches (latch
free). You can see in this listing high waits for some of the writing to
datafiles and log files. To identify which of these are major issues, you
must narrow down the list by investigating the granular reports within other
sections of Statspack.
Code Listing 1: Statspack report showing
waits related to reading a single block
Top 5 Wait Events
------------------------------------
Event Waits Wait Time (cs) % Total Wt Time
---------------------------------------------------------------------------------
db file sequential read 18,977,104 22,379,571 82.29
latch free 4,016,773 2,598,496 9.55
log file sync 1,057,224 733,490 2.70
log file parallel write 1,054,006 503,695 1.85
db file parallel write 1,221,755 404,230 1.49
Resolving Your Wait Events
Avg
Total Wait wait Waits
Event
Waits Timeouts Time (cs) (ms)
/txn
---------------------------- ------------ ---------- ----------- ------ ------
latch free
88,578 32,522 18,341
2 ######
enqueue
319 230
5,932 186 14.5
row cache lock
4,941 0
2,307 5 224.6
control file parallel write
1,172 0
332 3 53.3
db file parallel write
176 0
67 4 8.0
log file parallel write
315 0
65 2 14.3
db file scattered read
137 0
62 5 6.2
LGWR wait for redo copy
66 10
47 7 3.0
The following are 10 of the most common causes for
wait events, along with explanations and potential solutions:
1. DB File Scattered Read. This generally indicates waits related to full table scans. As full table scans are pulled into memory, they rarely fall into contiguous buffers but instead are scattered throughout the buffer cache. A large number here indicates that your table may have missing or suppressed indexes. Although it may be more efficient in your situation to perform a full table scan than an index scan, check to ensure that full table scans are necessary when you see these waits. Try to cache small tables to avoid reading them in over and over again, since a full table scan is put at the cold end of the LRU (Least Recently Used) list.
2. DB File Sequential Read. This event generally indicates a single block read (an index read, for example). A large number of waits here could indicate poor joining orders of tables, or unselective indexing. It is normal for this number to be large for a high-transaction, well-tuned system, but it can indicate problems in some circumstances. You should correlate this wait statistic with other known issues within the Statspack report, such as inefficient SQL. Check to ensure that index scans are necessary, and check join orders for multiple table joins. The DB_CACHE_SIZE will also be a determining factor in how often these waits show up. Problematic hash-area joins should show up in the PGA memory, but they're also memory hogs that could cause high wait numbers for sequential reads. They can also show up as direct path read/write waits. These circumstances are usually interrelated. When they occur in conjunction with the appearance of the db file scattered read and db file sequential read in the Top 5 Wait Events section, first you should examine the SQL Ordered by Physical Reads section of the report, to see if it might be helpful to tune the statements with the highest resource usage. Then, to determine whether there is a potential I/O bottleneck, examine the OS I/O statistics for corresponding symptoms. Also look at the average time per read in the Tablespace and File I/O sections of the report. If many I/O-related events appear high in the Wait Events list, re-examine the host hardware for disk bottlenecks and check the host-hardware statistics for indications that a disk reconfiguration may be of benefit.
3. Free Buffer. This indicates your system is waiting for a buffer in memory, because none is currently available. Waits in this category may indicate that you need to increase the DB_BUFFER_CACHE, if all your SQL is tuned. Free buffer waits could also indicate that unselective SQL is causing data to flood the buffer cache with index blocks, leaving none for this particular statement that is waiting for the system to process. This normally indicates that there is a substantial amount of DML (insert/update/delete) being done and that the Database Writer (DBWR) is not writing quickly enough; the buffer cache could be full of multiple versions of the same buffer, causing great inefficiency. To address this, you may want to consider accelerating incremental checkpointing, using more DBWR processes, or increasing the number of physical disks. To investigate if this is an I/O problem, look at the statspack file I/O Statistics.
4. Buffer Busy. This is a wait for a buffer that is being used in an unshareable way or is being read into the buffer cache. Buffer busy waits should not be greater than 1 percent. Check the Buffer Wait Statistics section (or V$WAITSTAT) to find out if the wait is on a segment header. If this is the case, increase the freelist groups or increase the pctused to pctfree gap. If the wait is on an undo header, you can address this by adding rollback segments; if it's on an undo block, you need to reduce the data density on the table driving this consistent read or increase the DB_CACHE_SIZE. If the wait is on a data block, you can move data to another block to avoid this hot block, increase the freelists on the table, or use Locally Managed Tablespaces (LMTs). If it's on an index block, you should rebuild the index, partition the index, or use a reverse key index. To prevent buffer busy waits related to data blocks, you can also use a smaller block size: fewer records fall within a single block in this case, so it's not as "hot." When a DML (insert/update/ delete) occurs, Oracle Database writes information into the block, including all users who are "interested" in the state of the block (Interested Transaction List, ITL). To decrease waits in this area, you can increase the initrans, which will create the space in the block to allow multiple ITL slots. You can also increase the pctfree on the table where this block exists (this writes the ITL information up to the number specified by maxtrans, when there are not enough slots built with the initrans that is specified).5. Latch Free. Latches are low-level queuing mechanisms (they're accurately referred to as mutual exclusion mechanisms) used to protect shared memory structures in the system global area (SGA). Latches are like locks on memory that are very quickly obtained and released. Latches are used to prevent concurrent access to a shared memory structure. If the latch is not available, a latch free miss is recorded. Most latch problems are related to the failure to use bind variables (library cache latch), redo generation issues (redo allocation latch), buffer cache contention issues (cache buffers LRU chain), and hot blocks in the buffer cache (cache buffers chain). There are also latch waits related to bugs; check MetaLink for bug reports if you suspect this is the case (oracle.com/support ). When latch miss ratios are greater than 0.5 percent, you should investigate the issue. I will cover latch waits in detail in my next Oracle Magazine column; the topic requires an article in itself. If latch free waits are in the Top 5 Wait Events or high in the complete Wait Events list, look at the latch-specific sections of the Statspack report to see which latches are contended for.
6. Enqueue. An enqueue is a lock that protects a shared resource. Locks protect shared resources, such as data in a record, to prevent two people from updating the same data at the same time. An enqueue includes a queuing mechanism, which is FIFO (first in, first out). Note that Oracle's latching mechanism is not FIFO. Enqueue waits usually point to the ST enqueue, the HW enqueue, the TX4 enqueue, and the TM enqueue. The ST enqueue is used for space management and allocation for dictionary-managed tablespaces. Use LMTs, or try to preallocate extents or at least make the next extent larger for problematic dictionary-managed tablespaces. HW enqueues are used with the high-water mark of a segment; manually allocating the extents can circumvent this wait. TX4s are the most common enqueue waits. TX4 enqueue waits are usually the result of one of three issues. The first issue is duplicates in a unique index; you need to commit/rollback to free the enqueue. The second is multiple updates to the same bitmap index fragment. Since a single bitmap fragment may contain multiple rowids, you need to issue a commit or rollback to free the enqueue when multiple users are trying to update the same fragment. The third and most likely issue is when multiple users are updating the same block. If there are no free ITL slots, a block-level lock could occur. You can easily avoid this scenario by increasing the initrans and/or maxtrans to allow multiple ITL slots and/or by increasing the pctfree on the table. Finally, TM enqueues occur during DML to prevent DDL to the affected object. If you have foreign keys, be sure to index them to avoid this general locking issue.
7. Log Buffer Space. This wait occurs because you are writing the log buffer faster than LGWR can write it to the redo logs, or because log switches are too slow. To address this problem, increase the size of the log files, or increase the size of the log buffer, or get faster disks to write to. You might even consider using solid-state disks, for their high speed.
8. Log File Switch. All commit requests are waiting for "logfile switch (archiving needed)" or "logfile switch (chkpt. Incomplete)." Ensure that the archive disk is not full or slow. DBWR may be too slow because of I/O. You may need to add more or larger redo logs, and you may potentially need to add database writers if the DBWR is the problem.
9. Log File Sync. When a user commits or rolls back data, the LGWR flushes the session's redo from the log buffer to the redo logs. The log file sync process must wait for this to successfully complete. To reduce wait events here, try to commit more records (try to commit a batch of 50 instead of one at a time, for example). Put redo logs on a faster disk, or alternate redo logs on different physical disks, to reduce the archiving effect on LGWR. Don't use RAID 5, since it is very slow for applications that write a lot; potentially consider using file system direct I/O or raw devices, which are very fast at writing information.
10. Idle Event.
There are several idle wait events listed after the output; you can ignore
them. Idle events are generally listed at the bottom of each section and
include such things as SQL*Net message to/from client and other background-related
timings. Idle events are listed in the stats$idle_event table.
A guide to the detail sections of the Statspack report
Section(s) | What You Can Use the Section(s) for |
Wait Events | Look for excessive waits and wait times; drill down to specific problems |
SQL Ordered by Buffer Gets, Physical Reads, and Rows Processed | Figure out which SQL statements to tune |
Instance Activity Statistics | Compare with baseline report; compute additional statistics |
Tablespace and File I/O | Investigate I/O bottlenecks, identify files and tablespaces with heavy I/O |
Buffer Pool | Identify specific buffer pools with high contention or I/O |
Buffer Wait Statistics | Identify types of buffers with large number of buffer waits |
Enqueue Activity | Investigate specific lock types that are causing the most waits |
Rollback Segment Statistics and Storage | Investigate waits for rollback segment headers |
Latch Activity, Latch Sleep Breakdown, Latch Miss Sources | Identify latching bottlenecks; diagnose and related problems |
Library Cache | Diagnose problems with shared pool |
Non-default
init.ora |
Look for unnecessary or problematic parameter definitions |
|
|
Wait Problem | Potential Fix |
DB File Scattered Read | Indicates many full table scans: tune the code; cache small tables. |
DB File Sequential Read | Indicates many index reads: tune the code (especially joins). |
Free Buffer | Increase the DB_CACHE_SIZE; shorten the checkpoint; tune the code. |
Buffer Busy | Segment header: add freelists or freelist groups. |
Buffer Busy | Data block: separate "hot" data; use reverse key indexes and/or smaller blocks. |
Buffer Busy | Data block: increase initrans and/or maxtrans. |
Buffer Busy | Undo header: add rollback segments or areas. |
Buffer Busy | Undo block: commit more often; use larger rollback segments or areas. |
Latch Free | Investigate the latch detail. |
Enqueue—ST | Use LMTs or preallocate large extents. |
Enqueue—HW | Preallocate extents above high-water mark. |
Enqueue—TX4 | Increase initrans and/or maxtrans on the table or index. |
Enqueue—TM | Index foreign keys; check application locking of tables. |
Log Buffer Space | Increase the log buffer; use faster disks for the redo logs. |
Log File Switch | Archive destination slow or full; add more or larger redo logs. |
Log File Sync | Commit more records at a time; use faster redo log disks or raw devices. |
Idle Event | Ignore it. |
|
|
Event | Idle Event Type |
Dispatcher timer | Shared server |
Lock manager wait for remote message | Oracle9i Real Application Clusters |
Pipe get | User process |
pmon timer | Background process |
PX Idle wait | Parallel query |
PX Deq Credit: need buffer | Parallel query |
PX Deq Credit: send blkd | Parallel query |
rdbms ipc message | Background process |
smon timer | Background process |
SQL*Net message from client | User process |
virtual Circuit status | Shared server |