Saturday 21 September 2013

What is WIO in Sar reports

When analysing sar data, one of the parameters that causes lots of confusion is the wio (or wait for IO) value.  It is counted in CPU reports as part of the CPU time, but is it really used CPU time?

If you look at the picture below you can see that even KSAR (http://sourceforge.net/projects/ksar/) makes it red in his graphs:


If you look closely, you can see that the red part is bigger at times when the blue and green parts are smaller.  This is a first indication that the red part is not literally a problem.

When a process has to wait for IO, it is shifted of the CPU.  If the CPU is able to wait (if it has some CPU time to spare), it will wait for the IO, so it can be handled immediately.  The time the CPU lets the process wait for IO, is the red part.

However, if the CPU has no time, and is more in use, it will shift the process to wait in a queue.  At that point, the CPU is able to serve other processes while the first process waits.

So, we can say that the WIO is significant, but it is not real "busy" time.

SCHED_NOAGE CPU problems with Oracle log writer on HPUX

One of Oracle's recommendations running Oracle on HPUX is using the SCHED_NOAGE CPU scheduler:

http://docs.oracle.com/cd/B28359_01/server.111/b32009/appb_hpux.htm

This is instead of the standard HPUX CPU scheduler which will change the priority of processes during their lifetime.  One of the reasons for this recommendation is to avoid  Cursor: pin-s problems when CPU peaks are common:

http://srivenukadiyala.wordpress.com/2012/01/30/sched_noage-and-latch-contention/

By changing the CPU scheduling policy, we introduced another problem.  One of the charactiristics of SCHED_NOAGE is that every process is started with the same priority.  Default, oracle recommends to use 178, which is the highest priority (the lower the number, the higher the priority) you can give to a process in SCHED_NOAGE.  The lower numbers are reserved for other scheduling policies (like realtime scheduling).  

The problem we have encountered is that at high loads, the log writer has to wait for other processes to complete.  Processes that are waiting for a commit must wait for the log writer to recieve CPU time.  This can cause heavy delays on the applications waiting for these commits.  

To solve this, we changed the priority of the log writer process (lgwr_DBNAME) so it would have priority over all the other processes.  

This limited the commit time trimendously.  Application performance boosted.  

The ideal way to give more CPU time to the log writer process, is to change the default priority of all the Oracle processes to for example 180.  After the DB is started, the priority of the log writer is changed to 178. 

The only problem is that Oracle support recommended against it, but was unable to supply an alternative.