With apologies to R.E.M.
I often get teased at work because I have the following little graphic in my Sidebar:
It’s a thumbnail of a custom webpage I built that displays the hourly cpu graphs of the 5 main production database servers my team and I write code for.
I glance at it all day, just to make sure all of my “children” are behaving as I expect them to. By basic patterns, I can see when one of them isn’t healthy — and then I start digging in to see why.
This isn’t rocket science — it’s probably what everyone does while driving in their car every day. You look at the gas and temperature gauges to see if you need to do anything. I’m amazed at people who don’t watch how their systems are doing — especially when it’s so easy.
Anyway, yesterday I saw one of the systems go to 80% cpu at a time when it normally runs between 40 and 60%. No one reported any issues — there weren’t any reports of slow transactions or poor response times. But whenever I see something out of the ordinary, I want to know why — and, I know that if the system had gone to 95-100% cpu for any sustained period, we’d start to get calls.
I prefer to have an answer ready when I get those calls
I made one mistake while diagnosing this problem — I chose invalid endpoints and comparisons when looking at the time window during which I thought the problem occurred — this led me down a circuitous path instead of directly at the problem. In this case I “saw” the cpu spike at 12 noon (actually, it appeared like it started at 11am, rose through noon, and declined from noon to 1pm). In this system the primary activity is servicing orders, so I compared order volume by hour with previous hours from 1pm back through 8am and didn’t see any appreciable difference in order counts from hour to hour.
(My mistake was in not comparing order volume to a “baseline” for the appropriate time period — i.e., Wednesdays between 8am and 1pm. Had I done so initially, I would have seen that we had higher than normal order volume right away — but then I would not have been able to use the techniques I eventually developed
.
I ran a quick statspack report for the 11am to 1pm time window and didn’t see any instance of inefficient SQL being run or ad-hoc queries, so I started looking at other system statistics to help me understand why the server got so busy. In particular, I started manually querying the stats$sysstat view (it’s a 9i database) to see what things got higher
Everything pointed at higher call volume — higher numbers of user calls, commits, parses, etc. Session logical reads seemed steady, so it looked like a lot of calls that didn’t do anything — polling for work maybe?
So now I wanted to see which statements were being called at higher frequencies than “normal” — and here’s where the work gets…..not hard, but tedious. I worked through the various stats tables to capture hourly execution counts for all SQL statements (at least the ones captured) and I built up execution profiles for Monday, Tuesday and Wednesday of this week and last week. During my comparisons, I saw that stored procedures and statements related to order submission were being called more frequently.
By then going back to the tables, I was able to see that we did indeed get more customer orders — not just during the 11am to 1pm window, but from 7am through 1pm. The cpu spike was just the cumulative affect of the orders being processed as the backlog built up.
I have three points from this:
- Always watch your production systems — you really want to know about potential problems before you get the phone calls.
- Broaden your window of investigation as much as possible — that 12 noon problem had its genesis at 7am, not 11am.
- I wish there was an easy way to see which SQL has changed its call frequency compared to “normal” — and be alerted when it happens. I think OEM 10g does this,, but only when you know the SQL to look for, and I miss the visualization techniques used by the old Savant product (which was bought by Precise, which was bought by Veritas, which was bought by Symantec)