"Can’t you just…?"

I’ve got to say that those three words never have their intended effect.  They’re usually spoken by someone trying to tell someone else how to do their job.  And that’s never well received :-)

I usually hear these words when I’m asked to address a performance issue — often because all other ways appear to be more difficult or risky, even if they’re the right way to address the problem.

And often times, the performance problem is 4 or 5 problems away from the root of the problem — and that’s even worse.

Here’s an example based on an old chestnut — the case of using tables as rows.  Full credit to Worse Than Failure, written up as follows: http://worsethanfailure.com/Articles/Hacking_the_jProject.aspx

Looking at the system one day, I see it spending large amounts of CPU, while apparently accomplishing very little.  When I browse active sessions, I see a connection from a system monitoring role account, running the following query:

select nvl (sum (bytes), 0)
from sys.dba_free_space
where tablespace_name = ‘XXXXX’

Ok, looks like a standard space monitoring query, so why is it taking so long?

I go an run the query myself — and yep, it takes a long time.  I try it for different tablespaces, but only one of them takes forever.

“Hmm…” (I say to myself), “…maybe the tablespace is super fragmented?  But why would a tablespace get super fragmented?  We almost never drop tables or shrink things.”

So, I go look at what’s in the tablespace using:

select *
from dba_segments
where tablespace_name = ‘XXXXX’

And THAT query takes forever and a day.  Do a bit more digging and I find out that there are 50,000 tables in that tablespace.  And I find a variation of the WTF article from above, complete with an infrastructure to “clean up” the tables by dropping old ones on a regular basis.

What would you do in this situation?  Keep in mind that changing the code or design would be “super expensive” :-)

What’s the Frequency, Kenneth?

With apologies to R.E.M.

I often get teased at work because I have the following little graphic in my Sidebar:

It’s a thumbnail of a custom webpage I built that displays the hourly cpu graphs of the 5 main production database servers my team and I write code for.

I glance at it all day, just to make sure all of my “children” are behaving as I expect them to.  By basic patterns, I can see when one of them isn’t healthy — and then I start digging in to see why.

This isn’t rocket science — it’s probably what everyone does while driving in their car every day.  You look at the gas and temperature gauges to see if you need to do anything.  I’m amazed at people who don’t watch how their systems are doing — especially when it’s so easy.

Anyway, yesterday I saw one of the systems go to 80% cpu at a time when it normally runs between 40 and 60%.  No one reported any issues — there weren’t any reports of slow transactions or poor response times.  But whenever I see something out of the ordinary, I want to know why — and, I know that if the system had gone to 95-100% cpu for any sustained period, we’d start to get calls.

I prefer to have an answer ready when I get those calls :-)

I made one mistake while diagnosing this problem — I chose invalid endpoints and comparisons when looking at the time window during which I thought the problem occurred — this led me down a circuitous path instead of directly at the problem.  In this case I “saw” the cpu spike at 12 noon (actually, it appeared like it started at 11am, rose through noon, and declined from noon to 1pm).  In this system the primary activity is servicing orders, so I compared order volume by hour with previous hours from 1pm back through 8am and didn’t see any appreciable difference in order counts from hour to hour.

(My mistake was in not comparing order volume to a “baseline” for the appropriate time period — i.e., Wednesdays between 8am and 1pm.  Had I done so initially, I would have seen that we had higher than normal order volume right away — but then I would not have been able to use the techniques I eventually developed :-).

I ran a quick statspack report for the 11am to 1pm time window and didn’t see any instance of inefficient SQL being run or ad-hoc queries, so I started looking at other system statistics to help me understand why the server got so busy.  In particular, I started manually querying the stats$sysstat view (it’s a 9i database) to see what things got higher :-)

Everything pointed at higher call volume — higher numbers of user calls, commits, parses, etc.  Session logical reads seemed steady, so it looked like a lot of calls that didn’t do anything — polling for work maybe?

So now I wanted to see which statements were being called at higher frequencies than “normal” — and here’s where the work gets…..not hard, but tedious.  I worked through the various stats tables to capture hourly execution counts for all SQL statements (at least the ones captured) and I built up execution profiles for Monday, Tuesday and Wednesday of this week and last week.  During my comparisons, I saw that stored procedures and statements related to order submission were being called more frequently.

By then going back to the tables, I was able to see that we did indeed get more customer orders — not just during the 11am to 1pm window, but from 7am through 1pm.  The cpu spike was just the cumulative affect of the orders being processed as the backlog built up.

I have three points from this:

  1. Always watch your production systems — you really want to know about potential problems before you get the phone calls.
  2. Broaden your window of investigation as much as possible — that 12 noon problem had its genesis at 7am, not 11am.
  3. I wish there was an easy way to see which SQL has changed its call frequency compared to “normal” — and be alerted when it happens.  I think OEM 10g does this,, but only when you know the SQL to look for, and I miss the visualization techniques used by the old Savant product (which was bought by Precise, which was bought by Veritas, which was bought by Symantec) :-)

Mean Mr. Modeler

With apologies to the Beatles. I read a postby Tom Kyte the other day about a data model that had passwords as clear-text, unique keys on a table and even used them as foreign keys! After reading it, I was reminded of a conversation I had with my wife, Wendy, about how she’s seeing a revival of interest in enterprise data modeling due to the SOA / BPEL fad surge ;-).

Despite some people’s opinion to the contrary, I’m less of a purist than I sound. I love data modeling, but I recognize that it can be an endless quest for non-achievable perfection. I once worked on a project to fully model all of the financial reporting attributes for a Fortune 500 company — the project took a full year to model, at which point the model was obsolete :-).

After that debacle, I’ve come to realize that data modeling (and the eventual goal of a data model, which is its representation in the physical implementation world) has a heavy dose of artistry along with the science of analysis and normalization. And I’m not sure how to teach the art — so much of how to model something includesthe requirement to understand how the data will be accessed and processed.

A common problem I encounter is a data “model” produced by programmers who have no experience in the art. These are usually physical models which mimic flat files or maybe, just maybe an advanced structure like a linked-list or tree. Most of the time they suffer greatly from poor attribute definition, lack of any keys, and no appreciation for scalability or schema evolution.

In the end, those last 2 items appear to be the biggest problems. I usually handle them with3 questions every time I’m presented with a physical model:

  1. How will this handle concurrency?
  2. What will it look like with 1 million entries? 10 million? 1 billion?
  3. How hard will it be to add attributes? Add dimensions (versions, history, etc.)?

An Answer to #2

I’m not really satisfied with this answer, but so far this is the best I can do.

  1. It attempts to walk the path from top to bottom, starting at the root
  2. It attempts to validate each step on the path, but it does it pretty crudely in my opinion
  3. At least the code is pretty easy to understand

SQL> select account_id
  2  from
  3  (
  4  select account_id, sys_connect_by_path(a.account_name,’.’) account_path
  5  from account a
  6  start with a.account_name =
  7  substr(‘sys.all_apply_progress.apply_name’,1,
  8  instr(‘sys.all_apply_progress.apply_name’,’.’)-1)
  9  and a.parent_account_id is null
 10  connect by prior a.account_id = a.parent_account_id
 11  and instr(‘.’||’sys.all_apply_progress.apply_name’||’.’,’.’||a.account_name||’.’) > 0
 12  )
 13* where account_path = ‘.’||’sys.all_apply_progress.apply_name’
SQL> /

ACCOUNT_ID
———-
     10100
Execution Plan
———————————————————-
SELECT STATEMENT Optimizer=CHOOSE (Cost=1 Card=1 Bytes=2015)
  VIEW (Cost=1 Card=1 Bytes=2015)
    CONNECT BY (WITH FILTERING)
      NESTED LOOPS
        TABLE ACCESS (BY INDEX ROWID) OF ‘ACCOUNT’ (Cost=2 Card=1 Bytes=15)
          INDEX (RANGE SCAN) OF ‘ACCOUNT_IK_NAME’ (NON-UNIQUE) (Cost=1 Card=4)
        TABLE ACCESS (BY USER ROWID) OF ‘ACCOUNT’
      NESTED LOOPS
        BUFFER (SORT)
          CONNECT BY PUMP
        TABLE ACCESS (BY INDEX ROWID) OF ‘ACCOUNT’ (Cost=1 Card=1 Bytes=19)
          INDEX (RANGE SCAN) OF ‘ACCOUNT_FK_PARENT’ (NON-UNIQUE) (Cost=1 Card=12)
Statistics
———————————————————-
          0  recursive calls
          0  db block gets
         24  consistent gets
          0  physical reads
          0  redo size
        213  bytes sent via SQL*Net to client
        234  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          7  sorts (memory)
          0  sorts (disk)
          1  rows processed

Answer #1

This answer comes from one of my staff, Jun Feng:

Good things about this answer:

  1. The code is simple
  2. It uses the fastest way to assemble the correct account path (starting at the child and working “backwards” up the tree to the ultimate parent)
  3. Realizes that the account_path we want is the one where the parent_account_id is null
  4. Uses the reverse() function — I know it’s undocumented, but boy is it cool!
  5. And heck, 9 consistent gets, 7 sorts and 0 recursive calls is pretty good too.

SQL> set autotrace on
SQL> select rtrim(reverse(sys_connect_by_path(reverse(account_name), ‘.’)),’.’) account_path
  2  from account
  3  where parent_account_id is null
  4  start with account_id = 10100
  5  connect by account_id = prior parent_account_id
  6  ;

ACCOUNT_PATH
——————————————————————————————–
sys.all_apply_progress.apply_name

Execution Plan
———————————————————-
SELECT STATEMENT Optimizer=CHOOSE (Cost=1 Card=1 Bytes=19)
  FILTER
    CONNECT BY (WITH FILTERING)
      NESTED LOOPS
        INDEX (UNIQUE SCAN) OF ‘ACCOUNT_PK’ (UNIQUE) (Cost=1 Card=1 Bytes=5)
        TABLE ACCESS (BY USER ROWID) OF ‘ACCOUNT’
      NESTED LOOPS
        BUFFER (SORT)
          CONNECT BY PUMP
        TABLE ACCESS (BY INDEX ROWID) OF ‘ACCOUNT’ (Cost=1 Card=1 Bytes=19)
          INDEX (UNIQUE SCAN) OF ‘ACCOUNT_PK’ (UNIQUE) (Cost=1 Card=1)

Statistics
———————————————————-
          0  recursive calls
          0  db block gets
          9  consistent gets
          0  physical reads
          0  redo size
        251  bytes sent via SQL*Net to client
        234  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          7  sorts (memory)
          0  sorts (disk)
          1  rows processed