What’s in a name?

I’d like to commend another blog to you for your reading pleasure.  Full of good content and written by someone with an illustrious forename.

Speaking of that forename (which I’m fortunate enough to share), I found a cool site which shows the historical popularity of a given forename in the United States.  Looking up my name I saw the following graph:


(I know, it’s hard to read — X axis is in years from 1880 to the present, Y axis is popularity).

(One of my interests is in data visualization, and so I really enjoyed playing with the software on that site). 

It confirms my suspicion that my name has become more popular lately — as I now hear mothers in loud, scolding voices at various public places shouting: “Dominic, don’t touch that!”.

Perhaps I should get used to it :-)

Whither catproc?

I had an interesting conversation yesterday with one of my folks who was attempting to install APEX into an Oracle 9i database (you know who you are). :-)

One thing I bet Oracle is starting to have to deal with is the fact that a lot of existing databases began life as a 7.3 or 8.1 system and have been upgraded and patched up into current versions — with the end result that they’re probably a missing a few things.  We were among many who encountered the issue of upgrading a database that “was originally” created as 32-bit into a 64-bit version.  I gather that the outcry over that was pretty loud.

Anyway, my friend’s installation of APEX into 9i ran into a problem that DBMS_XMLDOM was missing and came to me for help after scouring Metalink and the Internet to little or no avail.

So I got to show off my spelunking skills by digging through the scripts in rdbms/admin.

A long time ago, we all used to create databases from scratch using scripts, so we got used to running catalog.sql and then catproc.sql (to install the PL/SQL option), and then other scripts like catrep.sql (to install the replication option).  I used to tell new Oracle consultants that after attending classes and reading manuals, the best way to learn about the database was to read through sql.bsq and catalog.sql.

I started off by grep-ing for DBMS_XMLDOM in all of the .sql files in rdbms/admin, just to make sure we had some file somewhere which would re-create it if we needed it.  Found it in dbmsxmld.sql (a pretty obvious name for it).  I then grep-ed for dbmsxmld.sql to see what called it and came up with catxdbst.sql.  The file names made sense to me, as the dbms file was the base package create and the cat file was the caller — usually catalog.sql and catproc.sql called other cat files to install their stuff.

I kept going just to see where I’d end up and grep-ed for catxdbst.sql and found it called from catqm.sql.  Still no catalog.sql or catproc.sql, but I’m nothing if not persistent, so I kept going and found this in catproc.sql:

Rem XDB Schema creation
Rem Comment this out until catqm.sql is finalized.
Rem @@catqm.sql

I guess it never got finalized :-)

At this point it was pretty clear that we had the necessary code, it was probably just a matter of properly installing it.  The “cowboy” in me wanted to fire up sqlplus as sysdba and just run catqm.sql — but since other people actually use this database, I suggested we open a TAR (ooops!  I mean an SR), to have support walk us through the steps.

Later, out of curiousity, I wandered through the Oracle 9i doc set to see how to manually install XDB  / XMLDB.  A search on DBMS_XMLDOM listed the XML Developer’s Guide with 16 matches, so I figured it was in there somewhere.  What’s interesting to me is that the installation instructions are in Appendix A.  26 chapters on how to use it, but the installation is an afterthought.  I know, it’s automatically installed, so no need to have that up front.  But there was enough complexity in the installation section that I’m glad I suggested we open the TAR on the installation.

"Share and Enjoy"

Clearly, not enough people have read the Hitchhiker’s Guide to the Galaxy.

We’ve recently installed a new coffee machine here at work, and it immediately reminded me of the Nutrimatic Drink Dispenser.

I posted the quote from above on the wall near the machine, but it just appears to be confusing people.  Apparently not enough of them have contacted the Sirius Cybernetics Corporation to voice their opinion.

We now return you to your regularly scheduled blog programming.

"Can’t you just…?"

I’ve got to say that those three words never have their intended effect.  They’re usually spoken by someone trying to tell someone else how to do their job.  And that’s never well received :-)

I usually hear these words when I’m asked to address a performance issue — often because all other ways appear to be more difficult or risky, even if they’re the right way to address the problem.

And often times, the performance problem is 4 or 5 problems away from the root of the problem — and that’s even worse.

Here’s an example based on an old chestnut — the case of using tables as rows.  Full credit to Worse Than Failure, written up as follows: http://worsethanfailure.com/Articles/Hacking_the_jProject.aspx

Looking at the system one day, I see it spending large amounts of CPU, while apparently accomplishing very little.  When I browse active sessions, I see a connection from a system monitoring role account, running the following query:

select nvl (sum (bytes), 0)
from sys.dba_free_space
where tablespace_name = ‘XXXXX’

Ok, looks like a standard space monitoring query, so why is it taking so long?

I go an run the query myself — and yep, it takes a long time.  I try it for different tablespaces, but only one of them takes forever.

“Hmm…” (I say to myself), “…maybe the tablespace is super fragmented?  But why would a tablespace get super fragmented?  We almost never drop tables or shrink things.”

So, I go look at what’s in the tablespace using:

select *
from dba_segments
where tablespace_name = ‘XXXXX’

And THAT query takes forever and a day.  Do a bit more digging and I find out that there are 50,000 tables in that tablespace.  And I find a variation of the WTF article from above, complete with an infrastructure to “clean up” the tables by dropping old ones on a regular basis.

What would you do in this situation?  Keep in mind that changing the code or design would be “super expensive” :-)

What’s the Frequency, Kenneth?

With apologies to R.E.M.

I often get teased at work because I have the following little graphic in my Sidebar:

It’s a thumbnail of a custom webpage I built that displays the hourly cpu graphs of the 5 main production database servers my team and I write code for.

I glance at it all day, just to make sure all of my “children” are behaving as I expect them to.  By basic patterns, I can see when one of them isn’t healthy — and then I start digging in to see why.

This isn’t rocket science — it’s probably what everyone does while driving in their car every day.  You look at the gas and temperature gauges to see if you need to do anything.  I’m amazed at people who don’t watch how their systems are doing — especially when it’s so easy.

Anyway, yesterday I saw one of the systems go to 80% cpu at a time when it normally runs between 40 and 60%.  No one reported any issues — there weren’t any reports of slow transactions or poor response times.  But whenever I see something out of the ordinary, I want to know why — and, I know that if the system had gone to 95-100% cpu for any sustained period, we’d start to get calls.

I prefer to have an answer ready when I get those calls :-)

I made one mistake while diagnosing this problem — I chose invalid endpoints and comparisons when looking at the time window during which I thought the problem occurred — this led me down a circuitous path instead of directly at the problem.  In this case I “saw” the cpu spike at 12 noon (actually, it appeared like it started at 11am, rose through noon, and declined from noon to 1pm).  In this system the primary activity is servicing orders, so I compared order volume by hour with previous hours from 1pm back through 8am and didn’t see any appreciable difference in order counts from hour to hour.

(My mistake was in not comparing order volume to a “baseline” for the appropriate time period — i.e., Wednesdays between 8am and 1pm.  Had I done so initially, I would have seen that we had higher than normal order volume right away — but then I would not have been able to use the techniques I eventually developed :-).

I ran a quick statspack report for the 11am to 1pm time window and didn’t see any instance of inefficient SQL being run or ad-hoc queries, so I started looking at other system statistics to help me understand why the server got so busy.  In particular, I started manually querying the stats$sysstat view (it’s a 9i database) to see what things got higher :-)

Everything pointed at higher call volume — higher numbers of user calls, commits, parses, etc.  Session logical reads seemed steady, so it looked like a lot of calls that didn’t do anything — polling for work maybe?

So now I wanted to see which statements were being called at higher frequencies than “normal” — and here’s where the work gets…..not hard, but tedious.  I worked through the various stats tables to capture hourly execution counts for all SQL statements (at least the ones captured) and I built up execution profiles for Monday, Tuesday and Wednesday of this week and last week.  During my comparisons, I saw that stored procedures and statements related to order submission were being called more frequently.

By then going back to the tables, I was able to see that we did indeed get more customer orders — not just during the 11am to 1pm window, but from 7am through 1pm.  The cpu spike was just the cumulative affect of the orders being processed as the backlog built up.

I have three points from this:

  1. Always watch your production systems — you really want to know about potential problems before you get the phone calls.
  2. Broaden your window of investigation as much as possible — that 12 noon problem had its genesis at 7am, not 11am.
  3. I wish there was an easy way to see which SQL has changed its call frequency compared to “normal” — and be alerted when it happens.  I think OEM 10g does this,, but only when you know the SQL to look for, and I miss the visualization techniques used by the old Savant product (which was bought by Precise, which was bought by Veritas, which was bought by Symantec) :-)