Injection Nation

I’m somewhat surprised to see a lack of Oracle blogging reaction to the recent post on The Daily WTF which goes into great detail on a case of SQL injection.  Maybe we’ve either become tired of it or we assume that “my systems don’t do that!”.

So, how do you audit or track if your system is being hit by injection?  How would you detect it?  Assume you’re “just a DBA” — and no one tells you about applications being deployed that talk to the database.  Is there a way you could tell just by looking from within the database?  What kind of assumptions would you make?

The Value of Information

There has been an interesting and somewhat heated discussion going on about a recent blog post by Dominic Brooks and referenced by Doug Burns about the relative value of data vs. applications.  Actually, most of the heat seems to be directed at a comment made by Tim Gorman on several mailing lists in which he states that:

Data, not programs, is the only thing that matters — applications are transient and have no value except to acquire, manipulate, and display data. Data is the only thing with value.

I’ve deliberately taken the quote out of context — for that is how it’s being reacted to, fairly or unfairly on Doug Burns’ blog entry.

I’m not actually going to add any fuel to that fire, only offer up some observations.  I think I agree with many who are stating that data that lies about, unexploited by any application, is a pretty useless waste of storage.  That the true value of data comes from an ability to use it through an application which allows one to analyze, manipulate and visualize information synthesized from the data soup.  One reason I’m excited about the new company I’m with is its focus on helping people increase their ability to exploit their data.

To that end, one of my burning interests is in the ease of which the average employee has access to data and the means to create value out of it.  This includes data accessibility combined with compliance controls as well as tools and applications which allow the employee to tease ideas out of the data.  I wish Excel was a better data manipulation and analysis tool, since it’s so ubiquitous.  But my real concern is my perception that the language of data access has been kicked into a corner, shunned by end users and application programmers alike.  I find the lack of SQL knowledge and use appalling in most of the technologists I’ve encountered.  And that’s a real shame — for SQL’s ability to make data accessible I find second to none.  I have an idea about why SQL ability is failing, and I think it goes back to its original development.  The following is from a fascinating interview at McJones titled: The 1995 SQL Reunion: People, Projects, and Politics

Don Chamberlin: So what this language group wanted to do when we first got organized: we had started from this background of SQUARE, but we weren’t very satisfied with it for several reasons. First of all, you couldn’t type it on a keyboard because it had a lot of funny subscripts in it. So we began saying we’ll adapt the SQUARE ideas to a more English keyword approach which is easier to type, because it was based on English structures. We called it Structured English Query Language and used the acronym SEQUEL for it. And we got to working on building a SEQUEL prototype on top of Raymond Lorie’s access method called XRM.

At the time, we wanted to find out if this syntax was good for anything or not, so we had a linguist on our staff, for reasons that are kind of obscure. Her name was Phyllis Reisner, and what she liked to do was human-factors experiments. So she went down to San Jose State and recruited a bunch of San Jose State students to teach them the SEQUEL language and see if they could learn it. She did this for several months and wrote a paper about it, and gained recognition in the human-factors community for her work.[30], 31 I’m not sure if the results were very conclusive; it turned out that sure enough if you worked hard enough, you could teach SEQUEL to college students. [laughter] Most of the mistakes they made didn’t really have anything to do with syntax. They made lots of mistakes - they wouldn’t capitalize correctly, and things like that.

Looking back on it, I don’t think the problem we thought we were solving was where we had the most impact. What we thought we were doing was making it possible for non-programmers to interact with databases. We thought that this was going to open up access to data to a whole new class of people who could do things that were never possible before because they didn’t know how to program. This was before the days of graphical user interfaces which ultimately did make that sort of a revolution, and we didn’t know anything about that, and so I don’t think we impacted the world as much as we hoped we were going to in terms of making data accessible to non-programmers. It kind of took Apple to do that. The problem that we didn’t think we were working on at all - at least, we didn’t pay any attention to it - was how to embed query languages into host languages, or how to make a language that would serve as an interchange medium between different systems - those are the ways in which SQL ultimately turned out to be very successful, rather than as an end-user language for ad hoc users. So I think the problem that we solved wasn’t really the problem that we thought we were solving at the time.

Anyway, we were working on this language, and we adapted it from SQUARE and turned it into English and then we started adding a bunch of things to it like GROUP BY that didn’t really come out of the SQUARE heritage at all. So you couldn’t really say it had much to do with SQUARE before we were done. Ray and I wrote some papers about this language in 1974. We wrote two papers: one on SEQUEL/DML[32] and one on SEQUEL/DDL[33]. We were cooperating very closely on this. The DML paper’s authors were Chamberlin and Boyce; the DDL paper’s authors were Boyce and Chamberlin, for no special reason; we just sort of split it up. We wanted to go to Stockholm that year because it was the year of the IFIP Congress in Stockholm. I had a ticket to Stockholm because of some work I’d done in Yorktown, so Ray submitted the DDL paper to the IFIP Congress in Stockholm, and the DML paper we submitted to SIGMOD. This is the cover page of the SEQUEL/DML paper. It was 24 pages long. These were twin papers in our original estimation. We wrote them together and thought they were of comparable value and impact. But what happened to them was quite different. The DDL paper got rejected by the IFIP Congress; Ray didn’t get to go to Stockholm. I still have that paper in my drawer; it’s never been published. The DML paper did get accepted at SIGMOD. Several years later I got a call from a guy named Larry Ellison who’d read that paper; he basically used some of the ideas from that paper to good advantage. [laughter] The latest incarnation of these ideas is longer than 24 pages long; it’s the ISO standard for the SQL language, which was just described last week at SIGMOD by Nelson Mattos[34]. It’s now about 1600 pages.

It’s from this quote that I believe SQL gained its second-class status — it’s not for programmers, but it’s “too complicated” for end-users who became used to graphically interacting with applications.

Do you have someone on staff who really knows SQL?  Who can make the data super easily accessible to application programmers and end-users alike?  Who removes the barrier and lowers the hurdle in the way of turning data into value?  You’re probably gathering more and more relational data every day — and probably shredding your XML and storing your BLOBs there too.  I’m not saying that SQL is more important than data or the means to analyze it — I am saying that experts at SQL can make your databases perform better AND make it easier for your application people to focus on delivering that data to the people who want to use it.  Don’t put it in the limbo land of being not for programmers and not for end-users.

Update:  I wanted to give credit to the source of my quote:

Copyright (c) 1995, 1997 by Paul McJones, Roger Bamford, Mike Blasgen, Don Chamberlin, Josephine Cheng, Jean-Jacques Daudenarde, Shel Finkelstein, Jim Gray, Bob Jolls, Bruce Lindsay, Raymond Lorie, Jim Mehl, Roger Miller, C. Mohan, John Nauman, Mike Pong, Tom Price, Franco Putzolu, Mario Schkolnick, Bob Selinger, Pat Selinger, Don Slutz, Irv Traiger, Brad Wade, and Bob Yost. You may copy this document in whole or in part without payment of fee provided that you acknowledge the authors and include this notice.

Free Data Now!

So this post is only tangentially related to the current online effort to get access to Oracle 10g’s AWR/ASH data.  It’s actually more about data mashups.

When building an internal corporate application, most of the time you’re familiar with the internal sources of data and how to get at them — although, you can often have lots of “fun” doing the detective work to uncover hidden sources of internal data, fighting the data guardians and mastering the necessary API incantations to pry loose the secrets. :-)

By basing your application on only internal data you may be missing out on interesting opportunities to leverage public and freely available data and feeds.  Although, even there you need to be wary of usage restrictions and odd APIs.  Usage restrictions can be odd — unlimited access for “non-profit use”, or “public use”, but heavy costs and fines for “private” use.  An interesting example is Google Maps — here’s the quote from the Google Maps API page:

The Maps API is a free beta service, available for any web site that is free to consumers. Please see the terms of use for more information.

To use the Maps API on an intranet or in a non-publicly accessible application, please check out Google Maps for Enterprise.

Following the links shows you that to use Google Maps on an internal application will set you back $10,000 per year.

The question of usage rights for data has been a huge source of problems in professional sports leagues.  In 1997, the NBA sued Motorola and STATS, Inc. regarding the real-time dissemination of scores and game information and lost.  However, recently the NCAA clamped down on live-blogging at one of its events.  Even today, you’ll have a hard time finding real-time RSS feeds for MLB sports scores.  Although, those of you who love data analysis may enjoy poring over the MLB Enhanced GameCast data — latest example here and initial research here.

So, clearly you’ll need to be selective when working with external data — there’s also the questions of data reliability and validity.  However, the sheer plethora of public RSS feeds should be a new and interesting source of data considered when building data analysis applications.

I was working on this post last week when the 10g AWR/ASH petition came online here.  I’m of two minds about the issue.  In the end, I think limiting access to this data isn’t going to achieve what I believe to be Oracle’s goal of encouraging sales of Enterprise Manager Diagnostic Pack.  I’ve tried to evaluate the Diagnostic Pack — it’s not easy to install, and not easy to navigate.  Go ahead, try and find the links on oracle.com to the download and/or documentation.  (I tried this for the Change Management pack, and it was just about impossible to find).  Compare that with how easy it is to find and download the database, client tools and/or SQL Developer.  In the end, I think you’ll see people trying to re-create STATSPACK for 10g in a “public” fashion rather than running out and installing Enterprise Manager.  So, you’ll have a lot of time and effort spent in lobbying Oracle for access to this data, and/or effort spent on re-creating the data capture and storage in a public way — instead of inventing useful analysis tools based on the data that’s already there.

Ah well…

I know why it’s always the database

I got to this post via Elisa Gabbert and Jeff Hunter after doing my daily reading which started with Jeremy Schneider’s excellent post on a recent healthcheck.

Jeff sums it up best in wondering why, when an application is a complex combination of system components, is the database always blamed for poor performance?

I’ll tell you why.

People ask for data. And they don’t like the first set of data they get, so they ask again, and then they browse back and forth through the data.

And where is the data? In the database.

People don’t ask the network, or the disk, or the CPU or the memory for their data. And despite the amount of Javascript, PHP and Java code in their browsers and middle-tiers, they don’t ask their browser or application server either.

The database is responsible for serving data quickly. And since you’re the DBA, when the data doesn’t come back quickly enough, you’re asked to “look at the database”.

BTW, this is one of the reasons I’m a huge fan of constant monitoring — I like to see if the database is acting differently from when “everything was ok”. This requires some kind of historical view or “baseline”. Some way to look back in time to yesterday, or last week at this time, or last month during the monthly close, etc. If the profile during the “slow response” is the same as it was during the “fast response”, generally there’s nothing wrong with the database overall — there may be something going on with that particular request, but you know how to fix those.

And, unfortunately, the database is ridiculously well instrumented, and you’ve convinced everybody that you can quickly see everything going on in the database. Oracle’s event model has really no counterpart on the other parts of the application stack.

Check out Jeremy’s post again — look at how “easy” it was to see that AIX was hopelessly mis-configured and was using a ton of swap space.

I’ve never really liked the idea of swap space — Why would you ever want anything at all swapped out of memory on a database server? Do you monitor memory usage on your PC? What happens when you start using swap space? You go to the Kingston web site and purchase more memory, that’s what. I used to get annoyed at H/W O/S vendors when I’d ask why we needed so much swap. My favorite answers:

  1. Just in case
  2. We’ll crash / panic otherwise
  3. We “pre-swap” code text images to speed up swapping
  4. The system becomes more efficient during non-peak usage as it can handle more processes then as long as they’re idle

This is nuts.

Another fun excuse I’ve heard is when there is significant data caching at the middle tier and web tier — “it takes a long time for the database to give us data to fully populate the cache, so queries are slow”.

This is why I advocate SIMPLE architectures for data-intensive applications. Fewer non-database places where things can go wrong. Fewer times that the DBA gets asked about “why isn’t my data coming back fast enough” and has to spend time finding out which part of the application stack isn’t pulling its weight.

"Can’t you just…?"

I’ve got to say that those three words never have their intended effect.  They’re usually spoken by someone trying to tell someone else how to do their job.  And that’s never well received :-)

I usually hear these words when I’m asked to address a performance issue — often because all other ways appear to be more difficult or risky, even if they’re the right way to address the problem.

And often times, the performance problem is 4 or 5 problems away from the root of the problem — and that’s even worse.

Here’s an example based on an old chestnut — the case of using tables as rows.  Full credit to Worse Than Failure, written up as follows: http://worsethanfailure.com/Articles/Hacking_the_jProject.aspx

Looking at the system one day, I see it spending large amounts of CPU, while apparently accomplishing very little.  When I browse active sessions, I see a connection from a system monitoring role account, running the following query:

select nvl (sum (bytes), 0)
from sys.dba_free_space
where tablespace_name = ‘XXXXX’

Ok, looks like a standard space monitoring query, so why is it taking so long?

I go an run the query myself — and yep, it takes a long time.  I try it for different tablespaces, but only one of them takes forever.

“Hmm…” (I say to myself), “…maybe the tablespace is super fragmented?  But why would a tablespace get super fragmented?  We almost never drop tables or shrink things.”

So, I go look at what’s in the tablespace using:

select *
from dba_segments
where tablespace_name = ‘XXXXX’

And THAT query takes forever and a day.  Do a bit more digging and I find out that there are 50,000 tables in that tablespace.  And I find a variation of the WTF article from above, complete with an infrastructure to “clean up” the tables by dropping old ones on a regular basis.

What would you do in this situation?  Keep in mind that changing the code or design would be “super expensive” :-)