The Value of Information

There has been an interesting and somewhat heated discussion going on about a recent blog post by Dominic Brooks and referenced by Doug Burns about the relative value of data vs. applications.  Actually, most of the heat seems to be directed at a comment made by Tim Gorman on several mailing lists in which he states that:

Data, not programs, is the only thing that matters — applications are transient and have no value except to acquire, manipulate, and display data. Data is the only thing with value.

I’ve deliberately taken the quote out of context — for that is how it’s being reacted to, fairly or unfairly on Doug Burns’ blog entry.

I’m not actually going to add any fuel to that fire, only offer up some observations.  I think I agree with many who are stating that data that lies about, unexploited by any application, is a pretty useless waste of storage.  That the true value of data comes from an ability to use it through an application which allows one to analyze, manipulate and visualize information synthesized from the data soup.  One reason I’m excited about the new company I’m with is its focus on helping people increase their ability to exploit their data.

To that end, one of my burning interests is in the ease of which the average employee has access to data and the means to create value out of it.  This includes data accessibility combined with compliance controls as well as tools and applications which allow the employee to tease ideas out of the data.  I wish Excel was a better data manipulation and analysis tool, since it’s so ubiquitous.  But my real concern is my perception that the language of data access has been kicked into a corner, shunned by end users and application programmers alike.  I find the lack of SQL knowledge and use appalling in most of the technologists I’ve encountered.  And that’s a real shame — for SQL’s ability to make data accessible I find second to none.  I have an idea about why SQL ability is failing, and I think it goes back to its original development.  The following is from a fascinating interview at McJones titled: The 1995 SQL Reunion: People, Projects, and Politics

Don Chamberlin: So what this language group wanted to do when we first got organized: we had started from this background of SQUARE, but we weren’t very satisfied with it for several reasons. First of all, you couldn’t type it on a keyboard because it had a lot of funny subscripts in it. So we began saying we’ll adapt the SQUARE ideas to a more English keyword approach which is easier to type, because it was based on English structures. We called it Structured English Query Language and used the acronym SEQUEL for it. And we got to working on building a SEQUEL prototype on top of Raymond Lorie’s access method called XRM.

At the time, we wanted to find out if this syntax was good for anything or not, so we had a linguist on our staff, for reasons that are kind of obscure. Her name was Phyllis Reisner, and what she liked to do was human-factors experiments. So she went down to San Jose State and recruited a bunch of San Jose State students to teach them the SEQUEL language and see if they could learn it. She did this for several months and wrote a paper about it, and gained recognition in the human-factors community for her work.[30], 31 I’m not sure if the results were very conclusive; it turned out that sure enough if you worked hard enough, you could teach SEQUEL to college students. [laughter] Most of the mistakes they made didn’t really have anything to do with syntax. They made lots of mistakes – they wouldn’t capitalize correctly, and things like that.

Looking back on it, I don’t think the problem we thought we were solving was where we had the most impact. What we thought we were doing was making it possible for non-programmers to interact with databases. We thought that this was going to open up access to data to a whole new class of people who could do things that were never possible before because they didn’t know how to program. This was before the days of graphical user interfaces which ultimately did make that sort of a revolution, and we didn’t know anything about that, and so I don’t think we impacted the world as much as we hoped we were going to in terms of making data accessible to non-programmers. It kind of took Apple to do that. The problem that we didn’t think we were working on at all – at least, we didn’t pay any attention to it – was how to embed query languages into host languages, or how to make a language that would serve as an interchange medium between different systems – those are the ways in which SQL ultimately turned out to be very successful, rather than as an end-user language for ad hoc users. So I think the problem that we solved wasn’t really the problem that we thought we were solving at the time.

Anyway, we were working on this language, and we adapted it from SQUARE and turned it into English and then we started adding a bunch of things to it like GROUP BY that didn’t really come out of the SQUARE heritage at all. So you couldn’t really say it had much to do with SQUARE before we were done. Ray and I wrote some papers about this language in 1974. We wrote two papers: one on SEQUEL/DML[32] and one on SEQUEL/DDL[33]. We were cooperating very closely on this. The DML paper’s authors were Chamberlin and Boyce; the DDL paper’s authors were Boyce and Chamberlin, for no special reason; we just sort of split it up. We wanted to go to Stockholm that year because it was the year of the IFIP Congress in Stockholm. I had a ticket to Stockholm because of some work I’d done in Yorktown, so Ray submitted the DDL paper to the IFIP Congress in Stockholm, and the DML paper we submitted to SIGMOD. This is the cover page of the SEQUEL/DML paper. It was 24 pages long. These were twin papers in our original estimation. We wrote them together and thought they were of comparable value and impact. But what happened to them was quite different. The DDL paper got rejected by the IFIP Congress; Ray didn’t get to go to Stockholm. I still have that paper in my drawer; it’s never been published. The DML paper did get accepted at SIGMOD. Several years later I got a call from a guy named Larry Ellison who’d read that paper; he basically used some of the ideas from that paper to good advantage. [laughter] The latest incarnation of these ideas is longer than 24 pages long; it’s the ISO standard for the SQL language, which was just described last week at SIGMOD by Nelson Mattos[34]. It’s now about 1600 pages.

It’s from this quote that I believe SQL gained its second-class status — it’s not for programmers, but it’s “too complicated” for end-users who became used to graphically interacting with applications.

Do you have someone on staff who really knows SQL?  Who can make the data super easily accessible to application programmers and end-users alike?  Who removes the barrier and lowers the hurdle in the way of turning data into value?  You’re probably gathering more and more relational data every day — and probably shredding your XML and storing your BLOBs there too.  I’m not saying that SQL is more important than data or the means to analyze it — I am saying that experts at SQL can make your databases perform better AND make it easier for your application people to focus on delivering that data to the people who want to use it.  Don’t put it in the limbo land of being not for programmers and not for end-users.

Update:  I wanted to give credit to the source of my quote:

Copyright (c) 1995, 1997 by Paul McJones, Roger Bamford, Mike Blasgen, Don Chamberlin, Josephine Cheng, Jean-Jacques Daudenarde, Shel Finkelstein, Jim Gray, Bob Jolls, Bruce Lindsay, Raymond Lorie, Jim Mehl, Roger Miller, C. Mohan, John Nauman, Mike Pong, Tom Price, Franco Putzolu, Mario Schkolnick, Bob Selinger, Pat Selinger, Don Slutz, Irv Traiger, Brad Wade, and Bob Yost. You may copy this document in whole or in part without payment of fee provided that you acknowledge the authors and include this notice.

The Rule of 5

During my 2006 Hotsos presentation I mentioned 2 “rules of 5″ that I like to use — I didn’t come up with them myself, but I’m pleasantly surprised when I find evidence to support them.  Of course, the human brain always finds evidence to support it’s own prejudiced hypotheses (for an excellent read that demonstrates this concept, try Focault’s Pendulum by Umberto Eco).  Anyway, the 2 rules of 5 are:

  1. Most people have 5 times as much hardware as they need (Tom Kyte)
  2. A useful tuning goal for SQL is 5 LIOs per row per row source (Cary Millsap)

Of course, you need to know what LIOs are — a depressingly larger and larger number of DBAs I meet don’t have the foggiest notion of them.

I point you at an excellent blog post by Shakir Sadikali at the Pythian Group which shows off a ten-node RAC cluster brought to its knees by unindexed foreign keys (doh!).  Fixing that and other tuning operations has allowed them to reduce the cluster down from 10 nodes to 2 nodes (or, 1/5th their original hardware).  Score one for #1!

BTW, most people argue #2 by talking to me about aggregates.  My standard response is that any aggregate that is queried heavily is an opportunity for derivation, pre-calculation or optimization.

2008 Hotsos Conference Material

I’ve uploaded my presentation and the DDL code generation scripts I referenced in my talk.  Just scroll down on the right hand side of this blog to the section marked “Content”.