Sqoop /*+ parallel */

It’s often useful to keep one’s ear to the ground when one is responsible for database performance and/or efficiency — it’s a talent that I wish more DBAs had: a sense of curiosity about how the database is being used to support particular application use cases.

Today’s post deals with something rather basic, but it speaks to this idea of “agile” collaboration between developers and DBAs around something that has a lot of buzz right now: processing “Big Data” with Hadoop / MapReduce. In particular, a process that someone has already deemed to be “too big” to be accomplished within an existing Oracle database.

Opinions aside about the appropriate tool for the job, I happen to overhear some application developers explaining that their process to extract data from the database for Hadoop processing was the “long pole” in the time their entire process was taking. Of course, having only basic database and SQL knowledge, this time delay was attributed to the database being “slow”.

Normally this is enough to set off even the most mild-mannered DBAs (at least the ones who care anyway), but in my case I decided to just take a look and see if I could help them.

Turns out that they were using Sqoop to pull data out of Oracle for several tables, and then using the MapReduce framework to “correlate” them (i.e., join them).

Of course, there were limits to this approach as doing joins in MapReduce required custom code to map and hash column values, or they could always use Hive (which they actually ended up using — it could have worse: they might have used Pig). But it was a distraction from what they were actually trying to do.

So, I volunteered to help them — first by creating a view that joined everything they needed together, complete with an id column that uniquely identified each row. Bingo! No more Hive code necessary.

While this helped them by eliminating the extra code in their project, they were still disappointed with the speed at which they were able to extract the data using Sqoop. They even tried “adding threads” by using Sqoop with parallelism.

From the Sqoop User Guide:

Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or –num-mappers argument. Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ. By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16. Do not increase the degree of parallelism greater than that available within your MapReduce cluster; tasks will run serially and will likely increase the amount of time required to perform the import. Likewise, do not increase the degree of parallism higher than that which your database can reasonably support. Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result.

Of course at this point the competent performance DBA can see a problem right away — specifically in the following description of how parallelism is achieved:

When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.

In the absence of an index on the id column, the example above would require 5 full-table scans — my view was actually making things worse as I had assumed they were simply dumping ALL of the results from the view from one connection, not 4 attempts to slice it up.

In the end I simply materialized the results of the view into a table and indexed the id column — resulting in much happier developers and a blameless (for the time being) database.

Of course, the next thing they wanted was the full result set as an array of JSON objects, but that’s another story…

Injection Nation

I’m somewhat surprised to see a lack of Oracle blogging reaction to the recent post on The Daily WTF which goes into great detail on a case of SQL injection.  Maybe we’ve either become tired of it or we assume that “my systems don’t do that!”.

So, how do you audit or track if your system is being hit by injection?  How would you detect it?  Assume you’re “just a DBA” — and no one tells you about applications being deployed that talk to the database.  Is there a way you could tell just by looking from within the database?  What kind of assumptions would you make?

The Value of Information

There has been an interesting and somewhat heated discussion going on about a recent blog post by Dominic Brooks and referenced by Doug Burns about the relative value of data vs. applications.  Actually, most of the heat seems to be directed at a comment made by Tim Gorman on several mailing lists in which he states that:

Data, not programs, is the only thing that matters — applications are transient and have no value except to acquire, manipulate, and display data. Data is the only thing with value.

I’ve deliberately taken the quote out of context — for that is how it’s being reacted to, fairly or unfairly on Doug Burns’ blog entry.

I’m not actually going to add any fuel to that fire, only offer up some observations.  I think I agree with many who are stating that data that lies about, unexploited by any application, is a pretty useless waste of storage.  That the true value of data comes from an ability to use it through an application which allows one to analyze, manipulate and visualize information synthesized from the data soup.  One reason I’m excited about the new company I’m with is its focus on helping people increase their ability to exploit their data.

To that end, one of my burning interests is in the ease of which the average employee has access to data and the means to create value out of it.  This includes data accessibility combined with compliance controls as well as tools and applications which allow the employee to tease ideas out of the data.  I wish Excel was a better data manipulation and analysis tool, since it’s so ubiquitous.  But my real concern is my perception that the language of data access has been kicked into a corner, shunned by end users and application programmers alike.  I find the lack of SQL knowledge and use appalling in most of the technologists I’ve encountered.  And that’s a real shame — for SQL’s ability to make data accessible I find second to none.  I have an idea about why SQL ability is failing, and I think it goes back to its original development.  The following is from a fascinating interview at McJones titled: The 1995 SQL Reunion: People, Projects, and Politics

Don Chamberlin: So what this language group wanted to do when we first got organized: we had started from this background of SQUARE, but we weren’t very satisfied with it for several reasons. First of all, you couldn’t type it on a keyboard because it had a lot of funny subscripts in it. So we began saying we’ll adapt the SQUARE ideas to a more English keyword approach which is easier to type, because it was based on English structures. We called it Structured English Query Language and used the acronym SEQUEL for it. And we got to working on building a SEQUEL prototype on top of Raymond Lorie’s access method called XRM.

At the time, we wanted to find out if this syntax was good for anything or not, so we had a linguist on our staff, for reasons that are kind of obscure. Her name was Phyllis Reisner, and what she liked to do was human-factors experiments. So she went down to San Jose State and recruited a bunch of San Jose State students to teach them the SEQUEL language and see if they could learn it. She did this for several months and wrote a paper about it, and gained recognition in the human-factors community for her work.[30], 31 I’m not sure if the results were very conclusive; it turned out that sure enough if you worked hard enough, you could teach SEQUEL to college students. [laughter] Most of the mistakes they made didn’t really have anything to do with syntax. They made lots of mistakes – they wouldn’t capitalize correctly, and things like that.

Looking back on it, I don’t think the problem we thought we were solving was where we had the most impact. What we thought we were doing was making it possible for non-programmers to interact with databases. We thought that this was going to open up access to data to a whole new class of people who could do things that were never possible before because they didn’t know how to program. This was before the days of graphical user interfaces which ultimately did make that sort of a revolution, and we didn’t know anything about that, and so I don’t think we impacted the world as much as we hoped we were going to in terms of making data accessible to non-programmers. It kind of took Apple to do that. The problem that we didn’t think we were working on at all – at least, we didn’t pay any attention to it – was how to embed query languages into host languages, or how to make a language that would serve as an interchange medium between different systems – those are the ways in which SQL ultimately turned out to be very successful, rather than as an end-user language for ad hoc users. So I think the problem that we solved wasn’t really the problem that we thought we were solving at the time.

Anyway, we were working on this language, and we adapted it from SQUARE and turned it into English and then we started adding a bunch of things to it like GROUP BY that didn’t really come out of the SQUARE heritage at all. So you couldn’t really say it had much to do with SQUARE before we were done. Ray and I wrote some papers about this language in 1974. We wrote two papers: one on SEQUEL/DML[32] and one on SEQUEL/DDL[33]. We were cooperating very closely on this. The DML paper’s authors were Chamberlin and Boyce; the DDL paper’s authors were Boyce and Chamberlin, for no special reason; we just sort of split it up. We wanted to go to Stockholm that year because it was the year of the IFIP Congress in Stockholm. I had a ticket to Stockholm because of some work I’d done in Yorktown, so Ray submitted the DDL paper to the IFIP Congress in Stockholm, and the DML paper we submitted to SIGMOD. This is the cover page of the SEQUEL/DML paper. It was 24 pages long. These were twin papers in our original estimation. We wrote them together and thought they were of comparable value and impact. But what happened to them was quite different. The DDL paper got rejected by the IFIP Congress; Ray didn’t get to go to Stockholm. I still have that paper in my drawer; it’s never been published. The DML paper did get accepted at SIGMOD. Several years later I got a call from a guy named Larry Ellison who’d read that paper; he basically used some of the ideas from that paper to good advantage. [laughter] The latest incarnation of these ideas is longer than 24 pages long; it’s the ISO standard for the SQL language, which was just described last week at SIGMOD by Nelson Mattos[34]. It’s now about 1600 pages.

It’s from this quote that I believe SQL gained its second-class status — it’s not for programmers, but it’s “too complicated” for end-users who became used to graphically interacting with applications.

Do you have someone on staff who really knows SQL?  Who can make the data super easily accessible to application programmers and end-users alike?  Who removes the barrier and lowers the hurdle in the way of turning data into value?  You’re probably gathering more and more relational data every day — and probably shredding your XML and storing your BLOBs there too.  I’m not saying that SQL is more important than data or the means to analyze it — I am saying that experts at SQL can make your databases perform better AND make it easier for your application people to focus on delivering that data to the people who want to use it.  Don’t put it in the limbo land of being not for programmers and not for end-users.

Update:  I wanted to give credit to the source of my quote:

Copyright (c) 1995, 1997 by Paul McJones, Roger Bamford, Mike Blasgen, Don Chamberlin, Josephine Cheng, Jean-Jacques Daudenarde, Shel Finkelstein, Jim Gray, Bob Jolls, Bruce Lindsay, Raymond Lorie, Jim Mehl, Roger Miller, C. Mohan, John Nauman, Mike Pong, Tom Price, Franco Putzolu, Mario Schkolnick, Bob Selinger, Pat Selinger, Don Slutz, Irv Traiger, Brad Wade, and Bob Yost. You may copy this document in whole or in part without payment of fee provided that you acknowledge the authors and include this notice.

Free Data Now!

So this post is only tangentially related to the current online effort to get access to Oracle 10g’s AWR/ASH data.  It’s actually more about data mashups.

When building an internal corporate application, most of the time you’re familiar with the internal sources of data and how to get at them — although, you can often have lots of “fun” doing the detective work to uncover hidden sources of internal data, fighting the data guardians and mastering the necessary API incantations to pry loose the secrets. :-)

By basing your application on only internal data you may be missing out on interesting opportunities to leverage public and freely available data and feeds.  Although, even there you need to be wary of usage restrictions and odd APIs.  Usage restrictions can be odd — unlimited access for “non-profit use”, or “public use”, but heavy costs and fines for “private” use.  An interesting example is Google Maps — here’s the quote from the Google Maps API page:

The Maps API is a free beta service, available for any web site that is free to consumers. Please see the terms of use for more information.

To use the Maps API on an intranet or in a non-publicly accessible application, please check out Google Maps for Enterprise.

Following the links shows you that to use Google Maps on an internal application will set you back $10,000 per year.

The question of usage rights for data has been a huge source of problems in professional sports leagues.  In 1997, the NBA sued Motorola and STATS, Inc. regarding the real-time dissemination of scores and game information and lost.  However, recently the NCAA clamped down on live-blogging at one of its events.  Even today, you’ll have a hard time finding real-time RSS feeds for MLB sports scores.  Although, those of you who love data analysis may enjoy poring over the MLB Enhanced GameCast data — latest example here and initial research here.

So, clearly you’ll need to be selective when working with external data — there’s also the questions of data reliability and validity.  However, the sheer plethora of public RSS feeds should be a new and interesting source of data considered when building data analysis applications.

I was working on this post last week when the 10g AWR/ASH petition came online here.  I’m of two minds about the issue.  In the end, I think limiting access to this data isn’t going to achieve what I believe to be Oracle’s goal of encouraging sales of Enterprise Manager Diagnostic Pack.  I’ve tried to evaluate the Diagnostic Pack — it’s not easy to install, and not easy to navigate.  Go ahead, try and find the links on oracle.com to the download and/or documentation.  (I tried this for the Change Management pack, and it was just about impossible to find).  Compare that with how easy it is to find and download the database, client tools and/or SQL Developer.  In the end, I think you’ll see people trying to re-create STATSPACK for 10g in a “public” fashion rather than running out and installing Enterprise Manager.  So, you’ll have a lot of time and effort spent in lobbying Oracle for access to this data, and/or effort spent on re-creating the data capture and storage in a public way — instead of inventing useful analysis tools based on the data that’s already there.

Ah well…

I know why it’s always the database

I got to this post via Elisa Gabbert and Jeff Hunter after doing my daily reading which started with Jeremy Schneider’s excellent post on a recent healthcheck.

Jeff sums it up best in wondering why, when an application is a complex combination of system components, is the database always blamed for poor performance?

I’ll tell you why.

People ask for data. And they don’t like the first set of data they get, so they ask again, and then they browse back and forth through the data.

And where is the data? In the database.

People don’t ask the network, or the disk, or the CPU or the memory for their data. And despite the amount of Javascript, PHP and Java code in their browsers and middle-tiers, they don’t ask their browser or application server either.

The database is responsible for serving data quickly. And since you’re the DBA, when the data doesn’t come back quickly enough, you’re asked to “look at the database”.

BTW, this is one of the reasons I’m a huge fan of constant monitoring — I like to see if the database is acting differently from when “everything was ok”. This requires some kind of historical view or “baseline”. Some way to look back in time to yesterday, or last week at this time, or last month during the monthly close, etc. If the profile during the “slow response” is the same as it was during the “fast response”, generally there’s nothing wrong with the database overall — there may be something going on with that particular request, but you know how to fix those.

And, unfortunately, the database is ridiculously well instrumented, and you’ve convinced everybody that you can quickly see everything going on in the database. Oracle’s event model has really no counterpart on the other parts of the application stack.

Check out Jeremy’s post again — look at how “easy” it was to see that AIX was hopelessly mis-configured and was using a ton of swap space.

I’ve never really liked the idea of swap space — Why would you ever want anything at all swapped out of memory on a database server? Do you monitor memory usage on your PC? What happens when you start using swap space? You go to the Kingston web site and purchase more memory, that’s what. I used to get annoyed at H/W O/S vendors when I’d ask why we needed so much swap. My favorite answers:

  1. Just in case
  2. We’ll crash / panic otherwise
  3. We “pre-swap” code text images to speed up swapping
  4. The system becomes more efficient during non-peak usage as it can handle more processes then as long as they’re idle

This is nuts.

Another fun excuse I’ve heard is when there is significant data caching at the middle tier and web tier — “it takes a long time for the database to give us data to fully populate the cache, so queries are slow”.

This is why I advocate SIMPLE architectures for data-intensive applications. Fewer non-database places where things can go wrong. Fewer times that the DBA gets asked about “why isn’t my data coming back fast enough” and has to spend time finding out which part of the application stack isn’t pulling its weight.