Sqoop /*+ parallel */

It’s often useful to keep one’s ear to the ground when one is responsible for database performance and/or efficiency — it’s a talent that I wish more DBAs had: a sense of curiosity about how the database is being used to support particular application use cases.

Today’s post deals with something rather basic, but it speaks to this idea of “agile” collaboration between developers and DBAs around something that has a lot of buzz right now: processing “Big Data” with Hadoop / MapReduce. In particular, a process that someone has already deemed to be “too big” to be accomplished within an existing Oracle database.

Opinions aside about the appropriate tool for the job, I happen to overhear some application developers explaining that their process to extract data from the database for Hadoop processing was the “long pole” in the time their entire process was taking. Of course, having only basic database and SQL knowledge, this time delay was attributed to the database being “slow”.

Normally this is enough to set off even the most mild-mannered DBAs (at least the ones who care anyway), but in my case I decided to just take a look and see if I could help them.

Turns out that they were using Sqoop to pull data out of Oracle for several tables, and then using the MapReduce framework to “correlate” them (i.e., join them).

Of course, there were limits to this approach as doing joins in MapReduce required custom code to map and hash column values, or they could always use Hive (which they actually ended up using — it could have worse: they might have used Pig). But it was a distraction from what they were actually trying to do.

So, I volunteered to help them — first by creating a view that joined everything they needed together, complete with an id column that uniquely identified each row. Bingo! No more Hive code necessary.

While this helped them by eliminating the extra code in their project, they were still disappointed with the speed at which they were able to extract the data using Sqoop. They even tried “adding threads” by using Sqoop with parallelism.

From the Sqoop User Guide:

Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or –num-mappers argument. Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ. By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16. Do not increase the degree of parallelism greater than that available within your MapReduce cluster; tasks will run serially and will likely increase the amount of time required to perform the import. Likewise, do not increase the degree of parallism higher than that which your database can reasonably support. Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result.

Of course at this point the competent performance DBA can see a problem right away — specifically in the following description of how parallelism is achieved:

When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.

In the absence of an index on the id column, the example above would require 5 full-table scans — my view was actually making things worse as I had assumed they were simply dumping ALL of the results from the view from one connection, not 4 attempts to slice it up.

In the end I simply materialized the results of the view into a table and indexed the id column — resulting in much happier developers and a blameless (for the time being) database.

Of course, the next thing they wanted was the full result set as an array of JSON objects, but that’s another story…

Good Forever

Oracle 10.2.0.1 (I know, version out-of-date) on Windows 2003:

The other day one of my stored procedures error’ed out with:

ORA-01801: date format is too long for internal buffer

somewhere near a line in which I was doing the following assignment:

d := to_char(c.date_field,'YYYYMMDD');

Where d is of type varchar2, and c is a row from a cursor for loop, and date_field is a date column in the cursor

selecting the date field from the database didn’t look odd:

select date_field from table_name t where t.id = <id of failing row>;

’04-AUG-17′

However, selecting to_char did yield something odd:

select to_char(date_field,'YYYYMMDD') from table_name t where t.id = <id of failing row>;

’00000000′

Odd, no 04, no AUG and no 17.

At this point I could tell something was up with the data in the field, so I resorted to dump’ing it’s contents:

select dump(date_field) from table_name t where t.id = <id of failing row>;

Typ=12 Len=7: 220,117,8,4,1,1,1

Uh-oh — the first byte is supposed to indicate the century (subtract 100 from the value to get the century), and legitimate centuries stop at 99. This value indicates a century value of 120. For an “actual” date of:

August 4th, 12017

I guess what was weird to me was that the PL/SQL to_char reported an error, while the SQL to_char didn’t (instead it returned an odd ’0000000′ result).

Going forward, it was relatively easy to find rows like this in the table:

select id, dump(date_field) from table_name t where date_field > to_date('99991231','YYYYMMDD');

We actually left the bad data alone and popped a view on top of the table using LEAST / GREATEST to cap the value at 9999-12-31, which minimized the code impact to our PL/SQL routines.

Still and all, I can’t wait to hop into my time machine with Jules Verne and see what happens on that date.

Quick SQLDeveloper Hint — Dial0gInput

The other day I was attempting to debug a SQL statement which was doing a REGEXP_REPLACE and the comments said something like “remove all zero’s from the string”.

However, in looking at the code, it looked like it was removing all capital O’s from the string instead of 0′s — see even here within WordPress I can’t tell the difference between an O (the letter) and a 0 (the number).

At first I thought it a coding mistake until I tried to type both of them into a window within SQL Developer — nope, they looked the same. Must be a font issue.

Fonts can be changed in SQL Developer under the Preferences / Code Editor / Fonts selection — you may want to limit choices to fixed-width fonts by checking “Display Only Fixed-Width Fonts”. The default on my system was DialogInput — which appears to have the O/0 problem. I ended up choosing Consolas 12pt and exiting back out to the editor.

Only to see a horribly grainy looking font.  Yuck!

Back to the Preferences — somewhat hidden under Preferences / Code Editor / Display is an item for “Enable Text Anti-Aliasing” — check it.

Much, much better.

I think that preference item should be moved to the Fonts category, but at least I was able to find it — and my eyes feel a lot better.

Oh yeah, now I can tell zeros from o’s again — now on to those pesky 1′s and l’s … :-)

UPDATE

You may also want to adjust the font used for Printing to match your Editor selection — to do so, navigate to Preferences / Code Editor / Printing as well as Printing HTML to change the fonts to match your Editor selection.