It’s often useful to keep one’s ear to the ground when one is responsible for database performance and/or efficiency — it’s a talent that I wish more DBAs had: a sense of curiosity about how the database is being used to support particular application use cases.
Today’s post deals with something rather basic, but it speaks to this idea of “agile” collaboration between developers and DBAs around something that has a lot of buzz right now: processing “Big Data” with Hadoop / MapReduce. In particular, a process that someone has already deemed to be “too big” to be accomplished within an existing Oracle database.
Opinions aside about the appropriate tool for the job, I happen to overhear some application developers explaining that their process to extract data from the database for Hadoop processing was the “long pole” in the time their entire process was taking. Of course, having only basic database and SQL knowledge, this time delay was attributed to the database being “slow”.
Normally this is enough to set off even the most mild-mannered DBAs (at least the ones who care anyway), but in my case I decided to just take a look and see if I could help them.
Turns out that they were using Sqoop to pull data out of Oracle for several tables, and then using the MapReduce framework to “correlate” them (i.e., join them).
Of course, there were limits to this approach as doing joins in MapReduce required custom code to map and hash column values, or they could always use Hive (which they actually ended up using — it could have worse: they might have used Pig). But it was a distraction from what they were actually trying to do.
So, I volunteered to help them — first by creating a view that joined everything they needed together, complete with an id column that uniquely identified each row. Bingo! No more Hive code necessary.
While this helped them by eliminating the extra code in their project, they were still disappointed with the speed at which they were able to extract the data using Sqoop. They even tried “adding threads” by using Sqoop with parallelism.
From the Sqoop User Guide:
Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or –num-mappers argument. Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ. By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16. Do not increase the degree of parallelism greater than that available within your MapReduce cluster; tasks will run serially and will likely increase the amount of time required to perform the import. Likewise, do not increase the degree of parallism higher than that which your database can reasonably support. Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result.
Of course at this point the competent performance DBA can see a problem right away — specifically in the following description of how parallelism is achieved:
When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.
In the absence of an index on the id column, the example above would require 5 full-table scans — my view was actually making things worse as I had assumed they were simply dumping ALL of the results from the view from one connection, not 4 attempts to slice it up.
In the end I simply materialized the results of the view into a table and indexed the id column — resulting in much happier developers and a blameless (for the time being) database.
Of course, the next thing they wanted was the full result set as an array of JSON objects, but that’s another story…