Hate the player, not the game — or my case for Data EngineersJuly 11th, 2012 — ddelmoli
Has “database” become a dirty word within your organization lately? If you’re someone who has been a data technologist for the better part of your career, you may be wondering why the technologies you work with everyday seem to be acquiring such a bad rap. From NoSQL to No DB the current influx of brogrammers seem to take extreme pride in describing how they’re able to write code while avoiding any kind of database technology whatsoever.
The impetus for this post actually started with something I read on the ‘Net the other day about Command Query Responsibility Segregation (CQRS), and how I was initially excited about the concept.
Martin Fowler has a nice, gentle introduction the topic here.
Before I get into the post, however, I think it’s useful for me to describe some of my attitudes toward data management. What’s really odd is that while I rather strongly disagree with the tone of Uncle Bob Martin’s Rant, I actually strongly agree with his assertion about the high value of use-case driven development.
I’ve had gentle debates about the meaning of “data as truth” with several people, and the age-old debate of whether data is more “important” than application code. Generally I’ve found that such debates end up as religious arguments instead of attempting to get to the value of acting on data / or data in action. Because in the end it’s hard for data to have value unless its acted on by a set of processing directives (applications), and while it’s possible to have valuable applications that don’t require knowledge about the past (basic rule engine codifications), in general they need each other.
Why I call myself a data engineer
I’ve been impressed with EMC’s attempt to define a Data Science curriculum. In particular, I like how they describe the different skills and roles necessary for a successful data science team, including the hot new title of data scientist. The data science team often includes a data architect, a data engineer, and a database administrator. So, what is a data engineer? In a blog by Steve Todd, Director of EMC’s Global Research and Innovation Portfolio, he has the following characterizations:
The “Database Administrator” provisions and configures the database environment to support the analytical needs of the working team. The “Data Engineer” tends to have deep technical skills to assist with tuning SQL queries for data management and extraction. They also support data ingest to the analytic sandbox. These people can be one in the same, but many times the data engineer is an expert on queries and data manipulation (and not necessarily analytics as such). The DBA may be good at this too, but many times they may simply be someone who is primarily skilled at setting up and deploying a large database schema, or product, or stack.
Many, many DBAs wear both hats, but I think it’s not a good idea — in general I think that DBA is to data engineer as system administrator is to software engineer, but the lack of data engineers has forced DBAs into dual-roles, often for which they are not well-suited. While I have basic DBA skills, I’m much better at the skills listed under the data engineer — and I enjoy working with the data scientists or application developers who have questions about the data and/or how they’d like it structured to support their use cases.
This is one of the reasons why I agree with Uncle Bob’s rant in which he also rails against frameworks in addition to the database — I just wish frameworks had received equal billing in the rant and title, but I’m guessing that the No DB vitriol resonated more highly with readers. In general I like making sure data is organized in such a way as to support as many use cases as possible. That includes being performant for each use case — which may mean taking advantage of techniques to denormalize, duplicate and synchronize, cache and distribute data.
I suppose I could write a similar rant on No Data Frameworks, but then I’d probably step into the ORM battle, which really isn’t the focus of this post. But just to follow on to Uncle Bob’s rant — the reason I dislike many ORM Data Frameworks is that they tightly bind the application to a particular physical implementation of a data layout, which then limits and constrains my ability to adapt the data layout for new use cases, and leads to “persistence avoidance” in application code.
True story — on a recent Agile project, I was providing guidance on the data layer when I noticed that a bit of information for a new use case wasn’t being captured. I suggested to the team that it would be easy to extend the data layer in order to retain the additional information and I was met with groans: “But that means touching the persistence framework — that’s a big change!” — I was flabbergasted. Isn’t the data layer usually blamed for being inflexible? Are you telling me that it’s actually the framework causing the inflexibility?
If you’re still reading this, I’m sure you’re wondering how this ties in to CQRS and the original blog title.
When I first read about CQRS in Martin Fowler’s post, I became really interested — the idea that you would use different models for commands (“change” the data) and queries (“read” the data) made me think that frameworks that directly map models into applications could be retired in favor of messages related to use cases instead of model objects. To me, this means a data service API or set of virtual data layers which provide interfaces to data for applications, regardless of how the data is physically stored or organized. Huzzah! This would free me as a data engineer to ensure that I organized the data in ways which efficiently supported use cases. Since I tend to work in full-featured RDBMS systems, that meant I could wrap data using a data service API using whatever works, including things like stored procedures or RESTful web APIs using something like Oracle’s APEX listener.
So imagine my dismay when reading about CQRS and coming upon a whole series of blog posts about implementing CQRS expressly to “get rid of the database“. I intently read through the entire series trying to figure out what was wrong with the database that necessitated “getting rid of it” to implement CQRS. All to no avail. I’ve left a comment asking for that information, because I’m generally curious about it, but I have a guess.
It’s not about technology — it’s about the organization and its associated personalities that foster such an attitude.
Really now. In an organization with responsive data engineers there shouldn’t be a need to “get rid of the database”. One of the best reasons to have a database is that it provides so many ways to build the different kinds of models and transform the data between them with minimal need for additional frameworks or mountains of custom code.
In the end, I’m guessing that after years of hearing “No, we can’t do that” from the DBA’s-designated-as-data-engineers, the application team had come to equate the people with the technology. The implication is that the technology is the constraint instead of the people responsible for it.
So, what’s a way out? If your existing technology is the game, make sure you get the best players for every role and responsibility — don’t make your DBAs “play out of position” or else they’ll become hated representations of barriers to progress. If your organizational structure is the game, hate the game, change the game and advocate for skilled data engineers who can make your data more responsive to your businesses use cases. If you believe in “data as truth”, then invest in people who can make that data as useful as possible to as many use cases as you have.