Of Wine and Fish

In my last post touching on my case for Data Engineers, my friend Greg Rahn provided a humorous quote about data from Andy Todd:

“Data matures like wine, applications like fish”

Which, near as I can tell, came from an Open Source Developer’s Conference in Brisbane, 2009 at which Andy talked about, of all things, “Change in Database Schemas and Source Code“.

I’ve dropped Andy an email to see if his presentation is online anywhere, since it touches the topic that is near and dear to my heart.

In this post, though, I’d like to address some of the humor behind the quote — the implication that data gets better as it ages, while applications get worse (and start to smell like stinky fish).

Having worked with old data and old applications, I’m not sure I agree with the sentiment. Imagine the following:

“Mr. Hawkins, I need to you do some analysis on this music file I have — I want to know how many times C immediately follows A in this song.”

“No problem, Mr. Silver, I’ll get right on it — where’s the data and how do I read it?”

“Here’s the data file, Mr. Hawkins”, Mr. Silver hands Jim a grooved red plastic disk with a small hole in the middle of it, the faded words “78 RPM” written on the attached paper label. “And here’s the application code that reads the data file”, Mr. Silver bends over and grunts to lift a oddly shaped box with a huge bell and crank attached to it.

“Good luck, Mr. Hawkins! Let me know when you’ve finished that analysis!”

In my little story, both the data and application code have become ancient and almost unusable, leading me to another quote, this time from Kurt Bollacker:

“Data that is loved tends to survive”

It’s that aspect of loving your data (nod to my friends at Pythian and the estimable @datachick Karen Lopez) that keeps me interested in the efforts to make data transformation and evolution agile and easier.

There’s a balance to be struck in keeping data fresh and usable — it doesn’t just get better with age, but rather needs to be continually assessed against use cases in order to keep it useful. Applications need the same attention too, lest they start to smell like last week’s catch.

The trick is to minimize the effort in keeping both fresh — some of you may recognize this as minimizing technical debt. Really good software and data engineers (and perhaps this should be the responsibility of the software and data architects) constantly assess data and code against current and future use cases. If they’re smart, they invest in changes which reduce current and probable future technical debt on an ongoing, even agile, basis. The extra challenge for the data engineer is to balance this need not only for individual applications and use cases, but to discern ways to leverage data as is while not placing too much burden on the applications.

Hate the player, not the game — or my case for Data Engineers

Has “database” become a dirty word within your organization lately? If you’re someone who has been a data technologist for the better part of your career, you may be wondering why the technologies you work with everyday seem to be acquiring such a bad rap. From NoSQL to No DB the current influx of brogrammers seem to take extreme pride in describing how they’re able to write code while avoiding any kind of database technology whatsoever.

The impetus for this post actually started with something I read on the ‘Net the other day about Command Query Responsibility Segregation (CQRS), and how I was initially excited about the concept.

Martin Fowler has a nice, gentle introduction the topic here.

Before I get into the post, however, I think it’s useful for me to describe some of my attitudes toward data management. What’s really odd is that while I rather strongly disagree with the tone of Uncle Bob Martin’s Rant, I actually strongly agree with his assertion about the high value of use-case driven development.

I’ve had gentle debates about the meaning of “data as truth” with several people, and the age-old debate of whether data is more “important” than application code. Generally I’ve found that such debates end up as religious arguments instead of attempting to get to the value of acting on data / or data in action. Because in the end it’s hard for data to have value unless its acted on by a set of processing directives (applications), and while it’s possible to have valuable applications that don’t require knowledge about the past (basic rule engine codifications), in general they need each other.

Why I call myself a data engineer

I’ve been impressed with EMC’s attempt to define a Data Science curriculum. In particular, I like how they describe the different skills and roles necessary for a successful data science team, including the hot new title of data scientist. The data science team often includes a data architect, a data engineer, and a database administrator. So, what is a data engineer? In a blog by Steve Todd, Director of EMC’s Global Research and Innovation Portfolio, he has the following characterizations:

The “Database Administrator” provisions and configures the database environment to support the analytical needs of the working team. The “Data Engineer” tends to have deep technical skills to assist with tuning SQL queries for data management and extraction. They also support data ingest to the analytic sandbox. These people can be one in the same, but many times the data engineer is an expert on queries and data manipulation (and not necessarily analytics as such). The DBA may be good at this too, but many times they may simply be someone who is primarily skilled at setting up and deploying a large database schema, or product, or stack.

Many, many DBAs wear both hats, but I think it’s not a good idea — in general I think that DBA is to data engineer as system administrator is to software engineer, but the lack of data engineers has forced DBAs into dual-roles, often for which they are not well-suited. While I have basic DBA skills, I’m much better at the skills listed under the data engineer — and I enjoy working with the data scientists or application developers who have questions about the data and/or how they’d like it structured to support their use cases.

This is one of the reasons why I agree with Uncle Bob’s rant in which he also rails against frameworks in addition to the database — I just wish frameworks had received equal billing in the rant and title, but I’m guessing that the No DB vitriol resonated more highly with readers. In general I like making sure data is organized in such a way as to support as many use cases as possible. That includes being performant for each use case — which may mean taking advantage of techniques to denormalize, duplicate and synchronize, cache and distribute data.

I suppose I could write a similar rant on No Data Frameworks, but then I’d probably step into the ORM battle, which really isn’t the focus of this post. But just to follow on to Uncle Bob’s rant — the reason I dislike many ORM Data Frameworks is that they tightly bind the application to a particular physical implementation of a data layout, which then limits and constrains my ability to adapt the data layout for new use cases, and leads to “persistence avoidance” in application code.

True story — on a recent Agile project, I was providing guidance on the data layer when I noticed that a bit of information for a new use case wasn’t being captured. I suggested to the team that it would be easy to extend the data layer in order to retain the additional information and I was met with groans: “But that means touching the persistence framework — that’s a big change!” — I was flabbergasted. Isn’t the data layer usually blamed for being inflexible? Are you telling me that it’s actually the framework causing the inflexibility?

Again I point back to Uncle Bob on Clean and Screaming Architecture.

If you’re still reading this, I’m sure you’re wondering how this ties in to CQRS and the original blog title.

When I first read about CQRS in Martin Fowler’s post, I became really interested — the idea that you would use different models for commands (“change” the data) and queries (“read” the data) made me think that frameworks that directly map models into applications could be retired in favor of messages related to use cases instead of model objects. To me, this means a data service API or set of virtual data layers which provide interfaces to data for applications, regardless of how the data is physically stored or organized. Huzzah! This would free me as a data engineer to ensure that I organized the data in ways which efficiently supported use cases. Since I tend to work in full-featured RDBMS systems, that meant I could wrap data using a data service API using whatever works, including things like stored procedures or RESTful web APIs using something like Oracle’s APEX listener.

So imagine my dismay when reading about CQRS and coming upon a whole series of blog posts about implementing CQRS expressly to “get rid of the database“. I intently read through the entire series trying to figure out what was wrong with the database that necessitated “getting rid of it” to implement CQRS. All to no avail. I’ve left a comment asking for that information, because I’m generally curious about it, but I have a guess.

It’s not about technology — it’s about the organization and its associated personalities that foster such an attitude.

Really now. In an organization with responsive data engineers there shouldn’t be a need to “get rid of the database”. One of the best reasons to have a database is that it provides so many ways to build the different kinds of models and transform the data between them with minimal need for additional frameworks or mountains of custom code.

In the end, I’m guessing that after years of hearing “No, we can’t do that” from the DBA’s-designated-as-data-engineers, the application team had come to equate the people with the technology. The implication is that the technology is the constraint instead of the people responsible for it.

So, what’s a way out? If your existing technology is the game, make sure you get the best players for every role and responsibility — don’t make your DBAs “play out of position” or else they’ll become hated representations of barriers to progress. If your organizational structure is the game, hate the game, change the game and advocate for skilled data engineers who can make your data more responsive to your businesses use cases. If you believe in “data as truth”, then invest in people who can make that data as useful as possible to as many use cases as you have.

Death of the Enterprise Architect?

Last week I read an interesting article about how cloud computing is changing the role of the enterprise architect and it got me thinking about the bad rap many architects are getting in the brave new agile, cloud, big data world.

From what I’ve been reading, there’s been a bit of a straw man argument going on — enterprise architects are often described as uber-control freaks who attempt to dictate software architectures in a repressive way to implementation teams. Mention the term “reference architecture” and you’ll often raise the hackles of the new developer-led world.

To be sure, there are many enterprise architects who match that description. And they’re the ones who give architects a bad name, just like undisciplined developers can give agile a bad rap too.

I tend to agree with the article in many respects — the idea that software architectures can be fully contained and described in a prescriptive way that potentially limits or increases the cost of software that adds value to the business is something that really isn’t good. To me, it almost always comes back to the value question — does the architect add net positive business value?

One of the best roles I’ve seen for enterprise architects is in attention to technical debt — measuring the growth of it, assisting agile teams in ways to address emerging technical debt, and generally ensuring that the business isn’t accumulating potentially crippling amounts of it. To me, the companies which have these kinds of architects never have to “stop all development” in order to do vendor-software upgrades, re-platforming, or huge re-write / re-factoring efforts.

Again, I don’t think this is about the enterprise architect defining an architecture and enforcing it — it’s a lot more about being an architectural scout: staying out in front of the agile development teams to bring back ideas and help them choose concepts which keep the system flexible in the face of change. Not in defining the ultimate generic architecture — we’ve been down that path before with SOA, “message-bus”, integration servers and broker architectures.

I hope you’re lucky enough to work with such an enterprise architect.

In the abstract

During conference season, it can be a challenge to come up with abstracts that you can feel passionate about, while making sure to craft them in a way that is both attractive to selection committees and the audience you feel like you want to reach. I often find that tri-purpose (satisfying myself, a committee, and the potential audience) to be daunting and occasionally conflicting — leading to abstract paralysis.

Starting today, I’m going to work harder at it. If you’ve been to any of my presentations in the recent past, you know that I like to spend more time on what I think are technical “culture” issues rather than examples of how to implement or interpret the technical features of the latest software release. It’s an area that I’m passionate about, and it’s one that I feel is drastically underrepresented and underserved at most technical conferences.

The biggest challenge I have with those kinds of presentations is making them selectable and attractive — for the topics mostly concern our ability to collaborate and communicate effectively in support of our business and mission objectives. And in that case, we all feel (myself included) like we’re from Lake Wobegon.

To me, no where is this more apparent than in the discussions about the Agile movement in software development, testing and production operations. Fellow Oak Table member Martin Widlake has some excellent examples of these issues in his 2 recent blog posts on the subject:

“Friday Philosophy – Why Doesn’t Agile Work?” and “In Defense of Agile Development (and their Ilk)”

(I especially like “Ilk”)

In a small, forgotten corner of the Internet, I belong to a Yahoo! Group (yes, they still exist!) on Agile Databases, which has as its description:

Discussion about Database management with regards to Extreme Programming practices and principles.

You can visit the group here.

In a recent discussion, there was a post from Scott Ambler that I found myself violently agreeing with:

A question was asked about coordinating and scheduling changes made by database and ETL teams with the development teams in order to reduce confusion and churn during development.

Question / Comment: While one or more code iterations are taking place in parallel, the data design and ETL are working on their iteration of the db schema and data, which will be consumed by later code iterations.

Scott’s Comment / Answer: Better yet, this could occur in a “whole team” manner where data-experienced people are embedded in the actual team.  This can improve productivity by reducing overall overhead.  Unfortunately this can be difficult in many companies due to the organizational complexities resulting from the cultural impedance mismatch between data and development professionals.

(Emphasis mine)

I feel like I’ve have the privilege of working in places where those organizational complexities and cultural impedance mismatches were overcome and I’d love to talk about what I think made that happen.

Now just to write some compelling abstracts on the subject — ideas welcome!

Can you see what I see?

If you’ve got SQL access to your database servers, I want you to tell me the results of the following query (if you’re allowed to) J

Select value from v$parameter where name = ‘audit_trail’;

Go ahead, I’ll wait.

If it’s anything other than DB; DB, EXTENDED; XML or XML, EXTENDED you’re doing yourself and your organization a disservice.

Lately I’ve been amazed at the number of customers I’ve been at who are flabbergasted by “random” changes to their production databases. They’ll say things like “someone logged in and added an index” or “someone changed a stored procedure”. When I ask who did these things no one can say. Reactions to the resulting production issues usually range from witch-hunts to draconian password lockup procedures.

After the fire cools, the first question I ask is “Have you turned on database auditing?”

Usually I get an answer from the DBAs saying that auditing isn’t their role it’s the job of Security. When I ask Security about it they say those kind of issues are an application problems as long as no data was compromised.

(I love separation of duties in this case I think there’s a way to combine the peanut butter and chocolate though maybe have the DBAs and Security leads combine their knowledge / roles to add additional value to the organization)

In every database I can, I ask the DBAs to turn on auditing and run the following commands:

AUDIT TABLE
AUDIT CLUSTER
AUDIT CONTEXT
AUDIT DIMENSION
AUDIT DATABASE LINK
AUDIT DIRECTORY
AUDIT INDEX
AUDIT MATERIALIZED VIEW
AUDIT OUTLINE
AUDIT PROCEDURE
AUDIT PROFILE
AUDIT PUBLIC DATABASE LINK
AUDIT PUBLIC SYNONYM
AUDIT ROLE
AUDIT ROLLBACK SEGMENT
AUDIT SEQUENCE
AUDIT SESSION
AUDIT SYNONYM
AUDIT SYSTEM AUDIT
AUDIT SYSTEM GRANT
AUDIT TABLE
AUDIT TABLESPACE
AUDIT TRIGGER
AUDIT TYPE
AUDIT USER
AUDIT VIEW
AUDIT ALTER, GRANT ON DEFAULT

This way, just about every DDL command run against the database is logged to the audit trail. If it’s set to DB or DB, EXTENDED each command is written to a table in the database. If it’s set to XML or XML, EXTENDED commands are written to XML files in the audit_file_dest directory and also viewable via the DBA_COMMON_AUDIT_TRAIL view.

I LOVE having this on in development it allows me to track database changes that may need to be promoted to QA. In QA it lets me verify that what I sent from development actually got installed. And in production it gives me an accurate record of what changes were introduced into the production database by who, when and from where.

I often get some resistance saying that this will negatively affect performance if every DDL command causes this kind of write activity.

I usually laugh and say I sure as heck hope so! I want your program that creates “work tables” every second to feel some pain. It’s all part of my plan to Make Bad Practices Painful I’ve decided to spend less time arguing about Best Practices and more time on hindering the use of Bad Practices

Seriously though this is a great built-in feature much easier than writing DDL triggers

BTW, Oracle recommends setting this parameter to OS; XML or XML, EXTENDED….

From the 11.2 Security Guide (http://download.oracle.com/docs/cd/E11882_01/network.112/e16543/auditing.htm#BCGBCFAD)

Advantages of the Operating System Audit Trail

Using the operating system audit trail offers these advantages: