Showing posts with label orm. Show all posts
Showing posts with label orm. Show all posts

Friday, November 19, 2010

Prepare Now For Possible Future Head Transplant

This is the Database Programmer blog, for anybody who wants practical advice on database use.

There are links to other essays at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Planning For The Unlikely

We programmers love to plan for things that will hardly ever happen, like coding the system's upgrade engine to handle spontaneous human combustion, making sure the SQL scrubbing layer can also launch a rocket into space, and, well, trying to work out ahead of time what to do if we ever need a head transplant.

The boss comes over and says, "can you toss a simple plot onto the sales' staff home page that shows sales by day? Use that 'jquicky' or whatever you call it. Should take a couple of hours, right?" And three days later we're working on the world's greatest plotting system that can report everything except what the boss actually asked for because we haven't gotten around to that part of it yet. (Really, can he expect me to just bang this out without the required Seven Holy Layers of Abstraction and Five Ritual Forms of Parameterization, and the Just and Wholesome Mobile Support, or the features Not Yet Required but visible to the Far Seeing Wise and Learned Men?)

Abstraction Contraptions

So what I am getting at is that programmers of all stripes are addicted to abstraction, it gives us goosebumps and makes us feel warm and tingly, and so we do it even when we do not need to. We build abstraction contraptions.

When it comes to designing a database, this unhealthy proclivity can seriously slow you down, because of what I call:

Ken's Law

Everybody wants to be remembered for something. If I could write my own epitaph, it might be:

Table-based datastores are optimally abstract

This law is not about database access, it is about database design. It can be expressed informally as:

People Understand Tables Just Fine

Or more rigorously as:

Table-based datastores are optimally abstract; they require no additional abstraction when requirements are converted to desgin; they cannot be reduced to a less abstract form.

Structured Atomic Values

I should point out that this essay deals with structured atomic values, who live in the Kingdom of The Relational Database. The concepts discussed here do not apply to free-text documents or images, or sound files, or any other media.

No Additional Abstraction Required

My basic claim here is that you cannot create an abstraction of data schemas that will pay for itself. At best you will create a description of a database where everything has been given a different name, where tables have been designated 'jingdabs' and columns have been designated 'floopies' and in the end all of your jingdab floopies will become columns in tables. Oh, and I suppose I should mention the Kwengars will be foreign keys and the Slerzies will be primary keys.

After that it goes downhill, because if we generate an abstraction that is not a simple one-to-one mapping, we actually obscure the end goal. Consider an example so simple as to border on the trivial or absured. Why would we ever use the terms "One-to-Many" or "Many-to-Many" when the more precise terms "child table" and "cross-reference table" convey the same idea without the noise? I said above that this would sound trivial, and you can accuse me of nit-picking, but this is one of those camel's nose things, or perhaps a slippery slope. When technical folk get together to design a system, we should call things what they are, and not make up new words to make ourselves sound smarter.

No de-Abstraction is Possible

The second half of Ken's law says that you cannot de-Abstract a table schema into some simpler form. This should be very easy to understand, because relational databases deal with atomic values, that is, values which cannot themselves be decomposed. If you cannot decompose something, then it cannot be an abstraction of something more specific.

Going further, if the schema has been normalized, then every fact is stored in exactly one place, so no further simplification is possible. If you cannot simplify it or resolve it into something more specific, then it is not an abstraction of something else.

But Does it Work?

I originally began to suspect the existence of what I call in all humility "Ken's Law" when I was sitting at a large conference table with my boss, her boss, a couple of peers, and 3 or 4 reps from a Fortune 5 company. My job was basically to be C3PO, human-cyborg relations. Some people at the table protested loudly to being non-technical, while others had technical roles. But everybody at the table spent all day discussing table design.

Later on, when at a different position, the programmers received their instructions from Project Managers. The best Project Managers worked with their customers to figure out what they were trying to keep track of, and handed us specs that were basically table layouts. The customers loved these guys because they felt they could "understand what the project manager was talking about", and the project managers, who swore they were not technical, were respected because they handed us requirements we could actually understand and implement.

Since that time I have further learned that it is very easy for anybody who deals with non-technical people to bring them directly to table design without telling them you are doing it. All you have to do is listen to what words they use and adopt your conversation accordingly. If they say things like "I need a screen that shows me orders by customer types" they have told you there will be a table of customer types. Talk to them in terms of screens. If they say, "Our catalog has 3 different price list and four discount schemes" then you know that there will be a PRICELIST table, a DISCOUNTS table, and likely some cross-references and parent-child relationships going on here.

So How Does ORM Come Into This?

One of the greatest abstraction contraptions of this century (so far), is ORM, or Object-Relational Mapping, which I do not use precisely because it is an abstraction contraption.

To be clear, the mistake that ORM makes is not at the design phase, but at the access phase. The ORM meme complex instructs its victims that it is ok to put structured atomic values into a Relational Database, but when it comes time to access and use that data we will pretend we did not put it into tables and we will pretend that the data is in objects. In this sense the term Object- Relational Mapping is a complete misnomer, because the point is not to map data to objects but to create the illusion that the tables do not even exist.

Perhaps ORM should stand for Obscuring Reality Machine.

Getting Back to That Head Transplant

So what about that weird title involving head transplants? Obviously a head transplant is impossible, making it also very unlikely, besides being silly and non-sensical. It came to mind as a kind of aggregrate of all of the bizarre and unrealistic ideas about abstracting data designs that I have heard over the years.

One of these ideas is that it is possible and beneficial to create a design that is abstract so that it can be implemented in any model: relational, hierarchical, or network. I'm not saying such a thing is impossible, it is likely just a small matter of programming, but for heaven's sake, what's the point?

Conclusion

So don't waste time creating abstractions that add steps, possibly obscure the goal, and add no value. Don't plan for things that are not likely to happen, and avoid abstraction contraptions.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Other philosophy essays are:

Sunday, June 15, 2008

Why I Do Not Use ORM

An impedance mismatch occurs when two devices are connected so that neither is operating at peak efficiency. This lack of efficiency is not due to any intrinsic incompatibilities between the devices, it only exists once they are connected together the wrong way. Object Relational Mapping (ORM) does not cure a pre-existing impedance mismatch, it creates one, because it connects databases to applications in a way that hobbles both.

UPDATE: There is a newer Historical Review of ORM now available.

UPDATE: In response to comments below and on reddit.com, I have a new post that gives a detailed analysis of an algorithm implemented as a sproc, in app code with embedded SQL, and in ORM.

Welcome to the Database Programmer blog. This blog is for anybody who wants to see practical examples of how databases work and how to create lean and efficient database applications.

There are links to other essays at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Good Marketing, Terrible Analogy

Normally I like to reserve this space for a positive presentation of things that I have found that work, I don't like to waste time ranting against things I don't like. However, ORM is so pervasive in some circles that it is important to establish why you will not see it used here, so as to avoid a lot of unnecessary chatter about its absence.

The use of the term "impedance mismatch" is great marketing, worthy of a certain software company in the Northwest of the US, but the analogy is utterly wrong. The analogy is used incorrectly to imply an intrinsic incompatibility, but the real irony is that there is no such incompability, and if we want to use the analogy we are forced to say that ORM is the impedance mismatch, because it creates the inefficient connection.

It always comes back to the fact that modern databases were designed to provide highly reliable permanent storage, and they possess a slew of features that promote that end. Programming languages on the other hand are meant to process data in a stepwise fashion. When the two of them meet it is very important to establish a strategy that uses the strengths of both, instead of trying to morph one into the other, which never yields efficient results.

The Very Basics

The language SQL is the most widely supported, implemented, and used way to connect to databases. But since most of us have long lists of complaints about the language, we end up writing abstraction layers that make it easier for us to avoid coding SQL directly. For many of us, the following diagram is a fair (if not simplified) representation of our systems:

This diagram is accurate for ORM systems, and also for non-ORM systems. What they all have in common is that they seek to avoid manually coding SQL in favor of generating SQL. All systems seek to give the programmer a set of classes or functions that makes it easy and natural to work with data without coding SQL manually.

This brings us to a very simple conclusion: the largest part of working out an efficient database strategy is working out a good SQL generation strategy. ORM is one such strategy, but it is far from the simplest or the best. To find the simplest and the best, we have to start looking at examples.

First Example: Single Row Operations

Consider the case of a generic CRUD interface to a database having a few dozen tables. These screens will support a few single-row operations, such as fetching a row, saving a new row, saving updates, or deleting a row.

We will assume a web form that has inputs with names that begin with "inp_", like "inp_name", "inp_add1", "inp_city" and so forth. The user has hit [NEW] on their AJAX form, filled in the values, and hit [SAVE]. What is the simplest possible way to handle this on the server? If we strip away all pre-conceived ideas about the "proper" thing to do, what is left? There are only these steps:

  1. Assemble the relevant POST variables into some kind of data structure
  2. Perform sanity checks and type-validations
  3. Generate and execute an insert statement
  4. Report success or failure to the user

The simplest possible code to do this looks something like this (the example is in PHP):

# This is a great routine to have.  If you don't have
# one that does this, write it today!  It should return
# an associative array equivalent to:
#    $row = array( 
#       'name'=>'....'
#      ,'add1'=>'....'
#    )
# This routine does NOT sanitize or escape special chars
$row = getPostStartingWith("inp_");

# get the table name.  
$table_id = myGetPostVarFunction('table_id');

# Call the insert generation program.  It should have
# a simple loop that sanitizes, does basic type-checking,
# and generates the INSERT.  After it executes the insert
# it must caches database errors for reporting to the user.
#
if (!SQLX_Insert($table_id,$row)) {
    myFrameworkErrorReporting();
}

Without all of my comments the code is 5 lines! The Insert generation program is trivial to write if you are Using a Data Dictionary, and it is even more trivial if you are using Server-side security and Triggers.

This is the simplest possible way to achieve the insert, and updates and deletes are just as easy. Given how simple this is (and how well it performs), any more complicated method must justify itself considerably in order to be considered.

ORM cannot be justified in this case because it is slower (objects are slower than procedural code), more complicated (anything more than 5 lines loses), and therefore more error-prone, and worst of all, it cannot accomplish any more for our efforts than we have already.

Objection! What About Business Logic?

The example above does not appear to allow for implementing business logic, but in fact it does. The SQLX_Insert() routine can call out to functions (fast) or objects (much slower) that massage data before and after the insert operation. I will be demonstrating some of these techniques in future essays, but of course the best permforming and safest method is to use triggers.

Example 2: Processes, Or, There Will Be SQL

Many programmers use the term "process" to describe a series of data operations that are performed together, usually on many rows in multiple tables. While processes are not common on a typical website, they are plenty common in line-of-business applications such as accounting, ERP, medical programs, and many many others.

Consider a time entry system, where the employees in a programming shop record their time, and once per week the bookkeeper generates invoices out of the time slips. When this is performed in SQL, we might first insert an entry into a table of BATCHES, obtain the batch number, and then enter a few SQL statements like this:

-- Step 1, mark the timeslips we will be working with
UPDATE timeslips SET batch = $batch
 WHERE batch IS NULL;
 
-- Step 2, generate invoices from unprocessed timeslips
INSERT INTO Invoices (customer,batch,billing,date)
SELECT CUSTOMER,$batch,SUM(billing) as billing,NOW()
  FROM timeslips
 WHERE batch = $batch
 GROUP BY customer;
 
-- Step 2, mark the timeslips with their invoices
UPDATE timeslips 
   SET invoice = invoices.invoice
  FROM invoices 
 WHERE timeslips.customer = invoices.customer
   AND timeslips.batch    = $batch;

While this example vastly simplifies the process, it ought to get across the basic idea of how to code things in SQL that end up being simple and straightforward.

Counter Example: The Disaster Scenario

The biggest enemy of any software project is success. Code that works wonderfully on the developer's laptop is suddenly thrown into a situation with datasets that are hundreds of times larger than the test data. That is when performance really matters. Processes that took 3 minutes on the laptop suddenly take 10 hours, and the customer is screaming. How do these things happen?

Mostly they happen because programmers ignore the realities of how databases work and try to deal with them in terms they understand, such as objects or even simple loops. Most often what happens is that the programmer writes code that ends up doing something like this:

foreach( $list_outer as $item_outer) {
    foreach( $list_inner as $item_inner) {
        ...some database operation
    }
}

The above example will perform terribly because it is executing round trips to the database server instead of working with sets. While nobody (hopefully) would knowingly write such code, ORM encourages you do to this all over the place, by hiding logic in objects that themselves are instantiating other objects. Any code that encourages you to go row-by-row, fetching each row as you need it, and saving them one-by-one, is going to perform terribly in a process. If the act of saving a row causes the object to load more objects to obtain subsidiary logic, the situation rapidly detiorates into exactly the code snippet above - or worse!

On a personal note, I have to confess that I am continually amazed and flabbergasted when I see blog posts or hear conversations in user groups about popular CMS systems and web frameworks that will make dozens of database calls to refresh a single page. A seasoned database programmer simply cannot write such a system, because they have habits and practices that instinctively guard against such disasters. The only possible explanation for these systems is the overall newnewss of the web and the extreme ignorance of database basics on the part of the CMS and framework authors. One can only hope the situation improves.

Sidebar: Object IDs are Still Good

There are some people who, like myself, examine how ORM systems work and say, "no way, not in my code." Sometimes they also go to the point of refusing to use a unique numeric key on a table, which is called by some people an "Object ID" or OID for short.

But these columns are very useful for single-row operations, which tend to dominate in CRUD screens (but not in processes). It is a bad idea to use them as primary keys (see A Sane Approach To Choosing Primary Keys), but they work wonderfully in any and all single-row operations. They make it much easier to code updates and deletes.

Conclusions

The recurring theme of these essays is that you can write clean and efficient code if you know how databases work on their own terms. Huge amounts of application code can be swept away when you understand primary keys and foreign keys and begin to normalize your tables. The next step from there is knowing how to code queries, but sooner or later you have to grapple with the overall architecture. (Well supposedly you would do that first, but many of us seem to learn about architectural concerns only after we have coded long enough to recognize them).

A thorough knowledge of database behavior tends to lead a person away from ORM. First off, the two basic premises of ORM are factually incorrect: One, that there is some native incompatibility between databases and code, and two, that all the world must be handled in objects. These two misconceptions themselves might be excusable if they turned out to be harmless, but they are far from harmless. They promote a willful ignorance of actual wise database use, and in so doing are bound to generate methods that are inefficient at best and horrible at worst.

Overall, there are always simpler and better performing ways to do anything that ORM sets out to achieve.

Next Essay: Performance on Huge Inserts

Addendum June 19, 2008

After reading the comments on the blog over the last few days I have decided to put in this addendum rather than attempt to answer each comment independently. I have attempted to answer the objections in descending order of relevance.

The Example is Trivial or "Cheats"

This is a very compelling challenge to the article offered by bewhite and Michael Schuerig and it deserves a meaningful response. What I want to do is flesh out my approach and why I find it better than using ORM. While I do not expect this to lead to agreement, I hope that it answers their challenges.

  • My sphere of activity is business applications, where two dozen tables is trivial and the norm is for dozens or hundreds of tables.
  • When table count beyond the trivial, many concerns come into play that do not appear at lower table counts.
  • I have found that a single unified description of the database works best for these situations, provided it can specify at very least schema, automations, constraints, and security. This is what I refer to as the data dictionary.
  • The first use of the data dictionary is to run a "builder" program that builds the database. This builder updates schemas, creates keys and indexes, and generates trigger code. The same builder is used for clean installs and upgrades.
  • The generated trigger code answers directly the challenges as to how non-trivial inserts are handled. Downstream effects are handled by the triggers, which were themselves generated out of the dictionary, and which implement security, automations, and constraints. No manual coding of SQL routines thank you very much.
  • All framework programs such as SQLX_Insert() read the dictionary and craft the trivial insert. The code does what you would expect, which is check for type validity, truncate overlong values (or throw errors). But it does need to know anything more than is required to generate an INSERT, all downstream activity occurs on the server.
  • The dictionary is further used to generate CRUD screens, using the definitions to do such things as gray out read-only fields, generate lookup widgets for foreign keys, and so forth. This generated code does not enforce these rules, the db server does that, it simply provides a convenient interface to the data.
  • A big consequence here is that there is no need for one-class-per-table, as most tables can be accessed by these 'free' CRUD screens.
  • That leaves special-purpose programs where 'free' CRUD screens don't reflect the work flow of the users. In a business app these usually come down to order entry, special inquiry screens and the lot. These can be programmed as purely UI elements that call the same simple SQLX_Insert() routines that the framework does, because the logic is handled on the server.
  • This approach is not so much about code reuse as code elimination. In particular, the philosophical goal is to put developer assets into data instead of code.
  • When this approach is taken to its full realization, you simply end up not needing ORM, it is an unnecessary layer of abstraction that contributes nothing to quality at any stage.

These ideas are implemented in my Andromeda framework. It is not the purpose of this blog to promote that framework, but it has been successfully used to produce the types of applications I describe on this blog. I make mention of it here for completeness.

So to conclude, both of these gentlemen are correct that the example says nothing about how the crucial SQLX_Insert() routine is coded, and I hope at least that this addendum fleshes this out and makes clear where it is different from ORM.

The Model Should Be Based On Classes

bewhite asks "Do you propose us to organize our applications in terms of tables and records instead of objects and classes?"

Yes. Well, maybe not you, but that's how I do it. I do not expect to reach agreement on this point, but here at least is why I do it this way:

  • My sphere of activity is business applications, things like accounting, ERP, medical management, job control, inventory, magazine distribution and so forth.
  • I have been doing business application programming for 15 years, but every program I have ever written (with a single recent exception) has replaced an existing application.
  • On every job I have been paid to migrate data, but the old program goes in the trash. Every program I have written will someday die, and every program written by every reader of this blog will someday die, but the data will be migrated again and again. (...and you may even be paid to re-deploy your own app on a new platform).
  • The data is so much more important than the code that it only makes sense to me to cast requirements in terms of data.
  • Once the data model is established, it is the job of the application and interface to give users convenient, accurate and safe access to their data.
  • While none of this precludes ORM per se, the dictionary-based approach described above allows me to write both procedural and OOP code and stay focused on what the customer is paying for: convenient, accurate and safe access.
  • The danger in casting needs in any other terms is that it places an architectural element above the highest customer need, which is suspect at best and just plain bad customer service at worst. We all love to write abstractions, but I much prefer the one that gets the job done correctly in the least time, rather than the one that, to me, appears to most in fashion.

Old Fashioned Technnologies

More than one comment said simply that triggers and other server-side technologies "went out". Since I was there and watched it happen I would contend that when the web exploded a new generation came along with different needs. In particular the need for content and document management caused people to question all of the conventional uses of the SQL databases, and like all programmers they are convinced their world is the only world and all of the world, ergo, triggers are history because I don't use them. Nevertheless, those of us who continue to write business applications continue to use the technologies that worked well then and only work better now.

Ken Does Not Like OOP

I love OOP, especially for user interfaces. I just don't think it should own the domain model, and I don't think that "trapping" business logic inside of classes gives nearly the same independence as a data dictionary does. I've tried it both ways and I'll stick with the dictionary.

Any Use of OO Code Implies ORM

A few comments said outright that if you are using OOP code then you are by definition mapping. Technically this is untrue if you understand the use of the term "map" as opposed to "interface". Mapping is the process of creating a one-to-one correspondence between items in one group (the code) to items in the other (the database). A non-ORM interface is one in which any code, procedural or OOP, passes SQL and handles data without requiring a one-to-one mapping of tables or rows to classes or functions. My apps are not ORM because I have no such requirement that there be a class for every table, and no such requirement that there be any specific code to handle a table.

Don't Talk about Procedural Being Faster

At least three comments blasted this contention. To put things in context, performance in a database application goes in two stages. First and absolutely most critical is to be extremely smart about reducing database reads, since they are 1000's of times slower than in-memory operations. However, once that is done, there is no reason to ignore speed improvements that can be gained by optimizing the code itself. The commenters are correct that this gain is of a very low order, but I would stand by the statement after making this contextual addendum.

Thank You All

This was the most widely read piece in this series, definitely the most controversial. There will not likely be any other articles this controversial, as the main purpose of this essay was to provide regular readers with some background as to why they will not see ORM-based examples in future essays. Thanks for the comments!

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Other philosophy essays are: