There's No Action!

Tuesday, May 23, 2006

Why All Web Programming Frameworks Are Broken

(and how we can fix it)

Anyone who has been around the web programming world (or any programming world, really) has no doubt realized that there isn't a whole lot to what we do; basically, we create pretty forms and screens that manipulate data in some sort of database and return formatted results to the end user. Input goes in, processing happens, and output comes out. There are many different languages used to express this, but the underlying formula remains the same.

Now, programming is a very individualist exercise. We are generally a pretty talented group (at least in our own minds), and understanding the above formula often leads us to believe that we can somehow streamline the process so that creating a new web application takes much less code than it currently does in whatever platform we are using. And in this we are right; everybody who does this professionally comes up with their own framework or CMS system that they think is the cat's meow and purrfectly (sorry) summarizes their view of how a web application should fit together. Very often these systems allow us, as individuals, to spit out web applications faster. Some of the better ones (or better marketed ones) can even help other people make applications faster, as long as they can live with the assumptions made by the original developer.

That said, we have myriad application frameworks out there which promise the great MVC experience, where content is perfectly separated from the presentation, designers can be designers, coders can be coders, and the world will live in perfect harmony. Despite 10 years of this, however, we are still stuck in a world of buggy and difficult-to-maintain code. Systems like Fusebox, Ruby on Rails, Struts, and even .Net can all make your job a little easier if you follow their rules closely, but none are able to live up to the productivity hype which surrounds them or they would quickly jump to the front of the pack and take over the space. Think about it. If RoR made you 10 times as productive as .Net, would anyone still code in .Net? These systems all provide incremental improvements at best. The reason for this is that the problem doesn't lie in the web layer, but in the persistence/database layer.

databases you say?

Yes, databases.

I'll get right to the point; databases are great devices for storing tabular data, but they lack the crucial ability to represent state/time information in a uniform way. This probably requires some explanation, so I will try to do that below.

Relational databases handle data as rows, a logical abstraction which is what makes them good for being able to store arbitrary objects (the classes of which are represented by tables) in a uniform way. The problem comes in how the database can only see 3 different types of transactions on a per-row basis, the CREATE (in which the row is created), the UPDATE (any update to the row), and the DELETE (where the row is removed from the table entirely). For the most part, decent application design rules out ever performing DELETES (except for in associational tables) because it will result in a loss of rows needed for foreign key constraints, so practically speaking there are only two DML statements ever run, and one of those is only done once.

Now the database engine itself is only geared to handle these two transactions. Columns can be declared as NOT NULL, but this is only useful for cases in which the row will ALWAYS contain data. There is no clean way of saying (without ridiculous check constraints or trigger logic) that a field must always have a value when the row it is in reaches a certain point in its workflow. Similarly, any attempt to validate data at the database level (always a good practice if your database supports it) will likely result in adding another ON UPDATE trigger or independent stored procedure, resulting in another complicated piece of application logic that is left to the whims of the developer coding the procedure. Five years down the road, will anyone currently working on the project understand that STPROC_ADD_TICKNUM_U is the procedure to call that will associate a user with a problem ticket? Maybe or maybe not. I would guess that 50% of the applications out there are coded by one programmer who does things his or her own way, making naming conventions entirely arbitrary. Most shops that specialize in development adopt systems to handle just this situation, but even those involve retraining yourself to think like the guy who first developed the system.

Let's do a quick thought exercise to illustrate these issues.

The scenario: a typical web shopping cart, with the requirement that all of the steps of the checkout save themselves to the database so that the customer can pick up the process mid-order in case they leave the site in the middle for some reason.

The schema is simple (and not totally normalized):

Table 1: orders
order_id: int
order_num: varchar(10)
order_date: datetime
ship_date: datetime
status: int
create_date: datetime
mod_date: datetime
bill_name: varchar(100)
bill_address: varchar(200)
ship_name: varchar(100)
ship_address: varchar(200)

Table 2: order_items
order_item_id: int
item_id: int
price: float

Now the workflow. For the orders, we need to make all of the fields except for create_date and order_id nullable, since when the user is first entering data they can enter it piecemeal. The customer said the order_num should be sequential based on the order in which the orders were completed, a silly restraint but one which we must deal with anyway. The billing address, order number, and at least one item must exist when the order reaches "ordered" status, so we need to account for that somewhere. The ship date is only entered if the order reaches the "shipped" status. The create_date is pretty trivial in this example since it only represents the point in time that the user pressed the checkout button the first time, and the mod_date only represents the last time somebody touched the record for any reason. We are forced to have separate fields to account for the progression dates through the various statuses, and to update those either through some kind of application logic.

Even in this trivial and everyday example, we are left with putting the bulk of our order logic within application code, not data definition code. This is a critical difference, because it means the database itself has no concept of an order except for a list of possible fields it might have and therefore we can't use its powerful built-in integrity features to make sure the data stays pristine. In place of that, we have a bunch of procedures or DML statements written by developers in whichever style they choose, which aside from being prone to error, will also be totally arbitrary and difficult for new developers to decode.

a rant

I'd like to take a moment to talk about the art of software development.

I must say art here, because software is unlike any other type of engineering or mathematical discipline. It is often lamented by IT companies, project managers, and customers alike that software development projects are impossible to accurately forecast in terms of time and resources needed.

This is due to the fact that regardless of language, technology stack, and even methodology, 90% of what makes a program do interesting work is arbitrary code written by a single developer. Even if this code follows some sort of framework, this framework is only useful in doing what the designer of the framework intended for it to do. It is doing stuff that it was not designed to do that gets the developer paid.

Taking Ruby on Rails as an example, it contains handy convenience methods and conventions for creating new rows, updating mod_date fields, and even handling row locking (provided Rails is the only thing updating the database). But virtually every single statement you will write will involve some arbitrary WHERE clause, will need to SELECT rows using a JOIN and needing to skip the built-in functions because of the "id" column convention, or do some other cockimamy thing that the designers couldn't include in a general purpose framework. That is before you even get to how you choose to design and allocate functions between your various model classes, what you decide to name your functions, how you choose to design your controllers, what brand of HTML you want to use in your view layer, how you use CSS, etc.

The net result to all of this arbitrary code is that programming doesn't scale very well as you add more developers to it. Every new team member must learn what was done by those who came before in terms of both functionality and coding style, and then they will usually end up adding their own twists on top of that, making the next developer even more confused. For a project lasting a substantial period of time you must also deal with attrition as people leave the team, and then somehow regenerating their knowledge amongst new or existing staff.

As a result of all this, programming productivity doesn't increase linearly as you throw more bodies at it. If you get productivity 1.5 from 2 programmers, it may take you 7 to get to productivity 3.0. And even that is assuming you hire good people. :)

what does this have to do with web programming frameworks?

Good question. Getting back to the task at hand, the issue is that most web programming frameworks base themselves on some sort of database abstraction layer. This is usually the source of their substantial productivity claims, since it lets you (alternately)
  1. Do all your programming in one language!
  2. Not know SQL!
  3. Let the framework handle database/SQL optimization!
  4. Let the framework build your database for you!
  5. Update your model classes based on any schema changes you make!
While these sound nice on paper, they are often of dubious value (3, 4, 5), are a bad programming practice (1), or outright falsehoods (2). The bigger issue, though, is that when the database doesn't let you design your schema in a way that much metadata can be extracted from it, any database abstraction layer will always be a lowest-common-denominator abstraction. It's convenient to not have to manually update your classes when you add new columns to the database, but you still need to incorporate those changes into any web forms that use the new field and any functions that need to incorporate it into the workflow. This means still more arbitrary, single developer code.

the solution...

The way to get out of this conundrum is to come up with a new version of the SQL standard that can account for status/time as a database-level entity. We need to give rows the concept of status that is separate from a programmer-designated, arbitrary field and a batch of procedures or application language functions.

One idea would be to have an

ALTER TABLE tablename ADD STATUS statusname LEADS TO subsequent_statusname1, subsequent_statusname2, ...

statement which can show the order in which statuses occur. The status that has nothing leading to it will be the initial status. Then triggers could be created on a per-status basis, and stored procedures could be defined that are tied to moves between statuses, either to a new one or back to an old one. Also, special syntaxes can be added for NOT NULL fields to account for the status changes.

Such a system, with some natural extensions, would provide a richer set of DDL-level data for the various frameworks to take into account when creating their model libraries, and some could even go to the lengths of making web pages templates based one what fields are available for editing on which statuses. Over time we can evolve to the point where more logic is stored in the database/persistence layer where it belongs than in arbitrary language code.

conclusion

Web development has reached a point in it is receiving diminishing returns with attempts to refine it further at the application code level. We have long ago distilled websites into their basic elements and attempted to abstract out all that can be abstracted out, and we are left with a battery of competing standards and frameworks which differ very little from one another in anything more than syntax. At the same time, demands for increasing productivity and faster time to market are always ringing in our ears.

The only way to fix this in a fundamental way is to add metadata that is presently being defined in an ad-hoc basis by developers into the database/persistence layer. This will accomplish the dual objectives of giving the framework developers a richer set of data with which to abstract the database, and incorporate information presently stored in crude 3rd generation languages into an elegant 4th generation language like SQL. Database integrity will be assured, and the world will be as one.

Until next time.

1 Comments:

  • Some in the database world have been proposing similar recommendations for years. Hugh Darwen and Chris Date (de facto successors to Dr. E. F. Codd, creator of the relational model) have been criticizing SQL for years, and have proposed replacing it with languages that more fully implement the relational model. State transition constraints are just one small feature of what they've been agitating for.

    http://thethirdmanifesto.com

    Unfortunately, I think they're usually dismissed as ivory-tower Data Architecture "academics". I understand the criticism, as I live in the "real world" of SQL myself, and there's not much I can do about it.

    The shame of it is that some very smart people have some very well thought out ideas for the next generation of databases, but it seems very few are listening. The entrenched DBMS players are, understandably, only interested in increasing their own market share, and really have no incentive to gamble on something new.

    There is a project out there, called Rel, that is an attempt to implement these ideas, but I haven't had time to check it out yet. Maybe you can look into it, and let us know what you think.

    http://dbappbuilder.sourceforge.net/Rel.html

    By Blogger Column Meaney, at 7:33 PM  

Post a Comment

<< Home