Immutability -> Safer Denormalization

Denomalization in database design is generally considered a bad thing TM, but it can make some operations much faster and more convenient. One way to make denormalization safer is to have immutable fields. For example, let’s say you have the following tables:

A {
  ...
}

B {
  a A immutable;
}

C {
  b B;
  a A assert a = b.a
}

Now, if you want to know which A record a C record is associated with by way of B, you’d only have to follow one reference instead of two. This is especially handy when you want to find all the C records that are associated with a particular A record. If this were SQL, you wouldn’t have to join any tables. You’d just query C:

bad:  select * from C join B on B = C.b where B.a = ?
good: select * from C where a = ?

Nice, but this probably doesn’t seem like a huge improvement. But what if you’re using a database system that doesn’t support joins? Or what if you’re using SQL and you have to join many intermediate tables? In practice, I have run into both of these problems. The particular non-SQL database system I’m thinking of is Google App Engine Datastore, but the same principle applies to Amazon Web Services SimpleDB, and probably others. In the case of SimpleDB, there is no schema (Exampando models make this true to a certain extent for Datastore), which means you can’t declare fields as immutable, but if you’re using a library that maps domains to classes, and those classes contain field specifications (as in the case of Datastore’s Python API), then you can always implement immutability in the library.

If the a field in the B table were mutable, C has to worry about when B changes in order to ensure that C’s constraint on its a field maintains. Of course, you can still break the constraint by changing C’s b field without changing its a field, but it’s much easier to prevent that sort of thing than making sure changes in other tables don’t messing things up for C. You only to inspect values within a single record in order to determine if there’s something wrong, as opposed to having to check values in other records in other tables. Rails’ solution for this is validation, a central feature of the ActiveRecord library. Datastore has property validation too, but property validators in Datastore aren’t powerful enough to make sure values in two or more fields sync up. You can try to enforce constraints involving multiple properties by overriding the put method in your model classes, but that wouldn’t be very OO and just doesn’t feel right.

The reason overriding put would not be very OO is that you’d be changing the interface one way or another. Sure, you won’t need to change the number of arguments, but a method’s interface consists of more than just what arguments it accepts. Return values and exceptions are also part of a method’s interface (yes, exceptions too!). You might be tempted to return None if your multi-property constraint isn’t met, but according to Google’s documentation, put returns a Key. “That’s ok”, you’re thinking, “I can just have my put implementation throw a ValueError”. Sorry, you can’t do that either, because put does not throw ValueErrors. It might throw an exception if the Datastore service is down or some other problem like that, but ValueErrors are not on the put’s exceptions guest list.

If it seems odd that an OO implementation of put shouldn’t throw anything other than Google sanctioned exceptions, consider what Java does in the following situation:

public class Foo {
  public void foo() {
    // ...
  }
}

public class Bar extends A {
  public void foo() throws Exception {
    // Danger, Will robinson!
  }
}

This won’t compile, because the foo method has no business throwing a checked Exception in Bar. This would totally screw up Java’s type system. For example:

A b_in_disguise = new B();

If B is a subclass of A, this should totally be valid. But what if you do b_indisguise.foo() and it ends up wanting to throw an Exception?? Disaster! Java’s static type checking will look at this code and say to itself, “If b_in_disguise is (statically) declared to be of type A, I know that doing b_in_disguise.foo() will not throw any checked exceptions.”. If you’re skeptical that this is how Java works, and you try this in Eclipse, you’ll a see descriptive albeit terse explanation of the problem. If you unpack what it says, you’ll basically get the explanation that I just gave.

Anyway, back to databases: This is why data warehouses can be heavily denormalized: all the fields are immutable! Once data goes in your data warehouse, it’s pretty much fixed. Of course, you can still change it, but that shouldn’t be happening very often. When changes become necessary, people are going to be very careful about data integrity, because they’re probably performing manual changes that are supposed to fix some kind of data quality or integrity issue. In other words, the sorts of changes you’ll be making aren’t because of transient of application state, which is where normalization saves your butt.

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s