Tuesday, September 28, 2010

Questioning ORM Assumptions

I come from a stored procedure background to data access, with output parameters and datatables strewn throughout C# code.  I have "recently" been learning ORMs (specifically NHibernate, and Entity Framework).  I've done some prototyping with both, and used NHibernate on a few projects.

In 1-1 situations I have found it is incredibly nice to not have to write SQL, or virtually any data access code at all.  In situations that required some form of mapping (components, inheritance, etc) it's also very nice, though things become more brittle and error prone.  In fact, even in 1-1 situations, I've been surprised by how brittle NH mappings are.  Change just about anything on your entity and you're likely to break your mapping somehow.  But that seems to be a price it is worth paying to avoid writing SQL and manual mapping code.

However, I've recently been questioning some of the features ORMs bring.  I think most people would consider these features absolute requirements of an ORM.  However, I'm beginning to doubt how valuable they really are.  Perhaps some of this is in reality more trouble than it's worth?

Unit of Work

The first pattern I have some issues with is the Unit of Work pattern.  This is the pattern used by ORMs to allow you to get a bunch of objects from the ORM, make any changes you want, and then just tell the ORM to save.  The ORM figures out what you changed, and takes care of it.  There are two major benefits to this pattern:
1. You don't have to manually keep track of all the objects you changed in order to save them.  The ORM will just know what you changed, and make sure it gets persisted.
2. You don't have to concern yourself with the order things get saved in.  The ORM will automatically figure it out for you.

My first issue with this pattern is that it is not very intuitive.  You have to tell the ORM about new objects, and you to tell it to delete objects, but you don't have to tell it to update objects.  And, in fact, you don't have to tell it about ALL new objects as it will automatically insert some of them depending on how your mappings and objects are setup (Parent/Child relationships, for examples).  It tends to be further confused by the APIs frequently used.  For example, a lot of people use a Repository/Unit of Work pattern to hide NHibernate's session object.
var crypto = BookRepo.GetByTitle( "Cryptonomicon" );
crpto.Rating = 5;

var ender = new Book { Title = "Ender's Game", Author = "Orsan Scott Card" };
BookRepo.Add( ender );

uow.Save();
What happens at BookRepo.Add( ender )?  Does that issue an Insert to the database?  Is the crypto.Rating update saved?  And where the heck did this uow object come from and what relationship does it have with the BookRepo?!  If you know this pattern, you're probably so used to it that it doesn't seem strange.  But when you step back from it, I think you'll agree this is a pretty bizarre API.

Truth be told, some of this confusion is actually due to the Repository pattern.  You are supposed to think of a Repository as an in memory collection of objects.  The persistence is under the covers magic.  If you're writing an application where persistence is one of the primary concerns, I always thought it was kind of stupid to adopt a pattern which tries to pretend that persistence isn't happening...

But back to Unit of Work, the second issue I have is a certain loss of control.  It is very easy for you to write code using a unit of work and then have no idea what is actually being saved to the database when you issue the Save command.  To me, that's a really scary thing.  Now, to be fair, if you find yourself with code like that, it's probably really bad code.  But that doesn't change the fact that this pattern almost encourages it.  There is something nice about ActiveRecord's approach of calling Save on each entity you want to save to the database.  You're certainly gaining back control.

My last issue, and this one isn't really that big of a deal, but is still something that bothers me a bit...  The Unit of Work pattern couples the way you make changes to the transactions that are used to save them.  In other words, you can't change object A and object B, then save A in one transaction and B in another.  Instead, you'd have to change A, save it, change B, save it.  Like I said, this is a minor sort of quibble, but demonstrates again the assumptions made by the UoW pattern which steals some of your control.

None of these issues are all that serious.  But I still believe that Unit of Work is a very awakward way of dealing with your objects and persistence.

Lazy Loading

ORMs use Lazy Loading to combat the "Object Web" problem.    The object web problem arises when you have entities that reference other entities that reference other entities that reference other entities that ...  How do you load a single object in that web without loading the ENTIRE web?  Lazy Loading solves the problem by not loading all the references up front.  It instead loads them only when you ask for them.

NHibernate and Entity Framework use some pretty advanced and somewhat scary "dynamic proxy" techniques to accomplish this.  Basically they inherit from your class at run time and change the implementation of your reference properties so they can intercept when they are accessed.  There are some scenarios where this dynamic inheritance can cause you problems, but by and large it works and you can pretend its not even happening.

Lazy loading as a technique is very valueable.  But I think ORMs depend on it too heavily.  The problem with Lazy Loading is performance.  Its easy to write code that looks like it executes a single query to the database, but in reality ends up executing 10 or more.  At the extreme you have the N+1 select problem.  Once again, it boils down to trying to pretend the data access isn't happening.

DDD's solution to the Object Web problem is Aggregates.  An Aggregate is a group of entities.  The assumption is that when you load an Entity all its members will be loaded.  If you want to access another aggregate, then you have to query for it.  This cleanly defines when you can use an object traversal, and when you need to execute a query.  Basically, it forces you to remove some of the links in your object web.

By making Lazy Loading so easy, ORMs kind of encourage you to build large object webs.  Entity Framework in particular because it's designer will automatically make your objects mimic the database if you use the db-first approach and drag and drop your tables into the designer.  Meaning you will have every association and every direction included in your model.

While I don't have a problem with Lazy Loading, I do have a problem with using it too much.  This is the main reason why you read so much about people "profiling" their ORM applications and discovering crazy performance problems.  Personally, I'd rather put some thought into how I'm going to get my data from the persistance store up front then have to come back after the fact and waste tons of time trying to find all the areas where my app is executing a crazy number of queries needlessly.

Object Caching

NHibernate and Entity Framework keep a cache of the objects they load.  So if you ask for the same object twice, they'll be sure to give you the same instance of the object both times.  This prevents you from having two different versions of the same object in memory at the same time.  If you think about that for awhile, I'm sure you'll come up with all kinds of horror scenarios you could get into if you had two representations of the same object.

But I think this is an example of the ORM protecting me from myself too much, its just not that important of a feature.  Instead it adds more magic that makes the data access of my application even harder to understand.  One time when I say GetById( 1 ), it issues a select.  But the next time it doesn't.  So if I actually wanted it to (to get the latest data for example), I now have to call Refresh()...

Wrap Up

I got into all this because I didn't want to write SQL and I didn't want to write manual mappings.  I certainly got that.  But I also got Unit of Work, Lazy Loading, and Implicit Caching.  None of which I actually NEED and certainly never wanted.  And many of which actually create more problems than I had before!

Some Active Record implementations manage to fix these issues.  But I have concerns with using Active Record on DDD like code.  The main concern is that I want to model my domain, not my database.  The other big concern is I prefer keeping query definitions out of the entities, as it doesn't feel like their responsibility.

Now I'm not claiming any of these issues are a deal breaker to using NHibernate or Entity Framework or other ORMs.  But on the other hand, it doesn't feel like these patterns are the best possible approach.  I suspect there are alternative ways of thinking about Object Relational Mapping which may have some subtle affects on how we code data access and lead to better applications, developed more efficiently.  For now though, I'm settling for NHibernate.

Tuesday, September 21, 2010

Decoupling tests with .NET 4

Recently, I was struggling with an annoying smell in some tests I was writing and found a way to use optional and default parameters to decouple my tests from the object under test's constructor.  Not too long ago, Rob Conery wrote about using C#'s dynamic keyword to do all kinds of weird stuff.  When I was running into these issues with those tests, I took a look at what he'd been playing with.  Nothing there jumped out, but it lead to the optional parameters.

Specifically, I was TDDing an object that had many dependencies injected through the constructor.  Basically what happened was each new test introduced a new dependency, which causes the previous tests to have to be updated.  For example:
[Test]
public void Test1()
{
  var testobj = new Testobj( new Mock<ISomething>().Object );
}
I'm using moq here, and creating a default partial mock, which takes care of itself.

Then I write the next test:
[Test]
public void Test2()
{
  var setupMock = new Mock<ISomething>();
  // setup the mock
  var testobj = new Testobj( setupMock.Object, new Mock<ISomethingElse>().Object );
}
Notice this test has introduced a new dependency, ISomethingElse.  Now the first test wont compile, we have to go update it and add the mock for ISomethingElse.  This will continue with each test that introduces a new dependency causing every previous test to be updated.

You could simply refactor the constructor into a helper method so you only have to change it in one place.  But this doesn't work so well when the tests are passing in their own mocks.  You'd need lots of helper methods with lots of different method overloads.  Enter optional and default parameters!
public Testobj BuildTestobj(Mock<ISomething> something = null, Mock<ISomethingElse> somethingElse = null )
{
  return new Testobj(
    ( something ?? new Mock<ISomething>() ).Object,
    ( somethingElse ?? new Mock<ISomethingElse>() ).Object );
}
Now we can update the tests:
[Test]
public void Test()
{
  var testobj = BuildTestobj();
}

[Test]
public void Test2()
{
  var setupMock = new Mock<ISomething>();
  // setup the mock
  var testobj = BuildTestobj( something = setupMock ); }
Simple, clean, refactor friendly, and your tests are now nicely decoupled from the constructor's method signature!