Monday, June 29, 2009

SQL Performance: Child data

In this post I’d like to talk about a specific issue in SQL and all the various ways you could approach it. Specifically, I’d like to talk about dealing with a parent entity’s child data. As an example lets use a very simple document database that stores documents and their versions. Obviously, a document will have many versions but each version will have only one document, as in this diagram:

Document Db Diagram

This is not a complete database, clearly, but it indicates that the DocumentVersion table is a child of the Document table.

Now we get to the part where this gets at least partly interesting. Lets say we’re going to write a search that returns documents. In the results we want to display information about the current version of each document as well as the total number of versions for each document.

This is a surprisingly non trivial query…

select d.Id, d.Name, v.Id, v.FileName, v.Extension, v.FileSize,
( select count(Id) from DocumentVersion where DocumentId = d.Id ) as numVersions
from Document d
inner join DocumentVersion v on d.Id = v.DocumentId
where v.Id = ( select max(Id) from DocumentVersion where DocumentId = d.Id )

Now, there are a bunch of ways to write this, but this is a perfectly good example. Notice we have an inner select in the select clause to get the number of versions and we have another select in the where to get “latest” version. Here I’m depending on SQL Server’s Identity Specification to give me the latest row because it simplifies the query. If we didn’t want to do that, I’d have to either “select top 1” while ordering by the inserted date (which isn’t on the table in our example) or use a row number function and get the row where the row number = 1 again ordered by the inserted date. Both of these query are correlated, meaning they're run for each document in our results.

This query is ugly, but it works. We could optimize it and tweak the way its written to try to get the best possible performance out of it. But is this really the best way to do this? If we think about it, we’re going to be looking at all the versions for every document returned in our search. The more documents we return, the worse this is. But worse, we’re going to do WAY more reads than we are updates in this case. New versions simply are not going to be added that often. So it seems silly to be constantly looking up information about the versions over and over and over and over again when we know its unlikely it will have changed from the last time we looked at it.

Wouldn’t it be better to cache this information on the Document table so we don’t have to keep calculating it repeatedly, thereby simplifying the query and improving its performance?

To do this, we simply add “NumVersions” and “CurrentDocumentVersionId” columns to the Document table. But now we have to keep these columns up to date. There are a few ways to do this:

  1. Trigger on DocumentVersion updates Document’s cached columns on Insert/Delete
  2. Code that does inserts or deletes to DocumentVersion must update cached columns
  3. Cached columns are calculated columns that use a function to lookup values
We'll take these in turn. #1, using triggers, has these benefits:
  • Ensures the columns will always be up to date, no matter how the version records are changed
  • Code will be in just one place and we wont have to worry about it again
However, triggers have these drawbacks:
  • Slow (like, REALLY slow. Batch operations like inserting or deleting many versions will slow to a crawl)
  • Potential for subtle and hard to track down deadlock problems
  • Increased code complexity because the trigger must be written to handle ALL cases, even if you only use some (ex: inserting many versions at once)
On #2, updating cached columns when updating Versions, we have these benefits:
  • Simplest possible code
  • Performant
  • Deadlock issues are easier to see and handle
But it comes with it's own downsides:
  • Same code may end up in many places (ex: Insert and Delete stored procedures, if using sps)
  • Potential for error if someone inserts/deletes versions from a new location and forgets to update the cached columns
#3, using calculated columns, is the same as putting the lookup logic in the query (since you can't persist the value) but has the overhead of a function.

So, between #1, #2, and #3, which is the right option?

I used to use triggers, because of fear that someone would forget to update the columns if I went with #2. But the performance and deadlocking issues with triggers has now caused me to go with the "API layer" approach of #2.

I think the answer, as always, is it depends. If the tables you're using are likely to be always accessed through an API layer, then you should go with #2. But if many people will be manipulating those tables from many different areas and there is no central API layer, you're pretty much forced to go with #1.

And the question remains, is it really worth caching the data this way, or should you just keep the lookups in the queries. Once again, my favorite theme for this blog: it depends. The big question is really performance, and that depends on how the queries will be used. Are you going to be returning thousands of results, or just hundreds? Are you going to be running this query often?

In SQL there is no one size fits all rule. And worse, SQL is so complex it has to be treated as a black box, meaning you really can't reason about it. Therefore, your only hope is to test and test and test. You pretty much have to write the query every way you can imagine and then performance test each one... And that takes a lot of time.

As Scott Hanselman would say, "Dear Reader," what do you think? Have you been faced with this issue? What did you do?

3 comments:

  1. That's a really interesting problem and oddly enough I can't say I've ever had to do this. Although i could see it being really common.

    I think, just off the top of my head i would lean toward #2 unless some compelling reason pulled me another direction.

    I personally prefer most logic like this to be in code (by which i mean say C# instead of SQL like an SP). I reserve SPs to simple insert/update/delete but even then I'm starting to lean more toward ORM style solutions that's dont use SPs. Considering that this means I'd probably put in some sort of hook on that particular objects add or delete.

    The main reason is that code is generally a bit easier to test than pure SQL. Also id want to isolate this logic into a method or class and i find it a bit more maintainable to have a method in a class somewhere than another SP or function that a 2nd SP calls.

    ReplyDelete
  2. Agreed. I've been moving that way too.

    ReplyDelete
  3. you could add some abstraction, using views and stored procedures for all the CRUD operations, that would give you the room to switch strategies

    ReplyDelete

Note: Only a member of this blog may post a comment.