kwblog: sql

Showing posts with label sql. Show all posts

Tuesday, May 29, 2012

Minor Issues: Query Results vs. Models

I want to take a look at a minor issue that crops up in a very common application structure where you have a list of data, possibly from a search, that the user selects from to view details.

There are some minor issues that must be addressed, and they all have to do with queries especially when we're dealing with SQL. There will be a query that returns the list by gathering all the data, maybe doing some formatting, and joining to all the relevant tables. For example, if it's a list of books it well return the title, publish date, author (join to author table; format name), and genre (join to genre table).

Apart from listing the books, the app also needs to be able to add new books. This will work as follows:

A dialog pops up with all the fields to fill-in
On save, if everything validates, the book is saved in the database
The new book is added to the list with AJAX (did I mention it's a web app?)

Since I don't want to leave you hanging, here are the "minor issues" I'm going to look at:

Query performance (N+1 Select)/Query complexity
Formatting logic
Type conversion

To illustrate my points, I'll use the Active Record pattern. Using the book example, a naive implementation of the query might look like this:

var books = Books.All();
foreach(var book in books) {
  // display the data by accessing it this way:
  book.Title
  book.PublishedDate.ToString("MM/dd/yyyy")
  book.Author.FormattedName
  book.Genre.Name
}

Some things to note about this code:

It suffers from the N+1 Select problem because for each book it does a query to lazy load the author and another query to lazy load the Genre (technically that's N+2).
It formats the date with a .NET format string.
It formats the author name using the format logic built in to the Author class in the FormattedName property

The first is a serious issue that we *must* correct, but there isn't anything inherently wrong with the other two.

Query performance/complexity

To fix the N+1 Select problem, eager loading could be applied. Eager loading is a tool of ORMs that includes joins in your query an expands those into referenced objects without a separate database call. Entity Framework, for example, as a nice method called Include so you could write .Include("Author").Include("Genre"). NHibernate allows you to define this as part of the mapping.

This solves the N+1 Select problem, and is generally good enough for a simple example. But when the query is more complicated using the ORM to generate the SQL can be troublesome. And it's worth pointing out that written this way, the SQL will return all the fields from all the rows it joined to and selected from, even if only a small subset is needed. This may or may not affect performance, but it will impact the way indexes are defined.

The N+1 Select problem can also be solved by not using Books.All(), and instead writing a SQL query to do the necessary joins and come back with only the required data. There are two clear benefits to this:

Using SQL directly means there are no limits on what features of the database can be used. Plus, the query can be optimized however needed.
Only the required data fields need to be selected, instead of all the fields. And data from more than one table can be returned in one select without fancy eager loading features.

To represent the results, a Query Result class can be defined. This class will be very similar to the AR models, but only contain properties for the returned fields.

Formatting Logic

But this is where those two other bullet points from earlier come into play. Remember how the date was formatted with a .NET format string? In a custom query, this can easily be moved into the query result object. It's the formatting of the author name that is going to cause some trouble.

Pretend there are three columns that represent name: FirstName, MiddleName, LastName. There are three choices for how to format this into a single name display:

Put the formatting logic in the select statement of the SQL query (duplicates the logic on Author)
Put the formatting logic in a property of the query result object (duplicates the logic on Author)
Refactor Author and call it's method to format the name (awkward)

To explain, here's what Author might have looked like:

public class Author {
  ...
  public string FormattedName { get { return FirstName + " " + MiddleName + " " + LastName; } }
}

This formatting logic is coupled to the fields of the Author class, and so it can't be reused. To make it reusable, it could be refactored into a function that takes the fields as parameters. One way might look like:

public class Author {
  ...
  public string FormattedName { get { return FormatName(FirstName, MiddleName, LastName); } }
  public static string FormatName(string first, string middle, string last) {
    return first + " " + middle + " " + last;
  }
}

This is now in a format that could be used from within our query result object:

public class BookListResult {

  ...
  public string FormattedName { get { return Author.FormatName(FirstName, MiddleName, LastName); } }
}

Part of me loves this, and part of me hates it.

Type Conversion
The other issue that must be dealt with when using the Query Result approach, deals with the AJAX part of our scenario. Remember how we wanted to add the book to the top of the list after the add? Well our view that renders the list item is going to be typed to expect a BookListResult, which is what the query returns. However, after the Add, the code will have a Book instance, not a BookListResult. So this requires a way to convert a Book into a BookListResult. I usually do this by adding a constructor to BookListResult that accepts a Book, and that constructor then "dots through" the book collecting all the data it needs.

From a certain perspective, this can be viewed as duplicating the query logic because knowledge of what fields the QueryResult's data comes from appears in two places: once in terms of the physical SQL tables in the SQL query, and again in terms of the Active Record objects.

Yet somehow I still prefer the Custom Query approach to the eager loading approach... I just like to have that absolute control over the SQL query. The cost of the boilerplate code here is worth it to me if it means I can directly leverage the query features of my database (like row number, and full text, and CTEs and pivots, etc etc).

As in the last "Minor Issues" post (constructors and MVC controllers), I'd love to hear your thoughts or experiences with these patterns.

Monday, May 14, 2012

Powershell Listing SQL Table Columns

Powershell has an awesome utility called sqlps that both lets you execute sql statements against a database, and implements a file system provider over SQL.

One of the things I use this for all the time is to inspect the columns of a table. Management Studio's tree view is terrible for this, especially compared to the flexibility of powershell which allows you to do things like:

Sort the columns by name, or by data type
Filter the columns by name, or by data type, or by nullable, etc

Here's a series of commands I use a lot that I thought was worth sharing:

sqlps
cd sql\localhost\default\databases\mydatabase\tables\schema.table\columns
ls | where { $_.Name -notlike 'ColPrefix*' } | select Name, @{Name="Type"; Expression={"$($_.DataType.Name)($($_.DataType.MaximumLength))"}}, Nullable

That will display all the columns that DO NOT have a name starting with ColPrefix and will show you each columns Name, Data Type (formatted like "nvarchar(255)"), and whether it allows nulls.

Enjoy!

Monday, March 5, 2012

Database Seed Data

Almost two years ago, we started a major new development effort at Pointe Blank. We had the extreme fortune to be starting some new projects on a completely new platform (namely ASP.NET MVC instead of WinForms), which gave us the opportunity to take all the things we'd learned and start from scratch.

One of the main things we had learned was the importance of designing for automation. Probably the single most valuable automation related thing we did was automate the scripting of our database. We are using FluentMigrator to define migrations and psake with sqlcmd to create and drop database instances. I'm not all that crazy about FluentMigrator (its syntax is overly verbose, it's code seems overly complex) but it has worked very well for us so far.

We use this tool in a few different circumstances:

To apply other people's changes to our local dev databases
To migrate our QA environment
To spin up from scratch (and tear down) automated test environments
To deploy database changes to production

Database migrations have made working with our code base so dramatically easy, it's really amazing. You can spin up a new environment with one command in seconds. Incorporate other devs changes without a second thought.

Migration tools pay most of their attention to schema changes, but data gets largely ignored. So much so that we had to rough in our own approach to dealing with data, which works, but also sucks. And there doesn't seem to be any clear recommendations for data. There are some categories of data I'm concerned with:

List Data: these are values that are usually displayed in drop downs and are either not user generated, or there is a default set we want to seed (ex: states, name suffixes, record types, etc)
Configuration Data: this is data that is required for the system to work, (ex: server uris, an email address to send from, labels to display in the app, etc)

There are only two approaches to dealing with seed that I'm aware of:

Run the schema migrations, then insert all the seed data
Insert the seed data inside the migrations

At first blush #1 seemed easier, so it's the approach we originally took, but it has some drawbacks. The first challenge is to avoid duplicates. You can do that with IF NOT EXISTS(...) statements, or by inserting Primary Keys and ignoring PK violation errors. The second is dealing with schema changes to tables with seed data.

For example, suppose you write a migration to create a table, and you write insert statements to add seed data to it. You have already run your migrations and seeded the database. Now you write a new migration which adds a new not null column to the table. This migration will fail because the table already has data from the seeds and the new column requires a value. In this situation, you're hosed. You have to write the migration to add data into that column before making the column not allow nulls (or use a default value). Effectively you have to duplicate the seed data for that column in both the migration and the seed data file. You need it in both places in order to support creating the database from scratch and migrating an existing database.

#2 seems to have it's own downsides too. There is no one place that shows you all the data that will be seeded in a given table, it's spread throughout the migrations. It precludes you from having a schema.rb type of system (unless that system could somehow be smart enough to include data). That point is somewhat academic at this point though, because FluentMigrator doesn't have anything like this.

However, #2 is much more flexible and can accommodate anything you could ever dream of doing with data (that SQL would support of course). And I feel that pure #2 would be better than a #1/#2 hybrid, because it's just more straight forward.

And now I'd like to ask you, do you know of a good way of dealing with seed data? Are there any existing tools or strategies out there that you could link me to?

Tuesday, June 30, 2009

SQL Deadlocks: More with child data

In last week's post, SQL Performance: Child data, I wandered through an issue involving caching data about a parent table's child data. In that post I talked about writing SQL that would save a foreign key on the parent table to the most recent child record. This is very simple and would look like this:

begin tran TxExample
BEGIN TRY

declare @NewChild int

insert into Child ( blah, blah, blah ) values ( @blah, @blah, @blah )

select @NewChildId = SCOPE_IDENTITY()

update Parent set CurrentChildId = @NewChildId where Parent = @ParentId

if @@TRANCOUNT > 0 commit tran TxExample

END TRY
BEGIN CATCH
if @@TRANCOUNT > 0 rollback tran
END CATCH

This is relatively straight forward. It inserts the child, then updates the parent's cached data. Those two operations are wrapped in a transaction and a try catch to ensure that if anything should fail for any reason, both statements will be rolled back. This ensures data integrity.

And now it's time to talk about deadlocks. This code is susceptible to deadlocks. As a relatively contrived but none the less possible example suppose the following SQL could also be run:

begin tran TxExample2

update Parent set blah = @blah where ParentId = @ParentId

...

select * from Child where ChildParentId = @ParentId

commit tran

If these two queries were to run at the same time, operating on the same parentId, and SQL Server were to context switch them at just the write moment, they would deadlock. Specifically, if the first query completed its insert statement (line 6) and then SQL switched to the second query, we would deadlock.

This is because when the second query tries to select from the Child table, it will wait because the first query has inserted a new row and SQL Server's default isolation level is read committed, which means dirty data will not be read, instead it will wait for the data to be committed. So it's going to sit there, waiting for the first query to commit.

This isn't a deadlock yet. The deadlock happens when SQL switches back to the first query and attempts to execute the update on the parent. When it does this, it will try to obtain an exclusive lock on that parent row, but it won't be able to because the second query already has an exclusive lock from it's update. So it will wait for the second query to commit.

The first query is now waiting for the second query which is waiting for the first query and you have yourself a deadlock.

Before we fix it, we should ask ourselves "is this a big deal?" The answer is, it depends, but in general yes. If all your SQL is small and all your transactions complete quickly and you don't have very many users banging on the system concurrently then you probably wont see any deadlocks. But unless you can guarantee that all those conditions will remain the same you have to be at least a little worried. And if those conditions don't apply to you, you definitely have to be worried.

So how do we fix it? First thing we could do is to commit the transaction in the second query before executing the select. If this is possible, then it's a good idea. You want your transaction to commit as quickly as possible and you want to touch as few objects as you can while in the transaction. That said, there are plenty of reasons why you might not be able to commit the transaction after the update. For example, maybe you're reading the child data because you need it to perform another update, and those two updates have to be in the same transaction. In that case, there is nothing you can do to fix query #2.

But even if you could fix query #2, someone could some day come along and write query #3 which would introduce the same problem again. So what we really need to do is fix query #1. The way we do that is by having query #1 obtain a shared lock on all the resources we know it will need to touch, immediately at the top of the query.

Add this code after the BEGIN TRY:

set transaction isolation level repeatableread
select ParentId from Parent where ParentId = @ParentId
set transaction isolation level readcommitted

With this code in place, query #2 will not be able to execute it's update until query #1 completes. Thus, preventing the deadlock and saving the day!

This example was simple but the deadlock was still subtle and hard to see. This problem just gets more complicated the more complicated your SQL gets. And your SQL will get more complicated in direct relation to how complicated your data schema is. So you really have to be on the look out for this issue.

Before I wrap this up, I should mention that if you need to lock more than just one row in one table at the top of your query (like we did in query #1), life can get interesting. If the tables you are locking are all related you can lock them by inner joining to them. But if they are unrelated, things get interesting. If they're unrelated, you can't join from one to the next, so you need to execute separate select statements. And if two queries need to lock the same records in two unrelated tables, but they lock them in different orders (A, B vs. B, A) you can end up with a deadlock! For these cases you have to resort to what you learned in your operating systems class: always lock all your resources in the same order. Good luck with that.

I'll leave you with some rules of thumb, which apply to most cases but, of course, not all:

Keep your transactions as small as possible by touching as few objects as possible
Keep your transactions as fast as possible: if you have a query that can execute on n records in a single transaction where n is unbounded you are likely to find yourself in a world of hurt
Obtain shared locks on everything your transaction will eventually require exclusive locks on before you acquire any other locks
If you need to do any reads that don't need to be repeatable, do them before you obtain any shared or exclusive locks (this is really just in keeping with #2)
If you set the transaction isolation level to repeatable read make sure you're setting it back to read committed (even if its the last line of your query, this will make sure triggers don't execute in repeatable read)

Now, all of this has been learned by trial and error, experimentation, and a lot of reading. If you know of other ways around these issues, or if you have a different take, I definitely want to hear about it.

Monday, June 29, 2009

SQL Performance: Child data

In this post I’d like to talk about a specific issue in SQL and all the various ways you could approach it. Specifically, I’d like to talk about dealing with a parent entity’s child data. As an example lets use a very simple document database that stores documents and their versions. Obviously, a document will have many versions but each version will have only one document, as in this diagram:

This is not a complete database, clearly, but it indicates that the DocumentVersion table is a child of the Document table.

Now we get to the part where this gets at least partly interesting. Lets say we’re going to write a search that returns documents. In the results we want to display information about the current version of each document as well as the total number of versions for each document.

This is a surprisingly non trivial query…

select d.Id, d.Name, v.Id, v.FileName, v.Extension, v.FileSize,
( select count(Id) from DocumentVersion where DocumentId = d.Id ) as numVersions
from Document d
inner join DocumentVersion v on d.Id = v.DocumentId
where v.Id = ( select max(Id) from DocumentVersion where DocumentId = d.Id )

Now, there are a bunch of ways to write this, but this is a perfectly good example. Notice we have an inner select in the select clause to get the number of versions and we have another select in the where to get “latest” version. Here I’m depending on SQL Server’s Identity Specification to give me the latest row because it simplifies the query. If we didn’t want to do that, I’d have to either “select top 1” while ordering by the inserted date (which isn’t on the table in our example) or use a row number function and get the row where the row number = 1 again ordered by the inserted date. Both of these query are correlated, meaning they're run for each document in our results.

This query is ugly, but it works. We could optimize it and tweak the way its written to try to get the best possible performance out of it. But is this really the best way to do this? If we think about it, we’re going to be looking at all the versions for every document returned in our search. The more documents we return, the worse this is. But worse, we’re going to do WAY more reads than we are updates in this case. New versions simply are not going to be added that often. So it seems silly to be constantly looking up information about the versions over and over and over and over again when we know its unlikely it will have changed from the last time we looked at it.

Wouldn’t it be better to cache this information on the Document table so we don’t have to keep calculating it repeatedly, thereby simplifying the query and improving its performance?

To do this, we simply add “NumVersions” and “CurrentDocumentVersionId” columns to the Document table. But now we have to keep these columns up to date. There are a few ways to do this:

Trigger on DocumentVersion updates Document’s cached columns on Insert/Delete
Code that does inserts or deletes to DocumentVersion must update cached columns
Cached columns are calculated columns that use a function to lookup values

We'll take these in turn. #1, using triggers, has these benefits:

Ensures the columns will always be up to date, no matter how the version records are changed
Code will be in just one place and we wont have to worry about it again

However, triggers have these drawbacks:

Slow (like, REALLY slow. Batch operations like inserting or deleting many versions will slow to a crawl)
Potential for subtle and hard to track down deadlock problems
Increased code complexity because the trigger must be written to handle ALL cases, even if you only use some (ex: inserting many versions at once)

On #2, updating cached columns when updating Versions, we have these benefits:

Simplest possible code
Performant
Deadlock issues are easier to see and handle

But it comes with it's own downsides:

Same code may end up in many places (ex: Insert and Delete stored procedures, if using sps)
Potential for error if someone inserts/deletes versions from a new location and forgets to update the cached columns

#3, using calculated columns, is the same as putting the lookup logic in the query (since you can't persist the value) but has the overhead of a function.

So, between #1, #2, and #3, which is the right option?

I used to use triggers, because of fear that someone would forget to update the columns if I went with #2. But the performance and deadlocking issues with triggers has now caused me to go with the "API layer" approach of #2.

I think the answer, as always, is it depends. If the tables you're using are likely to be always accessed through an API layer, then you should go with #2. But if many people will be manipulating those tables from many different areas and there is no central API layer, you're pretty much forced to go with #1.

And the question remains, is it really worth caching the data this way, or should you just keep the lookups in the queries. Once again, my favorite theme for this blog: it depends. The big question is really performance, and that depends on how the queries will be used. Are you going to be returning thousands of results, or just hundreds? Are you going to be running this query often?

In SQL there is no one size fits all rule. And worse, SQL is so complex it has to be treated as a black box, meaning you really can't reason about it. Therefore, your only hope is to test and test and test. You pretty much have to write the query every way you can imagine and then performance test each one... And that takes a lot of time.

As Scott Hanselman would say, "Dear Reader," what do you think? Have you been faced with this issue? What did you do?