Tuesday, June 30, 2009

SQL Deadlocks: More with child data

In last week's post, SQL Performance: Child data, I wandered through an issue involving caching data about a parent table's child data. In that post I talked about writing SQL that would save a foreign key on the parent table to the most recent child record. This is very simple and would look like this:

begin tran TxExample
BEGIN TRY

declare @NewChild int

insert into Child ( blah, blah, blah ) values ( @blah, @blah, @blah )

select @NewChildId = SCOPE_IDENTITY()

update Parent set CurrentChildId = @NewChildId where Parent = @ParentId

if @@TRANCOUNT > 0 commit tran TxExample

END TRY
BEGIN CATCH
if @@TRANCOUNT > 0 rollback tran
END CATCH

This is relatively straight forward. It inserts the child, then updates the parent's cached data. Those two operations are wrapped in a transaction and a try catch to ensure that if anything should fail for any reason, both statements will be rolled back. This ensures data integrity.

And now it's time to talk about deadlocks. This code is susceptible to deadlocks. As a relatively contrived but none the less possible example suppose the following SQL could also be run:

begin tran TxExample2

update Parent set blah = @blah where ParentId = @ParentId

...

select * from Child where ChildParentId = @ParentId

commit tran
If these two queries were to run at the same time, operating on the same parentId, and SQL Server were to context switch them at just the write moment, they would deadlock. Specifically, if the first query completed its insert statement (line 6) and then SQL switched to the second query, we would deadlock.

This is because when the second query tries to select from the Child table, it will wait because the first query has inserted a new row and SQL Server's default isolation level is read committed, which means dirty data will not be read, instead it will wait for the data to be committed. So it's going to sit there, waiting for the first query to commit.

This isn't a deadlock yet. The deadlock happens when SQL switches back to the first query and attempts to execute the update on the parent. When it does this, it will try to obtain an exclusive lock on that parent row, but it won't be able to because the second query already has an exclusive lock from it's update. So it will wait for the second query to commit.

The first query is now waiting for the second query which is waiting for the first query and you have yourself a deadlock.

Before we fix it, we should ask ourselves "is this a big deal?" The answer is, it depends, but in general yes. If all your SQL is small and all your transactions complete quickly and you don't have very many users banging on the system concurrently then you probably wont see any deadlocks. But unless you can guarantee that all those conditions will remain the same you have to be at least a little worried. And if those conditions don't apply to you, you definitely have to be worried.

So how do we fix it? First thing we could do is to commit the transaction in the second query before executing the select. If this is possible, then it's a good idea. You want your transaction to commit as quickly as possible and you want to touch as few objects as you can while in the transaction. That said, there are plenty of reasons why you might not be able to commit the transaction after the update. For example, maybe you're reading the child data because you need it to perform another update, and those two updates have to be in the same transaction. In that case, there is nothing you can do to fix query #2.

But even if you could fix query #2, someone could some day come along and write query #3 which would introduce the same problem again. So what we really need to do is fix query #1. The way we do that is by having query #1 obtain a shared lock on all the resources we know it will need to touch, immediately at the top of the query.

Add this code after the BEGIN TRY:
set transaction isolation level repeatableread
select ParentId from Parent where ParentId = @ParentId
set transaction isolation level readcommitted

With this code in place, query #2 will not be able to execute it's update until query #1 completes. Thus, preventing the deadlock and saving the day!

This example was simple but the deadlock was still subtle and hard to see. This problem just gets more complicated the more complicated your SQL gets. And your SQL will get more complicated in direct relation to how complicated your data schema is. So you really have to be on the look out for this issue.

Before I wrap this up, I should mention that if you need to lock more than just one row in one table at the top of your query (like we did in query #1), life can get interesting. If the tables you are locking are all related you can lock them by inner joining to them. But if they are unrelated, things get interesting. If they're unrelated, you can't join from one to the next, so you need to execute separate select statements. And if two queries need to lock the same records in two unrelated tables, but they lock them in different orders (A, B vs. B, A) you can end up with a deadlock! For these cases you have to resort to what you learned in your operating systems class: always lock all your resources in the same order. Good luck with that.

I'll leave you with some rules of thumb, which apply to most cases but, of course, not all:
  1. Keep your transactions as small as possible by touching as few objects as possible
  2. Keep your transactions as fast as possible: if you have a query that can execute on n records in a single transaction where n is unbounded you are likely to find yourself in a world of hurt
  3. Obtain shared locks on everything your transaction will eventually require exclusive locks on before you acquire any other locks
  4. If you need to do any reads that don't need to be repeatable, do them before you obtain any shared or exclusive locks (this is really just in keeping with #2)
  5. If you set the transaction isolation level to repeatable read make sure you're setting it back to read committed (even if its the last line of your query, this will make sure triggers don't execute in repeatable read)
Now, all of this has been learned by trial and error, experimentation, and a lot of reading. If you know of other ways around these issues, or if you have a different take, I definitely want to hear about it.

Monday, June 29, 2009

SQL Performance: Child data

In this post I’d like to talk about a specific issue in SQL and all the various ways you could approach it. Specifically, I’d like to talk about dealing with a parent entity’s child data. As an example lets use a very simple document database that stores documents and their versions. Obviously, a document will have many versions but each version will have only one document, as in this diagram:

Document Db Diagram

This is not a complete database, clearly, but it indicates that the DocumentVersion table is a child of the Document table.

Now we get to the part where this gets at least partly interesting. Lets say we’re going to write a search that returns documents. In the results we want to display information about the current version of each document as well as the total number of versions for each document.

This is a surprisingly non trivial query…

select d.Id, d.Name, v.Id, v.FileName, v.Extension, v.FileSize,
( select count(Id) from DocumentVersion where DocumentId = d.Id ) as numVersions
from Document d
inner join DocumentVersion v on d.Id = v.DocumentId
where v.Id = ( select max(Id) from DocumentVersion where DocumentId = d.Id )

Now, there are a bunch of ways to write this, but this is a perfectly good example. Notice we have an inner select in the select clause to get the number of versions and we have another select in the where to get “latest” version. Here I’m depending on SQL Server’s Identity Specification to give me the latest row because it simplifies the query. If we didn’t want to do that, I’d have to either “select top 1” while ordering by the inserted date (which isn’t on the table in our example) or use a row number function and get the row where the row number = 1 again ordered by the inserted date. Both of these query are correlated, meaning they're run for each document in our results.

This query is ugly, but it works. We could optimize it and tweak the way its written to try to get the best possible performance out of it. But is this really the best way to do this? If we think about it, we’re going to be looking at all the versions for every document returned in our search. The more documents we return, the worse this is. But worse, we’re going to do WAY more reads than we are updates in this case. New versions simply are not going to be added that often. So it seems silly to be constantly looking up information about the versions over and over and over and over again when we know its unlikely it will have changed from the last time we looked at it.

Wouldn’t it be better to cache this information on the Document table so we don’t have to keep calculating it repeatedly, thereby simplifying the query and improving its performance?

To do this, we simply add “NumVersions” and “CurrentDocumentVersionId” columns to the Document table. But now we have to keep these columns up to date. There are a few ways to do this:

  1. Trigger on DocumentVersion updates Document’s cached columns on Insert/Delete
  2. Code that does inserts or deletes to DocumentVersion must update cached columns
  3. Cached columns are calculated columns that use a function to lookup values
We'll take these in turn. #1, using triggers, has these benefits:
  • Ensures the columns will always be up to date, no matter how the version records are changed
  • Code will be in just one place and we wont have to worry about it again
However, triggers have these drawbacks:
  • Slow (like, REALLY slow. Batch operations like inserting or deleting many versions will slow to a crawl)
  • Potential for subtle and hard to track down deadlock problems
  • Increased code complexity because the trigger must be written to handle ALL cases, even if you only use some (ex: inserting many versions at once)
On #2, updating cached columns when updating Versions, we have these benefits:
  • Simplest possible code
  • Performant
  • Deadlock issues are easier to see and handle
But it comes with it's own downsides:
  • Same code may end up in many places (ex: Insert and Delete stored procedures, if using sps)
  • Potential for error if someone inserts/deletes versions from a new location and forgets to update the cached columns
#3, using calculated columns, is the same as putting the lookup logic in the query (since you can't persist the value) but has the overhead of a function.

So, between #1, #2, and #3, which is the right option?

I used to use triggers, because of fear that someone would forget to update the columns if I went with #2. But the performance and deadlocking issues with triggers has now caused me to go with the "API layer" approach of #2.

I think the answer, as always, is it depends. If the tables you're using are likely to be always accessed through an API layer, then you should go with #2. But if many people will be manipulating those tables from many different areas and there is no central API layer, you're pretty much forced to go with #1.

And the question remains, is it really worth caching the data this way, or should you just keep the lookups in the queries. Once again, my favorite theme for this blog: it depends. The big question is really performance, and that depends on how the queries will be used. Are you going to be returning thousands of results, or just hundreds? Are you going to be running this query often?

In SQL there is no one size fits all rule. And worse, SQL is so complex it has to be treated as a black box, meaning you really can't reason about it. Therefore, your only hope is to test and test and test. You pretty much have to write the query every way you can imagine and then performance test each one... And that takes a lot of time.

As Scott Hanselman would say, "Dear Reader," what do you think? Have you been faced with this issue? What did you do?

Monday, June 22, 2009

The Paradox of Quality

In college I had a professor named Dr. Lang. He was a great guy, and he was full of great quotes. Some of them were just inane, but others were serious and insightful. Though, usually even those were fairly silly. One of those was, "All programmers are control freaks."

There is a lot of truth to that, but I think there is a deeper reason that causes that control freak behavior. The control freak behavior is just a side effect of a deeper motivation, namely that developers are quality freaks.

The book "First, Break All The Rules" is a really great "management" book that focuses on the concept of "talents." They use the word talent in a different way than we typically use it in conversation, and they break it down in to different kinds of talents. Of particular interest is what they call Striving talents,
"Striving talents explain the why of a person. They explain why he gets out of bed every day, why he is motivated to push and push just that little bit harder. Is he driven by his desire to stand out, or is good enough good enough for him? Is he intensely competitive or intensely altruistic or both? Does he define himself by his technical competence, or does he just want to be liked?"

Given this definition, I think we can make the case that developers have the Quality Striving Talent. This is what drives us to want our code to be perfect and our UIs to be perfect and our architecture to be perfect and our designs to be perfect and on and on. Without this talent, I think a developer would go completely insane!

But, quality is not a very straight forward concept, which brings us to:

Paradox #1: Programmers vs Management
Programmers, being so quality driven, want quality in everything they do. Like, the aforementioned, code, UI, architecture, design, etc.

On the other hand management wants a product that people will buy. Obviously, quality factors in here too: people will want quality in the product they buy, so management will want quality in the product they sell. But management doesn't need quality that goes as deep as the programmers are interested in. After all, very little of the details of programming are visible to the users or to management.

The paradox here is clear, and it explains an awful lot of the friction you can encounter between programmers and management.

Paradox #2: Quality is in the eye of the beholder
For example, one person may like a UI with bright color, gradients, shadows, reflections, slidy animations and some content mixed in where it fits. Other people might look at that UI and see distracting noise and bloat. They would prefer black on white densely packed text emphasizing content.

To each his own I guess. But, which of these UIs is higher quality?

Which of these paintings is higher quality?

Picasso - Weeping Woman  da Vinci - Mona Lisa

Paradox #3: Defining Quality
This has been a lot of talk about quality, but what is quality? Here are some factors you might look at to define the quality of an application:

  1. Bug count
  2. Learnability
  3. User Friendliness
  4. Appearance
  5. Feature parity
But there's a problem, some of these are competing concepts. My favorite example as always is Vim and Notepad. If you sit a normal person down in front of Vim they wont even be able to get text into it. But watch a real Vim wizard work and you'll think its magic. Vim is a powerfully usable editor, but it's very very difficult to learn. On the other hand, any moron can use Notepad, but the morons and the experts both use it the same way. It's learnable, but not terribly powerful, and therefore not nearly as usable.

What about factors for code quality:
  1. Bug count
  2. Simplicity
  3. Cohesion
  4. Readability
  5. Lines of code
  6. Style
  7. Broken Windows
Again, we find competing concepts! Take simplicity and cohesion for example. To achieve cohesion you may follow the SOLID design principles, all of which involve increasing abstraction, adding interfaces, and so forth. But all of these things detract from simplicity and increase the number of lines of code. Paradox!

Realizing that quality means different things to different people at different times is important to remaining sane. Especially when you're a quality driven developer. But as we've seen, quality is a slippery contradictory customer, and this really makes things non-trivial. One thing that might help is focusing on an intended audience to help you decide just what quality matters the most to that audience.  By focusing on an imaginary third party’s quality needs you can remove some of your own quality driven needs and find acceptable compromises.

Tuesday, June 9, 2009

An Anemic Community

I am a .NET developer. I've dabbled with Ruby and Python, learned Java and C++ in college, but there's no two ways about it, I'm a .NET dev. Occasionally I think it would be nice to work at one of those dev shops that hires out bodies to companies that need them so that I'd get to work with more languages. But most of the time I like being specialized, it means I have time to really learn my language and its tools. And it should mean that I turn out higher quality product because of it.

.NET is a great platform. The language is very much at the front of the curve as far as strongly typed languages go, and with C# 4.0 coming it will soon be incorporating a number of dynamic language concepts as well. Plus, as IronRuby and IronPython mature, it may become feasible to work with dynamic languages for some things and static for others, which is cool. Add to this picture LINQ, Windows Forms, WPF, WCF, ASP.NET and MVC and you're looking at a pretty compelling platform.

Now, the platform itself has its issues. The major one being that the tooling is targeted at the "least common demoninator" developer, which can make things interesting for a dev working with non-trivial applications in a non-trivial environment. But, that's a topic for another blog post. For our purposes here, lets agree that .NET as a platform is pretty darn good.

Unfortunately, the .NET community kind of sucks. There has been some improvement here in recent years with things like ALT.NET, but for the most part, the community remains pretty anemic. Compare it with the Java community for example. The vast majority of the non-microsoft products in .NET are ports from Java: nhibernate and nunit for example. Or compare it with jquery, the number of quality plugins and code samples is just ridiculous. It's a bloody fruited plain!

Now take a look at the blog posts that turn up from Google searches for .NET stuff. Especially around WPF and WCF all you find is amature quality stuff. Look at Codeplex, all you find are tons of discontinued projects people started, and what work remains there is terrible. And don't even get me started on the half solutions that riddle Code Project.

Why is this? There are a large number of .NET developers. There are all kinds of conferences and user groups. So why would it be that there is so little happening on the web with .NET. And why is it that what is happening is so bad?

I think its because .NET is a corporate community. The people who are sharing their work or writing code sample blog posts are either amatures, college kids, or "low ranking" people just getting started in their career. The actual "professionals" who are doing good work are doing it for some company or other and are unable to share it on the web! The few who actually do get involved on the web stick with waxing philosophical on their blogs (kind of like what I do...) and don't really add much value. Worse, if there were projects available, many companies wouldn't use them simply because they weren't Microsoft's.

The corporate view point is that any ideas or tools or products that have been developed by their people represent a competitive advantage. Giving away your advantage in the interest of helping the community is just stupid. Who would do such a thing?

Because of this there is little organizing going on. People are not starting useful and interesting projects and finding contributors. They can't, for fear that their employer will find out and they'll get into some kind of trouble they don't understand.

This is a real problem for .NET. I think that's why Microsoft works so hard at planting .NET bloggers in the community, like some kind of covert operation run by the CIA. That's helpful, because those people are writing good useful stuff. But it all falls short of any kind of organizing. And it also tends to keep its sights quite low, not really trying to contribute anything impressive.

The fact that no one is sharing really lowers the averge state of the art across the board. The ridiculous strides taken by Ruby on Rails and jQuery are examples of how much an open oriented platform can do. It makes it easier for people to jump on board and do non-trivial things. It also makes it more likely that as those people gain experience they'll contribute back. If it goes well, you can get a kind of snow ball effect where the people keep getting better and better and so does the technology.

I'm unclear on how the .NET world could begin to do this, since its so corporate driven. All I know is the low quality material makes a .NET developers life much much harder. And at times, its really frustrating.

Monday, June 8, 2009

Thinking like a TDDer

Welcome to my 100th post! I'm going to return to a topic that I've touched on many times in the past, namely, unit testing.

Unit testing is one of those things that is very attractive in principle but gets ugly real fast in practice. I've tried it countless times but never managed to follow through. But in principle, I'm a fan. Having tests to fall back on to help prevent regression and to make refactoring easier and less stressful is a big win. And following the TDD process is very rewarding. Every couple minutes you get another green bubble, encouraging you, telling you you're making progress and doing a good job. Plus, unit testing reinforces the benefits of following good design practices like loose coupling, DRY, and Do One Thing (which are all the same...).

For me personally, unit testing still has 2 primary stumbling blocks:
  1. Mocking
  2. "Maintenance" development
I've talked about many other "pitfalls" including Mocking in an earlier post called Unit Testing Pitfalls. I still think those are true, but I'm coming to terms with the ways around some of them. However, mocking still drives me crazy. I hate having to specify every method and/or property that my object under test is likely to call on the dependency. On the other hand, I hate writing Stub objects because it takes time and effort and you frequently make mistakes that cause tests to fail. Maybe I just need to learn all the features of RhinoMocks (if only the documentation was better) or learn Moq or something. If I find a way to resolve this, I'll let you know.

The second issue is "Maintenance development," by which I mean, going back to code you've written and changing it. When I'm first working on some code I find TDD very easy, but when I have to go back to it to fix bugs or add features I always go straight for the code and cause my Unit Tests to be out of date. And therefore, worthless.

A Continuous Integration server which ran the Unit Tests would force me to keep the tests up to date, but I don't like to be forced to do things. In fact, I'd probably just start commenting out the failing tests so I could get my bug fix checked in. Yeah, I know! Hideous behavior right? Should be punishable by death. But I'm lazy, and so are you.

My problem is that I'm still not thinking like a TDDer. If a TDDer needed to fix a bug, they would start by adding a failing test, then they would work until the test passed, THEN they'd try to repo the bug and make sure it was gone. That's the process they would follow, but what's really interesting in the mindset that process requires.

You've probably heard people say that tests are as good as if not better than documentation. I've even heard people says tests can replace specifications! These people are completely insane! Lock them up in a padded room, throw away the key, and please please please don't let them write any more blog posts!

Ok, I know, they aren't REALLY crazy, they're just using the words to mean something different than what the words REALLY mean. Welcome to software development, we love overrides so much we can't resist overriding words in the English language!

So if you fight through the crazy and think about the documentation argument for a moment, that unit tests are documentation, you can see where this might make sense. First, assume we're talking about API documentation and not end user documentation, obviously. I'd like to see an end user figure out which button they need to click on the screen by reading a bunch of unit tests. Don't worry, I'm sure they'll re-up your contract next year!

Anyway, you have a bunch of tests like, Apply_is_enabled_when_required_fields_are_filled() and so forth. A smart enough and patient enough person could read through tests like this and figure out what the object under test does and how to code against it. In fact, because of the volume of code in the tests, they'd be able to understand and extract much more meaning than they could from some MSDN style documentation.

But can you imagine if the only documentation Microsoft published for the .NET framework was its unit tests? The MSDN style docs are hard enough most of the time! Unit tests ARE NOT documentation! But they are incredibly documentative (made up words are fun!), and they are very very useful for someone who intends to work on the object under test. So, it makes a lot of sense to think about them this way. That is, its a useful mindset.

What about the argument that they are specifications. This one is even more off base. A specification is what you write up front to describe what the software should do and how it should work and basically what it should look like. Yeah, Agile people jump up and down and say you don't need specs. "Just write a story card!" they squeak. "Pin it to your wall!" Sorry little agile buddy, your story card is not a spec. It's not even the replacement of a spec. At some point you're still sitting down and figuring out what the software needs to do, and deciding how it should work. Maybe you're meeting with your "customer representative" and you show him your little story card and he starts rambling on about the details of that story while you furiously scribble notes on a legal pad. Then you sit down with your programming buddy (read: "pair") and you start writing code. Well, the scribbles on your note pad are the spec.

Despite all the condescending language in that paragraph, this is actually fine by me. As long you manage to capture enough detail in that meeting and are able to think through the consequences of your implementation decisions as well as the issues of integrating this part with the other parts of your software, this is a perfectly acceptable way of producing specifications. You might want to be a bit more diligent if you are writing the software that flys the space shuttle, but for most business and banking apps, I think you'll be just fine.

But, back to my point, unit tests are not Specifications. However, they do specify how the object under test should behave. In fact, they *should* specify absolutely everything about how that object should behave, if they're going to be truly effective. And this again indicates something about the mindset of the TDDer.

If unit tests are really going to work, they have to be more than regression tests. And they have to be more than a crutch you lean on during "initial development." They have to be both documentation and specification. This mind set leads you to all kinds of realizations.

For example, to be effective documentation and specification, they have to focus on behavior and they have to have good names. So, "Add_returns_false_null_person" is not a good test name. Maybe you should have gone with "Add_fails_when_person_is_missing."

If you're thinking about your tests this way and someone finds a bug, what's the best way to approach it? Not diving into the code and looking at stack traces trying to find the line of code that's in error, no sir! If there's a bug, you must have missed something in your specification of the problem. So what you're going to do is go look through your spec, find the missing piece that explains the bug, then update your spec.

What this means is that your goal when you're writing your tests shouldn't really be code coverage. And it shouldn't be to test every input/output combination. It should be to fully describe and specify all the required and expected behavior of the object you are testing. These should more or less turn out to be the same thing, but because the mindset is so completely different the details will be different. The details will be better. And you might actually stand a chance at keeping those tests up to date. Heck, you might even enjoy keeping them up to date instead of feeling like it's a chore.

Well, I don't know, that's asking a lot.

Is changing the way you approach your tests going to keep the real life details from getting ugly? Nope. You're still going to end up with test code that is ridiculously repetitive because you have to test the same method with these inputs, and those inputs, and those other inputs. And you'll still have to update all this repetitive code when you decide to change the type of one of the method parameters. And you'll still struggle with different ways to refactor your tests to cut down on the repetitiveness, which adds more abstraction and occasionally doesn't work out so well. And don't forget, you'll still be mocking out the dependencies. You know how I feel about mocking.

But now that your tests are more valueable, at least in your mind, this might all be worth it. I guess we'll just have to try it and see.

Friday, May 1, 2009

What Mercurial Can't Do: Subtree Repos

On the Mercurial website, at the very bottom of this page is a section titled "What Mercurial can't do" which says it can't host related projects together in one repository. Then it links out to the Forest Extension which makes nesting repositories easier to work with (and which comes with TortoiseHg, you just have to turn it on).

Its true that Mercurial can't do this, not even with the Forest Extension. Apparently there are plans for Mercurial to build some kind of subtree repo support in the near future. However, I'm not sure what it will look like. It might not enable the behavior I'm going to discuss here. We'll have to wait and see.

To describe what this really means, here's a possible layout you might find in TFS:
$\
MathUtils\
GraphUtils\
ProjectA\
ProjectA\
MathUtils\
GraphUtils\

In this example MathUtils and GraphUtils are seperate projects which have been written to be reused by other projects. ProjectA is reusing both of them.

ProjectA could just reference $\MathUtils and $\GraphUtils directly. But if there are many projects reusing them and making changes to them that could get hairy. Every time someone in ProjectB changes MathUtils and checks in, they could potentially break ProjectA immediately. To avoid this, ProjectA wants to isolate itself from changes other people might be making to the shared utils. To do this, they create a branch of MathUtils and GraphUtils in their own project. Now they can decide when the right time to bring in changes from other people is, or when the right time to share their own changes with other people is.

To bring in changes from outside in TFS you'd right click on $\MathUtils and say merge, selecting $\ProjectA\MathUtils as the target. To push changes back in TFS you'd right click on $\ProjectA\MathUtils and say merge, selecting $\MathUtils as the target.

In TFS, if you made changes to ProjectA and MathUtils and checked in, you'd get a single checkin containing all changes. When you then merged MathUtils back, the changeset would basically be split so only the changes to $\ProjectA\MathUtils got merged to $\MathUtils (For more on how TFS does merges, check out this post What Mercurial Can't Do: Merge by Changeset). This works because TFS doesn't respect "changesets" across branches.

This does not work in Mercurial. You could setup the same directory tree structure, but you couldn't have just one repository. Instead, you'd have to have $\ProjectA\MathUtils and $\ProjectA\GraphUtils setup as their own repositories nested in the $\ProjectA repository. Mercurial is smart, and it wont try to check in MathUtils or GraphUtils to ProjectA because it recoginizes them to be distinct repos.

So this setup actually works just fine, but it's more work. In TFS, we could do one checkin. In Mercurial, we'll have to checkin changes to each repo individually. That's probably not such a big deal, but in TFS you didn't have to keep track of where you were making changes. In Mercurial you might forget you changed something in MathUtils, and then you might forget to check it in. This is because when you do an "hg status" you'll only see changes in ProjectA. This is where the Forest Extension comes in handy. It adds an "hg fstatus" command which is basically a recursive status command. That way you wont lose track of your changes, but you still have to commit them individually.

But lets think about this for a minute. Is it really a smart thing to do, having one changeset with changes that affect both ProjectA and MathUtils? What's your checkin comment likely to be? Probably something like "ProjectA's xxx feature now can whizbang." If you're looking at the change history for $\Project\MathUtils and you see "ProjectA's xxx feature now can whizbang" does that makes any sense? Does it tell you what changed in MathUtils?

So once again, this is something that Mercurial just can't do. But maybe that's a good thing.

Thursday, April 30, 2009

What Mercurial Can't Do: Merge by Changeset

Mercurial is a distributed source control system that works very well on Windows and has great windows shell integration with TortoiseHg.

TFS is a centralized behemoth that does source control but also integrates (usually poorly) with every product Microsoft has ever released (not including Bob).

I've used both of these, but I've used TFS much more extensively. I recently started looking into what it would take to switch from TFS to Mercurial and was rather surprised to find a couple things that TFS can do that Mercurial cannot.

The first of these is the ability to do a merge by changeset. In TFS, say you create some branches as follows:
  1. Create a new TFS project called Project
  2. Check in some source at $\Project\Source
  3. Branch that source to $\Project\NewBranch
  4. Do 3 checkins to $\Project\NewBranch
Suppose you're not completely done with whatever it is you're working on so you can't merge all three changesets back to $\Project\Source just yet. But lets say your second changeset includes a bug fix to a file that you wont be changing as part of your "main" work in the other changesets. Maybe you want to go ahead and merge this bugfix back to $\Project\Source right now.

In TFS you could do this very easily.
  1. In Source Control Explorer, go to $\Project\NewBranch
  2. Right click and select "merge"
  3. Change to the "Merge by selected changeset" radio button
  4. Make sure the target is $\Project\Source
  5. Click next
  6. On the next page, select only the second changeset
  7. Click next and the merge is performed
Ta-Da! You just merged changeset #2 without merging changeset #1 or #3.

You cannot do this in Mercurial, at least, not with a "merge" operation. The only way to accomplish the same type of thing would be to create a patch out of NewBranch and apply it to Source using the hg export and hg import commands.

So the big question is, why can't you do this in Mercurial? The answer goes to the heart of what makes Mercurial so different from TFS. The first thing to realize is that Mercurial does not have "branch lines."

In TFS when you branch code TFS knows that the one is a parent of the other and you can only merge across that branch line. This means you can't do merges between two siblings. For example, in B -> A <- C, you can't merge B to C. You can only merge along the branch lines (Unless you do a baseless merge, which doesn't really count). In Mercurial, the normal way of creating a branch is to simply clone the repository, which means you have a full copy of the entire history of the repo. The image to the right shows what an hg pull would look like when bringing changes from a cloned repository into the original repository.

"A" represents the starting point. "B" represents the first change from the repository we're pulling in. What you can see here is that when you pull changes from mercurial, a temporary "branch" is created containing all the changesets from NewProject in parallel with any changesets from Source.

"C" is the only actual merge, in the tranditional way we think of merging, because it actually brings all the changes together. This is beautiful in its simplicity because until you get to C you don't have to do any work. Each changeset represents what changed from the parent, so you just import all the changesets and associate them with the correct parent. Then only at the very end do you have to do any merges.

A merge in TFS is not so clever. Basically all TFS does is figure out every file that changed on either side, and do a 3-way merge on each in turn, resulting in a new changeset. The upside of this is we can select a single changeset, ignore all the changesets around it, and do a merge.

In Mercurial, you can't pick just one changeset and merge. You have to merge all the changesets before it too because that's the definition of how a merge works in Mercurial. The upside of this is that all the changesets are preserved.

For example, say you want to know who added a certain file. In mercurial, you'll be able to figure this out regardless of what "branch" (cloned repository) it was added in. In TFS, you're screwed because the file will be added in a "merge" changeset. The merge may not have been done by the same person who added the file (in fact, it usually wont be), so to find out who added it, you have to manually follow the branches and inspect the history on each in turn. The same is true (and worse, actually) if you want to know who updated a line in a certain file.

Sadly for me, we're constantly "de-tangling" our changes by doing merges by changeset. But lets think about that for a minute. Is merging a single changeset even a sane thing to do? It turns out, not so much, because its possible for this to result in a broken state. Here's how:
  1. Joe Bob adds a new file "hippo.cs" and updates the C# project file
  2. Joe Smith adds a different new file "giraffe.cs" and updates the C# project file
  3. Joe Smith merges his changeset and ONLY his changeset up
  4. The result of the merge does not compile. The error is, "Can not find file "hippo.cs"
WTF!? Why is it looking for hippo.cs?! We didn't merge that changeset! How does it even know about hippo.cs? Its because Joe Bob and Joe Smith both changed the C# project file. When Joe Smith changed it, Joe Bob's changes were already in it. So Joe Smith's final C# project file includes Joe Bob's "hippo.cs" file. But when Joe Smith did a merge, he didn't include Joe Bob's changeset, so the hippo.cs file didn't get merged. And now you're broken.

This happens like all the freaking time with project files (which are the bane of branches and merges). But fortunately it's easy to fix. Just remove the missing files from the project file. But I think you can probably see that if this happened to any file other than the project file, like a real source file, you'd be in a world of hurt.

I'm actually STUNNED, given how much effort TFS puts into protecting the users from themselves that it allows you to merge selected changesets in this way! But it does. And its a feature that Mercurial just can't match, even if it is a feature that can lead to trouble. But maybe that's a good thing.