ADO.NET 2.0: DataTables, Expression Columns, and Poor Performance

Over the last few days I’ve been working on a high priority performance issue for a client. This issue arose not long after they started using the .NET 2.0 framework, and it was a big enough issue to delay an impending release to clients. I was told that when certain elements on the screen were refreshed by clicking a “Refresh” button, that the performance of the application paired with .NET 2.0 was not acceptable—although it was when paired with .NET 1.1. I was also told the bottleneck occurred in the DataSet.Merge method. “Interesting,” I thought before saying, “Shouldn’t take me too long to figure it out, I think.”

I did figure it out, but it took a while longer than I thought it would.

The problem centered around the “new and improved” ADO.NET classes (this was the problem at a high-level, anyway). ADO.NET is new and improved, but if you’re not careful you may end up wondering why certain operations are slow. If you call DataSet.Merge or DataTable.Merge and the DataTable(s) have DataColumns that make use of the Expression property, you may run into performance problems. I say “may” because it all depends upon how many columns have expressions defined as to whether the problem will hit you. Although I don’t have any testing in place to discover when the number of tables and expressions will begin to cause a problem when merging tables, in our particular case we were dealing with a DataSet with 8 tables that contained a total of 22 expression columns. The fix for this, which I’ll describe below, increased performance by an order of magnitude, moving the tests I had in place for this scenario from anywhere between six and ten seconds to between 600 and 1000 milliseconds.

The fix for the issue is relatively simple: Prior to merging, store the string value of the Expression property for every DataColumn, set every DataColumn’s Expression property to null or string.Empty, and after merging add the stored expressions back to each corresponding DataColumn.

I’m not entirely sure how the changes in ADO.NET 2.0 caused this to happen. To be honest, I’m not even sure if the issue didn’t exist in the 1.1 framework because I didn’t see it myself, and I haven’t run any tests against 1.1. However, our client definitely saw a big difference, so since the issue is fixed I don’t find a lot of value in trying to discover the differences between 1.1 and 2.0 in terms of performance.

However, in the course of discovering the issue, I did reverse engineer the System.Data assembly and although a lot of the algorithms between framework versions look similar in regards to DataSet/DataTable merges , the evaluation of Expressions has definitely changed. Seems to me like Microsoft let this issue slip through their fingers.

On the other hand, I suppose, as I always have, I’d recommend staying away from using DataColumn expressions in the first place. They might help you get something done in a way that appears as though it is easier, but feeding in an expression algorithm via a string (that often emanates from an XSD file instead of a code file) doesn't work for me. I’d much rather see real, compiled code that does the same thing—with tests written against it.