.NET (283) administrative (42) Ajax (42) ASP.NET (144) bicycle (2) books (152) browser (7) C# (121) cars (1) chess (25) CodePlex (10) Coma (7) database (37) deployment (3) Entity Framework (2) essay (96) flash/shockwave (2) flex (1) food (3) friend (2) game (14) idea (5) IIS (8) javascript (76) LInQ (2) Linux (6) management (4) manga (35) misc (590) mobile (1) movies (70) MsAccess (1) murder (2) music (62) mysql (1) news (96) permanent (1) personal (58) PHP (1) physics (2) picture (261) places (12) politics (13) programming (462) rant (107) religion (3) science (38) Sharepoint (3) software (53) T4 (2) technology (10) Test Driven Development (4) translation (2) VB (2) video (87) Visual Studio (44) web design (45) Windows API (8) Windows Forms (2) Windows Server (4) WPF/Silverlight (60) XML (10)

Wednesday, April 09, 2008

The DataRow value setter is slow!

It was about time I wrote a smashing IT entry. Here is to the obnoxious DataTable object, something about I have written before of bugs and difficulty in handling. Until now I haven't really thought about what kind of performance issues I might face when using it. I mean, yeah, everybody says it is slow, but how slow can it be? Twice as slow? Computers are getting faster and faster, I might not need a personal research into this. I tried to make a DataTable replacement object once and it was not really compatible with anything that needed DataTables so I gave up. But in this article I will show you how a simple piece of code became 7 times faster when taking into account some DataTable issues.

But let's get to the smashing part :) I was using C# to transform the values in a column from a datatable into columns. Something like this:
NameColumnValue
GeorgeMoney100
GeorgeAge31
GeorgeChildren1
JackMoney150
JackAge26
JackChildren0
JaneMoney300
JaneAge33
JaneChildren2


and it must look like this:

NameMoneyAgeChildren
George100311
Jack150260
Jane300332


I have no idea how to do this in SQL, if you have any advice, please leave a comment.
Update: Here are some links about how to do it in SQL and SSIS:
Give the New PIVOT and UNPIVOT Commands in SQL Server 2005 a Whirl
Using PIVOT and UNPIVOT
Transposing rows and columns in SQL Server Integration Services

Using PIVOT, the SQL query would look like this:

SELECT *
FROM #input
PIVOT (
MAX([Value])
FOR [Column]
IN ([Money],[Age],[Children])
) as pivotTable

Anyway, the solution I had was to create the necessary table in the code behind add a row for each Name and a column for each of the distinct value of Column, then cycle through the rows of the original table and just place the values in the new table. All the values are present and already ordered so I only need to do it using row and column indexes that are easily computed.

The whole operation lasted 36 seconds. There were many rows and columns, you see. Anyway, I profiled the code, using the great JetBrains dotTrace program, and I noticed that 30 seconds from 36 were used by DataRow.set_Item(int, object)! I remembered then that the DataTable object has two BeginLoadData and EndLoadData methods that disable/enable the checks and constraints in the table. I did that and the operation went from 36 to 27 seconds.

Quite an improvement, but the bottleneck was still in the set_Item setter. So, I thought, what will happen if I don't use a DataTable at all. After all, the end result was being bound to a GridView and it, luckily, knows about object collections. But I was too lazy for that, as there was quite a complicated binding code mess waiting for refactoring. So I just used a List of object arrays instead of the datatable, then I used DataTable.Rows.Add(object[]) from this intermediary list to the DataTable that I originally wanted to obtain. The time spent on the operation went from... no, wait

The time spent on the operation went from the 27 seconds I had obtained to 5! 5 seconds! Instead of 225.351 calls to DataRow.set_Item, I had 1533 calls to DataRowCollection.Add, from 21 seconds to 175 miliseconds!

Researching the reflected source of System.Data.dll I noticed that the DataRow indexer with an integer index was going through
DataColumn column=_columns[index]; return this[column];
How bad can it get?! I mean, really! There are sites that recommend you find the integer index of table columns and then use them as integer variables. Apparently this is NOT the best practice. Best is to use the DataColumn directly!

So avoid the DataRow setter.

No comments: