.NET (296) administrative (41) Ajax (42) AngularJS (2) ASP.NET (145) bicycle (2) books (180) browser (8) C# (135) cars (1) chess (29) CodePlex (10) Coma (8) database (49) deployment (3) Entity Framework (2) essay (114) flash/shockwave (2) flex (1) food (3) friend (2) game (20) idea (5) IIS (8) javascript (83) LInQ (2) Linux (6) management (4) manga (43) misc (676) mobile (1) movies (92) MsAccess (1) murder (2) music (64) mysql (1) news (100) permanent (1) personal (68) PHP (1) physics (2) picture (309) places (12) politics (14) programming (507) rant (120) religion (3) science (43) Sharepoint (3) software (58) space (1) T4 (2) technology (12) Test Driven Development (4) translation (2) VB (2) video (98) Visual Studio (45) web design (46) Windows API (8) Windows Forms (3) Windows Server (6) WPF/Silverlight (63) XML (11)

Wednesday, April 09, 2008

The DataRow value setter is slow!

It was about time I wrote a smashing IT entry. Here is to the obnoxious DataTable object, something about I have written before of bugs and difficulty in handling. Until now I haven't really thought about what kind of performance issues I might face when using it. I mean, yeah, everybody says it is slow, but how slow can it be? Twice as slow? Computers are getting faster and faster, I might not need a personal research into this. I tried to make a DataTable replacement object once and it was not really compatible with anything that needed DataTables so I gave up. But in this article I will show you how a simple piece of code became 7 times faster when taking into account some DataTable issues.

But let's get to the smashing part :) I was using C# to transform the values in a column from a datatable into columns. Something like this:
NameColumnValue
GeorgeMoney100
GeorgeAge31
GeorgeChildren1
JackMoney150
JackAge26
JackChildren0
JaneMoney300
JaneAge33
JaneChildren2


and it must look like this:

NameMoneyAgeChildren
George100311
Jack150260
Jane300332


I have no idea how to do this in SQL, if you have any advice, please leave a comment.
Update: Here are some links about how to do it in SQL and SSIS:
Give the New PIVOT and UNPIVOT Commands in SQL Server 2005 a Whirl
Using PIVOT and UNPIVOT
Transposing rows and columns in SQL Server Integration Services

Using PIVOT, the SQL query would look like this:
SELECT * 
FROM #input
PIVOT (
MAX([Value])
FOR [Column]
IN ([Money],[Age],[Children])
) as pivotTable

Anyway, the solution I had was to create the necessary table in the code behind add a row for each Name and a column for each of the distinct value of Column, then cycle through the rows of the original table and just place the values in the new table. All the values are present and already ordered so I only need to do it using row and column indexes that are easily computed.

The whole operation lasted 36 seconds. There were many rows and columns, you see. Anyway, I profiled the code, using the great JetBrains dotTrace program, and I noticed that 30 seconds from 36 were used by DataRow.set_Item(int, object)! I remembered then that the DataTable object has two BeginLoadData and EndLoadData methods that disable/enable the checks and constraints in the table. I did that and the operation went from 36 to 27 seconds.

Quite an improvement, but the bottleneck was still in the set_Item setter. So, I thought, what will happen if I don't use a DataTable at all. After all, the end result was being bound to a GridView and it, luckily, knows about object collections. But I was too lazy for that, as there was quite a complicated binding code mess waiting for refactoring. So I just used a List of object arrays instead of the datatable, then I used DataTable.Rows.Add(object[]) from this intermediary list to the DataTable that I originally wanted to obtain. The time spent on the operation went from... no, wait

The time spent on the operation went from the 27 seconds I had obtained to 5! 5 seconds! Instead of 225.351 calls to DataRow.set_Item, I had 1533 calls to DataRowCollection.Add, from 21 seconds to 175 miliseconds!

Researching the reflected source of System.Data.dll I noticed that the DataRow indexer with an integer index was going through
DataColumn column=_columns[index]; return this[column];
How bad can it get?! I mean, really! There are sites that recommend you find the integer index of table columns and then use them as integer variables. Apparently this is NOT the best practice. Best is to use the DataColumn directly!

So avoid the DataRow setter.

Update July 18, 2013:

Someone requested code, so here is a console application with some inline classes to replace the DataTable in GridView situations:

  class Program
  {
    static void Main(string[] args)
    {
      fillDataTable(false);
      fillDataTable(true);
      fillDataTableWriter();
      Console.ReadKey();
    }

    private static void fillDataTable(bool loadData)
    {
      var dt=new DataTable();
      dt.Columns.Add("cInt", typeof(int));
      dt.Columns.Add("cString", typeof(string));
      dt.Columns.Add("cBool", typeof(bool));
      dt.Columns.Add("cDateTime", typeof(DateTime));
      if (loadData) dt.BeginLoadData();
      for (var i=0; i<100000; i++)
      {
        dt.Rows.Add(dt.NewRow());
      }
      var now = DateTime.Now;
      for (var i = 0; i < 100000; i++)
      {
        dt.Rows[i]["cInt"] = 1;
        dt.Rows[i]["cString"] = "Some string";
        dt.Rows[i]["cBool"] = true;
        dt.Rows[i]["cDateTime"] = now;
      }
      if (loadData) dt.EndLoadData();
      Console.WriteLine((DateTime.Now-now).TotalMilliseconds);
    }

    private static void fillDataTableWriter()
    {
      var dt = new DataTableReplacement();
      dt.Columns.Add("cInt", typeof(int));
      dt.Columns.Add("cString", typeof(string));
      dt.Columns.Add("cBool", typeof(bool));
      dt.Columns.Add("cDateTime", typeof(DateTime));
      for (var i = 0; i < 100000; i++)
      {
        dt.Rows.Add(dt.NewRow());
      }
      var now = DateTime.Now;
      for (var i = 0; i < 100000; i++)
      {
        dt.Rows[i]["cInt"] = 1;
        dt.Rows[i]["cString"] = "Some string";
        dt.Rows[i]["cBool"] = true;
        dt.Rows[i]["cDateTime"] = now;
      }
      Console.WriteLine((DateTime.Now - now).TotalMilliseconds);
      now = DateTime.Now;
      var newDataTable = dt.ToDataTable();
      Console.WriteLine((DateTime.Now - now).TotalMilliseconds);
    }
  }

  public class DataTableReplacement:IEnumerable<IEnumerable<object>>
  {
    public DataTableReplacement()
    {
      _columns=new DtrColumnCollection();
      _rows=new DtrRowCollection();
    }

    private readonly DtrColumnCollection _columns;
    private readonly DtrRowCollection _rows;

    public DtrColumnCollection Columns
    {
      get { return _columns; }
    }

    public DtrRowCollection Rows { get { return _rows; } }

    public DtrRow NewRow()
    {
      return new DtrRow(this);
    }

    public DataTable ToDataTable()
    {
      var dt=new DataTable();
      dt.BeginLoadData();
      _columns.CreateColumns(dt);
      _rows.CreateRows(dt);
      dt.EndLoadData();
      return dt;
    }

    #region Implementation of IEnumerable

    public IEnumerator<IEnumerable<object>> GetEnumerator()
    {
      foreach (var row in _rows)
      {
        yield return row.ToArray();
      }
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
      return GetEnumerator();
    }

    #endregion
  }

  public class DtrRowCollection:IEnumerable<DtrRow>
  {
    private readonly List<DtrRow> _rows;

    public DtrRowCollection()
    {
      _rows=new List<DtrRow>();
    }

    public void Add(DtrRow newRow)
    {
      _rows.Add(newRow);
    }

    public DtrRow this[int i]
    {
      get { return _rows[i]; }
    }

    public void CreateRows(DataTable dt)
    {
      foreach (var dtrRow in _rows)
      {
        dt.Rows.Add(dtrRow.ToArray());
      }
    }

    #region Implementation of IEnumerable

    public IEnumerator<DtrRow> GetEnumerator()
    {
      return _rows.GetEnumerator();
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
      return GetEnumerator();
    }

    #endregion
  }

  public class DtrRow
  {
    private readonly object[] _arr;
    private readonly DataTableReplacement _dtr;

    public DtrRow(DataTableReplacement dtr)
    {
      _dtr = dtr;
      var columnCount = _dtr.Columns.Count;
      _arr=new object[columnCount];
    }

    public object this[string columnName]
    {
      get
      {
        var index = _dtr.Columns.GetIndex(columnName);
        return _arr[index];
      }
      set
      {
        var index = _dtr.Columns.GetIndex(columnName);
        _arr[index] = value;
      }
    }

    public object this[int columnIndex]
    {
      get
      {
        return _arr[columnIndex];
      }
      set
      {
        _arr[columnIndex] = value;
      }
    }

    public object[] ToArray()
    {
      return _arr;
    }
  }

  public class DtrColumnCollection
  {
    private readonly Dictionary<string,int> _columnIndexes;
    private readonly Dictionary<string,Type> _columnTypes;

    public DtrColumnCollection()
    {
      _columnIndexes=new Dictionary<string, int>();
      _columnTypes=new Dictionary<string, Type>();
    }

    public int Count { get { return _columnIndexes.Count; } }

    public void Add(string columnName, Type columnType)
    {
      var index = _columnIndexes.Count;
      _columnIndexes.Add(columnName, index);
      _columnTypes.Add(columnName, columnType);
    }

    public int GetIndex(string columnName)
    {
      return _columnIndexes[columnName];
    }

    public void CreateColumns(DataTable dt)
    {
      foreach (var pair in _columnTypes)
      {
        dt.Columns.Add(pair.Key, pair.Value);
      }
    }
  }

As you can see, there is a DataTableReplacement class which uses three other classes instead of DataColumnCollection, DataRowCollection and DataRow. For this example alone, the DtrRowCollection could have been easily replaced with a List<DtrRow>, but I wanted to allow people to replace DataTable wherever they had written code without any change to the use code.

In the example above, on my computer, it takes 1300 milliseconds to populate the DataTable the old fashioned way, 1000 to populate it with BeginLoadData/EndLoadData, 110 seconds to populate the DataTableReplacement. It takes another 920 seconds to create a new DataTable with the same data (just in case you really need a DataTable), which brings the total time to 1030. So this is the overhead the DataTable brings for simple scenarios such as these.

14 comments:

Shargon said...

please... code sample

Siderite said...

Here you go.

Anonymous said...

Thanks for the insight. In 64bit version using .Net 4.0 MS has improved the setter but your approach still works and cuts the time (your example) by a third.

Siderite said...

Thank you for your comment. Looking at my code I should point out that probably the column dictionaries should be initialized with case insensitive dictionaries in order to be consistent with DataTable, but I won't update the post now.

Anonymous said...

Thanks for the heads up on the case-insensitive dictionaries (though I still have to get my head around that). In any case, I tried your approach on a large data table. To my utter amazement, even with 64-bit Dot Net 4.0, your approach cut the time by about 65%. I guess I should dump the MS DataTable entirely and create my own version (inspired by you) as I need a very tiny fraction of its "rich" (aka bloated) features. Thanks for the inspiration. HAPPY NEW YEAR!

Siderite said...

Thank you! A Happy New Year to you and all the readers of the blog! Ohhhh-kay, let the non-readers of the blog also have a happy new year...

The case insensitive dictionary is just a dictionary with a IEqualityComparer as the constructor parameter. In the case of string keys you can use http://msdn.microsoft.com/en-us/library/system.stringcomparer.ordinalignorecase(v=vs.110).aspx.

I am glad my class worked. I wouldn't mind if you would share your discoveries here as well. My intention when I wrote the post was to make a class that I could use to replace a DataTable directly, without changing code. It was a half an hour effort, though, so I am sure things can be improved.

Anonymous said...

I would be glad to share stuff but I am still struggling. Basically, I am trying to create a simple DataTableX (based on your ideas) so that I can use it as a DataSource for a DataGridView. So far no success. I obviously don't understand dot net or the DataGridView enough! Do you have any suggestion on what I need to add to your DTR to use it as a DataSource? Any hints/code snippets would be greatly appreciated.

Siderite said...

From MSDN (http://msdn.microsoft.com/en-us/library/system.windows.forms.datagridview.datasource.aspx#remarksToggle) I see that the DataGridView DataSource property must implement one in a number of interfaces: IList, IListSource, IBindingList or IBindingListView. I think all you have to do is go implement :)

I suggest IBindingList (http://msdn.microsoft.com/en-us/library/system.componentmodel.ibindinglist.aspx), as it would be useful in WPF scenarios as well, I think. You can just implement all methods and properties with throw new NotSupportedException(); where it is not trivial to implement them, then see where binding throws errors and implement only the things you need.

Anonymous said...

I figured it out. Just needed two tiny changes. (1) Make the DataTableReplacement class also based on IListSource. (2) Add the following two lines in the class
bool IListSource.ContainsListCollection {get { return false; }}
IList IListSource.GetList(){ return (IList)_rows; }
Now I can also use your class to view the DataTableReplacement instance directly in a DataGridView (i.e., without the ToDataTable conversion. Lovely!

Anonymous said...

OOPS I spoke too soon. My previous solution does not work. If possible please remove it from your blog so that others don't get confused. Sorry about that.

Siderite said...

No problem, man :) Let them see you struggle. Maybe they would go through the same hoops as you if it wouldn't be for your comments. Thanks for tracking your problem on my post. Tomorrow I would be in the position to also check it out in other situations, so together we may come up with a stable, partially tested class.

Siderite said...

I thought about it. It doesn't work in the DataGridView because it has a hardcoded reference to DataTables. If the datasource is ... then ...

Of course this sucks. The problem is with the row objects. It expects the rows to have some information about the columns (i.e. having them as properties) is the source is not the DataTable.

That is why probably the only way is to use dynamic objects in the IListSource GetList method, some custom type inherited from ExpandoObject maybe, that would also propagate the changes to the original structure if you want it editable. That's rather complicated and stupid, I agree, but then that's how DataGridView works.

Siderite said...

Perhaps using a DataView, rather than a DataTable? I would research this in the coming days if I find the time.

Siderite said...

I tried some things and didn't find any solution. I can confirm that creating a list of dynamic object did NOT work and any attempt to override DataTable or DataView members or implementing any of the interfaces in my previous comments did not prove fruitful.

The only thing I didn't try (and I don't intend to) is to dynamically create a type that would have properties with the column names (assuming the column names are valid for property names) that would connect to the values in a DtrRow. That means run time compilation, which may defeat the purpose of speed :) but may be valuable in some scenarios.

Good luck!