Friday, October 07, 2016

Fragmentation in an SQL table with lots of inserts, updates and deletes

I've met an interesting case today when we needed to manipulate data from tens of thousands of people daily. Assuming we would use table rows for the information, then we get a table in which rows are constantly added, updated and deleted. The issue is with the space allocated in table pages.

SQL works like this: If it needs space it allocates some as a "page" which can contain more records. When you delete records the space is not reclaimed, it remains as is (this is called ghosting). The exception is when all records in a page are deleted, in which case the page is reused as an empty page. When you update a record with more data then it held before (like when you have a variable length column), the page is split, with the rest of the records on the page moved to a new page.

In a heap table (no clustered index) the space inside pages is reused for new records or for updated records that don't fit in their allocated space, however if you use a clustered index, like a primary key, the space is not reused, since there needs to be a correlation between the value of the column and its position in the page. And here lies the problem. You may end up with a lot of pages with very few records in them. A typical page is 8 kilobytes, so a table with a few integers in a record can hold hundreds of records on a single page.

Fragmentation can be within a page, as described above, also called internal, but also external, between pages, when the recycled pages are used for data that is out of order. To get a large swathe of records the disk might be worked hard in order to jump from page to page to get what is logically a continuous blob of data. It is disk input/output that kills a database.

OK, back to our case. A possible solution was to store all the data for a user in a "blob", a VARBINARY column. For reads or changes only the disk space occupied by the blob would be changed, with C# code handling everything. It's what is called trading CPU for IO, which is generally good. However this NoSql-like idea itself smelled badly to me. We are supposed to trust our databases, not work against them. The solution I chose is monitoring index fragmentation and occasionally issuing clustered index rebuilding or reorganizing. I am willing to bet that reading/writing the data equivalent to several pages of table is going to be more expensive than selecting the changes I want to make. Also, rebuilding the index will end up storing all the data per user in the same space anyway.

However, this case made me think. Here is a situation in which the solution might have been (and it was in a similar case implemented by someone else) to micromanage the way the database works. It made me question using a clustered index/primary key on a table.

These articles helped me understand more: