Monday, May 18, 2015

Encapsulating indexes or creating a database that allows for fast inserts and updates as well as indexed queries

Relational databases have the great advantage of being able to use indexes. These are constructs that trade space and the speed of inserts and updates for query speed. This introduces a problem: what do you do when you have a high input and high output application? How can you make your database fast for both changes and queries?

The traditional solution is a write-database and a read-database. There are background services that ensure that the read-database is updated as fast as necessary with the new data, most of the time also doing extra computation and caching for the information one needs extracted. OLAP systems are a common solution, where the aggregated data required by various reports is computed and stored, ready to be read by your applications. I am not saying this is a bad idea, in fact I am saying this is the best idea for this scenario. The problem that I have with it is that you can hardly automate the process. You need to know what you want to read, you need to write the software to create the data and aggregate it into what you need.

So I decided to try to build a system that obeys the following rules:
  1. The speed of insert and update operations needs to be unhindered by indexes. Indeed, changes to the original data should be avoided.
  2. The select operation need to benefit from the indexing system.
  3. The system must be able to determine by itself the data structures needed to cover the first two rules.
  4. Optionally, there should be a way to encapsulate this new behaviour into a separate data structure from the original database.

In order to insert and update as fast as possible, I needed tables that only had primary keys, identity integers rather than uniqueidentifiers, as they lead to fragmentation of clustered indexes. In order to not only index the columns that are used in where and join conditions, but also encapsulate the behaviour in some other data structure, I decided to create shadow tables with the columns that I needed for querying. These tables I would then index in order to improve selects. The connection between the original insert table and the select table would be made via primary keys and the synchronization between the two types of tables would be done in the background, whenever needed. Best of all, analysis on execution plans could automatically determine the columns needed for this system and create the tables and indexes required, then suggest improvements on the observed queries.

In order to test my theory I created the following tables:
  • OriginalTable - with a primary key identity ingeniously called Id and two other columns called Latitude and Longitude, containing random decimal(18,6) values
  • LocationIndexTable - with a refId integer primary key column that links to the OriginalTable Id - without being a foreign key - and two Latitude and Longitude columns that are indexed by the same non clustered index
  • LocationIndexTable2 - with a refId integer primary key column that links to the OriginalTable Id - without being a foreign key - and a locId integer column that links to another table called LocationTable and has its own non clustered index
  • LocationTable - with a primary key identity column and Latitude and Longitude columns. I know that this looks identical to LocationIndexTable, but this is the more complex case where there are multiple records with the same location. LocationTable holds the distinct Location and Latitude values
  • LocationIndexTable3 - with a refId integer column that links to the OriginalTable Id and is indexed by its own nonclustered index - without being a foreign key - and two Latitude and Longitude columns that are indexed by a clustered index

With a number of 77179072 original table rows, I attempted the queries for each case, careful to clean the cache and memory buffers before that. Here are the results:
  • SELECT count(1) FROM OriginalTable WHERE Latitude BETWEEN 45.5 AND 46 AND Longitude BETWEEN 8.5 AND 9 - no indexes whatsoever. Result: 30 seconds
  • SELECT count(1) FROM OriginalTable ot INNER JOIN LocationIndexTable lit ON lit.RefId=ot.Id WHERE lit.Latitude BETWEEN 45.5 AND 46 AND lit.Longitude BETWEEN 8.5 AND 9. Result: 17 seconds
  • SELECT count(1) FROM OriginalTable ot INNER JOIN LocationIndexTable2 lit2 ON lit2.RefId=ot.Id INNER JOIN LocationTable lt ON lit2.LocId=lt.Id WHERE lt.Latitude BETWEEN 45.5 AND 46 AND lt.Longitude BETWEEN 8.5 AND 9. Result: 41 seconds
  • SELECT count(1) FROM OriginalTable ot INNER JOIN LocationIndexTable3 lit ON lit.RefId=ot.Id WHERE lit.Latitude BETWEEN 45.5 AND 46 AND lit.Longitude BETWEEN 8.5 AND 9. Result: 22 seconds

Unexpectedly for me, the most comfortable solution also seems the faster. Indeed, one issue is that I didn't have duplicate location data in the database, so there was no advantage in adding a new table to link locations to the original table. That means that in cases where the indexed data has many duplicates, it might be advantageous to experiment with a "distinct" table, although indexing should take this case into account, as well. The clustered index is slower than the unclustered one, that was a surprise. Also, the indexing just made the query twice as fast - I had expected more. Of course, all this benchmarking was done after deleting the cache and buffers with the commands DBCC FREEPROCCACHE; DBCC DROPCLEANBUFFERS. It is interesting to see how fast they queries go without this clearing. The first unindexed query takes 3 or 4 seconds, while all the others take less than 2.

There is one more thing that needs to be addressed: moving these tables into another database. Are the indexes just as fast? They should be, but we must test interdatabase communication. This would allow to move the entire system outside a target database, truly encapsulated, really adding value to a database that remains unaffected. My tests show the following results: 28, 18, 29 and 18 seconds, respectively. Yes, you saw that right, the speed of the joins when the indexed databases are in another database on the same server seems faster! Just to make sure I reran the original tests on the same database and I got approximately the same results: 29,19,43 and 24, respectively. The only thing I didn't try (and I don't intend to) is to create primary-foreign key associations, as this means modifying the original tables, something I wish to avoid.

So, after all this I believe my idea has been validated. A lot more work has to be done in order to automate this system, but at least a no-effort manual solution is possible. Let's recap:
  1. The speed of row inserts and updates remains unchanged: the only index is the original primary key identity integer key that should always exist in a table anyway.
  2. The speed of selects is improved by creating tables that have an integer primary key that links to the original table, and only the columns used in queries, over which indexes are created.
  3. Queries can be used to determine the columns needed to index. Even better, computed columns can be used instead of the original columns, which adds a little more performance.
  4. Encapsulation is achieved by not only creating other tables for reading, but also moving these tables into another database, after which, unexpectedly, the performance is even better.

The end result is a system similar to indexing, but which takes space in another database and which is updated on demand, not when the system deems it necessary. Any SQL statements that worked before will continue to work unaffected, but faster alternatives for selects will become available. Overall, I am very satisfied, although I had expected better performance improvements than observed.