Wednesday, August 26, 2015

The story of the LEFT JOIN in T-SQL (and why I learned to love the NOT EXISTS)

As a software developer - and by that I mean someone writing programs in C#, Javascript and so on, and occasionally using databases when something needs to be stored somewhere - I have an instinctual fear of the Arrow Anti-pattern. Therefore I really dislike stuff like NOT EXISTS(SELECT 1 FROM Something). However, recent tests have convinced me that this is the best solution for testing for existing records. I am not going to reinvent the wheel; here are some wonderful links regarding this, after which I will summarize:

Let's say you want to insert in a table all records from another source that do not already exist in the table. You have several options, but the most commonly used are:
FROM SourceTable
LEFT JOIN DestinationTable
ON SomeCondition
WHERE DestinationTable.Id IS NULL
FROM SourceTable
WHERE NOT EXIST(SELECT 1 FROM DestinationTable WHERE SomeCondition)

Personally I prefer the first version, for readability reasons as well as having listened to the mantra "Never do selects in selects" for all my life. However, it becomes apparent that the second version is a lot more efficient. The simple reason is that for the first example Microsoft SQL Server will first join the two tables in memory, then filter. If you have multiple combinations of records that satisfy the condition this will result in some huge memory and CPU usage, especially if you have no indexes defined and, sometimes, because you have some indexes defined. The second option uses one of the few methods guaranteed to exit, NOT EXISTS, which immediately invalidates a record at the first match.

Other options involve using the EXCEPT or INTERSECT operations in SQL, but they are not really helping. Intersecting ids, for example, then inner joining with SourceTable works, but it is somewhere in between the two solutions above and it looks like crap as well. Join hints don't help either.