Wednesday, July 22, 2015

T-SQL: Joining two very large tables (or not) when trying to determine the hierarchical relationship between the records

Update: after another very useful comment from NULLable, I tried several new ideas:
  • range queries - trying to specify that the child StartIp, for example, is not only greater or equal to the parent StartIp, but also less or equal to the parent EndIp. In my case the query didn't go faster and adding new indexes as recommended in the comment made is slower. I believe it is because the range values are not static or just because clustering the start/end IP index is really way faster than any logical implementation of the search algorithm
  • cursor hints - obviously a very important hint that I should add to almost any cursor is LOCAL. A GLOBAL cursor can be accessed from outside the stored procedure and weird things can happen when running the stored procedure twice at the same time. NULLable recommended also STATIC READ_ONLY and FORWARD_ONLY. In truth the performance of the query doesn't really depend on the speed of the cursor, anyway, but I found an article that discusses the various cursor hints and ends up recommending LOCAL FAST_FORWARD. Check it out, it is very informative. My tests showed no real difference in this particular scenario.
  • RI-Tree implementation in SQL - the article that NULLable linked to is amazing! I just don't get it :) I will update this more when I gain more IQ points.

Update 2: I kind of understood the Relational Interval Tree implementation, but I couldn't find a way for it to help me. The code there creates a computed column of the same type as the IP columns then makes a BETWEEN comparison and/or a join or an apply with two table functions. I can't imagine how it could help me since the original query is basically just two BETWEEN conditions. But still a very interesting article.

I wanted to have a database of all Ripe records, in order to quickly determine the Internet Service Provider for an IP. We are discussing IPv4 only, so the structure of the table in the database looked like this:
CREATE TABLE [dbo].[RipeDb](
  [Id] [int] IDENTITY(1,1) NOT NULL,
  [StartIp] [bigint] NULL,
  [EndIp] [bigint] NULL,
  [NetName] [nvarchar](450) NULL,
  [StartTime] [datetime2](7) NULL,
  [EndTime] [datetime2](7) NULL,
  [ParentId] [int] NULL)

As you can see, I translate IPs into BIGINT so that I can quickly sort and select stuff. I also added a ParentId column that represents the parent ISP, as you have some huge chunk of IPs, split and sold to other ISPs, which in turn are selling bits of the IP range they own to others and so on. The data I receive, though, is a simple text file with no hierarchical relations.

The task, therefore, is to take a table like described above, with more than four million records, and for each of them find their parent, if any.

The simplest idea is to join the table with itself like this:
SELECT rp.Id as ParentId, 
   r.Id 
FROM   RipeDb r 
   INNER JOIN RipeDb rp 
       ON rp.StartIp <= r.StartIp 
          AND rp.EndIp >= r.EndIp 
          AND rp.EndIp - rp.StartIp > r.EndIp - r.StartIp 
This gets all ancestors for each record, so we need to use a RANK() OVER() in an inner select in order to select only the parent, but that's beyond the scope of the article.

Since we have conditions on the StartIp and EndIp columns, we need an index on them. But which?

Through trial and error, more than anything else, I realised that the best solution is a clustered index on StartIp,EndIp. That is why the first column (Id) is not marked as PRIMARY KEY in the definition of the table, because it has to look like this:
[Id] [int] PRIMARY KEY NONCLUSTERED IDENTITY(1,1) NOT NULL
. Yes, primary keys don't have to be clustered.

But now you hit the snag. The process is EXTREMELY slow. Basically on my computer this query would end in a few days (as opposed to twice as much with a nonclustered index). What the hell is going on?

I tried several things:
  • JOIN hints (Merge, Loop and Hash joins) - the query optimizer seems to choose the best solution anyway
  • Various index combinations - nothing beats a clustered index
  • Taking a bunch of records and joining only them in a WHILE loop - it doesn't fill up the temp db, but it is just as slow, if not worse

At this point I kind of gave up. Days of work trying to figure out why this is going so slow reached a simple solution: 4 million records squared means 16 thousand billion comparisons. No matter how ingenious SQL would be, this will be slow. "But, Siderite, I have tables large like this and joining them is really fast!" you will say. True, with equality the joins are orders of magnitude faster. Probably there is either place for improvement in the way I used the indexes or in the way they are implemented. If you have any ideas, please let me know.

So did I solve the problem? Yes, of course, by not relying on an SQL join. Think about how the ranges are arranged. If we order the IP ranges on their start and end values, you get something like this:



For each range, the following is either a direct child or a sibling. I created a stored procedure that called itself recursively, which should have worked, but then it reached the maximum level of recursion in SQL (32 - a value that one cannot change!) and so I had to do everything myself. How? With a cursor. Here is the final code:
DECLARE @ParentIds TABLE (Id INT,StartIp BIGINT, EndIp BIGINT)
DECLARE @ParentId INT
DECLARE @Id INT
DECLARE @StartIp BIGINT
DECLARE @EndIp BIGINT
DECLARE @OldParentId INT

DECLARE @i INT=0
DECLARE @c INT

DECLARE curs CURSOR LOCAL FAST_FORWARD FOR
SELECT r.Id, r.StartIp, r.EndIp, r.ParentId
FROM RipeDb r
WHERE r.EndTime IS NULL
ORDER BY StartIp ASC, EndIp DESC

OPEN curs

FETCH NEXT FROM curs
INTO @Id, @StartIp, @EndIp, @OldParentId

WHILE @@FETCH_STATUS=0
BEGIN

    DELETE FROM @ParentIds WHERE EndIp<@StartIp

    SET @ParentId=NULL
    SELECT TOP 1 @ParentId=Id FROM @ParentIds 
    ORDER BY Id DESC

    SELECT @c=COUNT(1) FROM @ParentIds

    IF (@i % 1000=0)
    BEGIN

    PRINT CONVERT(NVARCHAR(100),SysUtcDatetime())+' Updated parent id for ' + CONVERT(NVARCHAR(100),@i) +' rows. ' + CONVERT(NVARCHAR(100),@c) +' parents in temp table.'
    RAISERROR ('', 0, 1) WITH NOWAIT

    END
    SET @i=@i+1

    IF (ISNULL(@OldParentId,-1) != ISNULL(@ParentId,-1))
    BEGIN
        UPDATE RipeDb SET ParentId=@ParentId WHERE Id=@Id
    END

    INSERT INTO @ParentIds VALUES(@Id,@StartIp,@EndIp)

    FETCH NEXT FROM curs
    INTO @Id, @StartIp, @EndIp
END

CLOSE curs
DEALLOCATE curs

I will follow the explanation of the algorithm, for people hitting the exact issue that I had, but let me write the conclusion of this blog post: even if SQL is awesome in sorting and indexing, it doesn't mean that is the only solution. In my case, the SQL indexes proved to be a golden hammer that wasted days of my work.

So, the logic here is really simple, which makes this entire endeavour educational, but really frustrating to me:
  1. Sort the table by start IP ascending, then end IP descending - this makes the parents come before the children in the list
  2. Create a table variable to store the previous parents - so when you finished with a range you will automatically find yourself in its parent
  3. Use a cursor to move through all the items and for each one:
  4. Remove all parents that ended before the current item starts - removes siblings for the list
  5. Get the last parent in the list - that is the current parent range
  6. Set the parent id to be the one of the last parent

It's that deceptively simple and the query now ends in 15 minutes instead of days.

Another issue that might be interesting is that after the original import is created, the new records added to the table should be just a few. In that case, the first join and update might work faster! The next thing that I will do is count how many items I need to update and use one method or another based on that.

Hope that helps someone.

2 comments:

NULLable said...

1) Cursor based solution: I would try to declare the cursor thus (could be faster):
CURSOR LOCAL STATIC READ_ONLY FORWARD_ONLY FOR

2) I would try also next solution
CREATE INDEX a ON dbo.RipeDb (StartIp) INCLUDE (EndIp, Id, ParentId)
+
I would replace
ON rp.StartIp <= r.StartIp
with
ON rp.StartIp <= r.StartIp AND r.StartIp <= rp.EndIp
This solution allows a range query on a single column (r.StartIp).

3) Look at above solutions:
http://blogs.solidq.com/en/businessanalytics/interval-queries-in-sql-server-wrap-up/

Siderite said...

Thanks, man! I will try them and update the post. The range query thing makes a lot of sense. Logically I know that the start time will always be less than end time, but SQL doesn't know that. Good catch!