Tuesday, September 20, 2011

T-SQL: Determining the byte size of a text in a specific encoding

There was this FTP surrogate program that used SQL as a filesystem. I needed to store the size of the file (which was an HTML template and was stored as NTEXT) in the row where the content was stored. The problem is that the size of a text in a Microsoft SQL Server NTEXT column is about two bytes per character, while the actual size of the content, stored web like in UTF8, was different to almost half.

I thought that there must be an easy way to compute it, trying to cast the string to TEXT then using LEN, trying DATALENGTH, BINARY, etc. Nothing worked. In the end I made my own function, because the size of a string in UTF8 is documented on the Wikipedia page of that encoding: 1 byte for ASCII characters (character code<128), 2 bytes for less than 2048, 3 for 65536 and 4 for the rest. So here is the sql function that computes the size in UTF8:

CREATE FUNCTION [fn_UTF8Size]
(
@text NVARCHAR(max)
)
RETURNS INT
WITH SCHEMABINDING
AS

BEGIN
DECLARE @i INT=1
DECLARE @size INT=0
DECLARE @val INT
WHILE (@i<=LEN(@text))
BEGIN

SET @val=UNICODE(SUBSTRING(@text,@i,1))

SET @size=@size+
CASE
WHEN @val<128 THEN 1
WHEN @val<2048 THEN 2
WHEN @val<65536 THEN 3
ELSE 4
END
SET @i=@i+1

END

RETURN @size
END


A similar approach would work for any other encoding.

0 comments: