Pages

Saturday, March 6, 2010

String comparison and collation

You probably already know that string comparison is dependent on the collation. But sometimes this dependency is slightly strange…

I have installed an SQL Server with LATIN1_GENERAL_CI_AS as default collation (like is also the default setting during installation). Some weeks ago I came across a specific behavior in combination of this collation and the German language.

German is a beautiful language! Really. You may know that we Germans have some special vowels (the “Umlaut”: ä, ö, and ü) and also that extra ß which is often (but not always) simply interchangeable with “ss”. Look at this query that uses the default collation for comparison:

select case
         when 'ss' collate latin1_general_ci_as
             = 'ß' collate latin1_general_ci_as
              then 'Yes'
         else 'No'
       end [ss=ß?]

Considering that LATIN1_GENERAL_CI_AS is the default collation, the query above is identical to the following query:

select case
         when 'ss' = 'ß' then 'Yes'
         else 'No'
       end [ss=ß?]

In both cases the result is Yes, so “ß” is considered as to be identical to “ss” when using the LATIN1_GENERAL_CI_AS collation.

Hmm. Now look at this query:

select case
         when 'ä' = 'ae' then 'Yes'
         else 'No'
       end [ä=ae?]

This time the answer is No which is somewhat surprising, since one could consider the overall behavior inconsistent. In German, “ä” and “ae” (and also “ö” and “oe” as well as “ü” and “ue”) are just as interchangeable as “ß” and “ss”.

When this happened to me, I started browsing books online. There’s a hint, where you are advised setting the default collation to LATIN1_GENERAL_BIN for new installations. I didn’t know this so far, but ok: let’s retry our experiment:

select case
         when 'ss' collate latin1_general_bin
             = 'ß' collate latin1_general_bin
              then 'Yes'
         else 'No'
       end [ss=ß?]

Now the answer is No.

But considering the fact that most newer installations use LATIN1_GENERAL_CI_AS, simply because that’s the default during installation, I can’t just change the collation to LATIN1_GENERAL_BIN, since this will certainly create other problems with queries spanning multiple databases with different collations. Not taking into account that changing the collation for an existing server and all of its databases/columns is a cumbersome and also very risky task on its own…

No comments:

Post a Comment

Followers