You Installed Your SQL Server Wrong: Why Collation is Important

gsiDatabaseEssentials banner 1

You Installed Your SQL Server Wrong: Why Collation is Important

Bill Rehm, ATS - Database Lead

database top pic

At GSI we get a lot of companies asking us to analyze their databases after they experience performance and other issues. With Microsoft SQL Server there is one setting that, to the best of my recollection, is never set correctly – database collation. I know, never say never, but if I found it set correctly somewhere I sure don’t remember it. It’s a key setting and some applications – most notably JD Edwards EnterpriseOne – require a custom collation setting that is different from the default. If collation isn’t set right you’re going to have problems.

What is collation? Microsoft’s online documentation says:

Collations in SQL Server provide sorting rules, case, and accent sensitivity properties for your data. Collations that are used with character data types such as char and varchar dictate the code page and corresponding characters that can be represented for that data type. Whether you are installing a new instance of SQL Server, restoring a database backup, or connecting server to client databases, it is important that you understand the locale requirements, sorting order, and case and accent sensitivity of the data that you are working with.

When you select a collation for your server, database, column, or expression, you are assigning certain characteristics to your data that affects the results of many operations in the database. For example, when you construct a query by using ORDER BY, the sort order of your result set might depend on the collation that is applied to the database or dictated in a COLLATE clause at the expression level of the query.

The default collation of SQL Server is SQL_Latin1_General_CP1_CI_AS. This translates to:
• Latin1 – String values use the Latin 1 character set which is essentially ASCII
• CP1 – Code Page 1252. This is a lookup table that maps the hex value to a specific character in the character set.
• CI – Case Insensitive. When you look for a string value it doesn’t care if it’s upper or lower case. Searching for ‘VALUE’ returns ‘Value’, ‘VaLuE’, and so on.
• AS – Accent Sensitive. If a letter has an accent it is treated as a separate string, e.g. A is different from Ä.

JDE used to be ‘collation agnostic’, but it was found that double-byte languages in JDE were causing width-sensitivity issues. To fix that, in 9.0 Oracle began shipping JDE with a collation of Latin1_General_CI_AS_WS. As you probably guessed, the WS means that it’s width-sensitive and it solves JDE’s problem with double-byte languages. The good news is that it’s a logical subset of MSSQL’s default collation, so it works well with other data that prefers the standard US collation.

The collation difference does still matter if you’re not using double-byte languages. When your user data collation doesn’t match that of the tempdb you could get odd results. Your business data will have one collation and the tempdb has another, so they’re going to use different criteria depending on how you request data. A simple query of your data with nothing fancy would be sorted based on one collation and might look fine, but if you do any JOINs or an ORDER BY that result set gets sent to the tempdb. The data now has a different collation applied to the results and the potential for errors intensifies.

The problem can become more significant when Read-Committed Snapshot Isolation RCSI is enabled. I always recommend enabling RCSI, especially for high transaction OLTP databases. Microsoft and Oracle both strongly recommend enabling it as well. In a nutshell, this setting makes MSSQL behave like Oracle DB when it comes to transaction isolation. It tells MSSQL it can read data asynchronously in the hopes that nothing will change before the statement is done. If something does change, it goes back and reads the updated data. A handful of transactions might run longer, but the vast majority will run much faster since they don’t have to wait in line for their turn.

What does that have to do with tempdb? If your database is set up with RCSI according to best practice, then the tempdb is used to store the before/after data to support the optimistic locking. Virtually all your transactions will pass through the tempdb and a non-matching collation could cause wide ranging effects.

Admittedly the difference between the default collation and JDE collation is very slight, and if you don’t use double-byte languages it’s possible you haven’t noticed an issue yet. You can find the collation of all the databases on an instance by running:

SELECT name, collation_name FROM sys.databases

You can display your server collation with:

SELECT SERVERPROPERTY('Collation') AS 'Collation'

Unfortunately, the only good way to change the collation after the DB is installed is to rebuild the server from scratch or rebuild your system databases. You can’t simply run an ALTER DATABASE statement to change it either. If you just installed your database, you can just blow it away and reinstall. Otherwise you’re going to end up losing some weekend time getting this fixed.