Over the years I have come to see that every database has what I call data type drift. Simply put, data type drift is when you have columns with the same name but different data types or length. I’d say about 97% of databases I’ve reviewed have some form of drift. So why is that number so high?
Have you ever gone out to eat at your favorite restaurant and noticed that something in your favorite meal was a bit off? This happened to me. My favorite Chinese restaurant had the same cook for 17 years and one day he decided to be a driver for DHL. I may have never noticed because he wasn’t visible to the front of the house but I knew the food was quite different. I must have had the same Kung Pow 100 or more times and this was not the same. I knew the owner well and he told me about the cook.
This same scenario happens over time with an application. As developers come and go, the database can drift. One developer has always used varchar(20) for FirstName and the next developer has always used nvarchar(50).
This may happen all at once as well. Consider an application that has different modules, each with it’s own developer or development team. When a data architect isn’t present and the data model is not restricted through an ERD, you get drift.
Why is this important?
When columns that don’t have the same data type are used in a join or union an implicit conversion is performed to convert one of the columns to equal the other. This results in higher consumption of I/O, processor, and memory; resulting in longer processing time for the affected queries. In short things are slower and it’s somewhat easy to fix.
How do I identify drift?
I wrote this around eleven years ago and really haven’t updated it since.
Get the script here: Microsoft TechNet Gallery
If any column is used in a union or join then it should probably be corrected.
Here’s an example of the output:
Researching each column is easy, just use the following query:
I’ve got some drift. Should I fix it?
I always recommend only updating the column if and only if it’s frequently used in a join or union with another column that does not match the data type. If you’re never going to query them both together then don’t bother. A good example of this is the [description] column. This column name is used across various tables that have nothing to do with each other. These columns are not used in joins and never will be so they won’t be matched.
When creating a new database it is recommended that you use a data modeling tool, which will enforce “domains” on the column names and any time that column name is used again it will automatically use the same data type and length.
8 thoughts on “Does my database have data type drift?”
Data type drift can also cause many different types of very hard to debug SQL run time errors. For example, I worked with a legacy database (more than 4 generations of developers) that had several specific Id columns (e.g. Department, Employee, Customer, Site, …) that were used extensively in the 500+ tables. Over the years the related columns, parameters, and variables had been defined as TinyInt, SmallInt, Int, Varchar, or NVarchar. The database performance certainly suffered from extensive implicit conversions in the joins and wheres.
However, there also were frequent “un-reproducible” runtime errors caused by truncation. For example one table defined a coded key column as TinyInt while a referencing procedure defined the parameter as Varchar(20). The occasional customer data entry of “300” or “3AF”, etc would result in either silent truncation or conversion errors, which were not properly trapped/processed in the application.
Obviously, there were more problems here than just the data type mismatch issue but data type mismatch was (and remains) a major issue throughout the application suite
Very true and it’s an issue that most developers and DBAs don’t think about.
Enjoyed your article. I see this daily between the modules of the ERP system I do reporting against. It is annoyance at best, and can sometimes create a need for significant process workarounds that suck much of the intended robustness and efficiency out of the usefulness of the application.
One way to combat datatype drift is to use an ERD tool that supports domains (user definable data types). When the tool generates the SQL for new columns/tables the domain is replaced with the actual datatype.
Hi Roger, Thanks for the comment. There are many great data modeling (ERD) tools out there. I usually use ERStudio but I’ve heard of a lot of outstanding open source tools as well.
ModelRight’s my personal favorite. It’s pricy, but none of the open source ERD designers had enough oomph for me.
I was aware of the issue of data drift and the engine having to do conversions in the queries, and it brought up something interesting today as I was designing some staging tables for a data warehouse. The original tables were designed before the Date data type was added, so there are a lot of SmallDateTime fields that really contain just dates with a midnight time stamp. When designing the new staging tables, I wanted to take advantage of the Date data type to avoid any confusion of whether the fields had a timestamp or not. When developers are writing queries against the data, it’s extremely helpful to know if an event can occur any time during a day vs. just on that date. So my question is, when a Date data type is joined with a SmallDateTime data type, does it exhibit this same behavior, or is the engine smart enough to do the join without having to convert one or the other? I think the answer is the former, but it would be nice for it to be the latter.
Conversions happen on date type datatypes as well. You can test with this query:
DECLARE @Date TABLE (A date)
DECLARE @SmallDate TABLE (A smalldatetime)
INSERT INTO @Date (A) values (GETDATE())
INSERT INTO @SmallDate (A) values (GETDATE())
SELECT * FROM @Date
SELECT * FROM @SmallDate
One thing to note is that newer versions of SQL Server have introduced DateTime2 which has some really great features and can save some storage over the original DateTime type.