Sometimes it helps to go back to the basics and indexing is always a great topic. SQL Server has a wide variety of indexes; but, today we’re going classic and talking about Non-clustered indexes.
Microsoft has a lot of great resources discussing these topics, though I thought I’d take a moment to sum up some key points.
What is a Non-Clustered Index?
A non-clustered index is very similar to a clustered in the sense that they are both B-Tree structures; however, the non-clustered index does not contain the physical data on the leaf level. Instead, it holds pointers.
These pointers are different depending on if the table has a clustered index or not. If the table has a clustered index, the pointers are the clustered key. If the table has no clustered index defined, it’s called a HEAP and the non-clustered index pointers are comprised of row ids (RID). This is why you will see either Key Lookup or RID Lookup in your query plan.
Let’s take a look at the following query:
SELECT AddressLine1, City, PostalCode
WHERE ModifiedDate = ‘2007-05-01’
I added a non-clustered index (ncl_demo) on ModifiedDate and we can see that it was chosen by the optimizer. SQL Server is cost based, meaning that it will choose plans with the least cost. Sometimes SQL does a very poor job at this but that’s a whole separate blog topic.
CREATE NONCLUSTERED INDEX [ncl_demo] ON [Person].[Address] ([ModifiedDate])
We often see documentation that shows a b-tree but it never really relates back to an actual query, so I thought it would be neat to do just that.
The items in blue represent the non-clustered index and items in red come from the clustered.
Since [ncl_demo] does not include AddressLine1, City, or PostalCode, we must go to the physical data to retrieve and display them. Here’s a closer look:
1) Note that the non-clustered index seek on [ncl_demo] seeks ModifiedDate and outputs 13 rows but the output is AddressID, which isn’t even defined in the index. So why is it in the plan?
AddressID is the clustered key. Since the leaf level of a non-clustered index contains the RID or clustered key, AddressID is automatically part of [ncl_demo]. This is why it’s important to keep your clustered key narrow. Creating a clustered key with 16 columns would bloat your non-clustered index. Remember you do NOT need to make your PK clustered.
2) These 13 AddressID’s are then pushed into the Nested Loops which executes once but loops over the clustered index for each input row causing the Key Lookup to execute 13 times (1 for each row of the index seek).
3) The key lookup matches each row based on AddressID = AddressID and then outputs the missing columns from [ncl_demo].
So here’s a wild question: Why not just include every column in every non-clustered index? That’s a good idea, right? No more look ups?
Actually it’s a really bad idea. The reason is that for each column added to the index, even those in the “include” take up room. The index page can only hold 8096 bytes of data.
Looking at the math datetime (8 bytes) + int (4 bytes) + overhead (3 bytes), each entry costs 15 bytes of storage, allowing for 539 rows per page.
Now let’s say you have 15 additional columns in the clustered key. Those 15 columns would severely limit the amount of rows per page making the index very inefficient. So narrow clustered indexes are much better for everyone.
How do we fix the Lookup?
This really depends. The general objective would be to add the columns to the “include” section of the non-clustered; however, what it every column was needed? Well in that case you’d want to fix the application instead.
Once I had an application that was doing quite a bit of lookups. I looked at the queries and discovered they were all “SELECT *”. I spoke with the developer and he said “We don’t use SELECT * anywhere”. I explained that it wasn’t literal. They were selecting every column on every query; whether it was needed or not and he said “Well yeah, we do that everywhere.”
It’s important to understand that returning only what’s needed will greatly assist your index / performance tuning efforts. This is the battle we often have with ORM tools like Entity Framework. Preaching this to your devs will only help. Perhaps give them a custom coffee mug that says “Only query what you need”?
Let’s say that they are following best practice and we have the above query to tune. How do we fix it? Simply add the columns to the include:
CREATE NONCLUSTERED INDEX [ncl_demo] ON [Person].[Address]
INCLUDE ([AddressLine1], [City], [PostalCode])
WITH (DROP_EXISTING = ON, ONLINE = ON)
Now with this complete we rerun the query and find that the lookup has been resolved.
How do we find other potential lookup problems?
It’s fairly easy to find which tables experience a lot of Lookups. Simply query the SYS.DM_DB_INDEX_USAGE_STATS DMV and observe the USER_LOOKUPS column.
SELECT DB_NAME(database_id) as db,
OBJECT_NAME(object_id, database_id) as obj,
ORDER BY user_lookups DESC
Here we can see that another index is causing more lookups and should get our attention.
I’ll post a follow up blog on how to identify which queries are causing lookups. Stay tuned!
When you see a lookup, you’ll probably want to add the columns from the “output” of the key lookup or RID lookup. But before you do, evaluate the query and be sure that you aren’t adding too many.
For this and many other issues, check out HowsMyPlan.com