Query tuning 101: Problems with IN ()

If you’re a DBA or database developer, chances are you will have to write or rewrite queries from time to time. Rewriting may be for new features or simply to improve the performance of existing code.

There are tens of thousands of tips and tricks for tuning. Just about every one of them you can find online. Though I still see many developers writing queries in less than efficient ways. I’ve even been told that tuning doesn’t matter much anymore since the advent of SSD and FLASH storage. If only this were true.

Today I’d like to discuss IN (Transact-SQL) and a neat tip that I learned from Joe Sack. Using IN with a list of values is the same as a list of OR’s.

Let’s take a look at the following queries:

IN with 15 vs 16 values

SELECT SalesOrderID, ProductID, UnitPrice
FROM Sales.SalesOrderDetail
WHERE ModifiedDate in (‘2007-01-05’, ‘2007-01-17’, ‘2007-01-26’, ‘2007-02-01’, ‘2007-02-04’, ‘2007-02-14’, ‘2007-02-28’, ‘2007-03-06’, ‘2007-03-07’, ‘2007-03-10’, ‘2007-04-15’, ‘2007-04-17’, ‘2007-08-08’, ‘2007-11-11’, ‘2007-12-25’)

Vs

SELECT SalesOrderID, ProductID, UnitPrice
FROM Sales.SalesOrderDetail
WHERE ModifiedDate in (2007-01-05′, ‘2007-01-17’, ‘2007-01-26’, ‘2007-02-01’, ‘2007-02-04’, ‘2007-02-14’, ‘2007-02-28’, ‘2007-03-06’, ‘2007-03-07’, ‘2007-03-10’, ‘2007-04-15’, ‘2007-04-17’, ‘2007-08-08’, ‘2007-11-11’, ‘2007-12-25’, ‘2007-12-26’)

Normally, you’d think that these two queries would run with the exact same plan. After all, they are the same query but one has 16 values in the IN while the other only has 15. Let’s examine the two plans.

1 r

As you can see the second query is much slower and the extra value in the IN caused late filtering. This is a limitation on some types of operators such as this clustered index scan.

There isn’t just a limitation of 15 input values. There’s also one at 64. On the 65th input value the list will be converted to a constant scan which is then sorted and joined. Most interestingly enough is that the list in my demo query is already sorted ascending.

IN with 64 vs 65 values

SELECT DISTINCT city
FROM Person.Address
WHERE city in (
‘Boulogne-Billancourt’,’Boulogne-sur-Mer’,’Bountiful’,
‘Bracknell’,’Bradenton’,’Braintree’,’Brampton’,
‘Branch’,’Branson’,’Braunschweig’,’Bremerton’,
‘Brisbane’,’Brossard’,’Burbank’,’Burien’,
‘Burlingame’,’Burnaby’,’Bury’,’Byron’,’Calgary’,
‘Caloundra’,’Camarillo’,’Cambridge’,’Campbellsville’,
‘Canoga Park’,’Carnation’,’Carol Stream’,’Carrollton’,
‘Carson’,’Casper’,’Cedar City’,’Cedar Park’,
‘Central Valley’,’Cergy’,’Cerritos’,’Chalk Riber’,
‘Chandler’,’Chantilly’,’Charlotte’,’Chatou’,
‘Cheektowaga’,’Chehalis’,’Cheltenham’,’Cheyenne’,
‘Chicago’,’Chula Vista’,’Cincinnati’,’Citrus Heights’,
‘City Of Commerce’,’Clackamas’,’Clarkston’,’Clay’,
‘Clearwater’,’Cliffside’,’Cloverdale’,’Coffs Harbour’,
‘College Station’,’Colma’,’Colombes’,’Colomiers’,
‘Columbus’,’Concord’,’Coronado’,’Corpus Christi’

)

Vs

SELECT DISTINCT city
FROM Person.Address
WHERE city in (
‘Boulogne-Billancourt’,’Boulogne-sur-Mer’,’Bountiful’,
‘Bracknell’,’Bradenton’,’Braintree’,’Brampton’,
‘Branch’,’Branson’,’Braunschweig’,’Bremerton’,
‘Brisbane’,’Brossard’,’Burbank’,’Burien’,
‘Burlingame’,’Burnaby’,’Bury’,’Byron’,’Calgary’,
‘Caloundra’,’Camarillo’,’Cambridge’,’Campbellsville’,
‘Canoga Park’,’Carnation’,’Carol Stream’,’Carrollton’,
‘Carson’,’Casper’,’Cedar City’,’Cedar Park’,
‘Central Valley’,’Cergy’,’Cerritos’,’Chalk Riber’,
‘Chandler’,’Chantilly’,’Charlotte’,’Chatou’,
‘Cheektowaga’,’Chehalis’,’Cheltenham’,’Cheyenne’,
‘Chicago’,’Chula Vista’,’Cincinnati’,’Citrus Heights’,
‘City Of Commerce’,’Clackamas’,’Clarkston’,’Clay’,
‘Clearwater’,’Cliffside’,’Cloverdale’,’Coffs Harbour’,
‘College Station’,’Colma’,’Colombes’,’Colomiers’,
‘Columbus’,’Concord’,’Coronado’,’Corpus Christi’,’Seattle’

)

Here we see a significant difference between the two query plans by simply adding ‘Seattle’ to the list.

2 r

How do we work around these limitations?

I’ve always liked joining lists instead of using IN and NOT IN. For IN a simple INNER JOIN works great. For NOT IN a LEFT JOIN WHERE VALUE IS NULL works great as well. Let’s look at the IN

DECLARE @in TABLE (city nvarchar(60))

INSERT INTO @in
VALUES (‘Boulogne-Billancourt’),(‘Boulogne-sur-Mer’),(‘Bountiful’),
(‘Bracknell’),(‘Bradenton’),(‘Braintree’),(‘Brampton’),
(‘Branch’),(‘Branson’),(‘Braunschweig’),(‘Bremerton’),
(‘Brisbane’),(‘Brossard’),(‘Burbank’),(‘Burien’),
(‘Burlingame’),(‘Burnaby’),(‘Bury’),(‘Byron’),(‘Calgary’),
(‘Caloundra’),(‘Camarillo’),(‘Cambridge’),(‘Campbellsville’),
(‘Canoga Park’),(‘Carnation’),(‘Carol Stream’),(‘Carrollton’),
(‘Carson’),(‘Casper’),(‘Cedar City’),(‘Cedar Park’),
(‘Central Valley’),(‘Cergy’),(‘Cerritos’),(‘Chalk Riber’),
(‘Chandler’),(‘Chantilly’),(‘Charlotte’),(‘Chatou’),
(‘Cheektowaga’),(‘Chehalis’),(‘Cheltenham’),(‘Cheyenne’),
(‘Chicago’),(‘Chula Vista’),(‘Cincinnati’),(‘Citrus Heights’),
(‘City Of Commerce’),(‘Clackamas’),(‘Clarkston’),(‘Clay’),
(‘Clearwater’),(‘Cliffside’),(‘Cloverdale’),(‘Coffs Harbour’),
(‘College Station’),(‘Colma’),(‘Colombes’),(‘Colomiers’),
(‘Columbus’),(‘Concord’),(‘Coronado’),(‘Corpus Christi’),(‘Seattle’)

SELECT DISTINCT a.city
FROM Person.Address a
INNER JOIN @in i on (i.city = a.City)

Since we don’t have a table to join with already we create a table variable and insert the 65 values. Then we join the table variable to our query and you can see that this approach appears to cost much less. Notice that the IN forces an index scan while the join uses an index seek.

work around

Summary

I hope you’ve found this post useful. We all know the saying “It depends”. It’s a phrase to live by in SQL Server and one you should always keep in mind. Remember that adequate testing is always needed for any proper tuning job.

When tuning or writing new queries take this tip into account. IN and OR can be limited and these limitations can slow your query. Test with an IN and with a JOIN to see which is better for your environment.

If you liked this post you may want to read some more from me at: http://www.sqlservercentral.com/blogs/confessions-of-a-microsoft-addict/ and https://sqltechblog.com/

I also invite you to follow me on Twitter and LinkedIn.

11 thoughts on “Query tuning 101: Problems with IN ()

  1. In case you want to test the second query you’ll want to add the index shown in the post. Here’s the DDL:
    USE [AdventureWorks2012]
    GO
    CREATE NONCLUSTERED INDEX [ncl_demo2] ON [Person].[Address] ( [City] ASC ) ON [PRIMARY]
    GO

    Like

  2. Instead of LEFT JOIN with IS NULL as a substitute for NOT IN, it might be better to use NOT EXISTS and take advantage of a semi join. This is especially true for very large tables.

    Like

  3. Interesting, the limits at 15 and 64. But execution plans are just optimised, not always optimal. And statistics play a big role here, especially in your last example. Table variables do not have any statistics, so the optimizer assumes there’s just one row in the table – you should be careful when using them.

    Like

  4. I think this would be useful only when you have a small number of values you need to check for. From my knowledge, when using a table variable the estimates for that table’s cardinality will always be 1, which favors the use of a Nested Loop in your plan. In this case, for a number of 65 (or slightly more – tens to a few reasonable hundreds) the plan will work well, but for a larger number of values this will go very badly.

    Anyway, as you said, this “depends” and I just wanted to add this additional information to your article, since I think it’s worth considering if your query goes slower than expected.

    Like

  5. Hi,
    great article, I actually didn’t know that. Thx a lot.
    I think that the JOIN query could actually be improved by using a temporary table instead of a table variable. Tested on an azure VM with SQL Server 2014 SP2.

    These are my queries:

    create table #tempin (city nvarchar (60))
    go
    INSERT INTO #tempin
    VALUES (‘Boulogne-Billancourt’),(‘Boulogne-sur-Mer’),(‘Bountiful’),
    (‘Bracknell’),(‘Bradenton’),(‘Braintree’),(‘Brampton’),
    (‘Branch’),(‘Branson’),(‘Braunschweig’),(‘Bremerton’),
    (‘Brisbane’),(‘Brossard’),(‘Burbank’),(‘Burien’),
    (‘Burlingame’),(‘Burnaby’),(‘Bury’),(‘Byron’),(‘Calgary’),
    (‘Caloundra’),(‘Camarillo’),(‘Cambridge’),(‘Campbellsville’),
    (‘Canoga Park’),(‘Carnation’),(‘Carol Stream’),(‘Carrollton’),
    (‘Carson’),(‘Casper’),(‘Cedar City’),(‘Cedar Park’),
    (‘Central Valley’),(‘Cergy’),(‘Cerritos’),(‘Chalk Riber’),
    (‘Chandler’),(‘Chantilly’),(‘Charlotte’),(‘Chatou’),
    (‘Cheektowaga’),(‘Chehalis’),(‘Cheltenham’),(‘Cheyenne’),
    (‘Chicago’),(‘Chula Vista’),(‘Cincinnati’),(‘Citrus Heights’),
    (‘City Of Commerce’),(‘Clackamas’),(‘Clarkston’),(‘Clay’),
    (‘Clearwater’),(‘Cliffside’),(‘Cloverdale’),(‘Coffs Harbour’),
    (‘College Station’),(‘Colma’),(‘Colombes’),(‘Colomiers’),
    (‘Columbus’),(‘Concord’),(‘Coronado’),(‘Corpus Christi’),(‘Seattle’)
    SELECT DISTINCT a.city
    FROM Person.Address a
    INNER JOIN #tempin i on (i.city = a.City)
    drop table #tempin

    these are the stats for the table variable query:

    Table ‘#BC81FB54’. Scan count 0, logical reads 65, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (65 row(s) affected)

    (1 row(s) affected)

    SQL Server Execution Times:
    CPU time = 0 ms, elapsed time = 69 ms.

    (65 row(s) affected)
    Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table ‘Workfile’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table ‘Address’. Scan count 1, logical reads 216, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table ‘#BC81FB54’. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:
    CPU time = 32 ms, elapsed time = 870 ms.

    SQL Server Execution Times:
    CPU time = 0 ms, elapsed time = 0 ms.

    and these are my stats for the temp table query:

    QL Server Execution Times:
    CPU time = 0 ms, elapsed time = 0 ms.

    SQL Server Execution Times:
    CPU time = 0 ms, elapsed time = 0 ms.
    SQL Server parse and compile time:
    CPU time = 0 ms, elapsed time = 11 ms.
    Table ‘#tempin_____________________________________________________________________________________________________________000000000058’. Scan count 0, logical reads 65, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (65 row(s) affected)

    (1 row(s) affected)

    SQL Server Execution Times:
    CPU time = 31 ms, elapsed time = 80 ms.
    SQL Server parse and compile time:
    CPU time = 0 ms, elapsed time = 19 ms.

    (65 row(s) affected)
    Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table ‘Workfile’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table ‘Address’. Scan count 1, logical reads 216, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table ‘#tempin_____________________________________________________________________________________________________________000000000058’. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:
    CPU time = 31 ms, elapsed time = 488 ms.

    SQL Server Execution Times:
    CPU time = 0 ms, elapsed time = 0 ms.

    SQL Server Execution Times:
    CPU time = 0 ms, elapsed time = 0 ms.

    kr

    Michal Piatek

    Like

    • Hi Michal, Thanks for the comment. I think in every case “It depends.” On my surface running SQL 2016 a temp table (#) is consistently 100 ms slower than a table variable (@) for this tiny query. The temp table also produces an undesired plan. It’s true there are no stats for table variables but sometimes that helps. The other thing to note is that a table variable can be passed as a parameter which could be very helpful if you’re just trying to pass a small list of values to a stored procedure. I say, test and retest to find the most efficient code for your environment. Thanks again!

      Like

    • It doesn’t. Using a table in your IN “should” produce the same plan as a join. Always be sure to test your queries and view the plan for anything funky like the Constant Scan or Filter operators. If you see those, you may be able to improve the query.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s