Pages

Thursday, November 4, 2010

Multiple statistics sharing the same leading column

SQL Server will not prevent you from creating identical indexes or statistics. You may think that everything is under your control, but have you ever added a missing index that has been reported by the query optimizer, or the DTA? If so, you will certainly have created duplicate statistics. Have a look at the following example:

-- create test table with 3 columns
-- filler is only nedded to widen the row

create
table T1
 (

  c1
int not null identity(1,1) primary
key
 ,c2 int not
null
 ,filler nchar(200) not null default
'#'
 
)
go

-- insert 500000 rows with random
-- values between 0 and 49999 for c2

insert T1(c2
)
  select top(500000) abs(checksum(newid())) %
50000
   
from sys.fulltext_system_stopwords as
a
       
,sys.fulltext_system_stopwords as b

Before invoking the following SELECT command, allow the actual execution plan being displayed (Strg-M or selecting Query/Include Actual Execution Plan from the menu).

-- get some data
select * from T1 where c2 = 100

If you look at the execution plan, you see that the optimizer complains about a missing index on column c2. The prognosticated improvement is about 99%, so adding the index it’s certainly a good idea. Let’s do it:

-- add missing index
create nonclustered index ix_1 on t1(c2)

Perfect! The query runs much faster now and needs a lot fewer resources. But have a look at the table’s statistics:

image

You see three statistics, one for the primary key, a second one for our created index ix_1, and a third one that was automatically created during execution plan generation for the first SELECT statement. This is the statistics, named _WA_Sys…. If the AUTO CREATE STATISTICS option is set to ON, the optimizer will add missing statistics automatically. In our little experiment, the optimizer had to generate this column statistics on column c2 in order to make some assumptions about the number of rows that had to be processed.

And here’s the problem: When creating the index on column c2, a statistics on this column is also created, since every index has a corresponding linked statistics. That’s just the way it works. At the time the index was added, the column statistics on c2 (that _WA_Sys… statistics) already existed. If you don’t remove it manually, this statistics will remain there forever, although it is needless now. All it’s good for is to increase maintenance efforts during statistics updates. You can safely remove this statistics by executing:

drop statistics t1._WA_Sys_...

If you didn’t think about this before, there’s a chance that you’ll find some of those superfluous statistics duplicates inside your database(s). Here’s a query that finds index-related and column-statistics that match on the first column. Looking for matches on the first column is sufficient here, since the optimizer only automatically adds missing single-column statistics.

with all_stats as
 
(
  select s.object_id, s.name as stats_name, s.stats_id, s.
has_filter
        ,s.auto_created, sc.
stats_column_id
        ,sc.column_id, c.name as
column_name
    from sys.stats as
s
         inner join sys.stats_columns as
sc
                 on s.stats_id = sc.
stats_id
                and s.object_id = sc.
object_id
                and sc.stats_column_id =
1
        
inner join sys.columns as
c
                 on c.object_id = s.
object_id
                and c.object_id = sc.
object_id
                and c.column_id = sc.
column_id
   where objectproperty(s.object_id, 'IsUserTable') = 1
)
select row_number() over(partition by s1.object_id, s1.column_name order by s1.column_name) as
stats_no
      ,object_name(s1.object_id) as
table_name
      ,s1.stats_name as
stats_name
      ,s2.stats_name as
identical_stats_name
      ,s2.
column_name
  from all_stats as
s1
       inner join all_stats as
s2
               on s1.object_id = s2.
object_id
              and s1.stats_id != s2.
stats_id
              and s1.stats_column_id = s2.
stats_column_id
              and s1.column_id = s2.
column_id
              and s1.has_filter = s2.
has_filter
              and s1.auto_created != s2.auto_created

With that query at hand, you may easily find redundant statistics. Here’s a sample output:

image

If you’d like to find out more about SQL Server statistics, you may want to check out my series of two articles, published recently on the Simple-Talk platform.

Part 1: Queries, Damned Queries and Statistics
Part 2: SQL Server Statistics: Problems and Solutions

It’s free, so you might want to give it a try. Feel free to vote, if you like it!

10 comments:

  1. I can only run this by removing references to has_filter

    ReplyDelete
  2. Randy,

    are you on SQL Server 2005? The filtered index (and also filtered stats) feature is new for 2008.

    Regards,
    Holger

    ReplyDelete
  3. Hmmm I got two similar database (same tables, about the same amount of content rows but different rows and different views). On the one db this 'get dual statistics' query takes 19 seconds to return 22 results, on the other it takes like no time to come up with 40 results.
    Something is wrong in the first DB, but finding out what could be a nice puzzle :)

    ReplyDelete
  4. Thanks, these three posts are great. Spend a day reading them and enjoying playing with statistics and indexes.
    But then I ran SQL Profiler for 10 minutes on a production database, and then ran the trace through the tuning wizard. It came up with suggestions to add 90 stats (every time with different columns) on 1 table.
    It correctly selected the most heavily used table for its suggestions so its somewhat correct. But creating 90 stats on 1 table? It says those will do about 80% performance improvement. But how do I know it is 1% per stat, or that 5 stats will do 50% and the rest all together 30%? How can I evaluate which of the suggestions the tuning advisor reports are the ones with the most impact?

    ReplyDelete
  5. To me, the DTA seems somewhat aggressive regarding statistics recommendations. May be that's because of multi-column-stats which will not be added automatically during query execution.
    When evaluating DTA hints, I usually start with indexes, leaving stats aside.
    All in all, there's no rule of thumb on how to "do it right". You'll need some sure instincts and experience for this task...

    ReplyDelete
  6. Ok, thanks. Yes the indexes are already there, did help in the past, but not anymore. The advisor also doesn't suggest any new indexes anymore so I guess it thinks we've already got them all. I'll guess I'll have to look into the slowest and most executed queries and determine which multi-column-stat hint could help on those.
    Thanks for the answer because now I can stop looking at a way to get more info out of the DETA :).

    ReplyDelete
  7. When I execute the appropriate DROP STATISTICS commands for the auto-created statistics based on these findings, I get this error for each one:

    Cannot drop the statistics , because it does not exist or you do not have permission.

    However, they no longer exist. Do you know why it would say it cannot drop them and then do it anyway?

    - Mark

    ReplyDelete
  8. Mark,
    I have no idea. Sorry. Are you on SQL Server 2008 (R2) or later? I forgot to mention that I engineered the script for SQL Server 2008.

    Regards,
    Holger

    ReplyDelete
  9. Hi, is there a way to estimate the size of these stats objects in the database?
    Knowing that adds to the argument for dropping them.

    Thank you for this post,
    Marios Philippopoulos

    ReplyDelete
    Replies
    1. I remember an article on simple-talk.com that explains how to guess the size of the stats.
      Please do yourself a favor and do NOT delete those.

      Regards,
      Holger

      Delete

Followers