DBA

Statistics – how often and how precise? (Sept 2004)

Jonathan Lewis, JL Computer Consultancy

Preamble

The Cost Based Optimizer needs information about your data - but why, and how much?

Some sites recompute statistics for all columns of every schema every night; some sites take a small estimate of the odd object from time to time; most sites will probably be somewhere in the middle of the range. Some people will just switch to AUTO with 10g.

Is there any approach that will be just right for everybody? And if not, why not?

The presentation explains some of the basics of optimizer statistics, and will help you decide how to minimize the resources spent gathering statistics.

Optimal Thinking.

The first (and only) rule of optimization is: "Avoid unnecessary effort". But you have to operate this rule at many levels. In the case of statistics, for example, you have three considerations:

·  The optimizer should be able to use your statistics to find the most efficient execution plan for a statement.

·  You should not use unnecessary machine resources generating statistics.

·  You should not spend excessive amounts of time trying to work out a minimal strategy for generating statistics.

You can cause problems (other than the simple waste of resources) by collecting too many statistics but as a general rule: if you don't see any performance problems that could be caused by inappropriate statistics, and if you don't have any problems hitting your SLAs because of the time spent gathering statistics, then point 3 applies: you can probably find better things to do with your time than refine a strategy that is already working.

This paper is particularly for people who think their performance problems may be related to optimizer issues, or for people who are under pressure to find some more time in their batch windows. However, if you are interested in how the optimizer works, you might want to read this paper simply to enhance your understanding.

What does the CBO use statistics for?

In order to work out the optimum execution path, the optimizer needs to decide the best join order, the best join mechanism (e.g. hash, merge, nested loop), and the best data acquisition method (e.g. index, tablescan) for each table. Unless Oracle can work out a reasonably accurate estimate of the number of rows that will be acquired at each stage of the join it is likely to make some bad decisions.

Consider, for example, the single-table query:

select

{list of columns}

from t1

where

n1 between 1000 and 2000

;

If Oracle's estimate of the number of rows to be acquired is very small then it is possible that an index on (n1) will be used. If Oracle's estimate is that a few hundred rows will be acquired then the index may be not used, even if the target rows are very tightly clustered. The estimated number of rows returned can affect the data acquisition mechanism.

Consider a more complex case:

select

t1.*, t2.*

from t1, t2

where

t1.n1 between 10 and 20

and t2.id = t1.id

and t2.n2 between 100 and 200

;

Oracle may decide that there are just a few relevant rows in t1, and that these rows can be acquired using an index on (n1), and that there are just a few rows in t2 which can be acquired using an index on (id) for each row in t1. In this case the optimizer could decide that the join order should be (t1 to t2), and the join mechanism should be a nested loop.

But if the optimizer's estimate of the number of relevant rows in t1 is large, then it may decide that a nested loop would do too many indexed accesses into t2 to be efficient, and it might switch to a hash join. (Even then it might be a good idea to use an indexed access path into t2 using index on (n2) just once). So an error in the estimated number of rows to be acquired from t1 can result in a change in join mechanism.

Similarly, if the optimizer decided that each value of t1.id would find a lot of rows in t2, then it might decide to start by selecting rows from t2 using an index on (n2) before joining to t1. So an error in the estimated number of rows can result in a change of join order.

Technically, Oracle bases it's arithmetic on estimates of selectivity (fraction of available rows) and then derives the cardinality (number of rows to be acquired); however, the switch between cardinality and selectivity doesn't really make a lot of difference to the discussion, and it is sometimes easier to comprehend what is going on by looking at the cardinality.

How do you get accuracy

For the simpler cases (ignoring the requirement for histograms, the problems of column dependencies, and some 'sanity checks' that have appeared in recent versions of Oracle), there are basically eight critical numbers that Oracle wants:

·  Number of used blocks in each table **

·  Number of rows in each table

·  Number of distinct values for each column

·  Number of nulls in each column

·  Average length of each column

·  Number of branch levels of index **

·  Number of leaf-blocks in index

·  Clustering factor of index

The figures marked ** can be found very cheaply, but Oracle may have to do a lot of work to 'learn' the other numbers.

For the table-related figures, Oracle could read a random selection of rows (and check the figure for rows per block (nrow in the block dump)for each block it had to visit) and assume that the rows are a true representation of the table. The arithmetic would not be very difficult.

For the index-related figures, Oracle has to check a selection of blocks from the index to be able to generate a clustering factor. It is interesting to note that one of the most extreme differences between using the newer dbms_stats call and the old analyze call for collecting statistics on indexes is that they count index leaf blocks differently – dbms_stats considers only leaf blocks which are currently in use, analyze includes the leaf blocks that are on the index's free list.

Given that the mechanisms are relatively straightforward, the important questions are:

·  How much error can you allow in any particular statistic before the optimizer does the wrong thing ?

·  How large a sample is needed to keep the error within that limit ?

·  How long can you leave the statistics before the statistics come to be outside the limit ?

The general answer is: you can get away with being very lazy in most cases. However, you do need to be careful some of the time. Moreover, there are cases where Oracle is simply unable to produce a meaningful statistic however much work you make it do. To demonstrate the principles, consider the following worked example.

Test Case

I am going to generate four sets of numeric data in a table of 1,000,000 rows. Three of the data sets will each have 50,000 distinct values and 20 rows for each value, but the distribution of the values across the rows will be different in each set. One data set will cluster the values very tightly, one data set will scattered them very evenly through the table, and the third data set will use the dbms_random package to scatter the values randomly – in fact, this last data set will probably not have exactly 20 rows for each value.

The final data set will be generated from the "normal distribution" function that is available in the package dbms_random. As a result, you will not see the same number of rows for each value generated.

The SQL uses subquery factoring (an Oracle 9i innovation) to produce a large result set easily

create table t1

as

with milli_row as (

select /*+ materialize */

rownum

from all_objects

where rownum <= 1000

)

select

mod(rownum-1, 50000) scattered,

trunc((rownum-1)/20) clustered,

trunc(dbms_random.value(0,50000)) uniform,

trunc(7000 * dbms_random.normal) normal

from

milli_row m1,

milli_row m2

where

rownum <= 1000000

;

We can now use dbms_stats.gather_table_stats() against this table at different sample sizes, to see how good Oracle is at estimating the number of distinct values there are for each column. The results from one test run on a 9.2.0.4 system were as follows (because of Oracle's randomized sampling, these results will not be exactly reproducible):

Sample Percent / Scattered / Clustered / Uniform / Normal
100 / 50,000 / 50,000 / 50,000 / 42186
80 / 50,001 / 50,001 / 50,001 / 41,106
60 / 50,001 / 50,001 / 50,001 / 39,702
40 / 49,999 / 49,999 / 49,994 / 37,541
20 / 50,003 / 49,957 / 48,942 / 29.726
10 / 49,929 / 49,924 / 48,942 / 29,726
8 / 49,909 / 49,898 / 48,784 / 28,818
6 / 49,929 / 49,395 / 48,486 / 27,737
4 / 49,492 / 49,673 / 48,153 / 26,574
2 / 50,540 / 49,594 / 46,622 / 25,643
1 / 50,939 / 48,794 / 49,676 / 25.062

This highlights two important points. First that there are columns where even a 1% sample size does not introduce much error; second that there are columns where the sample size has a huge impact on the statistical results.

Let's deal with the 'nice' data first – columns where the distribution of values is even, boring and dull. In the worst possible case, Oracle has produced the value 46,662 as the estimate of the number of distinct values. This means that if you supplied the predicate "column = constant", the optimizer would estimate the number of relevant rows to be 1,000,000 / 46,622; which is 21.449: which Oracle would round to 21. It's probably close enough in most cases to leave the optimizer doing the right thing with a large table.

But what about the last column of results, the 'normal' column – depending on the sample size, the optimizer will predict anything between 24 and 40 rows on a simple predicate like 'column = constant'. It's quite possible that a variation of nearly 100% would cause the optimizer to pick a sub-optimal execution path. Why has this happened.

Let's look at the data more closely:

Select count(*) from t1 where normal = -15000 -- 6 rows

Select count(*) from t1 where normal = -5000 -- 47 rows

Select count(*) from t1 where normal = -200 -- 59 rows

Select count(*) from t1 where normal = 0 -- 114 rows

Select count(*) from t1 where normal = 200 -- 62 rows

Select count(*) from t1 where normal = 5000 -- 41 rows

Select count(*) from t1 where normal = 15000 -- 8 rows

The normal distribution is the famous 'bell curve' of the statistician – lots of values in the middle, tapering away to either side. How is Oracle supposed to provide one 'magic number' that describes the data. If our business is always asking about values in the range 10,000 – 20,000 then we want Oracle to believe that we will only get half a dozen rows from our queries. If our business is always asking about values in the range –10 to + 10, we want Oracle to believe that we will always get about 100 rows from our queries. There is no solution to this problem – except, perhaps, to invent some statistics that manage to describe the data to Oracle in the way that we see it.

Histograms

Of course, for cases like this, Oracle does give us some help with histograms. So let's take a look at what a histogram is.

To start, let's draw a graph of our 'normal' data. Obviously, we could count how many rows there are for each value, and then plot a point (value, number of rows) on a simple line graph. However, there are a million rows in the table, so this would be quite a lot of effort. So let's approximate by sorting the rows and batching them. We have two obvious options

Option 1 – we could draw a bar chart by saying: "How many rows have a value between –35,000 and –30,000, draw a bar that height. How many rows have a value between –30,000 and -25,000, draw a bar that height".

Option 2 – we could say: "break the sorted list into equal size chunks (say 100,000 rows each), what are the lowest and highest value in the first chunk, draw a bar across that range. The average number of rows for any value in that range is the number of rows in the chunk divided by the number of different possible values in the range, i.e. 100,000 / (highest – lowest).

The nice thing about option 2 is that we don't have to know in advance the highest and lowest values in the table before we decide on the widths of the bars we need to draw – so let’s use that method, and write a little SQL to work the results out for us. Analytic functions are excellent for this type of work:

select

tenth,

min(normal),

max(normal),

round(100000 / (max(normal) - min(normal)),2) height

from (

select

normal,

ntile(10) over (order by normal) tenth

from t1

)

group by tenth

order by tenth

;

The inline view sorts the data into order, and breaks it into tenths, giving each row an extra column that identifies which 'tenth' of the data it belongs to. The outer query then gets the low/high for each 'tenth' and works out a height for the bar by dividing this range into the number of rows in this tenth (in this case hard-coded at 100,000). The resulting graph is shown below.

BUT – an interesting thing effect appears when you use the dbms_stats package to create a histogram for this data - you find that Oracle runs SQL like the following.

select

min(minbkt),

maxbkt,

substrb(dump(min(val),16,0,32),1,120) minval,

substrb(dump(max(val),16,0,32),1,120) maxval,