How to fix cardinality misestimate?

In this post, I am going to explain how to correct the cardinality estimate in 5 different methods. Cardinality estimate is one of the first place to check in the execution plan when dealing with the query performance.

Examples have been implemented in Oracle 19c.

1) Missing Statistics

As you might imagine, if the estimated number of rows is not correct, it needs to be checked whether the statistics are up to date. One of the indicators that statistics are stale or missing is that when you look at the execution plan it says that dynamic sampling is used in the note section. I will disable it on purpose so as not to affect the execution plan. I will address dynamic sampling later on this post. DBMS_STAT package can be used in order to update the statistics. Let’s begin with a fundamental example.

alter session set optimizer_dynamic_sampling = 0;

create table t_test
(
    id number,
    txt varchar2(100),
    num number
);

insert into t_test(id, txt, num)
select 
    level, rpad('X', mod(level, 10), 'X'), mod(level, 10) 
from dual connect by level <= 100000;

commit;

select num, count(*) from t_test group by num order by 1;

In the first step, I created a table called T_TEST and populated with 100K rows. As it can be seen from the above result, rows have been evenly generated for each value in the NUM column. There are 10K records for each value.

explain plan for
select * from t_test where num = 5;

select plan_table_output from table(dbms_xplan.display('PLAN_TABLE', null, 'LAST'));
Plan hash value: 2796558804
 
----------------------------------------------------------------------------
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |        |     1 |     78|   104   (2)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T_TEST |     1 |     78|   104   (2)| 00:00:01 |
----------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("NUM"=5)

ROWS field shows 1 in the explain plan output whereas the correct value should be 10000. Let’s gather table stats and check the explain plan again.

select 
    table_name, num_rows, stale_stats, notes 
from USER_TAB_STATISTICS a where table_name = 'T_TEST';
begin
    dbms_stats.gather_table_stats(ownname => user, 
                                  tabname => 'T_TEST', 
                                  method_opt => 'FOR ALL COLUMNS SIZE 1');
end;
/
select 
    table_name, num_rows, stale_stats, notes 
from USER_TAB_STATISTICS a where table_name = 'T_TEST';
explain plan for
select * from t_test where num = 5;
select plan_table_output from table(dbms_xplan.display('PLAN_TABLE', null, 'LAST'));
Plan hash value: 2796558804
 
----------------------------------------------------------------------------
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |        | 10000 |   136K|   104   (2)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T_TEST | 10000 |   136K|   104   (2)| 00:00:01 |
----------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("NUM"=5)

After the statistics have been gathered the ROWS field which represents the estimated number of rows, shows correct value now.

2) What if the data values are highly skewed?

Histograms comes into play at this point. Let’s recreate our test table but this time without evenly distributing the data in NUM column.

drop table t_test;

create table t_test
(
    id number,
    txt varchar2(100),
    num number
);

insert into t_test(id, txt, num)
with 
q1 as (
    select level lv from dual connect by level <= 50
)
select rownum, rpad('X', mod(rownum, 10), 'X'), lv 
from q1, lateral (select level lv2 from dual connect by level <= power(lv,3));

commit;
select count(*) from t_test;
select num, count(*) from t_test group by num order by 1;

Totally, 1.6M rows has been generated. Each value in NUM column from 1 to 50 has been multiplied by its own cube. For example, in this case, we know that there will be 3375 (153) rows for NUM = 15.

explain plan for
select * from t_test where num = 15;
select plan_table_output from table(dbms_xplan.display('PLAN_TABLE', null, 'LAST'));
Plan hash value: 2796558804
 
----------------------------------------------------------------------------
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |        |     1 |    78 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T_TEST |     1 |    78 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("NUM"=15)

However, estimated number of rows is shown 1 in the explain plan output. Let’s gather statistics without including histogram.

begin
    dbms_stats.gather_table_stats(ownname => user, 
                                  tabname => 'T_TEST', 
                                  method_opt => 'FOR ALL COLUMNS SIZE 1');
end;
/
explain plan for
select * from t_test where num = 15;
select plan_table_output from table(dbms_xplan.display('PLAN_TABLE', null, 'LAST'));
Plan hash value: 2796558804
 
----------------------------------------------------------------------------
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |        | 32513 |   412K|  1192   (3)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T_TEST | 32513 |   412K|  1192   (3)| 00:00:01 |
----------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("NUM"=15)

The plan shows that the cardinality estimate 10 times higher than the actual value. You may be wondering where 32513 came from. Table has 1625625 rows and for NUM column there are 50 number of distinct values. Hence, optimizer basically calculates the cardinality by dividing the number of rows by the number of distinct values and found 32513. As a result, gathering statistics without histogram may not be sufficient to find the correct cardinality in columns with skew data. Lastly, let’s gather statistics again with histogram this time and check the plan again.

begin
    dbms_stats.gather_table_stats(ownname => user, 
                                  tabname => 'T_TEST', 
                                  method_opt => 'FOR ALL COLUMNS SIZE AUTO');
end;
/
explain plan for
select * from t_test where num = 15;
select plan_table_output from table(dbms_xplan.display('PLAN_TABLE', null, 'LAST'));
Plan hash value: 2796558804
 
----------------------------------------------------------------------------
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |        |  3375 | 43875 |  1192   (3)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T_TEST |  3375 | 43875 |  1192   (3)| 00:00:01 |
----------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("NUM"=15)

explain plan for
select * from t_test where num = 50;
select plan_table_output from table(dbms_xplan.display('PLAN_TABLE', null, 'LAST'));
Plan hash value: 2796558804
 
----------------------------------------------------------------------------
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |        |   124K|  1586K|  1192   (3)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T_TEST |   124K|  1586K|  1192   (3)| 00:00:01 |
----------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("NUM"=50)

explain plan for
select * from t_test where num = 4;
select plan_table_output from table(dbms_xplan.display('PLAN_TABLE', null, 'LAST'));
Plan hash value: 2796558804
 
----------------------------------------------------------------------------
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |        |    64 |   832 |  1192   (3)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T_TEST |    64 |   832 |  1192   (3)| 00:00:01 |
----------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("NUM"=4)

So, we see that the cardinality is calculated correctly for different values after creating histogram.

3) Multiple related columns on a single table

If you have a cardinality issue, you should also consider whether extended statistics will resolve it. If multiple columns are corelated to each other in a table then extended statistics can be created on these columns. Because, in this case standard statistics would not be sufficient. For example, if you have city and country columns in the same table, these columns will be related to each other. Another example is, car brands and models. In the following example, suppose that there is a correlation between C1 and C2 columns. The data is set in this way.

drop table t_test2;

create table t_test2
(
id number,
c1 varchar2(20),
c2 number
);


insert into t_test2
select 
    level id,
    case 
        when level between 1 and 5000 then 'A'
        when level between 5000 and 6500 then 'B'
        when level between 6501 and 7000 then 'C'
        when level between 7001 and 9000 then 'D'
        when level between 9001 and 9010 then 'E'
        when level between 9011 and 9100 then 'F'
        when level between 9101 and 9200 then 'G'
        when level between 9201 and 10000 then 'H'
    end c1,
    case 
        when level between 1 and 5000 then floor(DBMS_RANDOM.VALUE(1, 101))
        when level between 5000 and 6500 then floor(DBMS_RANDOM.VALUE(101, 201))
        when level between 6501 and 7000 then floor(DBMS_RANDOM.VALUE(201, 301))
        when level between 7001 and 9000 then floor(DBMS_RANDOM.VALUE(301, 401))
        when level between 9001 and 9010 then floor(DBMS_RANDOM.VALUE(401, 501))
        when level between 9011 and 9100 then floor(DBMS_RANDOM.VALUE(501, 601))
        when level between 9101 and 9200 then floor(DBMS_RANDOM.VALUE(601, 701))
        when level between 9201 and 10000 then floor(DBMS_RANDOM.VALUE(701, 801))
    end c2
from dual connect by level <= 10000;

commit;
select count(*) from t_test2 where c1 = 'A' and c2 = 52;
begin
    dbms_stats.gather_table_stats(ownname => user, tabname => 'T_TEST2');
end;
/
select 
    table_name, column_name, num_distinct, last_analyzed, histogram 
from USER_TAB_COL_STATISTICS where table_name = 'T_TEST2';
explain plan for
select count(*) from t_test2 where c1 = 'A' and c2 = 52;
select plan_table_output from table(dbms_xplan.display('PLAN_TABLE', null, 'LAST'));
Plan hash value: 2872170939
 
------------------------------------------------------------------------------
| Id  | Operation          | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |         |     1 |     6 |     7   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |         |     1 |     6 |            |          |
|*  2 |   TABLE ACCESS FULL| T_TEST2 |    24 |   144 |     7   (0)| 00:00:01 |
------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("C2"=52 AND "C1"='A')

Totally 10K rows has been inserted to the test table. As seen above, 47 rows returned from the query. With statistics, optimizer calculates there will only be 24 rows returned. Note that histogram were also created for C1 and C2 columns. For a more accurate result, we can tell the optimizer about this correlation by creating extending statistics on these columns.

select 
  dbms_stats.create_extended_stats(ownname => user,
                                   tabname => 'T_TEST2', 
                                   extension => '(C1, C2)') 
from dual;
select * from user_stat_extensions where table_name = 'T_TEST2';

After creating the extended statistics, the statistics need to be gathered again on T_TEST2 table in order to inform the optimizer about the correlation.

begin
    dbms_stats.gather_table_stats(ownname => user, 
                                  tabname => 'T_TEST2', 
                                  method_opt => 'FOR ALL COLUMNS SIZE SKEWONLY');
end;
/

Besides, if you look at the USER_TAB_COL_STATISTICS table, you can see that a histogram was created on the column group.

select 
    table_name, column_name, num_distinct, last_analyzed, histogram 
from USER_TAB_COL_STATISTICS where table_name = 'T_TEST2';
explain plan for
select count(*) from t_test2 where c1 = 'A' and c2 = 52;
select plan_table_output from table(dbms_xplan.display('PLAN_TABLE', null, 'LAST'));
Plan hash value: 2872170939
 
------------------------------------------------------------------------------
| Id  | Operation          | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |         |     1 |     6 |     7   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |         |     1 |     6 |            |          |
|*  2 |   TABLE ACCESS FULL| T_TEST2 |    47 |   282 |     7   (0)| 00:00:01 |
------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("C2"=52 AND "C1"='A')

select count(*) from t_test2 where c1 = 'D' and c2 = 350;
explain plan for
select count(*) from t_test2 where c1 = 'D' and c2 = 350;
select plan_table_output from table(dbms_xplan.display('PLAN_TABLE', null, 'LAST'));
Plan hash value: 2872170939
 
------------------------------------------------------------------------------
| Id  | Operation          | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |         |     1 |     6 |     7   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |         |     1 |     6 |            |          |
|*  2 |   TABLE ACCESS FULL| T_TEST2 |    19 |   114 |     7   (0)| 00:00:01 |
------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("C2"=350 AND "C1"='D')

So, the optimizer will now use the extended statistics and the cardinality estimates is now accurate.

4) Function Calls

As in correlated columns, extended statistics can be used for function calls as well. In other words, if you need to use function calls in the WHERE condition, you can create extended statistics in order to help the optimizer get good quality cardinality estimates.

drop table t_test3;

create table t_test3
(
    id number,
    c1 varchar2(1)
);

insert into t_test3
select level id, DBMS_RANDOM.STRING('A', 1) c1 from dual connect by level <= 100000;

commit;
select count(*) from t_test3 where lower(c1) = 'a';
explain plan for
select count(*) from t_test3 where lower(c1) = 'a';
select plan_table_output from table(dbms_xplan.display('PLAN_TABLE', null, 'LAST'));
Plan hash value: 3076426585
 
------------------------------------------------------------------------------
| Id  | Operation          | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |         |     1 |     2 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |         |     1 |     2 |            |          |
|*  2 |   TABLE ACCESS FULL| T_TEST3 |     1 |     2 |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter(LOWER("C1")='a')

begin
    dbms_stats.gather_table_stats(ownname => user, tabname => 'T_TEST3');
end;
/
explain plan for
select count(*) from t_test3 where lower(c1) = 'a';
select plan_table_output from table(dbms_xplan.display('PLAN_TABLE', null, 'LAST'));
Plan hash value: 3076426585
 
------------------------------------------------------------------------------
| Id  | Operation          | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |         |     1 |     2 |    71   (5)| 00:00:01 |
|   1 |  SORT AGGREGATE    |         |     1 |     2 |            |          |
|*  2 |   TABLE ACCESS FULL| T_TEST3 |  1000 |  2000 |    71   (5)| 00:00:01 |
------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter(LOWER("C1")='a')

select 
  dbms_stats.create_extended_stats(ownname => user, 
                                   tabname => 'T_TEST3', 
                                   extension => '(LOWER(C1))') 
from dual;

You can create expression statistics via DBMS_STAT package as follows.

begin
    dbms_stats.gather_table_stats(user, 'T_TEST3', method_opt => 'FOR COLUMNS (LOWER(C1))');
end;
/
select * from user_stat_extensions where table_name = 'T_TEST3';
select 
    table_name, column_name, num_distinct, last_analyzed, histogram 
from USER_TAB_COL_STATISTICS where table_name = 'T_TEST3';
explain plan for
select count(*) from t_test3 where lower(c1) = 'a';
select plan_table_output from table(dbms_xplan.display('PLAN_TABLE', null, 'LAST'));
Plan hash value: 3076426585
 
------------------------------------------------------------------------------
| Id  | Operation          | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |         |     1 |     2 |    71   (5)| 00:00:01 |
|   1 |  SORT AGGREGATE    |         |     1 |     2 |            |          |
|*  2 |   TABLE ACCESS FULL| T_TEST3 |  3869 |  7738 |    71   (5)| 00:00:01 |
------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter(LOWER("C1")='a')

So now, we can see that the optimizer makes a more accurate prediction.

Finally, I would like to mention that you can also create a function based index on the expression that is used in the WHERE condition. However, if the index is never used as an access method, you can consider to create extended statistics instead.

5) Dynamic Sampling

Dynamic sampling is another way to inform the optimizer. It can be used if an SQL statement contains a complex predicate expressions. This feature has been renamed and enhanced as Dynamic Statistics in 12c. At the beginning of this post I set it to 0, so that it didn’t affect the optimizer. Dynamic sampling feature can be controlled by the OPTIMIZER_DYNAMIC_SAMPLING parameter or DYNAMIC_SAMPLING hint, with available values ranging between 0 and 11.

select name, value from v$parameter where name = 'optimizer_dynamic_sampling';

Let’s set it back to its default value 2 and create a new test table.

alter session set optimizer_dynamic_sampling = 2;

create table t_test4
(
    id number,
    num number
);

insert into t_test4(id, num)
select 
level,
case 
    when mod(level, 100000) = 0 then 1
    when mod(level, 10000) = 0 then 2
    when mod(level, 1000) = 0 then 3
    when mod(level, 100) = 0 then 4
    else mod(level, 5) + 5
end
end from dual connect by level <= 100000;

commit;
select num, count(*) from t_test4 group by num order by 1;

As it can be seen from the test table, NUM column data is skewed.

explain plan for 
select * from t_test4 where num = 4;
select plan_table_output from table(dbms_xplan.display('PLAN_TABLE', null, 'LAST'));
Plan hash value: 1810199436
 
-----------------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |   937 | 24362 |    70   (3)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T_TEST4 |   937 | 24362 |    70   (3)| 00:00:01 |
-----------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("NUM"=4)
 
Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

So, the cardinality estimate is very close to the actual value by dynamic statistics, even if the statistics haven’t been gathered. Besides, as I said at the beginning of this post, if the note section says that dynamic sampling is used, this indicates that the statistics are out of date.

In this post, I mentioned 5 different ways to correct the cardinality misestimate. I hope by using these methods you will ensure that your optimizer makes an accurate prediction.

Posted in DBA
Leave a Reply

Your email address will not be published. Required fields are marked *