Tuesday, May 1, 2007

Cheetah spot by spot: Better update statistics

I think I will get many of you to agree with me if I say that the only thing we really know about statistics is that they're really important.

Everything else seems a bit unclear and remains more or less a mystery. This article won't clear your doubts about update statistics, since I'll focus only on Cheetah's new features.

If you want to learn more I advise you to check these resources:

I have to agree that something should be done about the documentation of the UPDATE STATISTICS command given its complexity and its importance for achieving maximum performance from the RDBMS.

The information gathered by the UPDATE STATISTICS is the fuel that drives the query optimizer. You can't expect good choices for query plans without having good and up to date statistics about your data.

I'll make a mental note for a future article about these issues. For now, and since this is an article in the Cheetah Spot by Spot series, let's just see what's new.

Most of you will certainly have some contact with your company or clients development teams, and if that's the case, you've probably found some situations where they are confident they've created the correct indexes, but nevertheless the query performance is poor.

Many times they simply don't know, or forgot, that they have to UPDATE STATISTICS before the optimizer picks up the index.

If you lived this situation you surely wondered why the engine doesn’t do this on index creation... Well that's a good question and it led the IBM Informix R&D team to change this behavior.

Statistical information is stored in systables, sysfragments, sysindices, sysdistrib and syscolumns. Cheetah brings improvements in the following situations:

  • More information about statistics (more columns in the above tables)
  • More automatic statistical data collected when manipulating Indexes and temporary tables
  • More options for the UPDATE STATISTICS MEDIUM to define the sampling size
  • More information when using the SET EXPLAIN instruction

Let's see these points in detail.


More information about statistics

Cheetah will have more columns for storing information about UPDATE STATISTICS. Namely;

  • systables.ustlowts : Time stamp (DATETIME YEAR TO FRACTION(5)) when the last UPDATE STATISTICS LOW was run for this table
  • sysdistrib.constr_time : Time stamp (DATETIME YEAR TO FRACTION(5)) when this distribution was collected
  • sysdistrib.smplsize : real sample size used for this distributions calculation
  • sysdistrib.rowssmpld : real number of rows sampled for this distribution
  • sysdistrib.ustnrows : number of rows in the table when the distribution was calculated

More automatic statistical data collected when manipulating Indexes and temporary tables

In Cheetah, the engine will make the equivalent of running UPDATE STATISTICS LOW when an index is created. It will also compute histograms for the first index column. The distributions will not be calculated (this does not affect the LOW MODE) if any of these verifies:

  • The header column is an UDT (either built-in or not)
  • It's a functional Index
  • It's a VII (Virtual Index Interface) index

The distributions calculated are equivalent to what UPDATE STATISTICS generates in MEDIUM mode, with a resolution of 1% for tables of fewer than a million rows, and 0.5% for larger tables.

This means that without having to do anything else, after the index creation, the optimizer will have information about it, so the index can and will be used (if it's good for the query plans)

  • The following improvements were made on temporary table statistics:
  • The number of rows and pages will be maintained automatically
  • The creation of indexes will generate distributions and statistics for the table and index

More options for the UPDATE STATISTICS MEDIUM to define the sampling size

Another new feature is a new clause for the UPDATE STATISTICS MEDIUM: SAMPLING SIZE . This clause can be used with the resolution and confidence clauses and allows the establishment of a minimum number of rows to be sampled. For more information about the sampling size in MEDIUM mode please consult the above reference for John Miller's article on UPDATE STATISTICS. Many people don't know that the sample size is pre-defined given the parameters of resolution and confidence. From now on, we can change this with this new SAMPLING SIZE option. The number given will have two meanings:

  • When it's less or equal to 1: It's a percentage of the table rows
  • When it's greater than 1: It's the real number of rows to be used for sampling
More information when using the SET EXPLAIN instruction

Although this is not specific for the UPDATE STATISTICS statement, the SET EXPLAIN statement was also improved. In Cheetah you can define the pathname for the SQEXPLAIN file. This is achieved by the statement

SET EXPLAIN FILE TO "/your/favorite/path/with/fs/permissions"

Another improvement refers to the already existing in v10, but undocumented statement

SET EXPLAIN STATISTICS

This will give you much more information about the engine behavior when solving your queries. You can turn off this feature with an ONCONFIG called EXPLAIN_STAT. If set to 0 it will disable the enhanced EXPLAIN ON. As more and more ONCONFIG parameters you can change it dynamically with

onmode -wm EXPLAIN_STAT=1

onmode -wf EXPLAIN_STAT=1

This has also implications in the onmode -Y command. You can now give 3 values:

  • onmode -Y 0 SESSION_ID
    disables EXPLAIN for session SESSION_ID
  • onmode -Y 1 SESSION_ID
    enables EXPLAIN with STATISTICS for session SESSION_ID
  • onmode -Y 2 SESSION_ID
    enables EXPLAIN without STATISTICS for session SESSION_ID

One final note: Although I've written above that the automatic statistics will be gathered for CREATE INDEX and temporary tables, in reality there are more instructions that will benefit from this:

CREATE INDEX ...

ALTER FRAGMENT FOR TABLE ... INIT

ALTER FRAGMENT FOR INDEX ... INIT

ALTER TABLE ADD UNIQUE CONSTRAINT

ALTER TABLE ADD PRIMARY KEY...

ALTER TABLE ADD FOREIGN KEY...

As for now, it's all. But this is an area that will probably have new features in the future.

I'd like to end with two references for utilities for handling update statistics.
Art S. Kagel, a distinguished participant in the Informix community has developed an extremely useful utility called dostats. It will probably handle everything you want to do related to Informix statistics and works with versions 5 (Online), 6 (does it exist?!), 7, 9, 10 and 11. He recently updated it to handle the SAMPLING SIZE feature of IDS 11.10.

You can get it on IIUG's software archive in a package called utils2_ak. It's written in ESQL/C, so you'll have to compile it for your platform using CSDK or c4gl (7.20+).

I have also made two SHELL scripts (dbs_updstats and tbl_updstats) that will handle statistics for database and single tables. These scripts are based on older scripts that were made publicly available and modified by the Informix technical team in Portugal (namely António Lima, I and possibly others). The current versions are very different but I based my work on their efforts so a great "thanks!" is due.

These scripts don't have all the functionalities of Art Kagel's dostats, but being simple SHELL scripts make them easy to pass along and maintain. The current versions have a functionality you won't find in the current dostats. I called it a "quick low mode" and it allows the LOW mode to collect only systables data. This will prevent the cleaning of indexes that is done by LOW mode. This cleaning can take very long. Be aware that although this mode will make it run almost instantly, its speed is achieved by not doing the hardest part of the work.

If you're interested you can find this scripts at my homepage.

No comments:

Post a Comment