Statistics collection recommendations

Cover Image:

http://www.flickr.com/photos/gertcha/6898156643/sizes/m/

Statistical information is vital for the optimizer when it builds query plans. But collecting statistics can involve time and resources. By understanding and combining several different statistics gathering techniques, users of Teradata can find the correct balance between good query plans and the time required to ensure adequate statistical information is always available.

This recently-updated compilation of statistics collection recommendations are intended for sites that are on any of the Teradata 14.0 software release levels. Some of these recommendations apply to releases earlier than Teradata 14.0, however some rely on new features available only in Teradata 14.0.

Contributors: Carrie Ballinger, Rama Krishna Korlapati, Paul Sinclair, February 12, 2013

Collect Full Statistics

Non-indexed columns used in predicates
All NUSIs
USIs/UPIs if used in non-equality predicates (range constraints)
Most NUPIs (see below for a fuller discussion of NUPI statistic collection)
Full statistics always need to be collected on relevant columns and indexes on small tables (less than 100 rows per AMP)
PARTITION for all partitioned tables undergoing upward growth
Partitioning columns of a row-partitioned table

Can Rely on Random AMP Sampling

USIs or UPIs if only used with equality predicates
NUSIs with an even distribution of values
NUPIs that display even distribution, and if used for joining, conform to assumed uniqueness (see Point #2 under “Other Considerations” below)
See “Other Considerations” for additional points related to random AMP sampling

Option to use USING SAMPLE

Unique index columns
Nearly-unique[1] columns or indexes

Collect Multicolumn Statistics

Groups of columns that often appear together with equality predicates. These statistics will be used for single-tables estimates.
Groups of columns used for joins or aggregations, where there is either a dependency or some degree of correlation among them. With no multicolumn statistics collected, the optimizer assumes complete independence among the column values. The more that the combination of actual values are correlated, the greater the value of collecting multicolumn statistics will be in this situation.

New Considerations for Teradata 14.0

When multiple statistics on a table are collected for the first time, group all statistics with the same USING options into a single request.
After first time collections, collect all statistics being refreshed on a table into one statement, and if all are being refreshed, re-collect at the table level.
Do not rely on copied or transferred SUMMARY statistics or PARTITION statistics between tables within a system or across systems. Recollect them natively. This ensures that changes to the configuration and other internal details (such as how internal partitions are arranged) will be available to the optimizer.
Recollect SUMMARY statistics after data loads, it runs very quickly and provides up-to-the-date row counts to the optimizer.
In Teradata 14.0, PARTITION statistics are no longer required on nonpartitioned tables, as the new SUMMARY table replaces the function previously provided by PARTITION or primary index statistics in determining stale statistics.

Other Considerations

Optimizations such as nested join, partial GROUP BY, and dynamic partition elimination will not be chosen unless statistics have been collected on the relevant columns.
NUPIs that are used in join steps in the absence of collected statistics are assumed to be 75% unique, and the number of distinct values in the table is derived from that. A NUPI that is far off from being 75% unique (for example, it’s 90% unique, or on the other side, it’s 60% unique or less) will benefit from having statistics collected, including a NUPI composed of multiple columns regardless of the length of the concatenated values. However, if it is close to being 75% unique, then random AMP samples are adequate. To determine what the uniqueness of a NUPI is before collecting statistics, you can issue this SQL statement:

EXPLAIN SELECT DISTINCT nupi-column FROM table;

For a partitioned table, it is recommended that you always collect statistics on:

PARTITION. This tells the optimizer how many row partitions are empty, a histogram of how many rows are in each row partition, and the compression ratio for column partitions. This statistic is used for optimizer costing.
Any partitioning columns. This provides cardinality estimates to the optimizer when the partitioning column is part of a query’s selection criteria.

For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI):

(PARTITION, PI). This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions. It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination.
(PARTITION, PI, partitioning column). This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination. It is used in rowkey join costing.

Random AMP sampling has the option of pulling samples from all AMPs, rather than from a single AMP (the default). All-AMP random AMP sampling has these particular advantages:

It provides a more accurate row count estimate for a table with a NUPI. This benefit becomes important when NUPI statistics have not been collected (as might be the case if the table is extraordinarily large), and the NUPI has an uneven distribution of values.
Statistics extrapolation for any column in a table will not be attempted for small tables or tables whose primary index is skewed (based on full statistics having been collected on the PI), unless all-AMP random AMP sampling is turned on. Because a random AMP sample is compared against the table row count in the histogram as the first step in the extrapolation process, an accurate random AMP sample row count is critical for determining if collected statistics are stale, or not.

For temporal tables, follow all collection recommendations made above. Currently, statistics are not supported on BEGIN and END period types. That capability is planned for a future release.

Contributors: Carrie Ballinger, Rama Krishna Korlapati, Paul Sinclair, February 12, 2013

[1] Any column which is over 95% unique is considered as a nearly-unique column.

Ignore ancestor settings:

Tags:

database

statistics collection

Apply supersede status to children:

Statistics collection recommendations – Teradata 14.0

Collect Full Statistics

Can Rely on Random AMP Sampling

Option to use USING SAMPLE

Collect Multicolumn Statistics

New Considerations for Teradata 14.0

Other Considerations

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112