Quantcast
Channel: Teradata Developer Exchange - Database
Viewing all 174 articles
Browse latest View live

Hybrid Row/Column-stores: A General and Flexible Approach

$
0
0
Short teaser: 
Why both row and columnar storage together?
Cover Image: 

Guest post by Dr. Daniel Abadi, March 6, 2015

During a recent meeting with a post-doc in my lab at Yale, he reminded me that this summer will mark the 10-year anniversary of the publication of C-Store in VLDB 2005. C-Store was by no means the first ever column-store database system (the column-store idea has been around since the 70s --- nearly as long as relational database systems), but it was quite possibly the first proposed architecture of a column-store designed for petabyte-scale data analysis. The C-Store paper has been extremely influential, with close to every major database vendor developing column-oriented extensions to their core database product in the past 10 years, with most of them citing C-Store (along with other influential systems) in their corresponding research white-papers about their column-oriented features.

Given my history with the C-Store project, I surprised a lot of people when some of my subsequent projects such as HadoopDB/Hadapt did not start with a column-oriented storage system from the beginning. For example, industry analyst Curt Monash repeatedly made fun of me on this topic (see, e.g. http://www.dbms2.com/2012/10/16/hadapt-version-2/).

In truth, my love and passion for column-stores has not diminished since 2005. I still believe that every analytical database system should have a column-oriented storage option. However, it is naïve to think that column-oriented storage is always the right solution. For some workloads --- especially those that scan most rows of a table but only a small subset of the columns, column-stores are clearly preferable. On the other hand, there any many workloads that contain very selective predicates and require access to the entire tuple for those rows which pass the predicate. For such workloads, row-stores are clearly preferable.

 

 

There is thus general consensus in the database industry that a hybrid approach is needed. A database system should have both column-oriented and row-oriented storage options, and the optimal storage can be utilized depending on the expected workload.

Despite this consensus around the general idea of the need for a hybrid approach, there is a glaring lack of consensus about how to implement the hybrid approach. There have been many different proposals for how to build hybrid row/column-oriented database systems in the research and commercial literature. A sample of such proposals include:

(1)    A fractured mirrors approach where the same data is replicated twice --- once in a column-oriented storage layer and once in a row-oriented storage layer. For any particular query, data is extracted from the optimal storage layer for that query, and processed by the execution engine.

(2)    A column-oriented simulation within a row-store. Let’s say table X contains n columns. X gets replaced by n new tables, where each new table contains two columns --- (1) a row-identifier column and (2) the column values for one of the n columns in the original table.  Queries are processed by joining together on the fly the particular set of these two-column tables that correspond to the columns accessed by that query.

(3)    A “PAX” approach where each page/block of data contains data for all columns of a table, but data is stored column-by-column within the page/block.

(4)    A column-oriented index approach where the base data is stored in a row-store, but column-oriented storage and execution can be achieved through the use of indexes.

(5)    A table-oriented hybrid approach where a database administrator (DBA) is given a choice to store each table row-by-row or column-by-column, and the DBA makes a decision based on how they expect the tables to be used.

In the rest of this post, I will overview Teradata’s elegant hybrid row/column-store design and attempt to explain why I believe it is more flexible than the above mentioned approaches.

The flexibility of Teradata’s approach is characterized by three main contributions.

 

1: Teradata views the row-store vs. column-store debate as two extremes in a more general storage option space.

The row-store extreme stores each row continuously on storage and the column-store extreme stores each column continuously on storage. In other words, row-stores maintain locality of horizontal access of a table, and column-stores maintain locality of vertical access of table. In general however, the optimal access-locality could be on a rectangular region of a table. 

 

 

Figure 1:  Row and Column Stores (uncompressed)

 

To understand this idea, take the following example. Many workloads have frequent predicates on date attributes. By partitioning the rows of a table according to date (e.g. one partition per day, week, month, quarter, or year), those queries that contain predicates on date can be accelerated by eliminating all partitions corresponding to dates outside the range of the query, thereby efficiently utilizing I/O to read in data from the table from only those partitions that have data matching the requested data range. However, different queries may analyze different table attributes for a given date range. For example, one query may examine the total revenue brought in per store in the last quarter, while another query may examine the most popular pairs of widgets bought together in each product category in the last quarter. The optimal storage layout for such queries would be to have store and revenue columns stored together in the same partition, and to have product and product category columns stored together in the same partition. Therefore we want both column-partitions (store and revenue in one partition and product and product category in a different partition) and row-partitions (by date).

 

This arbitrary partitioning of a table by both rows and columns results in a set of rectangular partitions, each partition containing a subset of rows and columns from the original table.  This is far more flexible than a “pure” column-store that enforces that each column be stored in a different physical or virtual partition.

 

Note that allowing arbitrary rectangular partitioning of a table is a more general approach than a pure column-store or a pure row-store. A column-store is simply a special type of rectangular partitioning where each partition is a long, narrow rectangle around a single column of data. Row-oriented storage can also be simulated with special types of rectangles.  Therefore, by supporting arbitrary rectangular partitioning, Teradata is able to support “pure” column-oriented storage, “pure” row-oriented storage, and many other types of storage between these two extremes.

 

2: Teradata can physically store each rectangular partition in “row-format” or “column-format”.

One oft-cited advantage of column-stores is that for columns containing fixed-width values, the entire column can be represented as a single array of values. The row id for any particular element in the array can be determined directly by the index of the element within the array. Accessing a column in an array-format can lead to significant performance benefits, including reducing I/O and leveraging the SIMD instruction set on modern CPUs, since expression or predicate evaluation can occur in parallel on multiple array elements at once.

 

Another oft-cited advantage of column-stores (especially within my own research --- see e.g. http://db.csail.mit.edu/projects/cstore/abadisigmod06.pdf ) is that column-stores compress data much better than row-stores because there is more self-similarity (lower entropy) of data within a column than across columns, since each value within a column is drawn from the same attribute domain. Furthermore, it is not uncommon to see the same value repeat multiple times consecutively  within a column, in which case the column can be compressed using run-length encoding --- a particularly useful type of compression since it can both result in high compression ratios and also is trivial to operate on directly, without requiring decompression of the data.

 

Both of these advantages of column-stores are supported in Teradata when the column-format is used for storage within a partition. In particular, multiple values of a column (or a small group of columns) are stored continuously in an array within a Teradata data structure called a “container”. Each container comes with a header indicating the row identifier of the first value within the container, and the row identifiers of every other value in the container can be deduced by adding their relative position within the container to the row identifier of the first value. Each container is automatically compressed using the optimal column-oriented compression format for that data, including run-length encoding the data when possible.

 

 

Figure 2:  Column-format storage using containers.

 

However, one disadvantage of not physically storing the row identifier next to each value is that extraction of a value given a row identifier requires more work, since additional calculations must be performed to extract the correct value from the container. In some cases, these additional calculations involve just positional offsetting; however, in some cases, the compressed bits of the container have to be scanned in order to extract the correct value. Therefore Teradata also supports traditional row-format storage within each partition, where the row identifier is explicitly stored alongside any column values associated with that row. When partitions are stored using this “row format”, Teradata’s traditional mechanisms for quickly finding a row given a row identifier can be leveraged.

 

In general, when the rectangular partitioning scheme results in wide rectangles, row format storage is recommended, since the overhead of storing the row id with each row is amortized across the breadth of the row, and the benefits of array-oriented iteration through the data are minimal. But when the partitioning scheme results in narrow rectangles, column-format storage is recommended, in order to get the most out of column-oriented array iteration and compression. Either way --- having a choice between row format and column format for each partition further improves the flexibility of Teradata’s row/columnar hybrid scheme.

 

3: Teradata enables traditional primary indexing for quick row-access even when column-oriented partitioning is used.

 

Many column-stores do not support primary indexes due to the complexity involved in moving around records as a result of new inserts into the index. In contrast, Teradata Database 15.10 supports two types of primary indexing when a table has been partitioned to AMPs (logical servers) by the hash value of the primary index attribute. The first, called CPPI, maintains all row and column partitions on an AMP sorted by the hash value of the primary index attribute. These hash values are stored within the row identifier for the record, which enables each column partition to independently maintain the same sort order without explicitly communicating with each other.  Since the data is sorted by the hash of the primary index attribute, finding particular records for a given value of the primary index attribute is extremely fast. The second, called CPPA, does not sort by the hash of the primary index attribute --- therefore the AMP that contains a particular record can be quickly identified given a value of the primary index attribute. However, further searching is necessary within the AMP to find the particular record. This searching is limited to the non-eliminated, nonempty column and row partitions.  Finding a particular record given a row id for both CPPI and CPPA is extremely fast since, in either case, the records are in row id order.

 

Combined, these three features make Teradata’s hybrid solution to the row-store vs. column-store tradeoff extremely general and flexible. In fact, it’s possible to argue that there does not exist a more flexible hybrid solution from a major vendor on the market. Teradata has also developed significant flexibility inside its execution engine --- adapting to column-format vs. row-format input automatically, and using optimal query execution methods depending on the format-type that a particular query reads from.   

====================================================================================

 

Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.

 

Ignore ancestor settings: 
0
Channel: 
Apply supersede status to children: 
0

Teradata Express for VMware Player

$
0
0
Channel: 
Full user details required: 
Full user details required

Download Teradata Express for VMware, a free, fully-functional Teradata database, that can be up and running on your system in minutes. For TDE 14.0, please read the launch announcement, and the user guide. For previous versions, read the general introduction to the new Teradata Express family, or learn how to install and configure Teradata Express for VMware.

There are multiple versions of Teradata Express for VMware available: for Teradata 13.0 and 13.10 and 14.0. More information on each package is available on our main Teradata Express page.

Note that in order to run this VM, you'll need to install VMware Player or VMware Server on your system. Also, please note that your system must have 64-bit support. For more details, see how to install and configure Teradata Express for VMware.

For feedback, discussion, and community support, please visit the Cloud Computing forum.

Download sizes, space requirements and MD5 checksums

Package Version Initial Disk Space Download size (bytes) MD5 checksum
Teradata Express 14.0 for VMware (4 GB) 14.00.00.01 3,105,165,305F8EFE3BBE29F3A3504B19709F791E17A
Teradata Express 14.0 for VMware (40 GB) 14.00.00.01 3,236,758,640B6C81AA693F8C3FB85CC6781A7487731
Teradata Express 14.0 for VMware (1 TB) 14.00.00.01 3,484,921,082 2D335814C61457E0A27763F187842612
Teradata Express 13.10 for VMware (1 TB) 13.10.00.10 15 GB3,002,848,12704e6cb9742f00fe8df34b56733ade533
Teradata Express 13.10 for VMware (40 GB) 13.10.00.10 10 GB2,943,708,647ab1409d8511b55448af4271271cc9c46
Teradata Express 13.0 for VMware (1 TB) 13.00.00.19 64 GB3,072,446,37591665dd69f43cf479558a606accbc4fb
Teradata Express 13.0 for VMware (40 GB) 13.00.00.19 10 GB2,018,812,0705cee084224343a01de4ae3879ada9237
Teradata Express 13.0 for VMware (40 GB, Japanese) 13.00.00.19 10 GB2,051,920,3728e024743aeed8e5de2ade0c0cd16fda9
Teradata Express 13.0 for VMware (4 GB) 13.00.00.19 10 GB2,002,207,4015a4d8754685e80738e90730a0134db9c
Teradata Tools and Utilities 13.10 Windows Client Install Package 13.10.00.00 409,823,3228e2d5b7aaf5ecc43275e9679ad9598b1

 

Ignore ancestor settings: 
Respect ancestor settings
Apply supersede status to children: 
Ignore children
Package version: 
13.10.00.05

Teradata Workload Management Release 15.0 /15.10 for SLES 11 Technical Overview

$
0
0
Course Number: 
54856
Training Format: 
Recorded webcast

This webcast provides an overview of the Teradata Workload Management, consisting of Teradata Integrated Workload Management and Teradata Active System Management (TASM).

The webcast also highlights the new features available for versions 15.0 and 15.10 on all platforms as well as features available only with TASM.

Version: Teradata 15.0, 15.10 and above

Presenter: Youko Watari - Teradata Corporation

Audience: 
Data Warehouse Administrator, Data Warehouse Technical Specialist, Data Warehouse Project/Program Mgmt
Price: 
$195
Credit Hours: 
1
Channel: 

Data Modeling on NoSQL

$
0
0
Course Number: 
54671
Training Format: 
Recorded webcast

NoSQL is being pulled closer and closer to the core of applications, not just as a reporting and data mining tool.

More and more applications are leveraging the power of NoSQL as a primary means of data storage. This session covers how to successfully model application data on NoSQL storage engines for everyday application use. We explore common design patterns, techniques and tips that help developers leverage the horizontal scalability of NoSQL stores while embracing their inherent limitations.

Topics include: 

  • Denormalization 
  • Intelligent Keys (including avoiding hot-spotting)
  • Counters
  • Data Sharding

These topics are not specific to a particular NoSQL solution, but broadly applicable to most popular non-relational data stores in use today.

Note: This was a 2015 Teradata Partners Conference session

Presenter: Bryce Cottam - Teradata Corporation

Price: 
$195
Credit Hours: 
2
Channel: 

Simplify User Management using Roles and Profiles

$
0
0
Course Number: 
33452
Training Format: 
Recorded webcast

This session defines what roles and profiles are, why they are important, how they are implemented, and how to use them.

The advantages of using roles, nested roles, and profiles to help administer a Teradata system are described. Examples of access rights issues and how they can be resolved using roles are discussed. The GRANT and REVOKE command formats are shown as well as examples using Teradata Studio. The impact of roles and profiles on users is also explored in this session via examples. This session also provides the newest enhancements (such as Query_Band) associated with roles and profiles in Teradata 15.10. The system views that provide role and profile information are discussed and data is provided based on the examples provided in the session.

Key Points:

  • Learn how to implement roles to reduce the number of rows in the DBC.AccessRights table and simplify user access rights management.
  • Learn how to use profiles in order to change a system attribute (e.g., SPOOL, QUERY_BAND) for many users with a single command.
  • Learn about the system views to display roles, profiles, and users associated with them.

This was a 2005 Teradata Partners Conference session, updated April 2016.

Presenter: Larry Carter, Senior Consultant - Teradata Corporation

Price: 
$195
Audience: 
DW Architects, DW Technical Specialists, DBAs, System Admins/Operators
Credit Hours: 
1
Channel: 

Six Habitual Architecture Mistakes and How to Avoid Them

$
0
0
Course Number: 
54690
Training Format: 
Recorded webcast

Is your architecture characterized by excessive costs, supportability issues and business dissatisfaction?

This session examines 6 habitual architecture mistakes observed on numerous client projects …

  1. employing a technology-driven approach
  2. allowing the architecture to accidentally evolve
  3. ignoring organizational constraints
  4. deviating from fundamental principles
  5. reinventing the wheel for common design problems,
  6. straying from engineering discipline. 

The presenter explores four architecture components that are essential for avoiding the mistakes:

  1. a structured architecture framework
  2. architecture principles & advocated positions
  3. design patterns & implementation alternatives
  4. reference architectures.

The session concludes with proven recommendations for maturing architecture capabilities.  You will leave the presentation not only better educated on architecture, but armed with ideas for architecting high-quality solutions in your own environment.

Note: This was a 2015 Teradata Partners Conference session.

Presenter: Eddie Sayer - Teradata Corporation

Price: 
$195
Credit Hours: 
1
Channel: 

The Instant Gratification of Real Time

$
0
0
Course Number: 
55028
Training Format: 
Recorded webcast
The latest buzz in analytics is “Real Time”.  And users’ expectations are now set by their phone apps.
Unfortunately it means something different to each person who uses. Because the term is used in so many different contexts, it is very difficult to segregate the different use cases and apply the right tools to meet the organization’s requirements.

Are your users demanding that everything be delivered in Real Time? Are dozens of vendors knocking down your door to sell you their shiny new toys? Are Streams something that are not to be crossed? Are expectations for access to data far outstripping current processes and technologies in your organization? If any or all of these are true for you, then come to this session to pull back the curtain on the hype.
 
Presenter: Todd Walter, Teradata Corporation
 
Note: This was a 2015 Teradata Partners Conference Presentation. 
Price: 
$195
Credit Hours: 
1
Channel: 

Why not Columnar with a Primary Index? What's a PA?

$
0
0
Course Number: 
55241
Training Format: 
Recorded webcast
Columnar with a primary index (PI) is supported in Teradata Database 15.10. But what are the caveats in using it?
This presentation discusses this new capability and the pitfalls to avoid.  In addition to columnar with a primary index or no primary index, columnar is supported with the new primary AMP index. This presentation discusses how a primary AMP index avoids some of the pitfalls of using columnar with a primary index while improving join and aggregate performance compared to having no primary index.
 
Presenter: Paul Sinclair, Software Engineer, Teradata Corporation
 
Price: 
$195
Audience: 
Application Developer, Data Analyst, Database Administrator, Design/Architect
Credit Hours: 
1
Channel: 

Planning Extended Teradata System outages with Unity 15.10

$
0
0
Short teaser: 
Using Unity 15.10 to avoid days of planned outages of planned Teradata system outages.
Cover Image: 
AttachmentSize
trackUnitySpace.zip1.25 KB

Benjamin Franklin once observed "...in this world nothing can be said to be certain, except death and taxes". Planned down time can easily be added to that list. It's a certainty that any given IT system will require some planned down time to address security fixes, make configuration changes, or expand storage and compute capacity. Smart businesses always stay ahead of these demands, and promptly patch or upgrade their systems, but that often comes with a cost in down time. Paradoxically, as business systems become more critical, this cost increases and becomes harder to justify, leading to pressure not to perform necessary updates, thus putting those systems at risk. Fortunately, Teradata has solved this paradox with Unity.

Using Unity 15.10, it is possible to take planned Teradata system outages, without a business impact. One of Unity’s core benefits is the ability to sustain long planned outages without taking applications off-line. Since Unity virtualizes access across multiple Teradata systems, application workloads can continue to function when one of the Teradata systems is taken off-line by shifting them to a second synchronized Teradata system. This can be used for a variety of situations from short outages to do simple maintenance work or restarts, to longer outages to do Teradata system expansions, upgrades or even full replacements. One Teradata customer has totaled over 3.5 days of avoided outage in one quarter while performing upgrades and expansions of their Teradata system, all without any business down time. This tangible business value provides an obvious return on their business investment. 

Unity provides two elements that make this possible. First, it tracks a state for the Teradata system which can be easily changed to control access to the system. The system state is employed for both passive routing and managed routing. Using the HALT operation, you can close down and quiesce all sessions on a Teradata system. If a user’s routing rule allows them to fail over to a second system, they will do so when the system becomes OUT OF SERVICE. Secondly, Unity’s recovery of managed sessions is a powerful mechanism that allows Teradata systems to be re-synced automatically.

Unity 15.10 can manage outages on three levels – the entire system, a database, or an individual table. All three are useful in different situations. Most commonly, table or database outages are used to orchestrate daily activity for a specific application or business process. System level outages are commonly done during a Major Database upgrade or to expand or even replace the underlying managed Teradata system. Since these activities normally take 2 to 4 days, they would be very disruptive without Unity to provide high availability.

Build experience before attempting an extended outage

The ability to sustain extended planned outages is one of the most compelling benefits Unity provides, so it is one of the driving factors to adding Unity to a multisystem environment. This does not mean that environments that are new to Unity should immediate attempt a multi-day extended outage as soon as they have Unity in production. Smart shops implement high availability systems carefully and methodically. It is essential that before attempting a multi-day outage, the entire multisystem environment is mature and stable. Operations staff needs time, over a period of months, to build up experience with Unity and the environment before being pushed into a major exercise. Beyond operating and monitoring Unity, a typical system outage will normally involve a complex series of steps across all of the Teradata ecosystem products, so it’s important that they can operate Unity with full confidence and familiarity. As a best practice, organizations should have at least 6-12 months of operational experience before attempting an extended outage. To help build experience, Unity can first be used for shorter outages of a few hours.

Ecosystem planning, Backups and External access

Going into an extended outage, it’s important that a complete schedule is developed that includes all of the components of the ecosystem, such as Data Mover, Ecosystem Manager, backups, Viewpoint, etc. In particular, it’s important that backups are not executed on the Out-Of-Service Teradata system during the outage period as well as while it’s recovering. Attempting to run a backup will cause interrupts to the recovery process and also produce a backup that is inconsistent with the active state of the system.

Allowed Time for a Planned Outage

This chapter primarily deals with environments that use Unity’s managed routing to perform data synchronization across two or more Teradata systems. Unity also provides passive routing for reporting and sandbox workloads – but since passive routing doesn’t perform any data synchronization, there is no limit imposed by it on how long a system can be out-of-service, and generally no consideration or planning required for it. For managed sessions, the amount of time a Teradata system can be out-of-service for a planned outage is limited by the volume of data and SQL requests that are executed in managed sessions on the other, still active, Teradata systems.

Measuring the time allowed for a planned outage

In planning for an extended outage that will cover multiple days, it is essential to measure and understand the daily profile of load volumes. It’s tempting to compile data with a finer granularity; it has little practical value when planning for a multi-day outage. Here’s an example profile taken from one Teradata customer:

There is no visible metric displayed on the Unity viewpoint portlet that tracks daily recovery log or data space use. To collect this information, there is a sample script attached that can be run on the second (with the standby sequencer) Unity server to take this space usage. These metrics will be added as an enhancement to Unity in a future release.

While it’s a common and natural assumption that it is best to schedule an outage for the Teradata system on the weekend, the above profile clearly shows the weekend actually has a heavier workload than during the week days. If a two or three day outage is required, Tuesday or Wednesday might actually be the best time of the week to start it.

Having this profile is critical for planning, because it allows you to determine how much of the recovery and data space will be consumed during the outage. In this example, if the outage started on Friday, and lasted until Sunday, we would expect roughly 54.7 GB of recovery log space and 3.5 TB of data space to be used.

Maximizing the time allowed for a planned outage

The space available in the recovery log is much smaller (typically 100 to 200 GB) than the space available in the /data file system, which starts at 7 TB and can be expanded by adding Unity expansion servers. In order to maximize the recovery window available for outages it is important to follow normal database best practices for loading data. Large volumes of data should be loaded via bulk load protocols like Fastload, Multiload, TPT Load or TPT Update, etc. These protocols are designed for much high data volumes than normal SQL. Using a bulk load protocol for these large loads is a best practice for Teradata that becomes even more important when used with Unity.

ETL developers will sometimes break this best practice (accidently or out of laziness) in situations when there is an existing load job that normally performs a trickle feed of daily data into the data warehouse. If they reuse the job without modification to load a one-time, very large load of historical data without making it a bulk load job it can cause an unusually high amount of the recovery log to be consumed and put the recoverability of the Teradata systems at risk. This is because it can fill the Unity recovery log with data that should be rightfully stored in the much larger /data file system. This can drastically reduce the time that it is possible to sustain an outage on a Teradata system.

Safe Guarding against Rogue Load Jobs

In order to protect against load jobs consuming too much of the recovery log (when they should instead use the bulk load space), Unity has several protection mechanisms that should be used. Note that these settings should be tuned based on the size of the recovery log, required recovery window duration, and volume of workloads going through Unity. The commonly recommend setting value is provided only as a rough guide.

Unity has two alarm thresholds that can raise an alert if too much of the recovery log is being used overall or by an individual process. To warn if an individual session is consuming too much of the recovery log set these settings:

 

Name

Description

Commonly recommended setting

RecoveryLogGrowthSessionAlertRate

Recovery log (bytes) consumed in the last 60 minutes

5% Recovery Log size

RecoveryLogGrowthSessionAlertThreshold

Recovery log (bytes) consumed by the life time of the session

2.5% Recovery Log size

To warn if all sessions are using too much of the recovery log, use:

Name

Description

Commonly recommended setting

RecoveryLogGrowthAlertRate

Overall Recovery log (bytes) consumed by all sessions in the last 60 minutes

10% Recovery Log size

 

Unity also has two limits that can be used to automatically kill a session that consumes too much of the recovery log.

Name

Description

Commonly recommended setting

RecoveryLogGrowthSessionKillRate

Recovery log (bytes) consumed in the last 60 minutes.

10% Recovery Log size

RecoveryLogGrowthSessionKillThreshold

Total Recovery log (bytes) consumed by the life time of the session.

5% Recovery Log size

Starting an extended outage

A planned system outage is started by performing a HALT on a Unity managed Teradata system. This can be done via viewpoint or the unityadmin command line. The Halt operation will wait for a period of time (controlled by the config setting HaltTimeout) for in-flight transactions to finish on the Teradata system. During this time, new transactions are paused while the operation waits for current transactions to complete. If the timeout passes, and the inflight transactions have not completed, then the HALT operation will fail. In this situation, the DBA can decide to either retry the HALT and wait, or they can elect to manually abort the inflight transactions in Unity. To make this decision, it helps to have a sense of how long the transaction involved normally takes. Data from the Teradata DBQL tables can be used to find this answer.

 Increasing the HALT timeout will make it more likely the operation will succeed, but will also increase the time that new transactions are paused.

Safe guarding against accidental recovery

During an extended outage, there is a possibility that a network drop or other unexpected events might trigger the system’s dispatcher processes to disconnect and reconnect to the active Unity sequencer. When this happens, the sequencer will automatically begin recovery of the system, and attempt to put it back in service before the right time.

This problem is easy to avoid if you shut down the dispatcher processes for the Teradata system once the system has been made OUT OF SERVICE. Each Teradata system has two dispatcher processes associated with it (an active and a standby). To shut down the dispatchers, you log in (as root or using sudo) to the unity servers and do:

1. Check on which unity server the dispatcher for the system that is OUT OF SERVICE is standby:

unityadmin> status;

Sequencer: region1_seq-unity1(active), Repository: unity1(active)

Sequencer: region2_seq-unity2(standby-synchronized), Repository: unity2(standby - synced)

 

Endpoint region1_ept: any(not listening):1025

Endpoint region2_ept: any(not listening):1025

 

System db1(unrecoverable) Tables: OOS 0, standby 0, unrecoverable 4, interrupted 0, restore 0, read-only 0, active 0

Dispatcher status: region1_dsp_db1(active) - up, region2_dsp_db1(standby) - up

Gateway status: db1 - db1cop1(up)

 

System db2(unrecoverable) Tables: OOS 0, standby 0, unrecoverable 4, interrupted 0, restore 0, read-only 0, active 0

Dispatcher status: region1_dsp_db2(standby) - up, region2_dsp_db2(active) - up

Gateway status: db2 - db2cop1(up)

2. Log in to that unity server with the standby dispatcher and shut it down:

# unity stop [standby dispatcher process name]

3. Now log in to the unity server where the dispatcher is ACTIVE and shut it down.

# unity stop [standby dispatcher process name]

STOP! Make sure you ONLY shutdown the specific dispatcher process on each unity server. Do not shutdown any other processes.

When the outage is complete, and you are ready to bring the system back to the ACTIVE state, you will first start the active dispatcher, and then the standby dispatcher. Recovery of the system will start automatically as soon as the active dispatcher connects to the active sequencer

Monitoring the time remaining during an outage

While a Teradata system is in the disconnected, interrupted or unrecoverable state, Unity will keep track of how far behind it is, and the time left to return the system to the active state before it can no longer be recovered and will fall to the unrecoverable state.

You can monitor how far behind a system is through the Unity viewpoint portlet:

 

…Or in the unityadmin command line, using the system status command:

---------------------------------------------

System ID              : 2

System TDP ID          : db2

Region ID              : 2

Region Name            : prod2

State                  : Disconnected

 

ETA till unrecoverable    : > 2 day(s)

  (with maximum workload) : 12 hour(s), 14 minute(s)

Log space remaining       : 91%

Time to Recover a System

When a planned or unplanned outage is complete, a DBA can initiate a full system recovery to bring the system back to the active state by replaying all the requests sent in managed sessions. How long this takes depends on a variety of complex factors, including the size of the Teradata system and the concurrency of the workloads. In a typical environment that includes a mixture of applications doing reads and writes, the recovery of a system should typically take less than the total time the system was down. This is because during recovery the system only needs to replay the write requests that it missed.

In a workload that is very write heavy, with little or no read traffic, the time taken for a system to recover might be longer, and may equal the time of the outage window. This is especially true if the client workload continues to drive the active systems at their full throughput while the recovering system is trying to catch up. While the system is recovering, Unity’s recovery log and data file system provide the recovering system extra capacity to sync up, since they are still storing incoming write requests at the top of the recovery queue as the recovering system is replaying them from the bottom of the queue.

You can monitor the progress of recovery on the Unity viewpoint portlet:

… or on the unityadmin command line, using the system status command. Note the ETA provided in seconds.

---------------------------------------------

System ID              : 2

System TDP ID          : td2

Region ID              : 2

Region Name            : region1

State                  : Restore

System DBS Release     : 14.10.07.01

System DBS Version     : 14.10.07.01

ETA till active           : 7342

Percentage of Log replayed: 19

unityadmin>

 

As recovery progresses, you should see the number of tables left in the RESTORE state drop. It is normal to see recovery progress in spurts because of long-running load jobs that may cause it to appear like nothing is happening for long periods of time. However, over the course of hours, you should see the number of tables in the ACTIVE state grow as the number in the RESTORE state drops.

Occasionally, if there are any issues that cause timeouts (again, because of long-running load jobs), you may see the entire system become interrupted. This is a normal part of recovery. After a period of time (controlled by the configuration setting RecoveryInterval) recovery will restart from the same place it left off at, and you will see the number of interrupted tables quickly drop to the levels they were previously.

Monitoring system recovery for issues following a long outage

Following an extended outage, it’s important to monitor the progress of the system recovery and respond to any alerts or interrupted sessions. This is because the IDs assigned to client sessions are reused by Unity. It is these session IDs that are used to sequence requests during recovery. If there is an issue that causes a request on a session id to fail during recovery with an interrupt, that will block any later sessions in the recovery that re-use the same session id. Consequently, it’s important to ensure that any issues that appear in the recovery process are addressed as they appear in a timely manner.

This could happen for any number of reasons – for example, users mistakenly accessing one Teradata system directly (not through Unity), DDL done as part of the system maintenance that was missed or any other human or process error.  Here are some of the most common issues:

  • Database space issues
  • Missing database grants or users
  • Locked tables because of direct access to the Teradata system

These are such common conditions that DBA’s should anticipate them before a long recovery and have a plan prepared to address them should they occur.

The best place to monitor issues that appear during recovery is on the interrupted session screen, or using the unityadmin command ‘session list interrupted’. The interrupted session screen divides sessions that are interrupted into ‘Root causes’ and ‘Secondary causes’. It is normal for sessions to occasionally become interrupted if they are waiting on other sessions to finish their recovery first. These sessions appear as ‘Secondary causes’ and no actions need to be taken to address them.

Skipping Requests

Most Teradata environments have 10 thousand to 100 thousand tables, so if there are 5 or 10 that cannot be successfully recovered automatically, it should not be a major concern.  If for some reason a request repeatedly fails recovery, you can elect to SKIP the request. If there are tables involved, you should use the option to make the tables unrecoverable. It’s important to stay focused on recovering the entire system, rather than being concerned about a small number of tables that are unrecoverable.

Skipping tables

Alternatively, you can mark the tables unrecoverable to have them skipped by the recovery process. Following a prolonged outage, if you know there are data sync issues on specific tables introduced by changes directly on the system, it’s preferable to deactivate those tables pre-emptively to have them skipped in the recovery processes, rather than have them cause interrupts during the recovery process.

Completing the outage

Once the outage is finished, you can list any tables that failed the recovery process using the unityadmin command:

unityadmin> object list unrecoverable;

  State of ds.customer(2064) on system 2 (td2): unrecoverable

  State of ds2.customerLog(2065) on system 2 (td2): unrecoverable

  State of ds2.customerLogDetail(5001) on system 2 (td2): unrecoverable

As a final step, you can use Ecosystem Manager to validate that the tables are, in fact, out-of-sync and then use Ecosystem Manager workflows to run a DataMover to resync them as time permits. There is no great hurry, since business applications are still online and operating on the remaining active copies of the tables on the other Teradata systems.

Conclusion

Maintenance is a fact of life, but it doesn’t have to impact your business users. Unity provides an optimal way to accommodate planned outages without the cost to SLA’s. This value will be obvious to both CIO’s and CFO’s.

Ignore ancestor settings: 
0
Channel: 
Apply supersede status to children: 
0

Oracle to Teradata 101

$
0
0
Course Number: 
22381
Training Format: 
Recorded webcast
If you've worked with Oracle for years and now find yourself the DBA for a Teradata Data Warehouse, you'll want to get productive as soon as possible.
You already have a good understanding of relational databases and SQL. You just need to learn the ins and outs of this new database. This session reviews the Teradata Database in Oracle terms. Basic architectural differences and the similarities between the two databases are discussed. Database setup and database structures are also covered. Armed with this comparison, you'll have the introductory information you'll need to get started with Teradata.
 

Key Points:

  • Understanding Oracle will give you a head start in learning Teradata.
  • There are some differences you will need to learn about the two databases.
  • You will be able to use your Oracle knowledge to understand Teradata.
 
Updated and re-recorded in June 2015.
 
Presenter: Dawn McCormick, Senior Database Consultant - Teradata Corporation
Price: 
$195
Credit Hours: 
2
Channel: 

Teradata’s Path to the Cloud - Secure Zones

$
0
0
Short teaser: 
Multi-tenant secure zones enable Cloud paradigms
Cover Image: 

Database users are increasingly becoming more comfortable with storing their data in database systems located in public or private clouds. For example, Amazon RDS (relational database service) is used by approximately half of all customers of Amazon’s AWS cloud. Given that AWS has over a million active customers, this implies that there are over a half a million users that are willing to store their data in Amazon’s public cloud database services.

A key feature of cloud computing --- a feature that enables efficient resource utilization and reduced overall costs --- is multi-tenancy. Many different users, potentially from different organizations, share the same physical resources in the cloud. Since many database-backed applications are not able to fully utilize the CPU, memory, and I/O resources of a single server machine 24 hours a day, database users can leverage the mutli-tenant nature of the cloud in order to reduce costs.

The most straightforward way to implement database multi-tenancy in the cloud is to acquire a virtual machine in the cloud (e.g. via Amazon EC2), install the database system on the virtual machine, and load data into it and access it as one would any other database system. As an optimization, many cloud providers offer specialized virtual machines with the database preinstalled and preconfigured in order to accelerate the process of setting up the database and making it ready to use. Amazon RDS is one example of a specialized virtual machine of this kind.

The “database system on a virtual machine” approach is a clean, elegant, and general way to implement multi-tenancy. Multiple databases running on different virtual machines can be mapped to the same physical machine, with negligible concern for any security problems arising from the resulting multi-tenancy. This is because the hypervisor effectively shields each virtual machine from being able to access data from other virtual machines located on the same physical machine.

For general cloud environments such as Amazon AWS, this general approach of achieving database multi-tenancy is a good solution, since it dovetails with the EC2 philosophy of giving each user his own virtual machine. However, for specific database-as-a-service / database-in-the-cloud solutions, supporting multi-tenancy via installing multiple database instances in separate virtual machines is inefficient for several reasons. First, storage, memory, and cache space must be consumed for each virtual machine. Second, the same database software must be installed on each virtual machine, thereby duplicating the storage, memory and cache space needed for the redundant instances of the same database software. These redundant copies also reduce instruction cache locality, since even if the same parts of the database system codebase are being accessed by different instances, since each instance has its own separate codebase, other instances running on other virtual machines that access the same part of the code cannot benefit from the fact that it is already in the instruction cache of a different virtual machine.

To summarize the above points:

(1)    Allowing multiple database users to share the same physical hardware (“multi-tenancy”) helps optimize resource utilization in the cloud, and therefore reduce costs.

(2)    Secure mutli-tenancy can be easily achieved via giving each user a separate virtual machine and mapping multiple virtual machines to the same physical machine.

(3)    When the different virtual machines are all running the same OS and database software, the virtual machine approach results in inefficient redundancy.

If all tenants of a multi-tenant system are using the same software, it is far more efficient to install a single instance of that software on the system, and allow all tenants to share the same software instance. However, a major concern with this approach is security: for example, in a database system, it is totally unacceptable for different tenants to have access to each other’s data. Even metadata should not be visible across tenants --- they should not be aware of each other’s table names and data profiles. In other words, each tenant should have a view of the database as if they are using an instance of that database installed and prepared specifically for that tenant, and any data and metadata associated with other tenants should be totally invisible.

Furthermore, even the performance of database queries, transactions, and other types of requests should not be harmed by the requests of other tenants. For example, if tenant A is running a long and resource-intensive query, tenant B should not observe slow-downs of the requests it is concurrently making of the database system.

Teradata’s recent announcement of its secure zones feature is thus a major step towards a secure, multi-tenant version of Teradata. Each tenant exists with its own “Secure Zone”, and each zone has its own separate set of users that can only access database objects within that zone. The view that a user has of the database is completely local to the zone in which that user is defined --- even the database metadata (“data dictionary tables” in Teradata lingo) is local to the zone, such that user queries of this metadata only return results for the metadata associated with the zone in which the user is defined. Users are not even able to explicitly grant permissions to view database objects of their zone to users of a different zone --- each zone is 100% isolated from the other secure zones[1].

Figure 1: Secure zones contain database uses, tables, profiles, and views

 

A key design theme in Teradata’s Secure Zones feature is the separation of administrative duties from access privileges. For example, in order to create a new tenant, there needs to be a way to create a new secure zone for that tenant. Theoretically, the most straightforward mechanism for accomplishing this would be via a “super user” analogous to the linux superuser / root that has access to the entire system and can create new users and data on the system at will. This Teradata superuser could then add and remove new secure zones, create users for those zones, and access data within those zones.

Unfortunately, this straightforward “superuser”   solution is fundamentally antithetical to the general “Secure Zones” goal of isolating zones from each other, since the zone boundaries have no effect on the superuser. In fact, the presence of a superuser would violate regulatory compliance requirements in certain multi-tenant application scenarios.

Therefore, Teradata’s Secure Zones feature includes the concept of a “Zone Administrator” --- a special type of user that can perform high level zone administration duties, but has no discretionary access rights on any objects or data within a zone. For example, the Zone Administrator has the power to create and drop zones, and to grant limited access to the zone for specific types of users. Furthermore, the Zone Administrator determines the root object of a zone. However, the Zone Administrator cannot read or write that root object, nor any of its descendants.

Analogous to a Zone Administrator is a special kind of user called a “DBA User”. Just as a Zone Administrator can perform administrative zone management tasks without discretionary access rights in the zones that it manages, a DBA User can perform administrative tasks for a particular zone without superuser discretionary access rights in that zone. In particular, DBA Users only receive automatic DDL and DCL rights within a zone, along with the power to create and drop users and objects. However, they must be directly assigned DML rights for any objects within a zone that they do not own in order to be able to access them. Thus, if every zone in a Teradata system is managed by a DBA User, then the resulting configuration has complete separation of administrative duties from access privileges --- the Zone Administrator and DBA Users perform the administration without any automatic discretionary access rights on the objects in the system.

The immediate use case for secure zones is Teradata’s new “Software Defined Warehouse” which is basically a Teradata private cloud within an organization. It consists of a single Teradata system that is able to serve multiple different Teradata database instances from the same system. If the organization develops a new application that can be served from a Teradata database, instead of acquiring the hardware and software package that composes a new Teradata system, the organization can instead serve this application from the software defined warehouse. Multiple existing Teradata database instances can also be consolidated into the software defined warehouse.

Figure 2: Software-Defined Warehouse Workloads

 

The software defined warehouse is currently intended for use cases where all applications / database instances that it is managing belong to the same organization. Nonetheless, in many cases, different parts of an organization are not allowed access to data for other parts of that organization. This is especially true for multinational or conglomerate companies with multiple subsidiaries where access to subsidiary data must be tightly controlled and restricted to users of the subsidiary or citizens of a specific country.  Therefore, each database instance that the software defined warehouse is managing exists within a secure zone.

In addition to secure zones, the other major Teradata feature that makes efficient multi-tenancy possible is Teradata Workload Management. Without workload management, it is possible for system resources to get hogged by a single database instance that is running a particularly resource intensive task, while users of the other instances see significantly increased latencies and overall degraded performance. For the multiple virtual-machine implementation of the cloud mentioned above, the hypervisor implements workload management --- ensuring that each virtual machine gets a guaranteed amount of important system resources such as CPU and memory.  Teradata’s “virtual partitions” works the same way --- the system resources are divided up so that each partition is guaranteed a fixed amount of system resources. By placing each Teradata instance inside its own virtual partition, the Teradata workload manager can thus ensure that the database utilization of one instance does not affect the observed performance of other instances.

When you combine Teradata Secure Zones and Teradata Workload Management, you end up with a cloud-like environment, where multiple different Teradata databases can be served from a single system. Additional database instances can be created “on demand”, backed by this same system, without having to wait for procurement of an additional Teradata system. However, this mechanism of “cloudifying” Teradata is much more efficient that installing the Teradata database software in multiple different virtual machines, since all instances are served from a single version of the Teradata codebase, without redundant operating system and database system installations.

Since I am not a full-time employee of Teradata and have not been briefed on future plans for Teradata in the cloud, I can only speculate about the next steps for Teradata’s plans for cloud. Obviously, Teradata’s main focus for secure zones and virtual partitions have been the software-defined warehouse, so that organizations can implement a private cloud or consolidate multiple Teradata instances onto a single system. However, I do not see any fundamental limitations to prevent Teradata from leveraging these technologies in order to build a public Teradata cloud, where Teradata instances from different organizations share the same physical hardware, just like VMs from different organizations share the same hardware in Amazon’s cloud. Whether or not Teradata chooses to go in this direction is likely a business decision that they will have to make, but it’s interesting to see that with secure zones and workload management, they already have the major technological components to proceed in this direction and build a highly-efficient database-as-a-service offering.

 

[1] There is a concept of a special type of user called a “Zone Guest”, which is not associated with any zone, and can have guest access to objects in multiple zones, but the details of this special type of user is outside the scope of this post.

___________________________________________________________________________________________________________

Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.

 

 

 

 

Ignore ancestor settings: 
0
Channel: 
Apply supersede status to children: 
0

Extract and analyze SHOWBLOCKS/SHOWWHERE data as easy as pie

$
0
0
Short teaser: 
SQL SHOWBLOCKS & SQL SHOWWHERE
Cover Image: 

Tired of parsing the output from Ferret SHOWBLOCKS and SHOWWHERE commands?  You don’t need to do that anymore with the release of the SQL SHOWBLOCKS and SQL SHOWWHERE features. The SQL SHOWBLOCKS feature is available in Teradata Database 15.0, and SQL SHOWWHERE is available in 15.10. These new features allow you to easily extract SHOWBLOCKS and SHOWWHERE data, which can subsequently be queried for analysis using SQL. Not only is this information simpler to view, but the table-based data can easily be exported to third party tools for better view/analysis of the system-level information.

Unlike the Ferret utility, the SQL SHOWBLOCKS and SQL SHOWWHERE macros do not require special console privileges (DBCCONS or CNS supervisor window) to run.  These SQL macros can be run through many user sessions at the same time, whereas the number of sessions that one can initiate through Ferret is limited to the number of CNS Supervisor window sessions that can be started.

SQL SHOWBLOCKS and SQL SHOWWHERE both employ the same two new system macros.  CreateFsysInfoTable creates a target table to hold the file system information and PopulateFsysInfoTable populates the target table with system-level information for SHOWBLOCKS or SHOWWHERE. Once the target tables are populated with SHOWBLOCKS or SHOWWHERE rows, normal SQL queries can be run on those target tables and several system level details, such as Data Block Size Statistics, Cylinder and Block Level Compression related stats, Temperature and grade of the storage, can be obtained.

The example below shows how to create a target table to hold SHOWBLOCKS information:

EXEC DBC.CreateFsysInfoTable (‘SYSTEMINFO’, ‘SHOWBLOCKS_M’,  ‘PERM’, ‘Y’, ‘SHOWBLOCKS’, ‘M’ );

-  Creates the permanent target table ‘SHOWBLOCKS_M’ with fallback in the target database ‘SYSTEMINFO’ for capturing the SHOWBLOCKS’s  medium (‘M’) display rows.

The example below shows how to populate the target table created above:

EXEC DBC.PopulateFsysInfoTable (‘PRODUCTION’, ‘CALLLOG_2015,  ‘SHOWBLOCKS’, ‘M’ , ‘SYSTEMINFO’, ‘SHOWBLOCKS_M’);

-  Populates the target table ‘SYSTEMINFO.SHOWBLOCKS_M’ with SHOWBLOCKS medium (‘M’) display rows of the input table ‘PRODUCTION.CALLLOG_2015’.

For more details on syntax, invocation and other requirements refer to the SQL Functions, Operators, Expressions and Predicates manual  and the SQL SHOWBLOCKS and SQL SHOWWHERE Orange Book listed in Appendix A. 

Figure1 shows the sample rows from a target table “SYSTEMINFO.SHOWBLOCKS_M” which stored SHOWBLOCKS information for a source table “PRODUCTION.CALLOG_2015”.  

Figure 1: Sample SQL SHOWBLOCKS rows from a target table

 

The corresponding SHOWBLOCKS output collected from Ferret is shown in Figure 2.

Figure 2: SHOWBLOCKS /M output from Ferret

 

The sample SQL used to extract data from the target table, which was then used to create the graphs shown in Figure3 and Figure4, is presented below:

Note that these graphs represent multiple invocations of the PopulateFsysInfoTable macro over time into the target table ‘SYSTEMINFO.SHOWBLOCKS_M’ before running the above SQL.

 

      

                           Figure3:  Estimated Compression Ratio by Date                                               Figure 4 : Min, Max and Average DB Size values (in units of sectors) by Date                     

 

 

Key Points

  • Compared to the traditional mechanism (the Ferret utility) for extracting SHOWBLOCKS and SHOWWHERE data for analysis of system level information, the new SQL SHOWBLOCKS and SQL SHOWWHERE methods are much easier, and have fewer limitations.
  • System-level data can be captured at multiple points over time, making it easy to collect historical data for long-term analysis
  • Graphs are easier to produce and data interpretation is more straightforward.

 

Appendix A: Reference Material

  1. Teradata Database Manual “SQL Functions, Operators, Expressions and Predicates, Release 15.10”
  2. Teradata Orange Book “SQL SHOWBLOCKS and SQL SHOWWHERE in Teradata Database” , Book# 541-0010699-A02, April 2015
  3. Teradata Orange Book “Block Level Compression, in Teradata 14.0, including Temperature Based and Independent Sub-table Compression; 2011-10”
Ignore ancestor settings: 
0
Channel: 
Apply supersede status to children: 
0

Teradata Partitioning

$
0
0
Course Number: 
53716
Training Format: 
Recorded webcast

Teradata partitioning was originally released in V2R5.0. This presentation reviews partitioning and the changes to this feature (from DPE to columnar) that have occurred over the last twelve years.

Additional insights into the feature are provided. If you thought you knew all about partitioning from V2R5.0, this presentation brings you up to date and also provides some insight into the usage of the feature.

Presenter: Paul Sinclair - Teradata Corporation

Price: 
$195
Credit Hours: 
2
Channel: 

Teradata Database 15.10 Overview

$
0
0
Course Number: 
54172
Training Format: 
Recorded webcast

Come hear all about the newest features of Teradata Database 15.10!

Learn about the exciting and new enhancements to our Columnar Tables offer, the latest on In-Memory Optimizations, the New Load Isolation feature, a new table level locking feature: Partition Level Locking (PLL) and Secure Zones for multi-tenancy.

The Teradata Database 15.10 release represents the next major step in the evolution of Data Warehousing excellence.

Presenter: Richard Charucki - Teradata Corporation

Price: 
$195
Credit Hours: 
2
Channel: 

Teradata Basics: Parallelism

$
0
0
Cover Image: 

This “Teradata Basics” posting describes the current dimensions of parallelism in the Teradata Database, and how they work together.

What is Query Parallelism?

Executing a single SQL statement in parallel means breaking the request into components, and working on all components at the same time, with one single answer delivered. Parallel execution can incorporate all or part of the operations within a query, and can significantly reduce the response time of an SQL statement, particularly if the query reads and analyzes a large amount of data.

With a design goal of eliminating single-threaded operations, the original architects of the Teradata Database parallelized everything, from the entry of SQL statements to the smallest detail of their execution. The database’s entire foundation was constructed around the idea of giving each component in the system many look-alike counterparts. Not knowing where the future bottlenecks might spring up, early developers weeded out all possible single points of control and effectively eliminated the conditions that breed gridlock in a system.

Limitless interconnect pathways, and multiple optimizers, host channel connections, gateways, and units of parallelism are supported in Teradata, increasing flexibility and control over performance that is crucial to large-scale data analytics today.

Teradata’s Unit of Parallelism

As you probably know, Teradata’s basic unit of parallelism is the AMP. From system configuration time forward all queries, data loads, backups, index builds, in fact everything that happens in a Teradata system, is shared across a pre-defined number of AMPs. The parallelism is predictable and understandable.

Each AMP acts like a microcosm of the database, supporting such things as data loading, reading, writing, journaling and recovery for all the data that it owns. The parallel units also have knowledge of each other and work cooperatively together behind the scenes.  This teamwork among parallel units is an unusual strength of the Teradata Database, driving higher performance with minimal overhead.

Teradata’s Dimensions of Query Parallelism

While the AMP is the fundamental unit of apportionment, and delivers basic query parallelism to the work in the system, there are two additional parallel dimensions woven into the Teradata Database, specifically for query performance. These are referred to here as “within-a-step” parallelism, and “multi-step” parallelism. A description of all three dimensions of parallelism that Teradata applies to a query follows:

Query Parallelism.Query parallelism is usually enabled in Teradata by hash-partitioning the data across all the AMPs defined in the system.  One exception is No Primary Index tables, which use other mechanisms for assigning data to AMPs.  Once data is assigned to an AMP, the AMP provides all the database services on its allocation of data blocks. All relational operations such as table scans, index scans, projections, selections, joins, aggregations, and sorts execute in parallel across the AMPs simultaneously. Each operation is performed on an AMP’s data independently of the data associated with the other AMPs.

Within-a-Step Parallelism. A second dimension of parallelism that will naturally unfold during query execution is an overlapping of selected database operations referred to here as within-a-step parallelism. The optimizer splits an SQL query into a small number of high level database operations called “steps” and dispatches these distinct steps for execution to the AMPs, one after another.

A step can be a small piece or a large chunk of work.  It can be simple, such as "scan a table and return the result" or complex, such as "scan two tables and apply predicates to each, join the two tables, redistribute the join result on specified columns, sort the redistributed rows, and place the redistributed rows in an intermediate table".

Within each of these potentially large chunks of work that we call steps, multiple relational operations can be processed in parallel by pipelining.  While a table scan is taking place, rows that are selected can be pipelined into a join process immediately.  Pipelining is the ability to begin one task before its predecessor task has completed and will take place whenever possible within each distinct step.

This dynamic execution technique, in which a second operation jumps off of a first one to perform portions of the step in parallel, is key to increasing the basic query parallelism. The relational-operator mix of a step is carefully chosen by the Teradata optimizer to avoid stalls within the pipeline.

Multi-Step Parallelism. Multi-step parallelism is enabled by executing multiple “steps” of a query simultaneously, across all the participating units of parallelism in the system. One or more tasks are invoked for each step on each AMP to perform the actual database operation. Multiple steps for the same query can be executing at the same time to the extent that they are not dependent on results of previous steps.

Below is a representation of how all of these three types of parallelism might appear in a query’s execution.

The above figure shows four AMPs supporting a single query’s execution, and the query has been optimized into 7 steps.  Step 1.2 and Step 2.2 demonstrate within-a-step parallelism, where two different tables are scanned and joined together (three different operations are performed).  The result of those three operations are pipelined into a sort and then a redistribution, all in one step. Step 1.1 and 1.2 together (as well as 2.1 and 2.2 together) demonstrate multi-step parallelism, as two distinct steps are chosen to execute at the same time, within each AMP.

And Even More Parallel Possibilities

In addition to the three dimensions of parallelism shown above, the Teradata Database offers an SQL extension called a “multi-statement request” that allows several distinct statements to be bundled together and sent from the client to the database as if they were one. These SQL statements will then be executed in parallel.

When the multi-statement request feature is used, any sub-expressions that the different SQL statements have in common will be executed once and the results shared among them.  Known as “common sub-expression elimination,” this means that if six select statements were bundled together and all contained the same subquery, that subquery would only be executed once. Even though these SQL statements are executed in an inter-dependent, overlapping fashion, each query in a multi-statement request will return its own distinct answer set.

This multi-faceted parallelism is not easy to choreograph unless it is planned for in the early stages of product evolution. An optimizer that generates three dimensions of parallelism for one query such as described here must be intimately familiar with all the parallel capabilities that are available and know how and when to use them.  But most importantly, the Teradata Database applies these multiple dimensions of parallelism automatically, without user intervention or special setup.

Ignore ancestor settings: 
0
Apply supersede status to children: 
0

Hands-On with Teradata QueryGrid™ - Teradata-to-Hadoop

$
0
0
Course Number: 
54300
Training Format: 
Recorded webcast

Today's analytic environments incorporate multiple technologies and systems. Teradata QueryGrid™  Teradata-to-Hadoop allows you to access data and processing on Hadoop from your Teradata data warehouse.

This course gives you an in-depth understanding how QueryGrid Teradata-to-Hadoop works including querying metadata, partition pruning, push-down processing, importing data and joins. Using live queries we explain the syntax and functionality of QueryGrid in order to combine data and analytics across your analytic ecosystem.

Presenter: Andy Sanderson, Product Manager - Teradata Corporation

Prerequisite:  Course #53556, High Performance Multi-System Analytics using Teradata QueryGrid (Webcast)

Price: 
$195
Credit Hours: 
1
Channel: 

Hands-On with Teradata QueryGrid™ - Teradata-to-Teradata

$
0
0
Course Number: 
54299
Training Format: 
Recorded webcast

Teradata QueryGrid enables high performance multi-system analytics across various data platforms. Integrating the data and analytics of multiple Teradata systems can help organizations to economically scale their Teradata environment by adding systems with different characteristics ...

... Or they can take advantage of existing investments by leveraging idle resources. The QueryGrid: Teradata-to-Teradata connector has been available as of Q1 2015 and provides bi-directional parallel data transfer, ad-hoc query capabilities and push-down processing between Teradata databases. This course is  a deep dive into both the functionality and example use cases for this connector. We use live queries to explain these capabilities and how to use them as well as discuss various applications for your analytic environment.

Presenter: Andy Sanderson, Product Marketing Manager - Teradata Corporation

Prerequisite:  Course #53556 High Performance Multi-System Analytics using Teradata QueryGrid (Webcast)

Price: 
$195
Credit Hours: 
1
Channel: 

Workload Management with SLES 11 – Tips and Techniques

$
0
0
Course Number: 
53853
Training Format: 
Recorded webcast

This session covers the main topics of Workload Management on SLES 11 systems. 

Typical concerns that arise while configuring workload management for SLES 11 are discussed and strategies for many of these situations are provided.

Presenter: Dan Fritz - Teradata Corporation

Price: 
$195
Credit Hours: 
2
Channel: 

Implementing Temporal on Teradata

Defense in Depth - Best Practices for Securing a Teradata Data Warehouse

$
0
0
Course Number: 
36675
Training Format: 
Recorded webcast

Defense in depth is the practice of implementing multiple layers of security to protect business-critical information.

This session helps security administrators and database administrators better understand the types of attacks that can be used to compromise the security of a data warehouse. The session describes proven best practices that can be used to build a comprehensive, layered defense against these attacks and demonstrates how to use the security features within the Teradata Database and associated platforms to implement a strategy that balances security requirements with other important cost, performance, and operational considerations.

Key Points:

  • Understand the major security threats to a data warehouse 
  • Proven best practices in 24 key areas for implementation of a defense in depth security strategy 

Presenter: Jim Browning, Enterprise Security Architect - Teradata Corporation
 

Audience: 
Database Administrator, Data Warehouse Architect
Price: 
$195
Credit Hours: 
2
Channel: 
Viewing all 174 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>