Key DCA performance
Post date: Oct 06, 2013 11:46:13 PM
A high-performance data warehouse solution strongly relies on the database technology and the power of the hardware platform. Three key performance indicators
that influence the success of a data warehouse solution are:
Scan rate—How quickly the database can read and process data under varying conditions and workloads
Data load rate—How quickly data can be loaded into the database
Scalability— How well the system can scale and predictably handle the ever-growing data load and workload requirements
This section provides information on both the scan and the data load rates achieved during performance testing. It also provides the scalability testing results from
expanding from two GPDB Standard Modules to four GPDB Standard Modules, and from expanding four GPDB modules to eight GPDB Standard Modules.
All tests were performed without performance tuning; therefore, any customer can expect to achieve these results “out of the box.”
Notes
• Benchmark results are highly dependent upon workload, specific application requirements, and system design and implementation.
Relative system performance will vary as a result of these and other factors. Therefore, do not use this workload as a substitute for specific
customer application benchmarking when contemplating critical capacity planning and/or product evaluation decisions.
• All performance data contained in this report was obtained in a rigorously controlled environment. Results obtained in other operating environments may vary significantly.
• EMC Corporation does not warrant or represent that a user can or will achieve similar performance expressed in transactions per minute.
Data analytics customers have unique business and operational requirements, which are reflected in the aspects of performance that are considered during the selection of a data solution. Examples of such performance aspects follow.
Data load rates
The speed at which data can be loaded into a database is important to customers who have a batch load process with a shrinking load window, a growing volume of
data, and are looking to move toward real time analysis.
For the DCA, tests were performed on one, two, and four GPDB modules to demonstrate the data load rate capability.
Test objectives: Data load rate
Data load rate test objectives
=======================================================================================================================
Objective Description
=======================================================================================================================
Data load rate Determine the rate at which data can be loaded by:
• 1 GPDB Standard Module
• 2 GPDB Standard Modules
• 4 GPDB Standard Modules
• 1 GPDB High Capacity Module
• 2 GPDB High Capacity Modules
• 4 GPDB High Capacity Modules
Linear scalability of data load rate Demonstrate that the data load rate improves in a linear manner as you add modules.
=======================================================================================================================
Test scenario: Data load rate
The test scenario was designed to load several large, flat ASCII files concurrently into the database to simulate a typical ETL operation. The test utilized Greenplum's MPP Scatter/Gather Streaming technology. The source dataset used for data load consisted of multiple separate ASCII data files spread across the ETL server environment. Sufficient bandwidth was provided between the DCA and ETL environment to ensure that this was not a bottleneck to performance.
Test method: Data load rate
To measure the data load rate, the validation team:
1. Created an external table definition for the ASCII dataset files that were located on the ETL environment and connected the ETL environment to the
Interconnect Bus using two 10 GbE LAGs.
2. Initiated the following SQL command on the Master Server and then executed it on the Segment Servers: insert into <target-table> select * from the <external table>
3. Measured the amount of time required to load the data.
4. Calculated the data load rate (TB/hour) by dividing the total amount of raw data loaded by the data load duration.
Test results: Data load rate
The data load rates recorded during testing for GPDB Standard Modules. The test results clearly demonstrate that the DCA data load rates scale in a
linear manner. For example, expanding from two modules to four modules leads to an effective doubling of the rate at which data can be loaded.
GPDB Standard Module Data load rates (TB/hour)
DCA option Data load rate
1 GPDB Standard Module 3.4 TB/hr
2 GPDB Standard Module 6.7 TB/hr
4 GPDB Standard Module 13.4 TB/hr
GPDB High Capacity Module Data load rates (TB/hour)
DCA option Data load rate
1 GPDB High Capacity module 3.4 TB/hour
2 GPDB High Capacity module 6.7 TB/hour
4 GPDB High Capacity module 13.4 TB/hour
Note: Testing involved loading data into compressed tables, which is not a disk-intensive operation. When loading data into uncompressed tables, which is a disk-
intensive operation, data loading rates for High-Capacity Modules might be slower than rates for Standard Modules because High-Capacity Modules use larger but slower (lower rpm) disks than Standard Modules use.
Query performance: This is a common concern for many customers. Query performance relies on four factors:
• Hardware (scan rate)
• Schema structure (table and index)
• Query complexity
• The applications that run on the data warehouse solution
Scan rate:
The performance of data warehouse systems is typically compared in terms of the scan rate, which measures how much throughput the database system can deliver.
Scan rate is an indication of how well the system is able to cope with processing vast volumes of data and the daily end-user workload.
Tests were performed on DCAs with Standard Modules and High Capacity modules to demonstrate scan rate performance.
Scan rate test objectives
=======================================================================================================================
Objective Description
=======================================================================================================================
Scan rate Determine the scan rate for both Standard and High Capacity module configurations. Scan rate is a measure
of how quickly the disks can move data (bytes).
Linear scalability of scan rate Demonstrate that scan rates improve in a linear manner.
=======================================================================================================================
Test scenarios: Scan rate
The scan rate was measured on GPDB Standard Module configurations and on GPDB High Capacity Module configurations.
Test results: Scan rate
Scan rates for GPDB Standard Modules in GB/s.
DCA GPDB Standard Module scan rates (GB/s)
========================================================
DCA option Scan rate
========================================================
1 GPDB Standard Module 5.9 GB/s
2 GPDB Standard Modules 11.8 GB/s
4 GPDB Standard Modules 23.6 GB/s
========================================================
The scan rate test results clearly demonstrate that:
• The DCA supports very high scan rates.
• The DCA scan-rate performance scales in a linear manner. Expanding from two modules to four modules leads to a doubling in performance.
DCA High Capacity Module Scan rates (GB/s)
========================================================
DCA option Scan rate
========================================================
1 GPDB High Capacity Module 3.5 GB/s
2 GPDB High Capacity Module 7 GB/s
4 GPDB High Capacity Module 14 GB/s
========================================================
Operations: This consists of three main areas:
• Backup
• Disaster recovery
• Development and test refresh
The operational area is often overlooked and can become a challenge for other areas of the customer’s business. Therefore, depending on existing operational challenges,most customers select only one performance area—data load, query, or operational. The DCA solution provides customers with distinct advantages for query, data load, and operational performance. It is also important to note that database query performance is driven by three factors:
• System architecture and RDBMS
• Schema design
• Query complexity
By following Greenplum Database best practices for partitioning, parallelism, table design, and query optimization, the DCA can provide the scan rate required for the
processing needs of today's massive data warehouses.
Key results
Testing and validation demonstrated that the DCA handles real-world workloads extremely well, within a range of scenarios and configurations. Because of
Greenplum's true MPP architecture, the behavior of the DCA changes in a predictable and consistent manner, ensuring that customers can depend on the DCA for daily information requirements.
“Out-of-the-box” performance
The results presented here were produced on a standard “out-of-the-box”
DCA with no tuning applied, and indicate the overall performance that a customer can expect to achieve. The DCA also provides the ability to tune the environment to
specific business needs to ensure an even greater level of performance.
Data load rates
The results of the data load rate testing for the DCA with Standard Modules versus the DCA with High Capacity Modules are presented in Table 2. The table shows the
maximum achievable rate for each module.
Table 1. Note Data load rates TB/hour
=======================================================
DCA option Data load rate
======================== ============
4 GPDB Standard Modules 13.4 TB/hour
4 GPDB High Capacity Modules 13.4 TB/hour
=======================================================
Note: Testing involved loading data into compressed tables, which is not a disk-intensive operation. When loading data into uncompressed tables, which is a disk-intensive operation, data loading rates for High-Capacity Modules might be slower than rates for Standard Modules because High-Capacity Modules use larger but slower (lower rpm) disks than Standard Modules use.
DCA performance
Scan rate is the unit of measure, expressed in data bandwidth (for example, GB/s). It is used to describe how much data can be read and processed within a certain period of time. Scan rate indicates how fast the disk I/O subsystem of the appliance can read data from the disk to support the database. Table 3 presents the scan rate results.
Table 2. DCA scan rates (GB/s)
=======================================================
DCA option Scan rate
======================== ============
4 GPDB Standard Modules 23.6 GB/s
4 GPDB High Capacity Modules 14 GB/s
=======================================================
DCA scalability
Expanding the data warehouse by upgrading from one module to 24 modules produces predictable performance gains with linear scaling. The four- to eight-module
scalability test results presented in this white paper demonstrate that the DCA is a readily expandable computing platform that can grow seamlessly with a customer’s business requirements.