What are DCA key features?

posted Dec 13, 2013, 5:52 AM by Sachchida Ojha

The base architecture of the DCA is designed with scalability and growth in mind. This enables organizations to easily extend their DW/BI capability in a modular way; linear gains in capacity and performance are achieved by expanding.
The DCA employs a high-speed Interconnect Bus that is used to provide database-level communication between all servers in the DCA. It is designed to accommodate access for rapid backup and recovery and data load rates (also known as ingest).Excellent performance is provided by effective use of the combined power of servers,software, network, and storage.
The DCA can be installed and available within 24 hours (or less) of the customer receiving delivery and is ready to use for faster return on investment (ROI).
The DCA uses cutting-edge industry-standard hardware optimized for data analytics.

What are main components of DCA

posted Dec 13, 2013, 5:51 AM by Sachchida Ojha

Greenplum Database:  Greenplum Database is an MPP database server, based on PostgreSQL open-source technology. It is explicitly designed to support BI applications and large, multi terabyte data warehouses.
Greenplum Database system: An associated set of Segment Instances and a Master Instance running on an array, which can be composed of one or more hosts.
GPDB Master Servers: The servers responsible for the automatic parallelization of queries.
GPDB Segment Servers: The servers that perform the real work of processing and analyzing the data.

What is EMC Data Computing Appliance (DCA)

posted Dec 13, 2013, 5:50 AM by Sachchida Ojha

The DCA is a purpose-built, highly scalable, parallel DW appliance that architecturally integrates database, compute, storage, and network into an enterprise-class, easy-to implement system. The DCA brings in the power of MPP architecture, delivers the fastest data loading capacity and the best price/performance ratio in the industry without the complexity and constraints of proprietary hardware. 

The DCA can also be set up in a UAP configuration that is capable of managing, storing, and analyzing large volumes of structured and unstructured data.Greenplum 
UAP includes Greenplum Database, Greenplum HD, and Greenplum Chorus.

The DCA is offered in multiple-rack appliance configurations to achieve the maximum flexibility and scalability for organizations faced with terabyte to petabyte scale data opportunities.

Key DCA performance

posted Oct 6, 2013, 4:46 PM by Sachchida Ojha   [ updated Oct 7, 2013, 7:59 AM ]

A high-performance data warehouse solution strongly relies on the database technology and the power of the hardware platform. Three key performance indicators
that influence the success of a data warehouse solution are:

Scan rate—How quickly the database can read and process data under varying conditions and workloads
Data load rate—How quickly data can be loaded into the database
Scalability— How well the system can scale and predictably handle the ever-growing data load and workload requirements

This section provides information on both the scan and the data load rates achieved during performance testing. It also provides the scalability testing results from
expanding from two GPDB Standard Modules to four GPDB Standard Modules, and from expanding four GPDB modules to eight GPDB Standard Modules.
All tests were performed without performance tuning; therefore, any customer can expect to achieve these results “out of the box.”

• Benchmark results are highly dependent upon workload, specific application requirements, and system design and implementation.
Relative system performance will vary as a result of these and other factors. Therefore, do not use this workload as a substitute for specific
customer application benchmarking when contemplating critical capacity planning and/or product evaluation decisions.
• All performance data contained in this report was obtained in a rigorously controlled environment. Results obtained in other operating environments may vary significantly.
• EMC Corporation does not warrant or represent that a user can or will achieve similar performance expressed in transactions per minute.

Data analytics customers have unique business and operational requirements, which are reflected in the aspects of performance that are considered during the selection of a data solution. Examples of such performance aspects follow.
Data load rates
The speed at which data can be loaded into a database is important to customers who have a batch load process with a shrinking load window, a growing volume of
data, and are looking to move toward real time analysis.

For the DCA, tests were performed on one, two, and four GPDB modules to demonstrate the data load rate capability.
Test objectives: Data load rate

Data load rate test objectives
Objective                                                                                                         Description
Data load rate                                                                                                 Determine the rate at which data can be loaded by:     
• 1 GPDB Standard Module                   
• 2 GPDB Standard Modules
• 4 GPDB Standard Modules
• 1 GPDB High Capacity Module
• 2 GPDB High Capacity Modules
• 4 GPDB High Capacity Modules
Linear scalability of data load rate                                                                    Demonstrate that the data load rate improves in a linear manner as you add modules.

Test scenario: Data load rate
The test scenario was designed to load several large, flat ASCII files concurrently into the database to simulate a typical ETL operation. The test utilized Greenplum's MPP Scatter/Gather Streaming technology. The source dataset used for data load consisted of multiple separate ASCII data files spread across the ETL server environment. Sufficient bandwidth was provided between the DCA and ETL environment to ensure that this was not a bottleneck to performance.
Test method: Data load rate
To measure the data load rate, the validation team:
1. Created an external table definition for the ASCII dataset files that were located on the ETL environment and connected the ETL environment to the
 Interconnect Bus using two 10 GbE LAGs.
2. Initiated the following SQL command on the Master Server and then executed it on the Segment Servers: insert into <target-table> select * from the <external table>
3. Measured the amount of time required to load the data.
4. Calculated the data load rate (TB/hour) by dividing the total amount of raw data loaded by the data load duration.

Test results: Data load rate
The data load rates recorded during testing for GPDB Standard 
Modules. The test results clearly demonstrate that the DCA data load rates scale in a
linear manner. For example, expanding from two modules to four modules leads to an 
effective doubling of the rate at which data can be loaded.
GPDB Standard Module Data load rates (TB/hour)
DCA option                                 Data load rate
1 GPDB Standard Module             3.4 TB/hr
2 GPDB Standard Module             6.7 TB/hr
4 GPDB Standard Module             13.4 TB/hr

GPDB High Capacity Module Data load rates (TB/hour)
DCA option                                    Data load rate
1 GPDB High Capacity module         3.4 TB/hour
2 GPDB High Capacity module         6.7 TB/hour
4 GPDB High Capacity module         13.4 TB/hour
Note: Testing involved loading data into compressed tables, which is not a disk-intensive operation. When loading data into uncompressed tables, which is a disk-
intensive operation, data loading rates for High-Capacity Modules might be slower than rates for Standard Modules because High-Capacity Modules use larger but slower (lower rpm) disks than Standard Modules use.

Query performance: This is a common concern for many customers. Query performance relies on four factors:
• Hardware (scan rate)
• Schema structure (table and index)
• Query complexity
• The applications that run on the data warehouse solution

Scan rate: 
The performance of data warehouse systems is typically compared in terms of the scan rate, which measures how much throughput the database system can deliver.
Scan rate is an indication of how well the system is able to cope with processing vast volumes of data and the daily end-user workload.
Tests were performed on DCAs with Standard Modules and High Capacity modules to demonstrate scan rate performance.

Scan rate test objectives
Objective                                                                     Description
Scan rate                                                                     Determine the scan rate for both Standard and High Capacity module configurations. Scan rate is a measure             
                                                                                   of how quickly the disks can move data (bytes).
Linear scalability of scan rate                                        Demonstrate that scan rates improve in a linear manner.

Test scenarios: Scan rate
The scan rate was measured on GPDB Standard Module configurations and on GPDB High Capacity Module configurations.

Test results: Scan rate

Scan rates for GPDB Standard Modules in GB/s.
DCA GPDB Standard Module scan rates (GB/s)
DCA option                                       Scan rate
1 GPDB Standard Module                   5.9 GB/s
2 GPDB Standard Modules                 11.8 GB/s
4 GPDB Standard Modules                 23.6 GB/s

The scan rate test results clearly demonstrate that:
• The DCA supports very high scan rates.
• The DCA scan-rate performance scales in a linear manner. Expanding from two modules to four modules leads to a doubling in performance.

DCA High Capacity Module Scan rates (GB/s)
DCA option                                    Scan rate
1 GPDB High Capacity Module         3.5 GB/s
2 GPDB High Capacity Module         7 GB/s
4 GPDB High Capacity Module         14 GB/s

Operations: This consists of three main areas:
• Backup
• Disaster recovery
• Development and test refresh
The operational area is often overlooked and can become a challenge for other areas of the customer’s business. Therefore, depending on existing operational challenges,most customers select only one performance area—data load, query, or operational. The DCA solution provides customers with distinct advantages for query, data load, and operational performance. It is also important to note that database query performance is driven by three factors:
• System architecture and RDBMS
• Schema design
• Query complexity
By following Greenplum Database best practices for partitioning, parallelism, table design, and query optimization, the DCA can provide the scan rate required for the
processing needs of today's massive data warehouses.

Key results

Testing and validation demonstrated that the DCA handles real-world workloads extremely well, within a range of scenarios and configurations. Because of
Greenplum's true MPP architecture, the behavior of the DCA changes in a predictable and consistent manner, ensuring that customers can depend on the DCA for daily information requirements.
“Out-of-the-box” performance
The results presented here were produced on a standard “out-of-the-box” 
DCA with no tuning applied, and indicate the overall performance that a customer can expect to achieve. The DCA also provides the ability to tune the environment to
specific business needs to ensure an even greater level of performance.
Data load rates
The results of the data load rate testing for the DCA with Standard Modules versus the DCA with High Capacity Modules are presented in Table 2. The table shows the
maximum achievable rate for each module.
Table 1. Note Data load rates TB/hour
DCA option                                               Data load rate
========================                 ============
4 GPDB Standard Modules                         13.4 TB/hour
4 GPDB High Capacity Modules                  13.4 TB/hour
Note: Testing involved loading data into compressed tables, which is not a disk-intensive operation. When loading data into uncompressed tables, which is a disk-intensive operation, data loading rates for High-Capacity Modules might be slower than rates for Standard Modules because High-Capacity Modules use larger but slower (lower rpm) disks than Standard Modules use.

DCA performance
Scan rate is the unit of measure, expressed in data bandwidth (for example, GB/s). It is used to describe how much data can be read and processed within a certain period of time. Scan rate indicates how fast the disk I/O subsystem of the appliance can read data from the disk to support the database. Table 3 presents the scan rate results.
Table 2.  DCA scan rates (GB/s)
DCA option                                               Scan rate
========================                 ============
4 GPDB Standard Modules                         23.6 GB/s
4 GPDB High Capacity Modules                  14 GB/s

DCA scalability
Expanding the data warehouse by upgrading from one module to 24 modules produces predictable performance gains with linear scaling. The four- to eight-module
scalability test results presented in this white paper demonstrate that the DCA is a readily expandable computing platform that can grow seamlessly with a customer’s business requirements.

1-4 of 4