DCA-FAQ
What are DCA key features?
The base architecture of the DCA is designed with scalability and growth in mind. This enables organizations to easily extend their DW/BI capability in a modular way; linear gains in capacity and performance are achieved by expanding. The DCA employs a high-speed Interconnect Bus that is used to provide database-level communication between all servers in the DCA. It is designed to accommodate access for rapid backup and recovery and data load rates (also known as ingest).Excellent performance is provided by effective use of the combined power of servers,software, network, and storage. The DCA can be installed and available within 24 hours (or less) of the customer receiving delivery and is ready to use for faster return on investment (ROI). The DCA uses cutting-edge industry-standard hardware optimized for data analytics. |
What are main components of DCA
Greenplum Database: Greenplum Database is an MPP database server, based on PostgreSQL open-source technology. It is explicitly designed to support BI applications and large, multi terabyte data warehouses. Greenplum Database system: An associated set of Segment Instances and a Master Instance running on an array, which can be composed of one or more hosts. GPDB Master Servers: The servers responsible for the automatic parallelization of queries. GPDB Segment Servers: The servers that perform the real work of processing and analyzing the data. |
What is EMC Data Computing Appliance (DCA)
The DCA is a purpose-built, highly scalable, parallel DW appliance that architecturally integrates database, compute, storage, and network into an enterprise-class, easy-to implement system. The DCA brings in the power of MPP architecture, delivers the fastest data loading capacity and the best price/performance ratio in the industry without the complexity and constraints of proprietary hardware. The DCA can also be set up in a UAP configuration that is capable of managing, storing, and analyzing large volumes of structured and unstructured data.Greenplum UAP includes Greenplum Database, Greenplum HD, and Greenplum Chorus. The DCA is offered in multiple-rack appliance configurations to achieve the maximum flexibility and scalability for organizations faced with terabyte to petabyte scale data opportunities. |
Key DCA performance
A high-performance data warehouse solution strongly relies on the database technology and the power of the hardware platform. Three key performance indicators that influence the success of a data warehouse solution are: Scan rate—How quickly the database can read and process data under varying conditions and workloads Data load rate—How quickly data can be loaded into the database Scalability— How well the system can scale and predictably handle the ever-growing data load and workload requirements This section provides information on both the scan and the data load rates achieved during performance testing. It also provides the scalability testing results from expanding from two GPDB Standard Modules to four GPDB Standard Modules, and from expanding four GPDB modules to eight GPDB Standard Modules. All tests were performed without performance tuning; therefore, any customer can expect to achieve these results “out of the box.” Notes • Benchmark results are highly dependent upon workload, specific application requirements, and system design and implementation. Relative system performance will vary as a result of these and other factors. Therefore, do not use this workload as a substitute for specific customer application benchmarking when contemplating critical capacity planning and/or product evaluation decisions. • All performance data contained in this report was obtained in a rigorously controlled environment. Results obtained in other operating environments may vary significantly. • EMC Corporation does not warrant or represent that a user can or will achieve similar performance expressed in transactions per minute. Data analytics customers have unique business and operational requirements, which are reflected in the aspects of performance that are considered during the selection of a data solution. Examples of such performance aspects follow. Data load rates The speed at which data can be loaded into a database is important to customers who have a batch load process with a shrinking load window, a growing volume of data, and are looking to move toward real time analysis. For the DCA, tests were performed on one, two, and four GPDB modules to demonstrate the data load rate capability. Test objectives: Data load rate Data load rate test objectives ======================================================================================================================= Objective Description ======================================================================================================================= Data load rate Determine the rate at which data can be loaded by:
Linear scalability of data load rate Demonstrate that the data load rate improves in a linear manner as you add modules. ======================================================================================================================= Test scenario: Data load rate The test scenario was designed to load several large, flat ASCII files concurrently into the database to simulate a typical ETL operation. The test utilized Greenplum's MPP Scatter/Gather Streaming technology. The source dataset used for data load consisted of multiple separate ASCII data files spread across the ETL server environment. Sufficient bandwidth was provided between the DCA and ETL environment to ensure that this was not a bottleneck to performance. Test method: Data load rate To measure the data load rate, the validation team: 1. Created an external table definition for the ASCII dataset files that were located on the ETL environment and connected the ETL environment to the Interconnect Bus using two 10 GbE LAGs. 2. Initiated the following SQL command on the Master Server and then executed it on the Segment Servers: insert into <target-table> select * from the <external table> 3. Measured the amount of time required to load the data. 4. Calculated the data load rate (TB/hour) by dividing the total amount of raw data loaded by the data load duration. Test results: Data load rate The data load rates recorded during testing for GPDB Standard Modules. The test results clearly demonstrate that the DCA data load rates scale in a linear manner. For example, expanding from two modules to four modules leads to an effective doubling of the rate at which data can be loaded. GPDB Standard Module Data load rates (TB/hour) DCA option Data load rate 1 GPDB Standard Module 3.4 TB/hr 2 GPDB Standard Module 6.7 TB/hr 4 GPDB Standard Module 13.4 TB/hr GPDB High Capacity Module Data load rates (TB/hour) DCA option Data load rate 1 GPDB High Capacity module 3.4 TB/hour 2 GPDB High Capacity module 6.7 TB/hour 4 GPDB High Capacity module 13.4 TB/hour Note: Testing involved loading data into compressed tables, which is not a disk-intensive operation. When loading data into uncompressed tables, which is a disk- intensive operation, data loading rates for High-Capacity Modules might be slower than rates for Standard Modules because High-Capacity Modules use larger but slower (lower rpm) disks than Standard Modules use. Query performance: This is a common concern for many customers. Query performance relies on four factors: • Hardware (scan rate) • Schema structure (table and index) • Query complexity • The applications that run on the data warehouse solution Scan rate: The performance of data warehouse systems is typically compared in terms of the scan rate, which measures how much throughput the database system can deliver. Scan rate is an indication of how well the system is able to cope with processing vast volumes of data and the daily end-user workload. Tests were performed on DCAs with Standard Modules and High Capacity modules to demonstrate scan rate performance. Scan rate test objectives ======================================================================================================================= Objective Description ======================================================================================================================= Scan rate Determine the scan rate for both Standard and High Capacity module configurations. Scan rate is a measure of how quickly the disks can move data (bytes). Linear scalability of scan rate Demonstrate that scan rates improve in a linear manner. ======================================================================================================================= Test scenarios: Scan rate The scan rate was measured on GPDB Standard Module configurations and on GPDB High Capacity Module configurations. Test results: Scan rate Scan rates for GPDB Standard Modules in GB/s. DCA GPDB Standard Module scan rates (GB/s) ======================================================== DCA option Scan rate ======================================================== 1 GPDB Standard Module 5.9 GB/s 2 GPDB Standard Modules 11.8 GB/s 4 GPDB Standard Modules 23.6 GB/s ======================================================== The scan rate test results clearly demonstrate that: • The DCA supports very high scan rates. • The DCA scan-rate performance scales in a linear manner. Expanding from two modules to four modules leads to a doubling in performance. DCA High Capacity Module Scan rates (GB/s) ======================================================== DCA option Scan rate ======================================================== 1 GPDB High Capacity Module 3.5 GB/s 2 GPDB High Capacity Module 7 GB/s 4 GPDB High Capacity Module 14 GB/s ======================================================== Operations: This consists of three main areas: • Backup • Disaster recovery • Development and test refresh The operational area is often overlooked and can become a challenge for other areas of the customer’s business. Therefore, depending on existing operational challenges,most customers select only one performance area—data load, query, or operational. The DCA solution provides customers with distinct advantages for query, data load, and operational performance. It is also important to note that database query performance is driven by three factors: • System architecture and RDBMS • Schema design • Query complexity By following Greenplum Database best practices for partitioning, parallelism, table design, and query optimization, the DCA can provide the scan rate required for the processing needs of today's massive data warehouses. Key results Greenplum's true MPP architecture, the behavior of the DCA changes in a predictable and consistent manner, ensuring that customers can depend on the DCA for daily information requirements. “Out-of-the-box” performance The results presented here were produced on a standard “out-of-the-box” DCA with no tuning applied, and indicate the overall performance that a customer can expect to achieve. The DCA also provides the ability to tune the environment to specific business needs to ensure an even greater level of performance. Data load rates The results of the data load rate testing for the DCA with Standard Modules versus the DCA with High Capacity Modules are presented in Table 2. The table shows the maximum achievable rate for each module. Table 1. Note Data load rates TB/hour ======================================================= DCA option Data load rate ======================== ============ 4 GPDB Standard Modules 13.4 TB/hour 4 GPDB High Capacity Modules 13.4 TB/hour ======================================================= Note: Testing involved loading data into compressed tables, which is not a disk-intensive operation. When loading data into uncompressed tables, which is a disk-intensive operation, data loading rates for High-Capacity Modules might be slower than rates for Standard Modules because High-Capacity Modules use larger but slower (lower rpm) disks than Standard Modules use. DCA performance Scan rate is the unit of measure, expressed in data bandwidth (for example, GB/s). It is used to describe how much data can be read and processed within a certain period of time. Scan rate indicates how fast the disk I/O subsystem of the appliance can read data from the disk to support the database. Table 3 presents the scan rate results.Table 2. DCA scan rates (GB/s) ======================================================= DCA option Scan rate ======================== ============ 4 GPDB Standard Modules 23.6 GB/s 4 GPDB High Capacity Modules 14 GB/s ======================================================= DCA scalability Expanding the data warehouse by upgrading from one module to 24 modules produces predictable performance gains with linear scaling. The four- to eight-module scalability test results presented in this white paper demonstrate that the DCA is a readily expandable computing platform that can grow seamlessly with a customer’s business requirements. |
1-4 of 4