DCA to DCA Transfer of Data in Greenplum Env

Problem description:

We have two or more DCAs in our environment. Each DCA will host different databases. The databases and DCAs should be able to operate independently without having to share h/w resources but they should also be able to share data in different tables quickly as and when needed.

Inter-Data Center DCA-to-DCA transfer 

1. Data Domain Replication - Connecting Multiple DCA’s using a Data Domain to Share data

2. Bulk Flat File Unload - Using writable/readable external table to unload and load data.

3. Backup Transfer - using regular back and restore option

4. Greenplum Replication - Future implementation not yet available in GP 4.2.x

Solution:

We propose to introduce a commodity server (or a set of commodity servers) in the middle to act as quick data transfer platforms between the DCAs. Each commodity server will be connected to each DCA through a 10G switch connected to the brocade switch on the DCAs. The VLAN interface on the DCAs will be used to bond the commodity server to the 10GB backplane of the DCA. A rough diagram is shown below.

Once this is set up, the commodity server should be available to each one of the DCAs as a gpfdist server. We can then use a combination of readable and writable external tables to transfer data between tables of databases in different DCAs through Greenplum’s fast data load/unload utility.

Use Case:

We have a table T1 on Database prod1 on DCA1. We need to transfer some / all of this data to database prod2 on DCA2. The following would be the steps.

1. Locate a directory on the commodity server which is hosted on a drive with sufficient capacity to hold the data. Let us say /data/transfer is our directory.

2. Start two gpfdist instances on the commodity server like below.

gpfdist –d /data/transfer –l /data/transfer/gpfdist_writer.log –p 8050 &

gpfdist –d /data/transfer –l /data/transfer/gpfdist_reader.log –p 8051 &

3. Create a writable external table on database prod1 on DCA1 like below.

create writable external table ext_t1 (like t1)

location (‘gpfdist://etl1:8050/t1.dat’)

format ‘text’;

4. Create a readable external table on database prod2 on DCA2 like below:

create external table ext_t1 (like t1) -- or supply the actual definition

location (‘gpfdist://etl1:8051/t1.dat’)

format ‘text’;

5. Dump the data from DCA1. Form prod1 on DCA1, run:

insert into ext_t1 select * from t1; -- optional where clause

6. Read the data in DCA2

insert into [SOME TABLE] select * from ext_t1; -- optional where clause

7. Drop the data file on the commodity server.

Note: It is possible to wrap all of the aforesaid steps into a single program (java/python) etc. and even possible to expose them as stored procedures on either database.

Reflections:

It is possible to use a Data Domain server in the middle as opposed to a commodity Linux server. The following considerations apply:

1. The DD server is likely to have more available space to transfer entire databases.

2. The DD server only works with backups / restores, so it would offer much less programmatic flexibility than external tables would need DBA involvement as backup / restores are typically a DBA’s privilege would cause extensive locks on the tables during backup which will impede other operations against the table being copied.

3. Cost consideration?

4. Multiple commodity servers (each with one or more 10GB NICs) can be used. Not sure if multiple data domains would work.

5. The commodity servers, thus connected in a shared manner, will form a common platform for ETL and other processes that can serve multiple DCA’s simultaneously. For example one server with one Informatica license would be able to load data to multiple DCAs.

Pros:

Multiple commodity servers can be used. The commodity servers, thus connected in a shared manner, will form a common platform for ETL and other processes that can serve multiple DCA’s simultaneously (e.g. one server with one Informatica license would be able to load data to multiple DCAs). Each DCA controls the traffic within its own environment. The sharing is only for loading/unloading.

Cons:

Assumes the DCAs are in the same data center and near each other.