Greenplum and Informatica integration

Informatica PowerExchange for Greenplum

PowerExchange for Greenplum provides connectivity between PowerCenter and the Greenplum Database to load data. When you run a Greenplum session, the PowerCenter Integration Service writes data to a named pipe and invokes the Greenplum gpload bulk loading utility. The gpload utility reads data from the named pipe and loads data into the Greenplum target.

================================================

See also

PWX Connector for Informatica and Greenplum database

Power Exchange for Greenplum

EMC Greenplum Data Integration Accelerator

================================================

Installing and Configuring PowerExchange for Greenplum

¨ PowerExchange for Greenplum Installation and Configuration Overview

¨ Upgrading PowerExchange for Greenplum

¨ Registering the Plug-in

You can install PowerExchange for Greenplum on Windows, Red Hat Enterprise Linux, Solaris, or on AIX machines. When you install PowerExchange for Greenplum, you complete the following tasks that allow PowerCenter to access the Greenplum database:

Install the server component. Enables the PowerCenter Repository Service to access the Greenplum metadata in the repository and the PowerCenter Integration Service to run Greenplum sessions.

Before you install PowerExchange for Greenplum, complete the following tasks:

1. Install or upgrade PowerCenter.

2. Install the Greenplum connectivity package to set up the database client libraries on the machine where the PowerCenter client runs. The connectivity package installs the Greenplum ODBC driver to connect to the Greenplum Database.

3. Install the Greenplum loaders package on the machine where the PowerCenter Integration Service runs. The loaders package contains the gpload utility.

For more information about setting up the gpload utility, see Greenplum documentation.

4. Verify that you can connect to the Greenplum Database using the gpload utility.

5. Verify that the following environment variables are set after you install the loaders package:

==========================================================================================

Environment Variable 

Description

==========================================================================================

PATH 

- Set to the location where the gpfdist utility is installed.

- Set to the location where Python is installed.

GPHOME_LOADERS 

Set to the installation directory of the Greenplum loaders.

For example: C:\Program Files\Greenplum\greenplumloaders-<

version>-build-<number>\

PYTHONPATH 

Set to the location where the pyyaml and pg8000 Python

modules are installed.

LIBRARY_PATH 

Set the shared environment variable to the lib directory of

the Greenplum loaders.

Use one of the following variables based on the operating

system:

- Solaris and Linux. Use the LD_LIBRARY_PATH variable.

- AIX. Use the LIBPATH variable.

Note: You can use the following scripts present in the

Greenplum loader installation directory to configure the

Greenplum load path:

- greenplum_loaders_path.bat for Windows

- greenplum_loaders_path.sh for UNIX

==========================================================================================

6. Restart both the PowerCenter Repository Service and the PowerCenter Integration Service after you update

the environment variables.

7. Verify that you have read, write, and execute permissions on the following directories:

<PowerCenter Installation Directory>/server/bin

<PowerCenter Installation Directory>/server/bin/Plugin

Installing the Server Component

The PowerExchange for Greenplum Server component installs the PowerCenter Integration Service and PowerCenter Repository Service components.

If the PowerCenter Integration Service or PowerCenter Repository Service is configured to run on primary and backup nodes, install the PowerExchange for Greenplum Server component on each node configured to run the PowerCenter Integration Service or PowerCenter Repository Service.

If the PowerCenter Integration Service is configured to run on a grid, install the PowerCenter for Greenplum Server component on each node configured to run on the grid. If you cannot install the PowerCenter for Greenplum Server component on each node on the grid, create a resource in the domain and assign it to each

node where you installed the PowerCenter for Greenplum Server component. When you create a session, configure the session to use the resource.

Installing the Server Component on Windows

1. Run install.bat from the installation package.

2. Click Next.

3. Select the Informatica installation directory.

By default, the server components are installed in the following location: C:\Informatica\<version folder>

4. Click Next.

5. Click Install to begin the installation.

6. Click Done when the installation is complete.

The PowerCenter Integration Service and PowerCenter Repository Service components are installed.

Installing the Server Component on UNIX

Install the PowerExchange for Greenplum Server component on UNIX when the PowerCenter Integration Service

or the PowerCenter Repository Service runs on UNIX.

1. Enter sh install.sh at the prompt.

2. Enter the installation directory.

By default, the server components are installed in the following location:

<User Home Directory>/Informatica/<version folder>

The PowerCenter Integration Service and PowerCenter Repository Service components are installed.

Note: Follow the same steps to install PowerExchange for Greenplum Server component on AIX, Linux, and Solaris operating systems.

Upgrading PowerExchange for Greenplum

When you upgrade PowerExchange for Greenplum, complete the following tasks:

1. Upgrade PowerCenter.

2. Register the plug-in with the repository.

Registering the Plug-in

After you install PowerExchange for Greenplum, register the PowerExchange for Greenplum plug-in with the PowerCenter repository. A plug-in is an XML file that defines the functionality of PowerExchange for Greenplum.To register the plug-in, the repository must be running in exclusive mode. Use the Administrator tool or the pmrep RegisterPlugin command to register the plug-in.

The plug-in file for PowerExchange for Greenplum is GreenplumConnector.xml. When you install PowerExchange for Greenplum, the installer copies GreenplumConnector.xml to the following directory:

<PowerCenter Installation Directory>\server\bin\Plugin

Note: If you do not have the Manage Services privilege to register the plug-in, contact the user who manages the PowerCenter Repository Service.

Greenplum Targets 

Greenplum target definitions represent metadata for Greenplum tables.

You can import Greenplum tables from the Greenplum database into the Designer before loading the table. You cannot use the Designer to create new tables in the Greenplum database. When you import Greenplum tables as target definitions, the ODBC data types corresponding to Greenplum data types appear in the Designer.

Importing Greenplum Target Definitions

You must import the Greenplum target definition into the Target Designer. Use an ODBC data source to import a Greenplum target definition.

Use the Greenplum drivers to create DSN connection to the Greenplum database.

1. In the Target Designer, click Targets > Import from Database.

2. Select the Greenplum data source used to connect to the target database.

If you need to create or change a Greenplum data source, click the Browse button to open the ODBC Administrator. Create the Greenplum data source and click OK. Select the Greenplum data source.

3. Enter the user name and password to connect to the database, and click Connect.

If you are not the owner of the table that you want to use as a target, specify the owner name.

4. Drill down through the list of database objects to view the available tables as targets.

5. Select the Greenplum table or tables to import into the repository.

You can press and hold the Shift key to select a block of tables, or press and hold the Ctrl key to make nonconsecutive selections. You can also use the Select All and Select None buttons to select or clear all available targets

6. Click OK.

The selected target definitions appear in the Navigator under the Targets node.

Greenplum Sessions and Workflows

1. Greenplum Sessions and Workflows

2. Configuring a PowerExchange for Greenplum Connection

3. Session Configuration with a Greenplum Target

4. Parameterization

5. Error Handling for Greenplum Targets

Greenplum Sessions and Workflows

In the Workflow Manager, create a session and associate it with a Greenplum mapping. Configure a Greenplum relational connection before you run the workflow. You can also define properties in a session to determine how the gpload utility writes data to a Greenplum target. When you run a session, the PowerCenter Integration Service invokes the gpload utility to load data into the Greenplum database.

Configuring a PowerExchange for Greenplum Connection

Before you run a Greenplum workflow, configure a Greenplum relational connection in the Workflow Manager. When you configure a Greenplum relational connection, you define the connection attributes that the gpload utility uses to connect to the Greenplum database.

1. In the Workflow Manager, connect to a repository. 

2. Click Connections > Relational.

The Relational Connection Browser dialog box appears.

3. Click New.

The Select Subtype dialog box appears.

4. Select Greenplum Connection from the Select Subtype list.

5. Click OK.

The Connection Object Definition dialog box appears.

6. Enter the following connection information:

Connection Attribute 

Description

Name 

Name for the Greenplum relational connection.

User Name 

User name with permissions to access the Greenplum database.

Password 

Password to connect to the Greenplum database.

Host Name 

Host name or IP address of the Greenplum server.

Port 

Greenplum server port number. If you enter 0, the gpload utility reads from the environment variable $PGPORT. Default is 5432.

Database 

Name of the database.

Schema 

Name of the schema that contains the metadata for Greenplum targets. Default is public.

7. Click OK.

The database connection appears in the Connection Browser list.

Session Configuration with a Greenplum Target

You can configure the session properties for Greenplum targets in the Transformations view on the Mapping tab.

Define the properties for each target instance in the session.

The following table describes the session properties that you can configure for Greenplum targets:

==============================================================================================================

Target 

Property Description

==============================================================================================================

Method 

Determines how the gpload utility processes the data from the

named pipe:

- Insert. Inserts rows into the target.

- Update. Updates rows in the target.

- Merge. If the rows exist in the target, updates the existing

rows. If the rows do not exist in the target, inserts the rows

into the target.

Match Columns 

Matches rows based on the comma-separated list of column

names. Enclose the column names in double quotes.

Update Columns 

Updates the columns specified in the comma-separated list of

column names. Enclose the column names in double quotes.

Update Condition 

Updates a row based on the specified condition. The gpload utility performs an update or merge operation based on the

update condition specified 

Format 

The PowerCenter Integration Service writes data in a format that is compatible with the gpload utility. Select one of the

following values:

- Text. In the text format, the PowerCenter Integration Service separates data using the delimiter character specified in the session properties. If the data contains the delimiter or escape characters specified in the session properties, you can clear the Skip Escaping check box or specify delimiter and escape character values that are not a part of the data.

- CSV. In the CSV format, the PowerCenter Integration Service encloses the data with the quote character specified in the session properties. The PowerCenter

Integration Service also separates the data using the delimiter character specified in session properties. If the data contains the quote or escape characters specified in the session properties, you can clear the Skip Escaping check box or specify quote and escape character values that are not a part of the data.

Default is Text.

Note: If the data contains newline characters, you must use the CSV format. If you use the Text format and the data contains newline characters, the data after the newline character is treated as a new record. In such situations, the gpload utility might reject or insert incorrect data into the tables.

Delimiter 

Delimiter separates successive input fields. For data in the text format, use any 7-bit ASCII value except a-z, A-Z, and 0-9. For data in the CSV format, use any 7-bit ASCII value except \n, \r, and \\. Default is pipe (|). You can also specify a non-printable ASCII character through an escape sequence using the decimal representation of the ASCII character. For example, \014 represents the shift out character.

Escape 

Character that treats special characters in the data as regular characters. In the text format, special characters comprise delimiter and escape characters. In the CSV format, special characters comprise quotes and escape characters. Use any 7-bit ASCII value as an escape character. Default is backslash (\).

Note: You can improve the session performance if the data does not contain escape characters.

Skip Escaping 

Skips escaping special characters in the data. Clear the Skip Escaping check box to treat special characters in the data as  regular characters.

Null As 

String that represents a null value. In the source data, any data item that matches the string is treated as a null value. Default is backslash N (\N).

Quote 

Character that encloses the data in the CSV format. The PowerCenter Integration Service encloses data by the specified character and passes the data to the gpload utility. The quote character is ignored for data in the text format. Use any 7-bit ASCII value that is not equal to the delimiter or null value. Default is double quotes (“).

Error Table 

Name of the error table where the gpload utility logs rejected rows when reading data that is processed by the PowerCenter Integration Service. The naming convention for the table name is <schema name>.<table name>, where schema name is the name of the schema that contains the table.

Error Limit 

For each Greenplum segment, the number of rows that the gpload utility discards or logs in the error table because of format errors. The gpload utility fails the session if the error limit is reached for any Greenplum segment. Default is zero. The maximum error limit is 2,147,483,647.

Greenplum Pre SQL T

he SQL command to run before loading data to the target.

Greenplum Post SQL 

The SQL command to run after loading data to the target.

Truncate Target Option 

Truncates the Greenplum target table before loading data to the target.

Greenplum Target Table 

Overrides the default target table name.

Greenplum Loader Logging 

Sets the logging level for the gpload utility. You can select one of the following values:

- None

- Verbose

- Very Verbose

Default is None.

Greenplum gpfdist Timeout 

The number of seconds that elapse before the gpfdist process times out when attempting to connect to the target. The default value is 30 seconds.

Window Pipe Buffer Size 

The size (in bytes) that the PowerCenter Integration Service allocates to buffer data before writing to the Greenplum bulk

loader. Enter a value between 1 and 131072. The default value is 2048. You might need to test different settings for optimal performance. This attribute is applicable for Informatica servers running on Windows.

Guidelines for Configuring Greenplum Session Properties

Use the following guidelines when you configure session properties for loading data into a Greenplum target:

1.  Use the recommended values for the delimiter, escape, and quote characters. If you specify values that are not valid, the gpload utility fails the session.

2.  Use the default values for the delimiter, escape, and quote characters to improve the session performance. When you use the default values, the PowerCenter Integration Service validates only the character and unichar datatypes for special characters, and ignores the other datatypes.

3. Verify that the column names specified in the Match Columns and Update Columns session properties do not have leading or trailing spaces. When you import target definitions, the Designer strips leading and trailing spaces from the column names. Therefore, if the Match Columns and Update Columns session properties have leading or trailing spaces, the gpload utility logs an error and the session fails. The gpload utility cannot load data into the Greenplum target because the column names in the target definition do not match the column names specified in the session properties.

Parameterization

You can parameterize Greenplum session properties to override the properties during run-time. You can create a parameter file with session properties that overrides the properties at runtime. You can parameterize the following session properties:

Error Handling for Greenplum Targets

You can set the error threshold for a session by configuring the Stop on Errors session property. In the Stop on Errors session property, enter the number of nonfatal errors that the PowerCenter Integration Service can encounter before it fails the session.

The error threshold does not include database errors that the gpload utility might encounter when loading data into the Greenplum database tables. The default value is 0. By default, the PowerCenter Integration Service does not stop a session when it encounters nonfatal errors.

You can set the error limit for a Greenplum segment to specify the number of rows that the gpload utility can discard before it fails a session. If you specify an error table, the gpload utility logs the discarded rows in the error table. The error limit includes rows with format errors. The default value is 0. By default, the gpload utility stops a session when it encounters a row with format errors.

Use the following naming convention for the error table name: <schema name>.<table name>

If you do not specify a schema name, the gpload utility creates the error table in the public schema. The error table format is predefined in the Greenplum database. If the table does not exist, the gpload utility creates the table based on the predefined format. If the specified table exists in the schema, but the table is not in the

prescribed format, the session fails. If a session fails, see the error table for more information about the errors. If you run the session again, the gpload utility appends the discarded rows to the error table.

For more information about the error tables, see the Greenplum documentation.

You can view load statistics in the session log. The gpload utility writes the error messages to the gpload log. The PowerCenter Integration Service reads the gpload log and writes the errors to the session log. The gpload utility writes the error messages to the gpload log at the following location:

$INFA_HOME/server/infa_shared/temp