Transforming XML Data in Greenplum

One of the advanced feature of greenplum database is that In addition to loading and extracting data in TEXT and CSV formats, the Greenplum Database data loader gpfdist now provides transformation features to load XML data into a table and to write data from the Greenplum Database to XML files. The following diagram shows gpfdist performing an XML transform.
The gadget spec URL could not be found

External Tables using XML Transformations

The best practice to load or extract XML data is the following five-part process.
  1. Determine the Transformation Schema
  2. Write a Transform
  3. Write the gpfdist Configuration
  4. Load the Data
  5. Transfer and Store the Data
The first three steps compromise most of the development effort. The last two steps are straightforward and repeatable, suitable for production.

Determine the Transformation Schema
Before creating the transform to read the XML data, there is some preparation work. First, determine the goal of the project. Some possibilities are indexing data, analyzing data, combining data, etc.

Then, examine the XML file and note the file structure and element names. Choose the elements to import and decide if any other limits are appropriate.
For example, the following XML file prices.xml is a simple, short file that contains price records. Each price record contains two fields, an item number and a price.

<?xml version="1.0" encoding="ISO-8859-1" ?>

The goal is import all the data into a Greenplum Database table with an integer itemnumber column and a decimal price column.
The gadget spec URL could not be found
Write a Transform
The transform specifies what to extract from the data.You can use any authoring environment and language appropriate for a project. For XML transformations Greenplum suggests choosing an appropriate technology such as XSLT, Joost (STX), Java, Python, or Perl based on the goals and scope of the project.
In the price example, the next step is to transform the XML data into a simple two-column delimited format.

This following STX transform completes the data transformation.

<?xml version="1.0"?>
<stx:transform version="1.0"
<!-- declare variables -->
<stx:variable name="itemnumber"/>
<stx:variable name="price"/>
<!-- match and output prices as columns delimited by | -->
<stx:template match="/prices/pricerecord">
<stx:value-of select="$itemnumber"/> <stx:text>|</stx:text>
<stx:value-of select="$price"/> <stx:text>
<stx:template match="itemnumber">
<stx:assign name="itemnumber" select="."/>
<stx:template match="price">
<stx:assign name="price" select="."/>

This STX transform declares two temporary variables, itemnumber and price, and the following three rules.

1.When an element satisfying the XPath expression /prices/pricerecord is found, examine the child elements and generate output containing the value of the itemnumber variable, a | character, the value of the price variable, and finally a newline.

2.When an <itemnumber> element is found, store the content of that element in the variable itemnumber.

3.When a <price> element is found, store the content of that element in the variable price.

Write the gpfdist Configuration
The gpfdist configuration is specified as a YAML 1.1 document. It specifies rules that gpfdist will use to select a Transform to apply when loading or extracting data.
Aside from the ordinary YAML rules, such as starting the document with three dashes (---), a gpfdist configuration must adhere to the following restrictions:

1.a VERSION setting must be present with the value

2.a TRANSFORMATIONS setting must be present and contain one or more mappings.

3.Each mapping in the TRANSFORMATION must contain both
  • a TYPE with the value 'input' or 'output'
  • a COMMAND indicating how the transform is run.
4.Each mapping in the TRANSFORMATION can contain optional CONTENT, SAFE, and STDERR settings.

The following gpfdist configuration is applicable to the prices example. Note that the initial indentation on each line is significant and reflects the hierarchical nature of the specification.
TYPE: input
COMMAND: /bin/bash %filename%

The COMMAND setting uses a wrapper script called with a %filename% placeholder. When gpfdist runs the prices_input transform, it invokes with /bin/bash and replaces the %filename% placeholder with the path to the input file to transform.The wrapper script contains the logic to invoke the STX transformation and return the output.

# - sample input transformation,
# demonstrating use of Java and Joost STX to convert XML into
# text to load into Greenplum Database.
# java arguments:
# -jar joost.jar joost STX engine
# -nodecl don't generate a <?xml?> declaration
# $1 filename to process
# input_transform.stx the STX transformation
# the AWK step eliminates a blank line joost emits at the end
java \
-jar joost.jar \
-nodecl \
$1 \
input_transform.stx \
| awk 'NF>0

The file makes use of the Joost STX engine along with the AWK interpreter. The following diagram shows the processing flow when gpfdist runs this transformation.

To summarize, this example gpfdist configuration contains the following three items:

  1. The config.yaml file defining TRANSFORMATIONS
  2. The wrapper script, referenced in the config.yaml file
  3. The input_transform.stx joost transformation, called from
Load the Data
Create the tables with SQL statements based on the appropriate schema.

There are no special requirements for the Greenplum Database tables used to hold loaded data. In the prices example, the following command creates an appropriate table.

itemnumber integer,
price decimal
DISTRIBUTED BY (itemnumber);

Transfer and Store the Data
There are two approaches to transforming the data with gpfdist.

1.GPLOAD : GPLOAD is easier to implement in many cases, but it only supports input transformations.
2. INSERT INTO SELECT FROM : INSERT INTO SELECT FROM supports both input and output transformations; however, it exposes more details.

Transforming with GPLOAD
When transforming data with GPLOAD, the two special settings TRANSFORM and TRANSFORM_CONFIG. must appear in the INPUT section of the GPLOAD control file. 

•TRANSFORM_CONFIG specifies the name of the gpfdist configuration file, as shown in the following diagram.
•The TRANSFORM setting indicates the name of the transformation that is described in the file named in TRANSFORM_CONFIG.

The transformation name must appear in two places: in the TRANSFORM setting of the of the gpfdist configuration file and in the TRANSFORMATIONS section of the file named in the TRANSFORM_CONFIG section.

In the GPLOAD control file the optional parameter MAX_LINE_LENGTH is an integer that specifies the maximum length of a line in the XML transformation data that is passed to gpload.

The following diagram shows the relationships between the GPLOAD control file, the gpfdist configuration file, and the XML data file.

When using an INSERT INTO SELECT FROM statement directly, the name of the transformation appears in the query fragment of the LOCATION url of the corresponding EXTERNAL TABLE definition as shown in bold in the following command.

LOCATION ('gpfdist://hostname:8080/prices.xml#transform=prices_input')

With this external table, running the following query loads data into the prices table.

INSERT INTO prices SELECT * FROM prices_readable;

Configuration File Format
The gpfdist configuration file uses the YAML 1.1 document format and implements a schema for defining the parameters of the transformation. The configuration file must be a valid YAML document.

The gpfdist program processes the document in order and uses indentation (spaces) to determine the document hierarchy and relationships of the sections to one another. The use of white space is significant. White space should not be used simply for formatting purposes and tabs should not be used at all.

The basic structure of a configuration file is shown in the below.---
TYPE: input | output
COMMAND: command
CONTENT: data | paths
SAFE: posix-regex
STDERR: server | console
TYPE: input | output
COMMAND: command
Required. The version of the gpfdist configuration file schema. The current version is
Required. Begins the transformation specification section. A configuration file must have at least one transformation. When gpfdist is requested to perform a transformation, it looks in this section for an entry with the matching transformation name.

Required. Indicates the direction of transformation. The legal values are input or output.
•A value of input instructs gpfdist to treat the standard output of the transformation process as a stream of records to be loaded into Greenplum Database.
•A value of output instructs gpfdist to treat the standard input of the transformation process as a stream of records from Greenplum Database to be transformed and written to the appropriate output.

Required. Indicates the command gpfdist will execute to perform the transformation.

For input transformations, gpfdist will invoke this command as indicated by the CONTENT setting. The command is expected to open the underlying file(s) as appropriate and produce one line of TEXT for each row to be loaded into Greenplum Database. The input transform determines whether the entire content should be converted to one row or to multiple rows.

For output transformations, gpfdist will invoke this command as indicated by the CONTENT setting. The output command is expected to open and write to the underlying file(s) as appropriate. The output transformation determines the final placement of the converted output.

Optional. The legal values are data and paths. The default value is data.
•When CONTENT specifies data, the text %filename% in the COMMAND section of this file is replaced by the path to the file to read or write.
•When CONTENT specifies paths, the text %filename% in the COMMAND section of this file is replaced by the path to the temporary file containing the list of files to read or write.

The following line shows an example of a COMMAND section showing the text %filename% that is replaced.
COMMAND: /bin/bash %filename%

Optional. A POSIX regular expression which the paths must match in order to be passed to the transformation. Specifying SAFE is useful when there is a concern about injection or improper interpretation of paths passed to the command. The default is no restriction on paths.

Optional.The legal values are server and console.

This setting indicates how standard error output from the transformation is handled. The default is server, indicating that gpfdist will capture the standard error output from the transformation in a temporary file and send the first 8k of that file to Greenplum Database as an error message which will subsequently appear as a SQL error. If the value is set to console, gpfdist takes no special steps to redirect or transmit the standard error output from the transformation.