Transforming XML Data in Greenplum

One of the advanced feature of greenplum database is that In addition to loading and extracting data in TEXT and CSV formats, the Greenplum Database data loader gpfdist now provides transformation features to load XML data into a table and to write data from the Greenplum Database to XML files. The following diagram shows gpfdist performing an XML transform.

External Tables using XML Transformations

The best practice to load or extract XML data is the following five-part process.

The first three steps compromise most of the development effort. The last two steps are straightforward and repeatable, suitable for production.

Determine the Transformation Schema

Before creating the transform to read the XML data, there is some preparation work. First, determine the goal of the project. Some possibilities are indexing data, analyzing data, combining data, etc.

Then, examine the XML file and note the file structure and element names. Choose the elements to import and decide if any other limits are appropriate.

For example, the following XML file prices.xml is a simple, short file that contains price records. Each price record contains two fields, an item number and a price.

<?xml version="1.0" encoding="ISO-8859-1" ?>

<prices>

<pricerecord>

<itemnumber>708421</itemnumber>

<price>19.99</price>

</pricerecord>

<pricerecord>

<itemnumber>708466</itemnumber>

<price>59.25</price>

</pricerecord>

<pricerecord>

<itemnumber>711121</itemnumber>

<price>24.99</price>

</pricerecord>

</prices>

The goal is import all the data into a Greenplum Database table with an integer itemnumber column and a decimal price column.

Write a Transform

The transform specifies what to extract from the data.You can use any authoring environment and language appropriate for a project. For XML transformations Greenplum suggests choosing an appropriate technology such as XSLT, Joost (STX), Java, Python, or Perl based on the goals and scope of the project.

In the price example, the next step is to transform the XML data into a simple two-column delimited format.

708421|19.99

708466|59.25

711121|24.99

This following STX transform completes the data transformation.

<?xml version="1.0"?>

<stx:transform version="1.0"

xmlns:stx="http://stx.sourceforge.net/2002/ns"

pass-through="none">

<!-- declare variables -->

<stx:variable name="itemnumber"/>

<stx:variable name="price"/>

<!-- match and output prices as columns delimited by | -->

<stx:template match="/prices/pricerecord">

<stx:process-children/>

<stx:value-of select="$itemnumber"/> <stx:text>|</stx:text>

<stx:value-of select="$price"/> <stx:text>

</stx:text>

</stx:template>

<stx:template match="itemnumber">

<stx:assign name="itemnumber" select="."/>

</stx:template>

<stx:template match="price">

<stx:assign name="price" select="."/>

</stx:template>

</stx:transform>

This STX transform declares two temporary variables, itemnumber and price, and the following three rules.

1.When an element satisfying the XPath expression /prices/pricerecord is found, examine the child elements and generate output containing the value of the itemnumber variable, a | character, the value of the price variable, and finally a newline.

2.When an <itemnumber> element is found, store the content of that element in the variable itemnumber.

3.When a <price> element is found, store the content of that element in the variable price.

Write the gpfdist Configuration

The gpfdist configuration is specified as a YAML 1.1 document. It specifies rules that gpfdist will use to select a Transform to apply when loading or extracting data.

Aside from the ordinary YAML rules, such as starting the document with three dashes (---), a gpfdist configuration must adhere to the following restrictions:

1.a VERSION setting must be present with the value 1.0.0.1.

2.a TRANSFORMATIONS setting must be present and contain one or more mappings.

3.Each mapping in the TRANSFORMATION must contain both

4.Each mapping in the TRANSFORMATION can contain optional CONTENT, SAFE, and STDERR settings.

The following gpfdist configuration is applicable to the prices example. Note that the initial indentation on each line is significant and reflects the hierarchical nature of the specification.

---

VERSION: 1.0.0.1

TRANSFORMATIONS:

prices_input:

TYPE: input

COMMAND: /bin/bash input_transform.sh %filename%

The COMMAND setting uses a wrapper script called input_transform.sh with a %filename% placeholder. When gpfdist runs the prices_input transform, it invokes input_transform.sh with /bin/bash and replaces the %filename% placeholder with the path to the input file to transform.The input_transform.sh wrapper script contains the logic to invoke the STX transformation and return the output.

#!/bin/bash

# input_transform.sh - sample input transformation,

# demonstrating use of Java and Joost STX to convert XML into

# text to load into Greenplum Database.

# java arguments:

# -jar joost.jar joost STX engine

# -nodecl don't generate a <?xml?> declaration

# $1 filename to process

# input_transform.stx the STX transformation

#

# the AWK step eliminates a blank line joost emits at the end

java \

-jar joost.jar \

-nodecl \

$1 \

input_transform.stx \

| awk 'NF>0

The input_transform.sh file makes use of the Joost STX engine along with the AWK interpreter. The following diagram shows the processing flow when gpfdist runs this transformation.

To summarize, this example gpfdist configuration contains the following three items:

Load the Data

Create the tables with SQL statements based on the appropriate schema.

There are no special requirements for the Greenplum Database tables used to hold loaded data. In the prices example, the following command creates an appropriate table.

CREATE TABLE prices (

itemnumber integer,

price decimal

)

DISTRIBUTED BY (itemnumber);

Transfer and Store the Data

There are two approaches to transforming the data with gpfdist.

1.GPLOAD : GPLOAD is easier to implement in many cases, but it only supports input transformations.

2. INSERT INTO SELECT FROM : INSERT INTO SELECT FROM supports both input and output transformations; however, it exposes more details.

Transforming with GPLOAD

When transforming data with GPLOAD, the two special settings TRANSFORM and TRANSFORM_CONFIG. must appear in the INPUT section of the GPLOAD control file. 

•TRANSFORM_CONFIG specifies the name of the gpfdist configuration file, as shown in the following diagram.

•The TRANSFORM setting indicates the name of the transformation that is described in the file named in TRANSFORM_CONFIG.

The transformation name must appear in two places: in the TRANSFORM setting of the of the gpfdist configuration file and in the TRANSFORMATIONS section of the file named in the TRANSFORM_CONFIG section.

In the GPLOAD control file the optional parameter MAX_LINE_LENGTH is an integer that specifies the maximum length of a line in the XML transformation data that is passed to gpload.

The following diagram shows the relationships between the GPLOAD control file, the gpfdist configuration file, and the XML data file.

Transforming with INSERT INTO SELECT FROM

When using an INSERT INTO SELECT FROM statement directly, the name of the transformation appears in the query fragment of the LOCATION url of the corresponding EXTERNAL TABLE definition as shown in bold in the following command.

CREATE READABLE EXTERNAL TABLE prices_readable (LIKE prices)

LOCATION ('gpfdist://hostname:8080/prices.xml#transform=prices_input')

FORMAT 'TEXT' (DELIMITER '|')

LOG ERRORS INTO prices_errortable SEGMENT REJECT LIMIT 10;

With this external table, running the following query loads data into the prices table.

INSERT INTO prices SELECT * FROM prices_readable;

Configuration File Format

The gpfdist configuration file uses the YAML 1.1 document format and implements a schema for defining the parameters of the transformation. The configuration file must be a valid YAML document.

The gpfdist program processes the document in order and uses indentation (spaces) to determine the document hierarchy and relationships of the sections to one another. The use of white space is significant. White space should not be used simply for formatting purposes and tabs should not be used at all.

The basic structure of a configuration file is shown in the below.---

VERSION: 1.0.0.1

TRANSFORMATIONS:

transformation_name1:

TYPE: input | output

COMMAND: command

CONTENT: data | paths

SAFE: posix-regex

STDERR: server | console

transformation_name2:

TYPE: input | output

COMMAND: command

...

VERSION

Required. The version of the gpfdist configuration file schema. The current version is 1.0.0.1.

TRANSFORMATIONS

Required. Begins the transformation specification section. A configuration file must have at least one transformation. When gpfdist is requested to perform a transformation, it looks in this section for an entry with the matching transformation name.

TYPE

Required. Indicates the direction of transformation. The legal values are input or output.

•A value of input instructs gpfdist to treat the standard output of the transformation process as a stream of records to be loaded into Greenplum Database.

•A value of output instructs gpfdist to treat the standard input of the transformation process as a stream of records from Greenplum Database to be transformed and written to the appropriate output.

COMMAND

Required. Indicates the command gpfdist will execute to perform the transformation.

For input transformations, gpfdist will invoke this command as indicated by the CONTENT setting. The command is expected to open the underlying file(s) as appropriate and produce one line of TEXT for each row to be loaded into Greenplum Database. The input transform determines whether the entire content should be converted to one row or to multiple rows.

For output transformations, gpfdist will invoke this command as indicated by the CONTENT setting. The output command is expected to open and write to the underlying file(s) as appropriate. The output transformation determines the final placement of the converted output.

CONTENT

Optional. The legal values are data and paths. The default value is data.

•When CONTENT specifies data, the text %filename% in the COMMAND section of this file is replaced by the path to the file to read or write.

•When CONTENT specifies paths, the text %filename% in the COMMAND section of this file is replaced by the path to the temporary file containing the list of files to read or write.

The following line shows an example of a COMMAND section showing the text %filename% that is replaced.

COMMAND: /bin/bash input_transform.sh %filename%

SAFE

Optional. A POSIX regular expression which the paths must match in order to be passed to the transformation. Specifying SAFE is useful when there is a concern about injection or improper interpretation of paths passed to the command. The default is no restriction on paths.

STDERR

Optional.The legal values are server and console.

This setting indicates how standard error output from the transformation is handled. The default is server, indicating that gpfdist will capture the standard error output from the transformation in a temporary file and send the first 8k of that file to Greenplum Database as an error message which will subsequently appear as a SQL error. If the value is set to console, gpfdist takes no special steps to redirect or transmit the standard error output from the transformation.