Installing the greenplum database

See also

Building an Infrastructure to Support Data Science Projects (Part 1 of 3) – Creating a Virtualized Environment.

Building an Infrastructure to Support Data Science Projects (Part 2 of 3) – Installing Greenplum with MADlib

Building an Infrastructure to Support Data Science Projects (Part 3 of 3) – Installing and Configuring R / RStudio with Pivotal Greenplum Integration

The Greenplum Database installer installs the following files and directories:

Configuring Your Systems and Installing Greenplum

Before we begin the install process we need to configure our system first.

A) Pre-Install 

1. Make sure your systems meet the System Requirements

2. Setting the Greenplum Recommended OS Parameters

B) Install 

3.(master only) Running the Greenplum Installer

4.Installing and Configuring Greenplum on all Hosts

5.(Optional) Installing Oracle Compatibility Functions

6.(Optional) Installing Greenplum Database Extensions

7.Creating the Data Storage Areas

8.Synchronizing System Clocks

C) Post-Install

9. Validating Your Systems

10. Initializing a Greenplum Database System

Pre-Install  

1.Minimum recommended specifications for servers intended to support Greenplum Database in a production environment.

Operating System

SUSE Linux SLES 10.2 or higher

CentOS 5.0 or higher

RedHat Enterprise Linux 5.0 or higher

Oracle Unbreakable Linux 5.5

Solaris x86 v10 update 7

File Systems

- xfs required for data storage on SUSE Linux and Red Hat (ext3 supported for root file system)

- zfs required for data storage on Solaris (ufs supported for root file system)

Minimum CPU

Pentium Pro compatible (P3/Athlon and above)

Minimum Memory

16 GB RAM per server

Disk Requirements

-150MB per host for Greenplum installation

-Approximately 300MB per segment instance for meta data

-Appropriate free space for data with disks at no more than 70% capacity

-High-speed, local storage

Network Requirements

Gigabit Ethernet within the array

Dedicated, non-blocking switch

Software and Utilities

bash shell

GNU tar

GNU zip

GNU readline (Solaris only)

On Solaris platforms, you must have GNU Readline in your environment to support interactive Greenplum administrative utilities such as gpssh. Certified readline packages are available for download from the EMC Download Center.

2. Setting the Greenplum Recommended OS Parameters

Greenplum requires the certain operating system (OS) parameters be set on all hosts in your Greenplum Database system (masters and segments).

1. Linux System Settings

2. Solaris System Settings

3. Mac OS X System Settings

In general, the following categories of system parameters need to be altered:

Shared Memory - A Greenplum Database instance will not work unless the shared memory segment for your kernel is properly sized. Most default OS installations have the shared memory values set too low for Greenplum Database. On Linux systems, you must also disable the OOM (out of memory) killer.

Network - On high-volume Greenplum Database systems, certain network-related tuning parameters must be set to optimize network connections made by the Greenplum interconnect.

User Limits - User limits control the resources available to processes started by a user's shell. Greenplum Database requires a higher limit on the allowed number of file descriptors that a single process can have open. The default settings may cause some Greenplum Database queries to fail because they will run out of file descriptors needed to process the query.

Linux System Settings

Set the following parameters in the /etc/sysctl.conf file and reboot:

xfs_mount_options = rw,noatime,inode64,allocsize=16m

sysctl.kernel.shmmax = 500000000

sysctl.kernel.shmmni = 4096

sysctl.kernel.shmall = 4000000000

sysctl.kernel.sem = 250 512000 100 2048

sysctl.kernel.sysrq = 1

sysctl.kernel.core_uses_pid = 1

sysctl.kernel.msgmnb = 65536

sysctl.kernel.msgmax = 65536

sysctl.kernel.msgmni = 2048

sysctl.net.ipv4.tcp_syncookies = 1

sysctl.net.ipv4.ip_forward = 0

sysctl.net.ipv4.conf.default.accept_source_route = 0

sysctl.net.ipv4.tcp_tw_recycle = 1

sysctl.net.ipv4.tcp_max_syn_backlog = 4096

sysctl.net.ipv4.conf.all.arp_filter = 1

sysctl.net.ipv4.ip_local_port_range = 1025 65535

sysctl.net.core.netdev_max_backlog = 10000

sysctl.vm.overcommit_memory = 2

For RHEL version 6.x platforms, the above parameters do not include the sysctl. prefix, as follows:

xfs_mount_options = rw,noatime,inode64,allocsize=16m

kernel.shmmax = 500000000

kernel.shmmni = 4096

kernel.shmall = 4000000000

kernel.sem = 250 512000 100 2048

kernel.sysrq = 1

kernel.core_uses_pid = 1

kernel.msgmnb = 65536

kernel.msgmax = 65536

kernel.msgmni = 2048

net.ipv4.tcp_syncookies = 1

net.ipv4.ip_forward = 0

net.ipv4.conf.default.accept_source_route = 0

net.ipv4.tcp_tw_recycle = 1

net.ipv4.tcp_max_syn_backlog = 4096

net.ipv4.conf.all.arp_filter = 1

net.ipv4.ip_local_port_range = 1025 65535

net.core.netdev_max_backlog = 10000

vm.overcommit_memory = 2

Set the following parameters in the /etc/security/limits.conf file:

* soft nofile 65536

* hard nofile 65536

* soft nproc 131072

* hard nproc 131072

1. XFS is the preferred file system on Linux platforms for data storage. Greenplum recommends the following xfs mount options:

rw,noatime,inode64,allocsize=16m

2. The Linux disk I/O scheduler for disk access supports different policies, such as CFQ, AS, and deadline. 

Greenplum recommends the following scheduler option: deadline

To specify a scheduler, run the following:

# echo schedulername > /sys/block/devname/queue/scheduler

For example:

# echo deadline > /sys/block/sbd/queue/scheduler

3. Each disk device file should have a read-ahead (blockdev) value of 16384.

To verify the read-ahead value of a disk device:

# /sbin/blockdev --getra devname

For example:

# /sbin/blockdev --getra /dev/sdb

To set blockdev (read-ahead) on a device:

# /sbin/blockdev --setra bytes devname

For example:

# /sbin/blockdev --setra 16385 /dev/sdb

4. Edit the /etc/hosts file and make sure that it includes the host names and all interface address names for every machine participating in your Greenplum Database system.

Solaris System Settings

Set the following parameters in /etc/system:

set rlim_fd_cur=65536

set zfs:zfs_arc_max=0x600000000

set pcplusmp:apic_panic_on_nmi=1

set nopanicdebug=1

Change the following line in the /etc/project file from:

default:3::::

to:

default:3:default

project:::project.max-sem-ids=(priv,1024,deny);

process.max-file-descriptor=(priv,252144,deny)

Add the following line to /etc/user_attr:

gpadmin::::defaultpriv=basic,dtrace_user,dtrace_proc

Edit the /etc/hosts file and make sure that it includes all host names and interface address names for every machine participating in your Greenplum Database system.

Mac OS X System Settings

•Add the following to /etc/sysctl.conf:

kern.sysv.shmmax=2147483648

kern.sysv.shmmin=1

kern.sysv.shmmni=64

kern.sysv.shmseg=16

kern.sysv.shmall=524288

kern.maxfiles=65535

kern.maxfilesperproc=65535

net.inet.tcp.msl=60

•Add the following line to /etc/hostconfig:

HOSTNAME="your_hostname"

B) Install : 

Running the Greenplum Installer

To configure your systems for Greenplum Database, you will need certain utilities found in $GPHOME/bin of your installation. Log in as root and run the Greenplum installer on the machine that will be your master host.

To install the Greenplum binaries on the master host

1.Download or copy the installer file to the machine that will be the Greenplum Database master host. Installer files are available from Greenplum for RedHat (32-bit and 64-bit), Solaris 64-bit and SuSe Linux 64-bit platforms.

2.Unzip the installer file where PLATFORM is either RHEL5-i386 (RedHat 32-bit), RHEL5-x86_64 (RedHat 64-bit), SOL-x86_64 (Solaris 64-bit) or SuSE10-x86_64 (SuSe Linux 64 bit). For example:

# unzip greenplum-db-4.2.x.x-PLATFORM.zip

3.Launch the installer using bash. For example:

# /bin/bash greenplum-db-4.2.x.x-PLATFORM.bin

4.The installer will prompt you to accept the Greenplum Database license agreement. Type yes to accept the license agreement.

5.The installer will prompt you to provide an installation path. Press ENTER to accept the default install path (/usr/local/greenplum-db-4.2.x.x), or enter an absolute path to an install location. You must have write permissions to the location you specify.

6.Optional. The installer will prompt you to provide the path to a previous installation of Greenplum Database. For example: /usr/local/greenplum-db-4.2.x.x)

This installation step will migrate any Greenplum Database add-on modules (postgis, pgcrypto, etc.) from the previous installation path to the path to the version currently being installed. This step is optional and can be performed manually at any point after the installation using the gpkg utility with the -migrate option.

Press ENTER to skip this step.

7.The installer will install the Greenplum software and create a greenplum-db symbolic link one directory level above your version-specific Greenplum installation directory. The symbolic link is used to facilitate patch maintenance and upgrades between versions. The installed location is referred to as $GPHOME.

8.To perform additional required system configuration tasks and to install Greenplum Database on other hosts, go to the next task Installing and Configuring Greenplum on all Hosts.

About Your Greenplum Database Installation

•greenplum_path.sh — This file contains the environment variables for Greenplum Database. 

•GPDB-LICENSE.txt — Greenplum license agreement.

•bin — This directory contains the Greenplum Database management utilities. This directory also contains the PostgreSQL client and server programs, most of which are also used in Greenplum Database.

•demo — This directory contains the Greenplum demonstration programs.

•docs — The Greenplum Database documentation (PDF files).

•etc — Sample configuration file for OpenSSL.

•ext — Bundled programs (such as Python) used by some Greenplum Database utilities.

•include — The C header files for Greenplum Database.

•lib — Greenplum Database and PostgreSQL library files.

•sbin — Supporting/Internal scripts and programs.

•share — Shared files for Greenplum Database.

Installing and Configuring Greenplum on all Hosts

When run as root, gpseginstall copies the Greenplum Database installation from the current host and installs it on a list of specified hosts, creates the Greenplum system user (gpadmin), sets the system user’s password (default is changeme), sets the ownership of the Greenplum Database installation directory, and exchanges ssh keys between all specified host address names (both as root and as the specified system user).

About gpadmin

When a Greenplum Database system is first initialized, the system contains one predefined superuser role (also referred to as the system user), gpadmin. This is the user who owns and administers the Greenplum Database.

Note: If you are setting up a single node system, you can still use gpseginstall to perform the required system configuration tasks on the current host. In this case, the hostfile_exkeys would just have the current host name only.

To install and configure Greenplum Database on all specified hosts

1.Log in to the master host as root:

$ su -

2.Source the path file from your master host’s Greenplum Database installation directory:

# source /usr/local/greenplum-db/greenplum_path.sh

3.Create a file called hostfile_exkeys that has the machine configured host names and host addresses (interface names) for each host in your Greenplum system (master, standby master and segments). Make sure there are no blank lines or extra spaces. For example, if you have a master, standby master and three segments with two network interfaces per host, your file would look something like this:

mdw

mdw-1

mdw-2

smdw

smdw-1

smdw-2

sdw1

sdw1-1

sdw1-2

sdw2

sdw2-1

sdw2-2

sdw3

sdw3-1

sdw3-2

Note: Check your systems’ /etc/hosts files for the correct host names to use for your environment.

4.Run the gpseginstall utility referencing the hostfile_exkeys file you just created. Use the -u and -p options to create the Greenplum system user (gpadmin) on all hosts and set the password for that user on all hosts. For example:

# gpseginstall -f hostfile_exkeys -u gpadmin -p P@$$word

Recommended security best practices:

Confirming Your Installation

To make sure the Greenplum software was installed and configured correctly, run the following confirmation steps from your Greenplum master host. If necessary, correct any problems before continuing on to the next task.

1.Log in to the master host as gpadmin:

$ su - gpadmin

2.Source the path file from Greenplum Database installation directory:

# source /usr/local/greenplum-db/greenplum_path.sh

3.Use the gpssh utility to see if you can login to all hosts without a password prompt, and to confirm that the Greenplum software was installed on all hosts. Use the hostfile_exkeys file you used for installation. For example:

$ gpssh -f hostfile_exkeys -e ls -l $GPHOME

If the installation was successful, you should be able to log in to all hosts without a password prompt. All hosts should show that they have the same contents in their installation directories, and that the directories are owned by the gpadmin user.

If you are prompted for a password, run the following command to redo the ssh key exchange:

$ gpssh-exkeys -f hostfile_exkeys

Installing Oracle Compatibility Functions

Optional. Many Oracle Compatibility SQL functions are available in Greenplum Database. These functions target PostgreSQL.

Before using any Oracle Compatibility Functions, you need to run the installation script $GPHOME/share/postgresql/contrib/orafunc.sql once for each database. For example, to install the functions in database testdb, use the command 

$ psql –d testdb –f \

$GPHOME/share/postgresql/contrib/orafunc.sql

To uninstall Oracle Compatibility Functions, use the script:

$GPHOME/share/postgresql/contrib/uninstall_orafunc.sql.

Note: The following functions are available by default and can be accessed without running the Oracle Compatibility installer: sinh, tanh, cosh and decode.

For more information about Greenplum’s Oracle compatibility functions, see the Oracle Compatibility Functions appendix of the Greenplum Database Administrator Guide.

Installing Greenplum Database Extensions

Optional. Use the Greenplum package manager (gppkg) to install Greenplum Database extensions such as pgcrypto, PL/R, PL/Java, PL/Perl, and PostGIS, along with their dependencies, across an entire cluster. The package manager also integrates with existing scripts so that any packages are automatically installed on any new hosts introduced into the system following cluster expansion or segment host recovery.

Creating the Data Storage Areas

Every Greenplum Database master and segment instance has a designated storage area on disk that is called the data directory location. This is the file system location where the directories that store segment instance data will be created. The master host needs a data storage location for the master data directory. Each segment host needs a data directory storage location for its primary segments, and another for its mirror segments.

To create the data directory location on the master

The data directory location on the master is different than those on the segments. The master does not store any user data, only the system catalog tables and system metadata are stored on the master instance, therefore you do not need to designate as much storage space as on the segments.

1.Create or choose a directory that will serve as your master data storage area. This directory should have sufficient disk space for your data and be owned by the gpadmin user and group. For example, run the following commands as root:

# mkdir /data/master

2.Change ownership of this directory to the gpadmin user. For example:

# chown gpadmin /data/master

3.Using gpssh, create the master data directory location on your standby master as well. For example:

# source /usr/local/greenplum-db-4.2.x.x/greenplum_path.sh

# gpssh -h smdw -e 'mkdir /data/master'

# gpssh -h smdw -e 'chown gpadmin /data/master'

To create the data directory locations on all segment hosts

1.On the master host, log in as root:

# su

2.Create a file called hostfile_gpssh_segonly. This file should have only one machine configured host name for each segment host. For example, if you have three segment hosts:

sdw1

sdw2

sdw3

3.Using gpssh, create the primary and mirror data directory locations on all segment hosts at once using the hostfile_gpssh_segonly file you just created. For example:

# source /usr/local/greenplum-db-4.2.x.x/greenplum_path.sh

# gpssh -f hostfile_gpssh_segonly -e 'mkdir /data/primary'

# gpssh -f hostfile_gpssh_segonly -e 'mkdir /data/mirror'

# gpssh -f hostfile_gpssh_segonly -e 'chown gpadmin /data/primary'

# gpssh -f hostfile_gpssh_segonly -e 'chown gpadmin /data/mirror'

Synchronizing System Clocks

Greenplum recommends using NTP (Network Time Protocol) to synchronize the system clocks on all hosts that comprise your Greenplum Database system. See www.ntp.org for more information about NTP.

NTP on the segment hosts should be configured to use the master host as the primary time source, and the standby master as the secondary time source. On the master and standby master hosts, configure NTP to point to your preferred time server.

To configure NTP

1.On the master host, log in as root and edit the /etc/ntp.conf file. Set the server parameter to point to your data center’s NTP time server. For example (if 10.6.220.20 was the IP address of your data center’s NTP server):

server 10.6.220.20

2.On each segment host, log in as root and edit the /etc/ntp.conf file. Set the first server parameter to point to the master host, and the second server parameter to point to the standby master host. For example:

server mdw prefer

server smdw

3.On the standby master host, log in as root and edit the /etc/ntp.conf file. Set the first server parameter to point to the primary master host, and the second server parameter to point to your data center’s NTP time server. For example:

server mdw prefer

server 10.6.220.20

4.On the master host, use the NTP daemon synchronize the system clocks on all Greenplum hosts. For example using gpssh:

# gpssh -f hostfile_gpssh_allhosts -v -e 'ntpd'

C) Post-Install

Greenplum provides the following utilities to validate the configuration and performance of your systems:

•gpcheck

•gpcheckperf 

Note: These utilities can be found in $GPHOME/bin of your Greenplum installation.

The following tests should be run prior to initializing your Greenplum Database system.

•Validating OS Settings

•Validating Hardware Performance

Validating OS Settings

Greenplum provides a utility called gpcheck that can be used to verify that all hosts in your array have the recommended OS settings for running a production Greenplum Database system. To run gpcheck:

1.Log in on the master host as the gpadmin user.

2.Source the greenplum_path.sh path file from your Greenplum installation. For example:

$ source /usr/local/greenplum-db/greenplum_path.sh

3.Create a file called hostfile_gpcheck that has the machine-configured host names of each Greenplum host (master, standby master and segments), one host name per line. Make sure there are no blank lines or extra spaces. This file should just have a single host name per host. For example:

mdw

smdw

sdw1

sdw2

sdw3

4.Run the gpcheck utility using the host file you just created. For example:

$ gpcheck -f hostfile_gpcheck -m mdw -s smdw

5.After gpcheck finishes verifying OS parameters on all hosts (masters and segments), you might be prompted to modify certain OS parameters before initializing your Greenplum Database system.

Validating Hardware Performance

Greenplum provides a management utility called gpcheckperf, which can be used to identify hardware and system-level issues on the machines in your Greenplum Database array. gpcheckperf starts a session on the specified hosts and runs the following performance tests:

•Network Performance (gpnetbench*)

•Disk I/O Performance (dd test)

•Memory Bandwidth (stream test)

Before using gpcheckperf, you must have a trusted host setup between the hosts involved in the performance test. You can use the utility gpssh-exkeys to update the known host files and exchange public keys between hosts if you have not done so already. Note that gpcheckperf calls to gpssh and gpscp, so these Greenplum utilities must be in your $PATH.

Validating Network Performance

To test network performance, run gpcheckperf with one of the network test run options: parallel pair test (-r N), serial pair test (-r n), or full matrix test (-r M). The utility runs a network benchmark program that transfers a 5 second stream of data from the current host to each remote host included in the test. By default, the data is transferred in parallel to each remote host and the minimum, maximum, average and median network transfer rates are reported in megabytes (MB) per second. If the summary transfer rate is slower than expected (less than 100 MB/s), you can run the network test serially using the -r n option to obtain per-host results. To run a full-matrix bandwidth test, you can specify -r M which will cause every host to send and receive data from every other host specified. This test is best used to validate if the switch fabric can tolerate a full-matrix workload.

Most systems in a Greenplum Database array are configured with multiple network interface cards (NICs), each NIC on its own subnet. When testing network performance, it is important to test each subnet individually. For example, considering the following network configuration of two NICs per host:

Example Network Interface Configuration

Greenplum Host

Subnet1 NICs

Subnet2 NICs

Segment 1

sdw1-1

sdw1-2

Segment 2

sdw2-1

sdw2-2

Segment 3

sdw3-1

sdw3-2

You would create four distinct host files for use with the gpcheckperf network test:

Example Network Test Host File Contents

hostfile_gpchecknet_ic1

hostfile_gpchecknet_ic2

sdw1-1

sdw1-2

sdw2-1

sdw2-2

sdw3-1

sdw3-2

You would then run gpcheckperf once per subnet. For example (if testing an even number of hosts, run in parallel pairs test mode):

$ gpcheckperf -f hostfile_gpchecknet_ic1 -r N -d /tmp > subnet1.out

$ gpcheckperf -f hostfile_gpchecknet_ic2 -r N -d /tmp > subnet2.out

If you have an odd number of hosts to test, you can run in serial test mode (-r n).

Validating Disk I/O and Memory Bandwidth

To test disk and memory bandwidth performance, run gpcheckperf with the disk and stream test run options (-r ds). The disk test uses the dd command (a standard UNIX utility) to test the sequential throughput performance of a logical disk or file system. The memory test uses the STREAM benchmark program to measure sustainable memory bandwidth. Results are reported in MB per second (MB/s).

To run the disk and stream tests

1.Log in on the master host as the gpadmin user.

2.Source the greenplum_path.sh path file from your Greenplum installation. For example:

$ source /usr/local/greenplum-db/greenplum_path.sh

3.Create a host file named hostfile_gpcheckperf that has one host name per segment host. Do not include the master host. For example:

sdw1

sdw2

sdw3

sdw4

4.Run the gpcheckperf utility using the hostfile_gpcheckperf file you just created. Use the -d option to specify the file systems you want to test on each host (you must have write access to these directories). You will want to test all primary and mirror segment data directory locations. For example:

$ gpcheckperf -f hostfile_gpcheckperf -r ds -D \

-d /data1/primary -d /data2/primary \

-d /data1/mirror -d /data2/mirror

5.The utility may take a while to perform the tests as it is copying very large files between the hosts. When it is finished you will see the summary results for the Disk Write, Disk Read, and Stream tests.

Configuring Localization Settings

Greenplum Database supports localization with two approaches:

•Using the locale features of the operating system to provide locale-specific collation order, number formatting, and so on.

•Providing a number of different character sets defined in the Greenplum Database server, including multiple-byte character sets, to support storing text in all kinds of languages, and providing character set translation between client and server.

Locale support refers to an application respecting cultural preferences regarding alphabets, sorting, number formatting, etc. Greenplum Database uses the standard ISO C and POSIX locale facilities provided by the server operating system. For additional information refer to the documentation of your operating system.

Locale support is automatically initialized when a Greenplum Database system is initialized. The initialization utility, gpinitsystem, will initialize the Greenplum array with the locale setting of its execution environment by default, so if your system is already set to use the locale that you want in your Greenplum Database system then there is nothing else you need to do.

When you are ready to initiate Greenplum Database and you want to use a different locale (or you are not sure which locale your system is set to), you can instruct gpinitsystem exactly which locale to use by specifying the -n locale option. For example:

$ gpinitsystem -c gp_init_config -n sv_SE

The example above sets the locale to Swedish (sv) as spoken in Sweden (SE). Other possibilities might be en_US (U.S. English) and fr_CA (French Canadian). If more than one character set can be useful for a locale then the specifications look like this: cs_CZ.ISO8859-2. What locales are available under what names on your system depends on what was provided by the operating system vendor and what was installed. On most systems, the command locale -a will provide a list of available locales.

Occasionally it is useful to mix rules from several locales, for example use English collation rules but Spanish messages. To support that, a set of locale subcategories exist that control only a certain aspect of the localization rules:

•LC_COLLATE — String sort order

•LC_CTYPE — Character classification (What is a letter? Its upper-case equivalent?)

•LC_MESSAGES — Language of messages

•LC_MONETARY — Formatting of currency amounts

•LC_NUMERIC — Formatting of numbers

•LC_TIME — Formatting of dates and times

If you want the system to behave as if it had no locale support, use the special locale C or POSIX.

The nature of some locale categories is that their value has to be fixed for the lifetime of a Greenplum Database system. That is, once gpinitsystem has run, you cannot change them anymore. LC_COLLATE and LC_CTYPE are those categories. They affect the sort order of indexes, so they must be kept fixed, or indexes on text columns will become corrupt. Greenplum Database enforces this by recording the values of LC_COLLATE and LC_CTYPE that are seen by gpinitsystem. The server automatically adopts those two values based on the locale that was chosen at initialization time.

The other locale categories can be changed as desired whenever the server is running by setting the server configuration parameters that have the same name as the locale categories (see the Greenplum Database Administrator Guide for more information on setting server configuration parameters). The defaults that are chosen by gpinitsystem are written into the master and segment postgresql.conf configuration files to serve as defaults when the Greenplum Database system is started. If you delete these assignments from the master and each segment postgresql.conf files then the server will inherit the settings from its execution environment.

Note that the locale behavior of the server is determined by the environment variables seen by the server, not by the environment of any client. Therefore, be careful to configure the correct locale settings on each Greenplum Database host (master and segments) before starting the system. A consequence of this is that if client and server are set up in different locales, messages may appear in different languages depending on where they originated.

Inheriting the locale from the execution environment means the following on most operating systems: For a given locale category, say the collation, the following environment variables are consulted in this order until one is found to be set: LC_ALL, LC_COLLATE (the variable corresponding to the respective category), LANG. If none of these environment variables are set then the locale defaults to C.

Some message localization libraries also look at the environment variable LANGUAGE which overrides all other locale settings for the purpose of setting the language of messages. If in doubt, please refer to the documentation of your operating system, in particular the documentation about gettext, for more information.

Native language support (NLS), which enables messages to be translated to the user’s preferred language, is not enabled in Greenplum Database for languages other than English. This is independent of the other locale support.

Locale Behavior

The locale settings influence the following SQL features:

•Sort order in queries using ORDER BY on textual data

•The ability to use indexes with LIKE clauses

•The upper, lower, and initcap functions

•The to_char family of functions

The drawback of using locales other than C or POSIX in Greenplum Database is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE. For this reason use locales only if you actually need them.

Troubleshooting Locales

If locale support does not work as expected, check that the locale support in your operating system is correctly configured. To check what locales are installed on your system, you may use the command locale -a if your operating system provides it.

Check that Greenplum Database is actually using the locale that you think it is. LC_COLLATE and LC_CTYPE settings are determined at initialization time and cannot be changed without redoing gpinitsystem. Other locale settings including LC_MESSAGES and LC_MONETARY are initially determined by the operating system environment of the master and/or segment host, but can be changed after initialization by editing the postgresql.conf file of each Greenplum master and segment instance. You can check the active locale settings of the master host using the SHOW command. Note that every host in your Greenplum Database array should be using identical locale settings.

Initializing a Greenplum Database System

Because Greenplum Database is distributed, the process for initializing a Greenplum Database management system (DBMS) involves initializing several individual PostgreSQL database instances (called segment instances in Greenplum).

Each database instance (the master and all segments) must be initialized across all of the hosts in the system in such a way that they can all work together as a unified DBMS. Greenplum provides its own version of initdb called gpinitsystem, which takes care of initializing the database on the master and on each segment instance, and starting each instance in the correct order.

After the Greenplum Database database system has been initialized and started, you can then create and manage databases as you would in a regular PostgreSQL DBMS by connecting to the Greenplum master.

Initializing Greenplum Database

These are the high-level tasks for initializing Greenplum Database:

1.Make sure you have completed all of the installation tasks described in “Configuring Your Systems and Installing Greenplum”.

2.Create a host file that contains the host addresses of your segments. 

3.Create your Greenplum Database system configuration file. 

4.By default, Greenplum Database will be initialized using the locale of the master host system. Make sure this is the correct locale you want to use, as some locale options cannot be changed after initialization. 

5.Run the Greenplum Database initialization utility on the master host. 

Creating the Initialization Host File

The gpinitsystem utility requires a host file that contains the list of addresses for each segment host. The initialization utility determines the number of segment instances per host by the number host addresses listed per host times the number of data directory locations specified in the gpinitsystem_config file.

This file should only contain segment host addresses (not the master or standby master). For segment machines with more than one network interface, this file should list the host address names for each interface — one per line.

To create the initialization host file

1.Log in as gpadmin.

$ su - gpadmin

2.Create a file named hostfile_gpinitsystem. In this file add the host address name(s) of your segment host interfaces, one name per line, no extra lines or spaces. For example, if you have four segment hosts with two network interfaces each:

sdw1-1

sdw1-2

sdw2-1

sdw2-2

sdw3-1

sdw3-2

sdw4-1

sdw4-2

3.Save and close the file.

Note: If you are not sure of the host names and/or interface address names used by your machines, look in the /etc/hosts file.

Creating the Greenplum Database Configuration File

Your Greenplum Database configuration file tells the gpinitsystem utility how you want to configure your Greenplum Database system. An example configuration file can be found in $GPHOME/docs/cli_help/gpconfigs/gpinitsystem_config.

To create a gpinitsystem_config file

1.Log in as gpadmin.

$ su - gpadmin

2.Make a copy of the gpinitsystem_config file to use as a starting point. For example:

$ cp $GPHOME/docs/cli_help/gpconfigs/gpinitsystem_config /home/gpadmin/gpconfigs/gpinitsystem_config

3.Open the file you just copied in a text editor.

Set all of the required parameters according to your environment. A Greenplum Database system must contain a master instance and at least two segment instances (even if setting up a single node system).

The DATA_DIRECTORY parameter is what determines how many segments per host will be created. If your segment hosts have multiple network interfaces, and you used their interface address names in your host file, the number of segments will be evenly spread over the number of available interfaces.

Here is an example of the required parameters in the gpinitsystem_config file:

ARRAY_NAME="EMC Greenplum DW"

SEG_PREFIX=gpseg

PORT_BASE=40000

declare -a DATA_DIRECTORY=(/data1/primary /data1/primary /data1/primary /data2/primary /data2/primary /data2/primary)

MASTER_HOSTNAME=mdw

MASTER_DIRECTORY=/data/master

MASTER_PORT=5432

TRUSTED SHELL=ssh

CHECK_POINT_SEGMENT=8

ENCODING=UNICODE

4.(optional) If you want to deploy mirror segments, uncomment and set the mirroring parameters according to your environment. Here is an example of the optional mirror parameters in the gpinitsystem_config file:

MIRROR_PORT_BASE=50000

REPLICATION_PORT_BASE=41000

MIRROR_REPLICATION_PORT_BASE=51000

declare -a MIRROR_DATA_DIRECTORY=(/data1/mirror /data1/mirror /data1/mirror /data2/mirror /data2/mirror /data2/mirror)

Note: You can initialize your Greenplum system with primary segments only and deploy mirrors later using the gpaddmirrors utility.

5.Save and close the file.

Running the Initialization Utility

The gpinitsystem utility will create a Greenplum Database system using the values defined in the configuration file. 

To run the initialization utility

1.Run the following command referencing the path and file name of your initialization configuration file (gpinitsystem_config) and host file (hostfile_gpinitsystem). For example:

$ cd ~

$ gpinitsystem -c gpconfigs/gpinitsystem_config -h gpconfigs/hostfile_gpinitsystem

For a fully redundant system (with a standby master and a spread mirror configuration) include the -s and -S options. For example:

$ gpinitsystem -c gpconfigs/gpinitsystem_config -h gpconfigs/hostfile_gpinitsystem -s standby_master_hostname -S

2.The utility will verify your setup information and make sure it can connect to each host and access the data directories specified in your configuration. If all of the pre-checks are successful, the utility will prompt you to confirm your configuration. For example:

=> Continue with Greenplum creation? Yy/Nn

3.Press y to start the initialization.

4.The utility will then begin setup and initialization of the master instance and each segment instance in the system. Each segment instance is set up in parallel. Depending on the number of segments, this process can take a while.

5.At the end of a successful setup, the utility will start your Greenplum Database system. You should see:

=> Greenplum Database instance successfully created.

Troubleshooting Initialization Problems

If the utility encounters any errors while setting up an instance, the entire process will fail, and could possibly leave you with a partially created system. Refer to the error messages and logs to determine the cause of the failure and where in the process the failure occurred. Log files are created in ~/gpAdminLogs.

Depending on when the error occurred in the process, you may need to clean up and then try the gpinitsystem utility again. For example, if some segment instances were created and some failed, you may need to stop postgres processes and remove any utility-created data directories from your data storage area(s). A backout script is created to help with this cleanup if necessary.

Using the Backout Script

If the gpinitsystem utility fails, it will create the following backout script if it has left your system in a partially installed state:

~/gpAdminLogs/backout_gpinitsystem_<user>_<timestamp>

You can use this script to clean up a partially created Greenplum Database system. This backout script will remove any utility-created data directories, postgres processes, and log files. After correcting the error that caused gpinitsystem to fail and running the backout script, you should be ready to retry initializing your Greenplum Database array.

The following example shows how to run the backout script:

$ sh backout_gpinitsystem_gpadmin_20071031_121053

Setting Greenplum Environment Variables

You must configure your environment on the Greenplum Database master (and standby master). A greenplum_path.sh file is provided in your $GPHOME directory with environment variable settings for Greenplum Database. You can source this file in the gpadmin user’s startup shell profile (such as .bashrc).

The Greenplum Database management utilities also require that the MASTER_DATA_DIRECTORY environment variable be set. This should point to the directory created by the gpinitsystem utility in the master data directory location.

To set up your user environment for Greenplum

1.Make sure you are logged in as gpadmin:

$ su - gpadmin

2.Open your profile file (such as .bashrc) in a text editor. For example:

$ vi ~/.bashrc

3.Add lines to this file to source the greenplum_path.sh file and set the MASTER_DATA_DIRECTORY environment variable. For example:

source /usr/local/greenplum-db/greenplum_path.sh

export MASTER_DATA_DIRECTORY=/data/master/gpseg-1

4.(optional) You may also want to set some client session environment variables such as PGPORT, PGUSER and PGDATABASE for convenience. For example:

export PGPORT=5432

export PGUSER=gpadmin

export PGDATABASE=default_login_database_name

5.Save and close the file.

6.After editing the profile file, source it to make the changes active. For example:

$ source ~/.bashrc

7.If you have a standby master host, copy your environment file to the standby master as well. For example:

$ cd ~

$ scp .bashrc standby_hostname:`pwd`

Note: The .bashrc file should not produce any output. If you wish to have a message display to users upon logging in, use the .profile file instead.