We have migrated to new Google sites!
Greenplum database 4.3.0 introduced 3 new gp_workfile_* views in the gp_toolkit schema. If you recall these views are not available before 4.3.0. You have to run a script
gp_workfile_mgr.sql (available in /usr/local/greenplum-db/share/postgresql/contrib) manually in order to create these views in older versions.
Workfile Disk Spill Space (4.3.0.0, 4.3.1.0, 4.3.2.0)
Server Configuration Parameters for gp_workfiles
1. gp_workfile_compress
_algorithm
2. gp_workfile_limit_files
_per_query
3. gp_workfile_limit_per
_query
4. gp_workfile_limit_per
_segment
5. gp_workfile
_checksumming
Greenplum Database creates spill files, also known as workfiles, on disk if it does not have sufficient memory to execute an SQL query in memory.
By default maximum no of spill files that can be created is 100,000 which is sufficient for the majority of queries. However, If a query creates more than the specified number of spill files, Greenplum Database returns error:
ERROR: number of workfiles per query limit exceeded
Reasons behind large number of spill files to be generated include:
Data skew is present in the queried data.
The amount memory allocated for the query is too low.
when you get this error there are many ways you solve this problem.
By changing the query, changing the data distribution,
By changing the system memory configuration.
Note: You can control the maximum amount of memory that can used by a query with the Greenplum Database server configuration parameters max_statement_mem,statement_mem or through resource queues.
Monitoring spill file usage is done by looking at the information available using gp_workfile_* views
sachi=# \dv gp_toolkit.gp_workfile_*
List of relations
Schema | Name | Type | Owner | Storage
------------+-------------------------------+------+---------+---------
gp_toolkit | gp_workfile_entries | view | gpadmin | none
gp_toolkit | gp_workfile_usage_per_query | view | gpadmin | none
gp_toolkit | gp_workfile_usage_per_segment | view | gpadmin | none
(3 rows)
Lets look at the definition of these views.
1. gp_workfile_entries: This view contains one row for each operator using disk space for workfiles on a segment at the current time. The view is accessible to all users, however non-superusers only to see information for the databases that they have permission to access.
2. gp_workfile_usage_per_query: This view contains one row for each query using disk space for workfiles on a segment at the current time. The view is accessible to all users, however non-superusers only to see information for the databases that they have permission to access.
3. gp_workfile_usage_per_segment:This view contains one row for each segment. Each row displays the total amount of disk space used for workfiles on the segment at the current time. The view is accessible to all users, however non-superusers only to see information for the databases that they have permission to access.
gp_workfile_usage_per_segment (one row for each segment)
||
||
V
gp_workfile_usage_per_query ( one row for each query using disk space for workfiles on a segment at the current time)
||
||
V
gp_workfile_entries (one row for each operator using disk space for workfiles on a segment at the current time)
sachi=# \d gp_toolkit.gp_workfile_entries
View "gp_toolkit.gp_workfile_entries"
Column | Type | Modifiers
---------------+---------+-----------
datname | name |
procpid | integer |
sess_id | integer |
command_cnt | integer |
usename | name |
current_query | text |
segid | integer |
slice | integer |
optype | text |
workmem | integer |
size | bigint |
numfiles | integer |
directory | text |
state | text |
utility | integer |
View definition:
WITH all_entries AS (
SELECT c.segid, c.path, c.hash, c.size, c.utility, c.state, c.workmem, c.optype, c.slice, c.sessionid, c.commandid, c.query_start, c.numfiles
FROM ONLY gp_toolkit.__gp_localid, gp_toolkit.__gp_workfile_entries_f() c(segid integer, path text, hash integer, size bigint, utility integer, state integer, workmem integer, optype text, slice integer, sessionid integer, commandid integer, query_start timestamp with time zone, numfiles integer)
UNION ALL
SELECT c.segid, c.path, c.hash, c.size, c.utility, c.state, c.workmem, c.optype, c.slice, c.sessionid, c.commandid, c.query_start, c.numfiles
FROM ONLY gp_toolkit.__gp_masterid, gp_toolkit.__gp_workfile_entries_f() c(segid integer, path text, hash integer, size bigint, utility integer, state integer, workmem integer, optype text, slice integer, sessionid integer, commandid integer, query_start timestamp with time zone, numfiles integer)
)
SELECT s.datname,
CASE
WHEN c.state = 1 THEN s.procpid
ELSE NULL::integer
END AS procpid, c.sessionid AS sess_id, c.commandid AS command_cnt, s.usename,
CASE
WHEN c.state = 1 THEN s.current_query
ELSE NULL::text
END AS current_query, c.segid, c.slice, c.optype, c.workmem, c.size, c.numfiles, c.path AS directory,
CASE
WHEN c.state = 1 THEN 'RUNNING'::text
WHEN c.state = 2 THEN 'CACHED'::text
WHEN c.state = 3 THEN 'DELETING'::text
ELSE 'UNKNOWN'::text
END AS state, c.utility
FROM all_entries c
LEFT JOIN pg_stat_activity s ON c.sessionid = s.sess_id;
sachi=# \d gp_toolkit.gp_workfile_usage_per_query
View "gp_toolkit.gp_workfile_usage_per_query"
Column | Type | Modifiers
---------------+---------+-----------
datname | name |
procpid | integer |
sess_id | integer |
command_cnt | integer |
usename | name |
current_query | text |
segid | integer |
state | text |
size | numeric |
numfiles | bigint |
View definition:
SELECT gp_workfile_entries.datname, gp_workfile_entries.procpid, gp_workfile_entries.sess_id, gp_workfile_entries.command_cnt, gp_workfile_entries.usename, gp_workfile_entries.current_query, gp_workfile_entries.segid, gp_workfile_entries.state, sum(gp_workfile_entries.size) AS size, sum(gp_workfile_entries.numfiles) AS numfiles
FROM gp_toolkit.gp_workfile_entries
GROUP BY gp_workfile_entries.datname, gp_workfile_entries.procpid, gp_workfile_entries.sess_id, gp_workfile_entries.command_cnt, gp_workfile_entries.usename, gp_workfile_entries.current_query, gp_workfile_entries.segid, gp_workfile_entries.state;
sachi=# \d gp_toolkit.gp_workfile_usage_per_segment
View "gp_toolkit.gp_workfile_usage_per_segment"
Column | Type | Modifiers
----------+----------+-----------
segid | smallint |
size | numeric |
numfiles | bigint |
View definition:
SELECT gpseg.content AS segid, COALESCE(sum(wfe.size), 0::numeric) AS size, sum(wfe.numfiles) AS numfiles
FROM ( SELECT gp_segment_configuration.content
FROM gp_segment_configuration
WHERE gp_segment_configuration.role = 'p'::"char") gpseg
LEFT JOIN gp_toolkit.gp_workfile_entries wfe ON gpseg.content = wfe.segid
GROUP BY gpseg.content;
sachi=#