Performance Tuning Approach
- Data Types / Byte Alignment
- Distribution Analysis
- Partitioning Analysis
- Indexing Strategies (if any)
- Explain Plans
DISTRIBUTED BY must be balanced. That is, follow its Data Types as well as columns which either came from a Clustering Index or have been selected as UNIQUE NOT NULL values matching all other tables to make the LOCAL JOIN available
Only build Partitions when these are truly necessary. If the distributions give you the results you need in terms of response time then prevent the creation of a Partition
Indexing is a bad word in Greenplum. Although supported, this is the last resort to a inefficiently written queries which usually have to do to an unbalanced distribution selection and/or some other related predicate that should have been coded (like the distributions are not being used in the JOIN)
- Be constant with the execution of ANALYZE for tables that only get INSERT executed against them
- Help your DBA find the Distribution Mismatch to help you make your processing work faster
- Greenplum Database devises a query plan for each query it is given.
- Choosing the right query plan to match the query and data structure is absolutely critical for good performance.
- A query plan defines how the query will be executed in Greenplum Database’s parallel execution environment.
- By examining the query plans of poorly performing queries, you can identify possible performance tuning opportunities.
Based on Greenplum’s Architecture:
- Distribute by JOIN (For LOCAL JOIN Practice)
- Partition by Predicate when truly necessary
- Index only when truly necessary (not encourage)
- As a DBA you must analyze and make sure that all of the tables associated with one another do posses the same DISTRIBUTION KEY set of components (or columns)
- A LOCAL JOIN is that join between two or more tables that share the same Distribution column values, just as a Clustering Index does in a conventional RDBMS.
- A LOCAL JOIN executes faster than a conventional Clustering INDEX.
The query planner uses the database statistics it has to choose a query plan with the lowest possible cost. Cost is measured in disk I/O and CPU effort (shown as units of disk page fetches). The goal is to minimize the total execution cost for the plan. You can view the plan for a given query using the EXPLAIN command. This will show the query planner’s estimated plan for the query. For example:
sachi=> select * from employees where employee_id=198;
employee_id | first_name | last_name | email | phone_number | hire_date | job_id | salary | commission_pct | manager_id | department_id
198 | Donald | OConnell | DOCONNEL | 650.507.9833 | 2007-06-21 00:00:00 | SH_CLERK | 2600.00 | | 124 | 50
EXPLAIN ANALYZE causes the statement to be actually executed, not only planned. This is useful for seeing whether the planner’s estimates are close to reality. For example:
sachi=> EXPLAIN ANALYZE select * from employees where employee_id=198;
Gather Motion 1:1 (slice1; segments: 1) (cost=0.00..3.34 rows=1 width=85)
Rows out: 1 rows at destination with 1.252 ms to first row, 1.253 ms to end, start offset by 8.761 ms.
-> Seq Scan on employees (cost=0.00..3.34 rows=1 width=85)
Filter: employee_id = 198::numeric
Rows out: 1 rows with 0.147 ms to first row, 0.162 ms to end, start offset by 9.741 ms.
(slice0) Executor memory: 183K bytes.
(slice1) Executor memory: 201K bytes (seg1).
Memory used: 128000K bytes
Total runtime: 10.188 ms
On an active database, you're often running larger tasks that can impact the performance of the database, which is bad for a fast database-backed Web site. For example, pg_dump to make backups, COPY/CREATE INDEX as in the above examples, etc. Not to fear; there are some simple things you can do to lessen the impact, e.g.:
nice -n 20 ionice -n 4 pg_dump
nice -n 20 ionice -n 4 psql
This wraps the command to have the lowest cpu priority (via nice) and ip priority (viaionice). Additionally I will often also use
cpulimit -e pg_dump -l 20 -z
which will limit pg_dump (in this case) to 10% of CPU while the current one ie running and then exit (via cpulimit). Of course, it also usually makes sense to run these things at off times using cron, etc.
These GUCs are useful for queries that finish within less than 5 seconds and recommended to be used for OLTP-like workloads. Specifically helpful for systems that have large number of segments. Need to be tuned carefully in a high concurrency system. A system is high concurrency if resource queues are configured to run 50 or more queries concurrently.
- A value larger than or equal to the number of primary segments means that each slice in a query plan will get its own thread when dispatching to segments
- Lower values will use more threads, which utilizes more resources on the master
- Reducing this value improves the performance of queries that run for a couple of seconds
- Enables or disables the dispatching of targeted query plans for queries that access data on a single segment
- This significantly reduces the response time of qualifying queries as there is no interconnect setup involved
- Direct dispatch requires more CPU utilization on the master
- Improves performance of queries that have a filter on the distribution keys
- This needs to be accounted for when deciding on distribution keys for tables
- Specially helpful in high concurrency environments
- A higher setting may improve performance for power-users that want to issue many complex queries in a row
- Helpful in high concurrency environments
1. random_page_cost (master/session/reload) Default value: 100
Sets the planner’s estimate of the cost of a non sequentially fetched disk page
Lower value increases the chances for index scan to be picked
2. enable_indexscan (master/session/reload) Default value: on
Enables or disables the query planner’s use of index-scan plan types
3. enable_nestloop (master/session/reload) Default value: off
Enables or disables the query planner’s use of nested-loop join plans
This should be enabled for use of index in nested loop joins
4. enable_bitmapscan (master/session/reload) Default value: on
Enables or disables the query planner’s use of bitmap-scan plan types.
Generally bitmap scan provides faster access, however you can try disabling it in specifically if you are getting very few rows out of index
5. enable_seqscan (master/session/reload) Default value: on
Disabling enable_seqscan results in use of index
Use this parameter very carefully only as last resort
Iterative tuning steps to favor index usage
Start by turning setting or confirming the following GUC settings
enable_indexscan to on
For joins via index lookup, set enable_nestloop to on
Start by lowering random_page_cost
Set to 20
If still not using the index, then set it to 10
If still not using the index, increase seq_page_cost
Set to 10
If still not using the index, then set it to 15
- The type smallint, integer, and bigint store whole numbers, that is, numbers without fractional components, of various ranges. Attempts to store values outside of the allowed range will result in an error.
- The type integer is the common choice, as it offers the best balance between range, storage size, and performance. The smallint type is generally only used if disk space is at a premium. The bigint type should only be used if the integer range is insufficient, because the latter is definitely faster.
- On very minimal operating system the bigint type might not function correctly, because it relies on compiler support for eight-byte integers. On such machines, bigint acts the same as integer, but still takes up eight bytes of storage. (We are not aware of any modern platform where this is the case.)
- SQL only specifies the integer types integer (or int), smallint, and bigint. The type names int2, int4, and int8 are extensions, which are also used by some other SQL based DBMS.
- CHAR is allowed but must contain a length that is reasonable with high cardinality to mimic the same distribution effect that of DATE.
- VARCHAR is allowed but must contain a length that is reasonable with high cardinality to mimic the same distribution effect that of DATE.
- NUMERIC has no limit but it can store numbers with up to 1000 digits of precision.
- BYTEA 1 or 4 bytes plus the actual binary string. It is variable-length.
- TIMESTAMP behaves just as DATE does
- TIME behaves just as DATE and TIMESTAMP does
- DATE fields are also favorable for these act as if the data is basically distributed simulating that of a partition. Do not use it alone in the distribution.
The columns with the following data types should not be part of any DISTRIBUTION key selected for any given table:
- TEXT is data type with unlimited length. A very noxious data type.
- DECIMAL not to be used at all in the DISTRIBUTION key
- DOUBLE-PRECISSION (Floating-Point binary64)
- REAL or real number
- SERIAL mimics a good sequential number but panelizes a Singleton Select JOIN practice
- MONEY which stores a monetary types of data (it has been deprecated)
- Boolean types (0/1, true/false, etc)
select gp_segment_id, count(*) , abs(count(*) - avg(count(*)) over(order by gp_segment_id rows between unbounded preceding and unbounded following))/count(*) skewfrom bloattest
group by gp_segment_id
order by gp_segment_id
Problem: If I load one file to a table (1 single Informatica session, 1 GP connection) the load takes about 7 minutes. Now if change and do say 15 loads all to that same table via 15 individual Informatica sessions each with their own external table, the same volume of data for each session and a GP connection for each session, the load time jumps to about 15-20 minutes. Prior to the loads, I truncate the target table manually and then start the loads. Now once that first set of loads completes, if I repeat the 15 loads without truncating the loads complete in 7 minutes just as a single load would. What would cause the original 15 to take so much longer? There is no vacuum or analyze after the loads, just a data dump and commit.
SELECT relation::regclass, * FROM pg_locks WHERE NOT granted;
relation | locktype | database | relation | page | tuple | transactionid | classid | objid | objsubid | transaction | pid | mode | granted | mppsessionid | mppiswriter | gp_segment_id
salesdata | relation | 21635 | 710899 | | | | | | | 5038006 | 4076 | ShareUpdateExclusiveLock | f | 178315 | t | -1
salesdata | relation | 21635 | 710899 | | | | | | | 5038012 | 4117 | ShareUpdateExclusiveLock | f | 178316 | t | -1
salesdata | relation | 21635 | 710899 | | | | | | | 5038015 | 4140 | ShareUpdateExclusiveLock | f | 178317 | t | -1
salesdata | relation | 21635 | 710899 | | | | | | | 5038017 | 4169 | ShareUpdateExclusiveLock | f | 178319 | t | -1
salesdata | relation | 21635 | 710899 | | | | | | | 5038016 | 4152 | ShareUpdateExclusiveLock | f | 178318 | t | -1
Solution:Greenplum has an auto-analyze feature that kicks of when data is inserted into a table the first time (after create or truncate). This feature is controlled by the setting gp_autostats_mode which is typically set to (ON_NO_STATS).
Analyzing a table is a fairly costly operation (multiple SQLs are run per column of the table) and the analyze acquires a ShareUpdateExclusiveLock on the table which makes sure that multiple analyze statements on a single table cannot run in parallel. Also, since analyze only samples the table, the speed of the analyze is not impacted so much by the number of rows as it is impacted by the number of columns and partitions.
Now, think of fifteen parallel initial inserts. Greenplum will assign all fifteen of them the "analyze after insert" task because the table was empty before each of the inserts. After the insert itself which should finish quickly, all fifteen threads will analyze the table. Because of the ShareUpdateExclusiveLock, the analyze operations will run serially, not in parallel. This will cause the operation to slow down a lot.
In case of non-parallel inserts, one insert and one analyze operation will happen, which is faster.
In case of subsequent loads, the table is non-empty, so the analyze operation is not triggered at all.
Hope the above explains the situation you faced. Given this, I would recommend the following:
1. Try to not parallelize load (insert, update, delete) operations to Greenplum. The MPP system already runs many parallel threads for you for a single statement, so it is often not effective to implement external parallelization. I believe Informatica has an option to parallelize its internal operations while still submitting a single load operation to Greenplum.
2. If for some reason, such parallel operations are not avoidable (they should always be, I hope), set the GP_AUTOSTATS_MODE to NONE and analyze the table after all load threads are complete. GP_AUTOSTATS_MODE can be changed at session level, but it would be more challenging to achieve it through PWX (gpload). Setting GP_AUTOSTATS_MODE at global level might impact other users who may be expecting the auto-analyze behavior.
- Query plans are a right to left tree plan of nodes that are read from bottom to top with each node passing its result to the node directly above. There is one line for each node in the plan tree that represents a single operation, for example a scan, join, aggregation or sort operation. The node will identify the method used to perform the operation. For example a scan operation may perform a sequential scan or index scan. A join operation may perform a hash join or nested loop join.
- The query plan will also include motion nodes. The motion operations are responsible for moving rows between segments required to process the query. The node will identify the method used to perform the motion operation for example, a redistribution motion or broadcast motion. A gather motion is when segments send the resulting rows to the master host. The last operation for most query plans will be a gather motion.
- Scan operators include Seq Scan on heap tables, Append-Only Scan on row oriented AO tables, Append-Only Columnar Scan on column oriented AO tables, Index scan, and Bitmap Append-Only Row-Oriented Scan.
- Join operators include hash join, nested loop join and merge join. Hash joins are typically the fastest joins in PDB. A nested loop join requires each outer tuple to be compared with each inner tuple that might join to it and requires one of the tables to be broadcasted. The nested loop join requires the broadcast of one of the tables so that all rows in one table can be compared to all rows in the other table. Nested loop joins perform well for small tables or tables limited by index use. It is also used for Cartesian joins and range joins. There are performance implications when using a nested loop join with large tables. For plan nodes that contain a nested loop join operator validate the SQL and ensure that the results are what is intended.
- Poorly written or incorrect SQL is often times the primary offender affecting query performance. Also, joining two keys with different data types can result in a nested loop join. A merge join sorts both datasets and merges it together. A merge join is fast for pre-ordered data and is very rare in the real world.
- Motion operators include broadcast motion, redistribute motion and gather motion. In a broadcast motion every segment performs a broadcast (or sends) its own, individual rows to all other segments. This will result in every segment instance having its own complete and local copy of the entire table.
- A broadcast motion may not be as optimal as a redistribute motion therefore the optimizer typically only selects a broadcast motion for small tables. A broadcast motion is not acceptable for large tables. In the case where data was not distributed on the join key, a dynamic redistribution of the needed rows from one of the tables to another segment will be performed.
- A gather motion is the last operation for most query plans.
- To analyze query plans first identify plan nodes where the estimated cost to perform the operation is very high. Determine if the estimated number of rows seems reasonable and the cost relative to the number of rows for the operation performed.
- If using partitioning validate that partition elimination is achieved. To achieve partition elimination the query predicate (WHERE clause) must be the same as the partitioning criteria. It is also important to note in order to eliminate partitions the WHERE clause must contain an explicit value.
- The WHERE clause cannot contain a subquery if partitions are to be eliminated.
- Review the execution order of the query plan tree. Review the estimated number of rows. We want the execution order to build on the smaller tables or hash join result and probe with larger tables.
- Optimally the largest table is used for the final join or probe to reduce the number of rows being passed up the tree to the topmost plan nodes. If the analysis reveals that the order of execution builds and/or probes is not optimal ensure that database statistics are up to date. Running
- ANALYZE will likely address this and produce an optimal query plan.
- Identify plan nodes where a sort or aggregate operation is performed. Hidden inside an aggregate operation is sort. If the sort or aggregate operation involves a large number of rows there is opportunity to improve query performance. A HashAggregate operation is preferred over sort and aggregate operations when a large number of rows are required to be sorted. Usually a sort operation is chosen by the optimizer due to the SQL construct, that is, due to the way the SQL is written. Most sort operations can be replaced with a HashAgg if the query is rewritten. To favor a HashAggregate operation over a sort and aggregate operation ensure the enable_groupagg GUC is set to ON.
- Eliminate large table broadcast motions for redistribute motions. If the number of rows is large then consider using the gp_segments_for_planner GUC to increase the cost of the motion. This sets the number of primary segments for the planner to assume in its cost and size estimates. If gp_segments_for_planner is set to zero (0) the default, then the value used is the actual number of primary segments. This variable affects the query planner’s estimates of the number of rows handled by each sending and receiving process in motion operators. Increasing the number of primary segments will increase the cost of the motion thereby favoring a redistribute motion over a broadcast motion. For example, setting gp_segments_for_planner = 100000 tells the planner that there are 100000 segments. Conversely to influence the optimizer to broadcast a table and not redistribute it, set gp_segments_for_planner to a low number, for example two (2).