Concurrency Control in Greenplum Database

Post date: Nov 02, 2012 4:51:24 PM

Unlike Oracle database, which uses locks and latches for concurrency control, Greenplum Database (as does PostgreSQL) maintains data consistency by using a multiversion model (Multiversion Concurrency Control, MVCC). This means that while querying a database, each transaction sees a snapshot of data which protects the transaction from viewing inconsistent data that could be caused by (other) concurrent updates on the same data rows. This provides transaction isolation for each database session. In a nutshell, readers don’t block writers and writers don’t block readers. Each transaction sees a snapshot of the database rather than locking tables.

Greenplum differs from PostgreSQL for update and delete commands. It acquires an Exclusive Lock on a table. However, this lock doesn’t block readers. SQL queries will return data and users will see a snapshot of the data that isn’t dirty. This Exclusive Lock does however block other update and delete commands which is different from PostgreSQL.

MVCC, by eschewing explicit locking methodologies of traditional database systems, minimizes lock contention in order to allow for reasonable performance in multiuser environments. The main advantage to using the MVCC model of concurrency control rather than locking is that in MVCC locks acquired for querying (reading) data do not conflict with locks acquired for writing data, and so reading never blocks writing and writing never blocks reading.

Greenplum is a shared nothing database where no two nodes share the same data. Data is spread across multiple servers based on a distribution key defined on each table. A good key is typically a unique identifier in a table and this can be a single or multiple columns. If you pick a good key, each segment will have roughly the same number of rows and at Greenplum we call this the “skew”.

Greenplum Database provides various lock modes to control concurrent access to data in tables. Most Greenplum Database SQL commands automatically acquire locks of appropriate modes to ensure that referenced tables are not dropped or modified in incompatible ways while the command executes. For applications that cannot adapt easily to MVCC behavior, the LOCK command can be used to acquire explicit locks. However, proper use of MVCC will generally provide better performance than locks.

Transaction Isolation Levels

The SQL standard defines four transaction isolation levels. In Greenplum Database, you can request any of the four standard transaction isolation levels. But internally, there are only two distinct isolation levels — read committed and serializable:

1. read committed — When a transaction runs on this isolation level, a SELECT query sees only data committed before the query began. It never sees either uncommitted data or changes committed during query execution by concurrent transactions. However, the SELECT does see the effects of previous updates executed within its own transaction, even though they are not yet committed. In effect, a SELECT query sees a snapshot of the database as of the instant that query begins to run. Notice that two successive SELECT commands can see different data, even though they are within a single transaction, if other transactions commit changes during execution of the first SELECT. UPDATE and DELETE commands behave the same as SELECT in terms of searching for target rows. They will only find target rows that were committed as of the command start time. However, such a target row may have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. The partial transaction isolation provided by read committed mode is adequate for many applications, and this mode is fast and simple to use. However, for applications that do complex queries and updates, it may be necessary to guarantee a more rigorously consistent view of the database than the read committed mode provides.

2. serializable — This is the strictest transaction isolation. This level emulates serial transaction execution, as if transactions had been executed one after another, serially, rather than concurrently. Applications using this level must be prepared to retry transactions due to serialization failures. When a transaction is on the serializable level, a SELECT query sees only data committed before the transaction began. It never sees either uncommitted data or changes committed during transaction execution by concurrent transactions. However, the SELECT does see the effects of previous updates executed within its own transaction, even though they are not yet committed. Successive SELECT commands within a single transaction always see the same data. UPDATE and DELETE commands behave the same as SELECT in terms of searching for target rows. They will only find target rows that were committed as of the transaction start time. However, such a target row may have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the serializable transaction will wait for the first updating transaction to commit or roll back (if it is still in progress). If the first updater rolls back, then its effects are negated and the serializable transaction can proceed with updating the originally found row. But if the first updater commits (and actually updated or deleted the row, not just locked it) then the serializable transaction will be rolled back.

3. read uncommitted — Treated the same as read committed in Greenplum Database.

4. repeatable read — Treated the same as serializable in Greenplum Database.

The default transaction isolation level in Greenplum Database is read committed. To change the isolation level for a transaction, you can declare the isolation level when you BEGIN the transaction, or else use the SET TRANSACTION command after the transaction is started.