Wednesday, March 12, 2014

Learning PostgreSql: READ COMMITTED and Data Integrity

As we have just discussed, READ COMMITTED isolation level behaves very much like Sql Server's READ_COMMITTED_SNAPSHOT. As such, we need to be very careful with data integrity - lots of code that just works on Sql Server under its default isolation level, READ COMMITTED, does not work on PostgreSql under its default isolation level, which is also READ COMMITTED, but behaves differently.

Note: whenever we use READ_COMMITTED_SNAPSHOT on Sql Server, we need to be just as careful with data integrity  - otherwise we can introduce lots of subtle bugs. Hugo Kornelis described these bugs in his blog series entitled "Snapshot isolation: a threat for integrity". Also I described the same effects in the book entitled "Defensive Database Programming". This is why we are going to discuss how to enforce data integrity when readers do not block writers.

I am well aware that there is a lot of advice to just enable READ_COMMITTED_SNAPSHOT and boost performance, without mentioning the side effects impacting data integrity. We need to be very careful with such advice: performance is important, of course, but I am usually working on systems where sacrificing data integrity to boost performance is not acceptable. 

Demonstrating the problem.

Suppose that we have the following business rule: high priority tickets cannot be assigned to developers who are on vacation. Also suppose that to enforce this business rule, we are using subqueries whenever we modify data in Developers and Tickets tables. It does not matter if we add subqueries directly to our DML commands or wrap them in triggers - the problem stays the same.
Let us set up the tables and some test data:

CREATE TABLE Developers(
  
Name VARCHAR NOT NULL PRIMARY KEY,
  
Status VARCHAR NOT NULL
  );
INSERT INTO Developers(NameStatus)VALUES('Jeff''Active');  CREATE TABLE Tickets(
  
Problem VARCHAR NOT NULL PRIMARY KEY,
  
Priority VARCHAR NOT NULL,
  
AssignedTo VARCHAR NULL,
  
CONSTRAINT FK_devlopers FOREIGN KEY(AssignedTo)
  
REFERENCES Developers(Name),
  
CONSTRAINT CHK_high_priority_always_assigned
    
CHECK((Priority='high' AND AssignedTo IS NOT NULL)
      OR 
Priority<>'high')
  );
INSERT INTO Tickets(ProblemPriorityAssignedTo)VALUES('TPS report hangs','low','Jeff'); 

We are ready to reproduce the problem. In one session, let us begin a transaction, update Jeff's status to "Vacation", but not commit it yet:

BEGIN TRANSACTION;UPDATE Developers SET Status='Vacation'WHERE Name='Jeff'AND NOT EXISTS(SELECT FROM Tickets AS t
WHERE t.AssignedTo Developers.Name
   AND t.Priority='high');
 

In another session, let us begin a transaction, increase the ticket's priority to high, but not commit it yet:

BEGIN TRANSACTION;UPDATE Tickets SET Priority='high'WHERE Problem='TPS report hangs'AND (SELECT Status FROM Developers WHERE Name=Tickets.AssignedTo)<>'Vacation'; 

Apparently the update succeeds:

SELECT FROM Tickets;
"TPS report hangs";"high";"Jeff" 

The reason why the update succeeded is simple: this session does not see the uncommitted data from the other session, and it is not blocked, because writers do not block readers. This session read Jeff's status, which is still active:

SELECT FROM Developers ;
"Jeff";"Active" 

We can commit both transactions. Now we have data that violates our data integrity rules: a high priority ticket is assigned to a developer who is on vacation. Let us restore the original data:

UPDATE Developers SET Status='Active'WHERE Name='Jeff';
UPDATE Tickets SET Priority='low'WHERE Problem='TPS report hangs'; 

Let us try out a higher isolation level. Our attempts to use REPEATABLE READ and even SERIALIZABLE make no difference. However, after a short investigation, we realize that the following script works differently on PostgreSql and Sql Server:

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN TRANSACTION;
On PostgreSql, the SET command does not affect the isolation level of the following transaction. 
The following section explains why.

SET TRANSACTION works differently

Unlike in SQL Server, a standalone SET TRANSACTION command has no effect whatsoever. The following script shows that:

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;select current_setting('transaction_isolation'); 
"read committed" 

The reason is simple: in PostgreSql, SET TRANSACTION command affects only the current transaction. The standalone SET TRANSACTION runs in its own transaction, so when it completes, its transaction completes too, restoring the default isolation level. The correct way to set isolation level is as follows:

BEGIN TRANSACTION;SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;select current_setting('transaction_isolation');     
"repeatable read" 

Also we can use an equivalent, but more concise, method:

BEGIN ISOLATION LEVEL REPEATABLE READ;select current_setting('transaction_isolation');     
"repeatable read"

Trying out higher isolation levels

Now that we know how to set isolation levels properly, let us rerun the same scenario under higher isolation levels.
Under REPEATABLE READ, we still get the same problem, and we can commit both transactions.
After restoring the original data, let us replay the scenario under SERIALIZABLE isolation level. As under lower isolation levels, both modifications complete - this is completely different from Sql Server, where one modification would be blocked. However, when we try to commit both transactions, only one will succeed, while another will fail with the following error message:

ERROR:  could not serialize access due to read/write dependencies among transactions

So, essentially PostreSql implementation is optimistic - it only detects conflicts at commit time. 

Mimicking Sql Server's behavior

We can make sure that subqueries that enforce data integrity are blocked by writers, which mimics Sql Server's implementation of isolation levels. The following scripts shows how to do that - note the FOR UPDATE clause in the subquery. In one session, run this:

--restore original dataUPDATE Developers SET Status='Active'WHERE Name='Jeff';
UPDATE Tickets SET Priority='low'WHERE Problem='TPS report hangs';
-- enforce data integrity Sql Server styleBEGIN TRANSACTION;UPDATE Developers SET Status='Vacation'WHERE Name='Jeff'AND NOT EXISTS(SELECT FROM Tickets AS WHERE t.AssignedTo Developers.Name
   
AND t.Priority='high' FOR UPDATE);  

In another session, run this:

BEGIN TRANSACTION;UPDATE Tickets SET Priority='high'WHERE Problem='TPS report hangs'AND (SELECT Status FROM Developers WHERE Name=Tickets.AssignedTo FOR UPDATE)<>'Vacation' 

This command hangs, exactly like it does on SQL Server. Let us return to the first session and commit. The update in the second session immediately completes, but the ticket's priority is not updated:

SELECT FROM Developers;"Jeff""Vacation" SELECT FROM Tickets;"TPS report hangs""low""Jeff"  

As we have seen, FOR UPDATE clause allows us to replicate SQL Server's pessimistic implementation of isolation levels on PostgreSql.

Conclusion

As we have seen, PostgreSql implementation of isolation levels is substantially different. As such, we need to be very careful whenever we port DML that uses subqueries to enforce data integrity - otherwise we may end up with dirty data, just like when we turn on READ_COMMITTED_SNAPSHOT on Sql Server. There is much more about the differences in the implementation of isolation levels - we shall continue next time.
Note that we could use constraints to enforce this business rule and save ourselves a lot of trouble. I've described the solution in my free eBook "Defensive Database Programming", in the chapter entitled "Advanced Use of Constraints".

No comments:

Post a Comment