Wednesday, March 12, 2014

Learning PostgreSql: Fun with REPEATABLE READ

In this post we shall run some examples under REPEATABLE READ, and see how they behave differently. 

Setting up test data

On SQL Server, run the following:

CREATE TABLE Tickets(
   ID INT NOT NULL,
   Problem VARCHAR(100) NOT NULL,
   SpaceFiller CHAR(200) NOT NULL
);
INSERT INTO Tickets(
   ID ,
   Problem,
   SpaceFiller)SELECT Number*10, 'Problem '+CAST([Number] AS VARCHAR(10))+'0', 'Space Filler'FROM data.Numbers;
ALTER TABLE Tickets ADD PRIMARY KEY(ID);
UPDATE Tickets SET Problem = 'Add 16Gb Ram to Kenny''s workstation'WHERE ID = 90090;

On PostgreSql, we do not need a Numbers table - there is a built in function generate_series, which is very useful:

CREATE TABLE Tickets(
   ID INT NOT NULL,
   Problem VARCHAR NOT NULL
);
TRUNCATE TABLE Tickets;
INSERT INTO Tickets(
   ID ,
   Problem)SELECT generate_series*10, 'Problem ' || CAST(generate_series AS VARCHAR) FROM generate_series(1, 16000);
   ALTER TABLE Tickets ADD PRIMARY KEY(ID);
UPDATE Tickets SET Problem = 'Add 16Gb Ram to Kenny''s workstation'WHERE ID = 90090; 

Identical selects running under REPEATABLE READ on Sql Server may return different results 

The following scripts demonstrate that reads are not completely repeatable, despite the isolation level's name being REPEATABLE READ and as such claiming otherwise.
In one tab, run this:

BEGIN TRANSACTION;UPDATE Tickets SET Problem = 'Replace printer in NW corner'WHERE ID = 49000;

In another tab, run this:

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;BEGIN TRANSACTION;SELECT COUNT(*) AS TotalProblems, SUM(CASE WHEN Problem LIKE '%Kenny%' THEN 1 END) AS [Kenny's Problems],SUM(CASE WHEN Problem LIKE '%printer%' THEN 1 END) AS [Printer Problems]
FROM Tickets WHERE ID BETWEEN 0 AND 100000;

This select is blocked by the uncommitted update in the first tab and as such it does not complete. Let us get back to the first tab and run the following:

UPDATE Tickets SET ID=101WHERE ID = 90090;
COMMIT;

This update physically moves Kenny's ticket from a page that has not been read by the select yet to a page which has already been read. Once the transaction in the first tab commits, the select in the second tab completes with the following results: 

TotalProblems Kenny's Problems Printer Problems------------- ---------------- ----------------10000         NULL             1

Clearly Kenny's ticket has not been read at all. If we rerun this select, in the same transaction, we shall get different results, as follows:

TotalProblems Kenny's Problems Printer Problems------------- ---------------- ----------------10001         1            1

Note that Kenny's problem is included in the second result set, and the total in TotalProblems column is now correct as well.
As we have seen,
  • when we rerun queries in the same transaction, we may get different results 
  • selects may return rows inserted after the select started - such as the only ticket for "Printer problem"
  • selects may fail to return some rows that were committed before the transaction started - note that the ticket for "Kenny's problem" is completely missing from the first result set. This effect may also cause incorrect totals. Note that TotalProblems column is different in the first and second result sets.
If we rerun this example on PostgreSql, the first select is not blocked by the modification, and consequent selects return the same results as the first one.

REPEATABLE READ on Sql Server does not prevent lost updates

The following scenario demonstrates how an update can be lost. To begin the scenario, run the following script in one tab:

BEGIN TRANSACTION;
UPDATE TicketsSET Problem = 'Water cooler making funny noises'WHERE ID = 4010;

In another tab, run this:

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;BEGIN TRANSACTION;
UPDATE TicketsSET Problem = 'Printer problem ' + CAST(ID AS VARCHAR(5))WHERE ID BETWEEN 4000 AND 4100; 

The update is waiting on the uncommitted change. As soon as we commit the transaction in the first tab, the update completes. Note that it overwrites the change from the first tab:

SELECT Problem FROM TicketsWHERE ID BETWEEN 4000 AND 4100;
Problem----------------------------------------------------------------------------------------------------(snip)Printer problem 4010(snip) 
Similarly, a modification running under REPEATABLE READ can update or delete rows that were inserted after the transaction started. 
To reproduce, let us begin archiving tickets in one tab: 
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;BEGIN TRANSACTION;
SELECT * INTO ArchivedTickersFROM TicketsWHERE ID BETWEEN 4000 AND 4100;
In another tab, let us add a new ticket: 
INSERT Tickets(ID, Problem, SpaceFiller)VALUES(4005, 'Water cooler making funny noises', 'Test value'); 
Let us get back to the first tab and complete the archiving:
DELETE FROM TicketsWHERE ID BETWEEN 4000 AND 4100;COMMIT;
SELECT COUNT(*) FROM TicketsWHERE ID BETWEEN 4000 AND 4100;
-----------0

Clearly the new ticket number 4005 was deleted along with the archived ones. This is a textbook example of a lost update.

REPEATABLE READ on PostgreSql prevents lost updates

Let us rerun the previous scenario - we shall observe a very different behavior. In one session, let us start archiving a range of tickets:

BEGIN ISOLATION LEVEL REPEATABLE READ;;
CREATE TABLE archived_ticketsAS 
SELECT *FROM TicketsWHERE ID BETWEEN 4000 AND 4100;

In another session, let us add a new ticket:

INSERT INTO Tickets(ID, Problem)VALUES(4005, 'Water cooler making funny noises'); 

In the first session, let us complete the archiving. The unarchived ticket number 4005 has not been deleted:

DELETE FROM TicketsWHERE ID BETWEEN 4000 AND 4100;
COMMIT;
SELECT Problem FROM TicketsWHERE ID BETWEEN 4000 AND 4100;
"Water cooler making funny noises"

Let us rerun the scenario that updates an existing ticket. In one session, run this:

BEGIN TRANSACTION;
UPDATE TicketsSET Problem = 'Add index on Shipments.Height'WHERE ID = 4010;

In another session run this:

BEGIN ISOLATION LEVEL REPEATABLE READ;;
UPDATE TicketsSET Problem = 'Printer problem ' || CAST(ID AS VARCHAR(5))WHERE ID BETWEEN 4000 AND 4100;

Commit the transaction in the first session, 
and the second transaction detects a lost update and blows up, as follows: "ERROR:  could not serialize access due to concurrent update". 
As we have seen, PostgreSql has detected and prevented both lost updates. As such, we do not need to troubleshoot two subtle bugs that are rather difficult to reproduce.

Conclusion

As we have seen REPEATABLE READ is implemented very differently on PostgreSql, so we need to be very careful when we port T-SQL that uses REPEATABLE READ. As we develop against PostgreSql, we can and should take advantage of its optimistic implementation of isolation levels, rather than trying to exactly replicate Sql Server's behavior.

No comments:

Post a Comment