W dniu 21.06.2011 07:30, Michael Hudson-Doyle pisze:
Hi guys (who else should I be emailing this sort of thing to?),
After making rapid progress on the scheduler last week, today I've hit a bit of a stumbling block. Basically, we're implementing a queue in SQL with this sort of pseudo-code:
- look for a job we could run
- mark it as running
- commit
There's an obvious race condition here if two requests overlap which could result in the same job running on two boards. I know how to prevent this in two ways for postgres:
- This way: http://johtopg.blogspot.com/2010/12/queues-in-sql.html
- Relying on SERIALIZABLE isolation level and retrying request that fail to commit (this is what Launchpad does)
3) I never tried this solution and I don't know how this is different from 2) but let's try this. Hopefully it would still work on SQLite.
A Queue has to operations: get() and put(). Put appends an item to the queue, get removes an item from the front of the list.
CREATE TABLE queue id PRIMARY KEY, aid UNIQUE INTEGER, ...
Put is trivial, as in the code you linked to, just insert, keep the helper 'aid' (allocation ID) NULL.
Get is trickier as we need to make sure that no two consumers allocate the same queue item. Regardless of isolation levels you use a unique constraint will always work, only one client would succeed in getting the UPDATE query below to work. The other will get an IntegrityError and can try again (to fetch another item).
# First we need to get an allocation id. We could use something different here (like a sequence or some other good identifier). In this example I'll just take next largest existing 'aid'. SELECT IFNULL(MAX(aid)+1, 1) AS new_aid FROM QUEUE;
# Then we need to allocate a row for processing. We do that by trying # to update it with the 'aid' we computed above. We only update one row # - the oldest one (here designated by a row with the smallest ID) that # is still not allocated.
UPDATE queue SET aid=new_aid WHERE id=(SELECT MIN(id) FROM queue WHERE aid IS NULL);
The problem is that in tests we run against SQLite, and I don't know if either technique really applies to django+sqlite (the first approach has postgres specific syntax and for the second, django appears to really really want to run in autocommit mode).
Fortunately django almost never runs in autocommit mode. There is an implicit transaction around the whole request. It is easy to control the transaction processing around a piece of code, see [1]. On PostgreSQL we also need to properly implement handling of IngegrityError as it differs significantly from SQLite [2]
This will probably work out OK, but we won't be able to test the postgres variant.
The test script that was in dashboard tree tested the app in sqlite and postgresql. If you want to test that the queue is indeed working let's move it to a dedicated app (django reusability :-) setup a small instance on postgress and bombard it with queries.
Does this sound sane to you guys?
I would rather have a single solution for both databases if possible. If not then perhaps SQL is not the right way to implement a queue but that would significantly complicate our work.
CCing to linaro-dev, somebody might be interested.
Best regards ZK
[1]: https://docs.djangoproject.com/en/dev/topics/db/transactions/?from=olddocs#c... [2]: https://docs.djangoproject.com/en/dev/topics/db/transactions/?from=olddocs#h...
On Tue, 21 Jun 2011 10:17:41 +0200, Zygmunt Krynicki zygmunt.krynicki@linaro.org wrote:
W dniu 21.06.2011 07:30, Michael Hudson-Doyle pisze:
Hi guys (who else should I be emailing this sort of thing to?),
After making rapid progress on the scheduler last week, today I've hit a bit of a stumbling block. Basically, we're implementing a queue in SQL with this sort of pseudo-code:
- look for a job we could run
- mark it as running
- commit
There's an obvious race condition here if two requests overlap which could result in the same job running on two boards. I know how to prevent this in two ways for postgres:
- This way: http://johtopg.blogspot.com/2010/12/queues-in-sql.html
- Relying on SERIALIZABLE isolation level and retrying request that fail to commit (this is what Launchpad does)
- I never tried this solution and I don't know how this is different
from 2) but let's try this. Hopefully it would still work on SQLite.
I realized belatedly that this sort of thing isn't an issue on SQLite at all because any writer blocks all other writers in SQLite (the source of the infamous "database is locked" error message).
A Queue has to operations: get() and put(). Put appends an item to the queue, get removes an item from the front of the list.
CREATE TABLE queue id PRIMARY KEY, aid UNIQUE INTEGER, ...
Put is trivial, as in the code you linked to, just insert, keep the helper 'aid' (allocation ID) NULL.
Get is trickier as we need to make sure that no two consumers allocate the same queue item. Regardless of isolation levels you use a unique constraint will always work, only one client would succeed in getting the UPDATE query below to work. The other will get an IntegrityError and can try again (to fetch another item).
I guess this is not very different from the trick that the blog post I link to uses.
# First we need to get an allocation id. We could use something different here (like a sequence or some other good identifier). In this example I'll just take next largest existing 'aid'. SELECT IFNULL(MAX(aid)+1, 1) AS new_aid FROM QUEUE;
A sequence would make sense here, but sqlite doesn't support them.
# Then we need to allocate a row for processing. We do that by trying # to update it with the 'aid' we computed above. We only update one row # - the oldest one (here designated by a row with the smallest ID) that # is still not allocated.
UPDATE queue SET aid=new_aid WHERE id=(SELECT MIN(id) FROM queue WHERE aid IS NULL);
Yeah, OK, this seems to work. It's probably a bit inefficient in that it uses three queries where you can get by with one using postgres-specific syntax but it'll work.
The problem is that in tests we run against SQLite, and I don't know if either technique really applies to django+sqlite (the first approach has postgres specific syntax and for the second, django appears to really really want to run in autocommit mode).
Fortunately django almost never runs in autocommit mode. There is an implicit transaction around the whole request. It is easy to control the transaction processing around a piece of code, see [1].
I got very confused by this, I admit. In general, calling .save() on a model does a COMMIT, unless you're managing the transactions yourself (or using TransactionMiddleware)?
And of course, the way this is being done in the scheduler at the moment is not part of the web app, so the "usual" stuff doesn't always apply.
On PostgreSQL we also need to properly implement handling of IngegrityError as it differs significantly from SQLite [2]
It's just that it dooms transactions, right? The approach we'll take on integrity errors is to rollback and retry so that's easy.
(it's confusing that https://docs.djangoproject.com/en/dev/topics/db/transactions/#transaction-ro... says that the changes made by a.save() will be lost when the default is for .save() to commit, so they must be assuming that you're using TransactionMiddleware).
This will probably work out OK, but we won't be able to test the postgres variant.
The test script that was in dashboard tree tested the app in sqlite and postgresql. If you want to test that the queue is indeed working let's move it to a dedicated app (django reusability :-) setup a small instance on postgress and bombard it with queries.
Does this sound sane to you guys?
I would rather have a single solution for both databases if possible. If not then perhaps SQL is not the right way to implement a queue but that would significantly complicate our work.
Well, SQL isn't the right way to implement a queue in some ways, but I think we can make it work.
Cheers, mwh
On Wed, 22 Jun 2011 10:21:05 +1200, Michael Hudson-Doyle michael.hudson@linaro.org wrote:
Well, SQL isn't the right way to implement a queue in some ways, but I think we can make it work.
To be clear, I'm not saying we should stop modelling our job queue in the database.
In other news, I found a nicely simple way of making my problem go away: what we want to prevent is the same job being dispatched to multiple boards. So if we add a "current_job" column to the Device table and make it UNIQUE, then the problem goes away (this is a bit different from a solution to the full queuing problem as in principle overlapping requests for the same board could return the same job but that would be an application-level bug in our system).
Cheers, mwh
W dniu 22.06.2011 00:21, Michael Hudson-Doyle pisze:
Fortunately django almost never runs in autocommit mode. There is an implicit transaction around the whole request. It is easy to control the transaction processing around a piece of code, see [1].
I got very confused by this, I admit. In general, calling .save() on a model does a COMMIT, unless you're managing the transactions yourself (or using TransactionMiddleware)?
Something like that.
It's easier to read the source than read my explanations. django.db.models.base:Model.base_save() calls transaction.commit_unless_managed().
In practice, django is almost always managed so save is never commits. The transaction middleware you mentioned makes django views "managed" using commit_on_success decorator on the request method. You need to disable that manually and I don't see any reason why someone would such a thing.
And of course, the way this is being done in the scheduler at the moment is not part of the web app, so the "usual" stuff doesn't always apply.
On PostgreSQL we also need to properly implement handling of IngegrityError as it differs significantly from SQLite [2]
It's just that it dooms transactions, right?
Not really, it just makes supporting them a little trickier. I do that with deserialization of bundles. In practice it's quite easy if you remember to use save points correctly. See my 1-4 points below for explanation why.
The approach we'll take on
integrity errors is to rollback and retry so that's easy.
You can rollback to a savepoint.
(it's confusing that https://docs.djangoproject.com/en/dev/topics/db/transactions/#transaction-ro... says that the changes made by a.save() will be lost when the default is for .save() to commit, so they must be assuming that you're using TransactionMiddleware).
It's not only transaction middleware. Any code that earlier (in the call history of a particular function) enabled transactions will behave this way.
In reality this is all manageable:
1) Proper code will work regardless autocommit mode :-) 2) PostgreSQL does not break things, you just need to take account for savepoints. 3) Handling transactions manually implies you use savepoints and thus get 1 and 2 right. 4) Handling transactions automatically (around a view) implies you get shielded from most of the complexity, if your actual application logic is happy not to use explicit manual transactions in such cases.
Best regards ZK
On Thu, 23 Jun 2011 03:59:49 +0200, Zygmunt Krynicki zygmunt.krynicki@linaro.org wrote:
W dniu 22.06.2011 00:21, Michael Hudson-Doyle pisze:
Fortunately django almost never runs in autocommit mode. There is an implicit transaction around the whole request. It is easy to control the transaction processing around a piece of code, see [1].
I got very confused by this, I admit. In general, calling .save() on a model does a COMMIT, unless you're managing the transactions yourself (or using TransactionMiddleware)?
Something like that.
It's easier to read the source than read my explanations. django.db.models.base:Model.base_save() calls transaction.commit_unless_managed().
OK.
In practice, django is almost always managed so save is never commits.
Really? TransactionMiddleware isn't installed by the settings file that django-admin startapp creates. Is this just a crummy default that everyone changes?
The transaction middleware you mentioned makes django views "managed" using commit_on_success decorator on the request method. You need to disable that manually and I don't see any reason why someone would such a thing.
See above.
And of course, the way this is being done in the scheduler at the moment is not part of the web app, so the "usual" stuff doesn't always apply.
On PostgreSQL we also need to properly implement handling of IngegrityError as it differs significantly from SQLite [2]
It's just that it dooms transactions, right?
Not really, it just makes supporting them a little trickier. I do that with deserialization of bundles. In practice it's quite easy if you remember to use save points correctly. See my 1-4 points below for explanation why.
The approach we'll take on
integrity errors is to rollback and retry so that's easy.
You can rollback to a savepoint.
Yeah, but there's no reason to do that in getJobForBoard.
(it's confusing that https://docs.djangoproject.com/en/dev/topics/db/transactions/#transaction-ro... says that the changes made by a.save() will be lost when the default is for .save() to commit, so they must be assuming that you're using TransactionMiddleware).
It's not only transaction middleware. Any code that earlier (in the call history of a particular function) enabled transactions will behave this way.
Sure, but it's a bit disingenuous to not mention that in the documentation I think. I guess I should file a ticket.
In reality this is all manageable:
- Proper code will work regardless autocommit mode :-)
- PostgreSQL does not break things, you just need to take account for
savepoints. 3) Handling transactions manually implies you use savepoints and thus get 1 and 2 right.
s/savepoints/savepoints or transactions/ -- there's no requirement to involve savepoints if you don't need to.
- Handling transactions automatically (around a view) implies you get
shielded from most of the complexity, if your actual application logic is happy not to use explicit manual transactions in such cases.
Yeah, it's very good default for webapps. Launchpad does that, but goes even further: it automatically does a rollback when a GET method completes, and uses SERIALIZABLE isolation level, retrying a method if the commit fails due to a serialization violation.
However the code I'm working on is not part of a webapp :)
https://code.launchpad.net/~mwhudson/lava-scheduler/daemon-v1/+merge/65616 btw.
Cheers, mwh