Andy Doan andy.doan@linaro.org writes:
On 01/28/2013 02:27 AM, Dave Pigott wrote:
Looks like we need to plan for some downtime while this happens. I'll make sure I take the boards offline. Can we run something outside the lab to keep availability of job submission?
This seems to be an idea worth at least discussing.
In fact, I think we have discussed it before :-)
It seems like it would be really cool to create some sort of simple scheduler service that basically accepts accepts all job requests and saves them to disk.
Yeah, feels like it should be fairly simple. It could run in EC2 or whatever.
Maybe its preseeded with a job-id so that we can return unique job ID's back to the caller.
I think this is essential. It might help to move away from using the database-assigned primary key id as the id we present to the user maybe? One way this could work though is people _always_ submit to this simple service in the cloud, in which case it could get to assign the IDs.
We then create some type of import tool that, once the service is back online, can suck in this data and execute the jobs.
Right.
Thoughts? Am I solving the wrong problem?
Well. The thing that occurs to me is that what we are doing here is building a system that aims to be available for writes in the face of network partitions, and other people have already built systems that have this property -- it is basically the whole principle behind Amazon's famous dynamo db [1] and the systems it inspired like Riak and Cassandra. It seems unlikely that we'd do a better job than them.
One thing that I don't completely understand how to replicate if we have a simple job-accepting scheduler in the cloud is the sanity check about the submitting user being able to submit results to the stream specified in the job -- or even if token provided while submitting the job is valid, come to think of it!
Cheers, mwh
[1] Everyone in computing should take the 40 minutes or so it takes to read this paper: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf if only for this quote:
"For example, customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados."