IPerity Blog

What makes us think...

Distributed cron: Design & Implementation (3/3)

Distributed cron: Design & Implementation (3/3)

In the last blogpost we talked about the timing issues when implementing cron. We also specified interval paradigms. You can read about it here: http://www.iperity.com/distributed-cron-challenges-in-cron-timing-23/

The design

All these considerations have led us to the following implementation. Every cronjob in the database has an “allocated_until” field, indicating until what timestamp that cronjob is allocated to a certain server. The executing server updates this field every few minutes. If it fails, the field will after a while no longer contain a timestamp in the future and is thus eligible for execution by another system again.

That was the easy part. Now how should we determine which job is run when and where? Some form of election will have to take place. Within the platform, we use a message/event bus from ZeroC’s Ice[2] middleware called IceStorm[3]. Every system refreshes all the cronjobs from the database every hour.

A couple of minutes before a job is due for execution, all systems will calculate when to send out a cronjob claim message. This calculation is based on the current load of the system. The higher the load, the later the claim message will be sent. When a system receives a claim message, it will mark that job as allocated to another system and will not send out it’s own claim message anymore.

Simultaneous claim messages?

Because the systems are located across the country, it is of course likely that systems with a similar load will simultaneously send out a claim message. The claim message also contains a randomly determined priority. If a system receives a claim message from another system after it has just sent out it’s own claim message, it will compare priorities to determine which system will actually run the job.

So no need to re-elect in case of race conditions due to network latency. When a system has finished executing a job, it sends a release message for that job, signifying that the job may be executed again by any system.

This seemingly simple solution covers most problematic situations, such as network latency and simultaneous cronjob claims, preferably run jobs on idle systems and preventing simultaneous runs of a single cronjob when its execution takes longer that its interval.

Regularly updating the “allocated_until” field in the database ensures that when a system which has claimed a job breaks down and thus no longer updates this field, the other systems will eventually reset the job status from “allocated to another system until we receive a release message” (which will of course never be sent) to “freely available for execution”.

The implementation

Implementation proved complicated. What if a system re-reads the list of cron jobs from the database and the “allocated_until” field contains a timestamp in the past, but that same system is executing that job? What if a system receives a cron job release message from another, but was actually executing that job itself?

In the rare case two systems send out a claim message simultaneously AND have the same priority, what should be done? What should a system do when it refreshes the job list from the DB and finds out it is running a job that no longer exists? Or is receives a claim message for a new job that is does not know anything about?


All these more or less realistic scenarios come in to play when implementing distributed cron. After two iterations of implementation, we decided to implement every cron job as a state machine. The amount of internal states was finally reduced to five:
* Loaded in memory only (and thus allocated to another)
* Scheduled (start election a few minutes before next execution time)
* Electing (in election phase, but didn’t send claim message yet)
* Elected (in election phase, sent claim message)
* Executing (currently executing cronjob)

The number of events was set at seven:
* Claimed (received a claim message from another system)
* Done (The internally running cronjob has finished executing)
* Readnew (Read a previously unknown cronjob from the database)
* Refresh (Re-read a known cronjob from the DB, possibly with a new interval specification)
* Release (Another system has released a cronjob)
* Remove (A cronjob has been removed from the database)
* Timer (The cronjob timer has fired, the reason depends on the state)

Envision a five by seven (35-cell) matrix with these states and events. Every combination requires a set of checks to do and actions to execute. In the end, describing this matrix was the most time consuming of all implementation parts. Fortunately, when we were done, we knew we had determined what should be done in every possible scenario.

The unittests for distributed cron were in some cases even more code that the actual implementation. As expected, implementation took more time than initially estimated. However, the developers now have the comfort of delivering a robust and reliable cron implementation in the platform.

If you are considering implementing something similar, give us a call. We’re happy to help.

[1] http://en.wikipedia.org/wiki/Cron#Configuration_file
[2] http://www.zeroc.com/
[3] https://doc.zeroc.com/display/Ice/IceStorm