In the last blogpost we talked about how cron in Telephony looks like. We also discussed why we need a distributed cron setup. You can read about it here: http://www.iperity.com/distributed-cron-telephony-cron-13/ Our last paragraph:
What we designed
We designed a distributed cron system that runs cronjobs at the right time, running at most on one system at any given time. The job is also highly more likely to be run on the system that is most idle. Next to that, if a system fails during the run of a cronjob, this is detected and the cronjob is run elsewhere at the next interval.
The steep learning curve of ever increasing complexity
This might not sound too hard, but during design, we came across a couple of challenges. What if a cronjob happens to run longer than it’s execution interval? So for example CDR aggregation is run every 5 minutes, but what if a single aggregation takes 7 minutes because of a peak load a while ago? We don’t want to have two instances of the aggregation job running simultaneously.
How do we ensure that when it is time to run a job, the servers unanimously know which one will execute it? We could use some sort of election, but servers are spread across the country, implying network delays that increase the chance of race conditions to a realistic level.
We don’t want “job fights” between servers, sending traffic for every job to all servers “discussing” who will run it. Also, if a certain system is executing the cron job but breaks down, how will the other servers know that that system has no longer claimed that cron job and that it is “freely available for execution” again?
The currently implemented cron daemons available on most UNIX systems follow a seemingly simple interval specification for cron jobs. At most 5 intervals specifying the day of the week, the month, the day of the month, the hour and the minute. Implementation of this format for the cron daemons is also intuitive. Every minute, check which of the given interval specifications matches with the current minute. If it matches, run the job.
However, in distributed cron, an election needs to take place BEFORE the job is executed. This implies a whole new paradigm to interval specifications. Instead of determining if the current time matches it, determine when the NEXT time will be. This proved to be rather difficult, but we managed.
If you’re a programmer, think about the following task: A cronjob has an interval specification of: every two hours between 8 o’clock in the morning and 10 o’clock in the evening at 20 minutes passed the hour, but only on Mondays, Wednesdays and Fridays during the summer. In “cron language”, this is:
20 8-22/2 * 6-8 1,3,5
So, how do you build an algorithm which can read the above specification and then determine when “two minutes before the earliest future execution” is, given your current time? For the sake of simplicity, we have not even considered the fact that different companies on the platform reside in different time zones…