When developing a highly available, multi-datacentre and thus multi-city telephony platform, a lot of extra considerations come in to play when running system administration jobs at certain intervals, compared to single system crons. When you have one server, everything is simple. Your storage is in one location, regular cron jobs are run on that system and there is no risk of split-brain scenarios.
But with Compass, we have to take all those things into consideration when we run cron jobs. What kind of tasks would a telephony platform have to run regularly at set intervals you might ask? Our customers provided the answers to that in the form of feature requests. Two of the many possibilities are described below.
Telephony cron jobs
Since our platform is designed with location-independent employees in mind, all users are logged on to a certain phone. A phone in itself can’t do anything, only when someone is logged in. Then this phone assumes the identity (or identities) of this user, ringing when that user’s extension is dialled and using that user’s caller ID when dialling out.
When people switch desks on an almost daily basis, it is convenient that all users are logged off from their phones that they have used that day. This could be done e.g. at midnight.
Another task is aggregation of call detail records. We keep logs of all things that have happened to every call, such as entering a queue, playing a sound file, call transfers and alike. Some customers only need a simple overview of how many calls were placed that day, not what exactly happened with every call. For these customers, we aggregate all call events. This is done every 5 minutes, so the aggregation is not that far behind from the current time.
Two types of cron jobs are described here: one that is platform wide (aggregation is done for all customers) and one that is company specific (log off all users at midnight).
From a system management point of view, one would just write some lines in a cron table on one system with the right time intervals and commands. But… what if that server becomes unresponsive? Or is rather busy? In either of these cases, we don’t want the cron jobs to be run on that single system.
The distributed cron was born. We want a list of cron jobs that are executed at the right intervals and preferably on the system that is most idle. However, since some cron jobs write to a platform wide shared storage, running a single job on more than one system simultaneously would wreak havoc.
We designed a distributed cron system that runs cronjobs at the right time, running at most on one system at any given time. The job is also highly more likely to be run on the system that is most idle. Next to that, if a system fails during the run of a cronjob, this is detected and the cronjob is run elsewhere at the next interval.
Next chapter: Challenges in cron timing
In my next post I will explain what challenges we encountered regarding timing issues.