Using the Grid/ToPoS
ToPoS - A token Pool Server for Pilot Jobs
When you have tens, hundreds of jobs that you submit to the Grid you may find yourself writing software which checks the status of all those jobs and tries to find out which ones have succeeded and which ones should be resubmitted.
In addition, when submitting a large number of jobs you will often find that they need to operate on the same database or some other static datasource. For every jobsubmission this database then has to be copied to the node the job is running on. This introduces a large overhead per job.
A solution to these problems is to use a pilot job framework. Such frameworks start by submitting a number of pilot jobs to the Grid. Pilot jobs are like normal jobs, but instead of executing the task directly they contact a central server once they are running on a worker node. Then, and only then, will they be assigned a task, get their data and start executing. The central server handles the request from pilot jobs and keeps a log of what tasks are being handled, are finished, and can still be handed out. A pilot job contacts the central task server. It can request a new task, report the succesfull completion of a task just executed or report that it is still working on the task it received previously. When the task server doesn't hear about the progress or completion of a task it has distributed, it will assume the pilot job is either dead or the job has failed. As a result the task will be assigned to another pilot job after it has made a request for a new task.
Pilot jobs will die when there is nothing else to do, or their wall clock time is up. This reduces overhead for jobsubmission considerably. A pilot job will not die after it has succesfully completed a task, but immediately ask for another one.
ToPoS is the system which implements all this. It is a very simple system, which can be very effective. The name of ToPoS refers to Token Pools. The idea is that all a task server needs to provide to pilot jobs is a token which uniquely identifies a task. This token can be as simple as a unique number. All the Pilot job has to do is to map the token to a task, execute it and report back to the token pool server. Then grid computing comes down to creating lists of tasks, and present them as tokens in a pool.
The situation is depicted in the above figure. The user has to create tokens and to submit pilot jobs. The pilot jobs will contact the Token Pool Server and request for a token. The Token Pool Server will hand out a unique token to each requesting job. Of course a pilot job will have to know what to do with the token. The simplest form a token can take is a number. In that case, the pilot job will have to map the number to a specific task it has to compute. When it finishes this task it will delete the token from the Token Pool on the server.
The idea of tokens in a pool can be extended in several ways. Tokens can be files, parameters, numbers etc. What is important about the concept of a token is that it somehow identifies a unique task to be computed. For example, consider a researcher who wants to align 100 genome sequences against a database using a certain program P. ToPoS can be used as follows. The 100 genome sequences are not very large in size and can be uploaded as tokens in ToPoS. The user can do this with an Internet browser. The database can be expected to be large and it usually recommended to upload this to a Storage Element and then replicate it several times. See the section about Grid storage. He then submits about a hundred or more pilot jobs, each containing the program P and a reference to the database. The job will contain also a command to request a token from ToPoS. ToPoS will get requests from running jobs for tokens (in this case genome sequences). It deals these out uniquely. When a job finishes its computation it will tell ToPoS to delete the token and it will then request for a new one. ToPoS will just deal out tokens as longs as there tokens in the pool. In this scenario, it is possible for jobs to receive the same token and that a computation is performed more than once. In general, this is not a problem. However, it is possible to "lock" tokens. when tokens are locked, they will only be dealt out once. Each lock comes with a timeout which means that before the expiration of this timeout a job should confirm posession of the token. If the timeout expires the lock will be removed and the token will be free for distribution.
Advantages
Notice that when you use ToPoS you do not have to worry about job failures. When jobs fail they will not request and process tokens. The user should not look at the pilot jobs he/she has submitted but to the number of tokens being processed. There is no need to keep track of submitted jobs and resubmit those (and only those) that failed. Remember a pilot job will die if all tokens have been processed.
A second advantage of using ToPoS is the reduced overhead of job submission. Once a pilot job is in place it will consume tokens as long as there are any, or it's wall clock time (the maximum time a job is allowed to run) is up. This can also be advantegous for jobs who all need to transfer a large datafile. In the last example, a database should be downloaded from a SE to the Worker Node were the job is running. Note, that this is always the same database for all jobs. When a pilot job is running and is consuming token after token it only needs to download the database once.
HTTP server
In ToPoS, the server is a HTTP server which is not necessarily tied to Grid infrastructure. In fact, you can have simple access to a ToPoS server with your Internet browser or with HTTP command line clients like wget and curl.
The concept of token pools can also be used to create workflows. A task server can deal out tokens from different pools in succession and pilot jobs can remove tokens from pools (by processing them) and create tokens in other pools.
ToPoS is created by SARA (Pieter van Beek). If you want to know more and like to work with it, please contact grid.support@sara.nl.
You can have a peek at topos here: http://topos.grid.sara.nl/4.1/ or here: http://purl.org/sara/topos/latest/
Client libraries
While 'wget' and 'curl' can be used in shell scripts or even other languages, sending 'raw' HTTP-commands can get tedious and error-prone. Client libraries have been written which make accessing ToPoS easier. One such library is for shell scripts and is included in the example below. Other client libraries are:
- PerlToPoS
- a ToPoS client library written in and for Perl scripts
Working example
To let you get familiar with using the grid with ToPoS, we have provided a working example; including a set of easy to use client tools