Pilots and Jobs
Pilots are created completely independently of the server and contact the server via HTTP once they are initiated. The server searches its database of "defined jobs", sorts the jobs by priority and sends a reference to a job to the pilot which sent the request. In the current version, the job reference contains the path to the executable, its parameters and also the part of the job environment which helps to reference the data both in the input and the output by using environment variables.
- finished (completion of a job)
- stopped (timeout w/o getting a job)
- dispatched (sent to a pilot for execution)
Matching jobs to pilots
Jobs are created in the "template" state and won't be matched
DAG as a template
- DAG describes the topology and general properties of a workflow, and serves as a template for workflows. The system stores multiple templates referred to by their names.
- Vertex and Edge tables: vertices are jobs and the edges are data. The edge class has the following attributes at a minimum
- Source node
- Target node
The ID is important if we want to support multiple edges between same nodes (e.g. files of different types produced by one job and consumed by another). This type of DAG is sometimes called MultiDiGraph but terminology may vary. Of course other useful parameters are implemented (path, state etc) as the edges refer to actual data.
- Leaves of a DAG: can only be a job, not data (since all data are edges and not vertices). This also has the benefit of not having final data unaccounted for - it must me either flushed or moved to permanent storage in most cases. The job/task responsible for either of these operations forms the leaf. If the data source for the root of the tree is purely external, such DAG nodes is assigned a special type "NOOP" and is handled accordingly (e.g. no pilots are necessary for its execution and it's essentially a pass-through)