Difference between revisions of "P3s"

From DUNE
Jump to navigation Jump to search
 
(30 intermediate revisions by the same user not shown)
Line 1: Line 1:
'''!!!UNDER CONSTRUCTION!!!'''
+
= Documentation =
=States of Pilots and Jobs=
+
This page does not contain documentation for end users. For that, please consult the
 +
documents kept and updated on GitHub, in the "documents"
 +
folder of the p3s repo:
 +
* https://github.com/DUNE/p3s.
 +
 
 +
The contents of the repository should be considered definitive and
 +
overriding other information.
 +
 
 +
The remainder of this page is a collection of developer's notes for p3s of
 +
various degrees of obsoletion and usefulness. General technical notes on deployment
 +
(subject to change) can be found on a separate page: "[[P3s notes]]".
 +
 
 +
=Current Development=
 +
 
 +
==User Management==
 +
Currently a list of users is static and read by the server at initialization time.
 +
It works but we may want to add more functionality.
 +
 
 +
==Priorities==
 +
Deactivated now, do we need to reconsider?
 +
 
 +
==Killing active jobs==
 +
Need more testing
 +
 
 +
==Workflows==
 +
Need more testing
 +
 
 +
=Pilots and Jobs=
 +
==The Pilot Framework==
 +
Job descriptions are sent to the server by users or agents using
 +
a specialized client. As a result, a DB record is created for is
 +
requested job. Nothing happens until an active process running on
 +
a WN (the "pilot") sends request to the server at which point if
 +
any jobs are in the queue, the top one on the stack will be
 +
picked and information sent back to the pilot as to how to execute
 +
this job.
 +
 
 
==Pilots==
 
==Pilots==
 +
Pilots are created completely independently of the server and contact the server via HTTP once they are initiated. The server searches its database of "defined jobs", sorts the jobs by priority and sends a reference to a job to the pilot which sent the request. In the current version, the job reference contains the path to the executable, its parameters and also the part of the job environment which helps to reference the data both in the input and the output by using environment variables.
 +
 
States:
 
States:
 
* active
 
* active
Line 8: Line 46:
 
* finished (completion of a job)
 
* finished (completion of a job)
 
* stopped (timeout w/o getting a job)
 
* stopped (timeout w/o getting a job)
 +
 +
Status:
 +
* OK
 +
* FAIL
  
 
==Jobs==
 
==Jobs==
 
States:
 
States:
 +
* template
 
* defined
 
* defined
* dispatched (sent to a pilot)
+
* dispatched (sent to a pilot for execution)
 
* running
 
* running
 
* finished
 
* finished
 +
 +
Events:
 +
* jobstart
 +
* jobstop
 +
 +
==Matching jobs to pilots==
 +
Jobs are created in the "template" state and won't be matched
  
 
=DAG=
 
=DAG=
==DAG Model in p3s==
+
==DAG as a template==
* Vertex and Edge tables: vertices are jobs and the edges are data
+
* DAG describes the topology and general properties of a workflow, and serves as a template for workflows. The system stores multiple templates referred to by their names.
* Leaves of a DAG: can only be a job, not data (since all data are edges and not vertices). This also has the benefit of not having final data unaccounted for - it must me either flushed or moved to permanent storage in most cases. The job/task responsible for either of these operations forms the leaf.
+
 
 +
* Vertex and Edge tables: vertices are jobs and the edges are data. The edge class has the following attributes at a minimum
 +
** ID
 +
** Source node
 +
** Target node
 +
 
 +
The ID is important if we want to support multiple edges between same nodes (e.g. files of different types produced by one job and consumed by another). This type of DAG is sometimes called MultiDiGraph but terminology may vary. Of course other useful parameters are implemented (path, state etc) as the edges refer to actual data.
 +
 
 +
* Leaves of a DAG: can only be a job, not data (since all data are edges and not vertices). This also has the benefit of not having final data unaccounted for - it must me either flushed or moved to permanent storage in most cases. The job/task responsible for either of these operations forms the leaf. If the data source for the root of the tree is purely external, such DAG nodes is assigned a special type "NOOP" and is handled accordingly (e.g. no pilots are necessary for its execution and it's essentially a pass-through)
 +
 
 +
<hr/>
 +
=P3S walk-through=
 +
Below is a scenario of the actual operation of the p3s prototype as it is tested:
 +
 
 +
* DAGs which are templates for actual workflows are defined in XML and sent to the p3s server where they are stored, as needed. There is no limit on how many templates can be in the system, and they can be updated or deleted at will. They are identified by a unique name and I propose to use this for versioning too, for simplicity.
 +
 
 +
* Also for simplicity, payloads are fixed in a particular DAG - which means DAGs of same topology but with different node content (or with different data types corresponding to edges) will be considered different. Doing proper inheritance quickly became too complex so I skipped it. To illustrate, if you change an executable in a single node or a single file format in a DAG, you will have to create a new DAG (and under a new name, presumably).
 +
 
 +
* a watcher script detects existence of a file (configurable) in a directory (configurable), it will loop and sleep as configured. If triggered, it's configured to create a workflow based on a specific DAG template.
 +
 
 +
* as per the above statement, the server is prompted via HTTP to create a workflow based on a pre-loaded template of a certain kind; the edges in the DAG which served as placeholders are populated with the actual file path information provided by the watcher (or rather its instance, which can be many). Multiple files can be plugged into a DAG if needed via the same interface.
 +
 
 +
* if the first node in the graph is a NOOP it's automatically toggled to "finished" so the rest of the DAG can proceed. This covers the scenario where the source of the data is purely external - however it doesn't need to be and the file finding jobs can be the first node (I think you suggested this once). The "first node" is picked via topological sorting, so it's scientific.
 +
 
 +
* the first unfinished job in the graph which is not dummy (NOOP) is automatically set to "defined" state, which means it can be matched to a pilot. Other jobs remain in the "template" state until their ancestors are executed.
 +
 
 +
* in the meantime, an independent script on some worker node is creating pilots (again, looping and sleeping as configured)
 +
 
 +
* brokerage (matching) happens and jobs get executed; their children in the graph are then toggled to "defined" state so they in turn can be picked by the pilots.
 +
 
 +
* every time a job is finished we check if it was the last one in a DAG, in which case the whole workflow is toggled to "finished" state.
 +
 
 +
 
 +
=LArSoft=
 +
<pre>
 +
you will find larsoft in:
 +
/mnt/nas01/software/dqm/larsoft/
 +
 
 +
To setup cvfms:
 +
source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh
 +
source /cvmfs/fermilab.opensciencegrid.org/products/larsoft/setups
 +
 
 +
then in the directory where larsoft is:
 +
cd /mnt/nas01/software/dqm/larsoft/v06
 +
source localProducts_larsoft_v06_33_00_e14_prof/setup
 +
mrbslp
 +
 
 +
To run one example from Voica's modules:
 +
lar -c job/onlinemonitorprotodune.fcl /mnt/nas01/users/radescu/Feb2017_v22/inputs/detsim_single_DistONSuppOFF_100.root
 +
 
 +
if you want only one event then:
 +
lar -c job/onlinemonitorprotodune.fcl /mnt/nas01/users/radescu/Feb2017_v22/inputs/detsim_single_DistONSuppOFF_100.root -n1
 +
 
 +
at the end it should create:
 +
onlinemonit.root
 +
and three text files (Voica knows what is there)
 +
 
 +
 
 +
lar -c /mnt/nas01/software/dqm/larsoft/v06/job/onlinemonitorprotodune.fcl /mnt/nas01/users/radescu/Feb2017_v22/inputs/detsim_single_DistONSuppOFF_100.root -n1
 +
</pre>
 +
 
 +
= Archive =
 +
==Online Hardware==
 +
P3S is not using this hardware which was initially planned for deployment in EHN1, so this is of purely
 +
historical interest
 +
 
 +
Cooling power (water) for 18 racks, each consuming up to 18KW
 +
* 6 cooled racks to the DAQ (3 NP04 and 3 NP02) in one counting room
 +
* 12 cooled racks for computing farms​ in a second counting room
 +
 
 +
==AutoPyFactory==
 +
* https://twiki.grid.iu.edu/bin/view/Documentation/Release3/AutoPyFactory
 +
* https://twiki.grid.iu.edu/bin/view/Documentation/Release3/AutoPyFactoryDeployment
 +
* https://twiki.grid.iu.edu/bin/view/Documentation/Release3/AutoPyFactory
 +
 
 +
<pre>
 +
[CERN_COMPASS_PROD]
 +
enabled = True
 +
batchqueue = CERN_COMPASS_PROD
 +
wmsqueue = CERN_COMPASS_PROD
 +
batchsubmitplugin = CondorOSGCE
 +
batchsubmit.condorosgce.gridresource = ce503.cern.ch
 +
#batchsubmit.condorosgce.gridresource = condorce01.cern.ch
 +
sched.maxtorun.maximum = 9999
 +
batchsubmit.condorosgce.condor_attributes = periodic_remove = (JobStatus == 2 && (CurrentTime - EnteredCurrentStatus) > 604800)
 +
batchsubmit.condorosgce.condor_attributes.+maxMemory = 1900
 +
batchsubmit.condorosgce.condor_attributes.+xcount = 1
 +
batchsubmit.condorosgce.proxy = compass-production
 +
executable.arguments = %(executable.defaultarguments)s
  
==DAG as a state machine in p3s==
+
executable = /home/autopyfactory/runpilot3-wrapper-compass.sh
A design idea:
+
executable.defaultarguments = -F COMPASS -s %(wmsqueue)s -h %(batchqueue)s -I vm127.jinr.ru -p 943 -w https://vm127.jinr.ru
* have an attribute in the Job class which specifies whether children of that particular node have been created (can be one child, of course)
+
</pre>
* this assumes that the data which is input to each child have been created and is already available
 
* special (but trivial) case is the leaf of a DAG - no further jobs need to be generated. Finding whether a node is a leaf can be done by looking for edges with "source" attribute corresponding to this node - the leaf condition is ascertained when there are no such edges.
 

Latest revision as of 18:08, 6 February 2018

Documentation

This page does not contain documentation for end users. For that, please consult the documents kept and updated on GitHub, in the "documents" folder of the p3s repo:

The contents of the repository should be considered definitive and overriding other information.

The remainder of this page is a collection of developer's notes for p3s of various degrees of obsoletion and usefulness. General technical notes on deployment (subject to change) can be found on a separate page: "P3s notes".

Current Development

User Management

Currently a list of users is static and read by the server at initialization time. It works but we may want to add more functionality.

Priorities

Deactivated now, do we need to reconsider?

Killing active jobs

Need more testing

Workflows

Need more testing

Pilots and Jobs

The Pilot Framework

Job descriptions are sent to the server by users or agents using a specialized client. As a result, a DB record is created for is requested job. Nothing happens until an active process running on a WN (the "pilot") sends request to the server at which point if any jobs are in the queue, the top one on the stack will be picked and information sent back to the pilot as to how to execute this job.

Pilots

Pilots are created completely independently of the server and contact the server via HTTP once they are initiated. The server searches its database of "defined jobs", sorts the jobs by priority and sends a reference to a job to the pilot which sent the request. In the current version, the job reference contains the path to the executable, its parameters and also the part of the job environment which helps to reference the data both in the input and the output by using environment variables.

States:

  • active
  • dispatched
  • running
  • finished (completion of a job)
  • stopped (timeout w/o getting a job)

Status:

  • OK
  • FAIL

Jobs

States:

  • template
  • defined
  • dispatched (sent to a pilot for execution)
  • running
  • finished

Events:

  • jobstart
  • jobstop

Matching jobs to pilots

Jobs are created in the "template" state and won't be matched

DAG

DAG as a template

  • DAG describes the topology and general properties of a workflow, and serves as a template for workflows. The system stores multiple templates referred to by their names.
  • Vertex and Edge tables: vertices are jobs and the edges are data. The edge class has the following attributes at a minimum
    • ID
    • Source node
    • Target node

The ID is important if we want to support multiple edges between same nodes (e.g. files of different types produced by one job and consumed by another). This type of DAG is sometimes called MultiDiGraph but terminology may vary. Of course other useful parameters are implemented (path, state etc) as the edges refer to actual data.

  • Leaves of a DAG: can only be a job, not data (since all data are edges and not vertices). This also has the benefit of not having final data unaccounted for - it must me either flushed or moved to permanent storage in most cases. The job/task responsible for either of these operations forms the leaf. If the data source for the root of the tree is purely external, such DAG nodes is assigned a special type "NOOP" and is handled accordingly (e.g. no pilots are necessary for its execution and it's essentially a pass-through)

P3S walk-through

Below is a scenario of the actual operation of the p3s prototype as it is tested:

  • DAGs which are templates for actual workflows are defined in XML and sent to the p3s server where they are stored, as needed. There is no limit on how many templates can be in the system, and they can be updated or deleted at will. They are identified by a unique name and I propose to use this for versioning too, for simplicity.
  • Also for simplicity, payloads are fixed in a particular DAG - which means DAGs of same topology but with different node content (or with different data types corresponding to edges) will be considered different. Doing proper inheritance quickly became too complex so I skipped it. To illustrate, if you change an executable in a single node or a single file format in a DAG, you will have to create a new DAG (and under a new name, presumably).
  • a watcher script detects existence of a file (configurable) in a directory (configurable), it will loop and sleep as configured. If triggered, it's configured to create a workflow based on a specific DAG template.
  • as per the above statement, the server is prompted via HTTP to create a workflow based on a pre-loaded template of a certain kind; the edges in the DAG which served as placeholders are populated with the actual file path information provided by the watcher (or rather its instance, which can be many). Multiple files can be plugged into a DAG if needed via the same interface.
  • if the first node in the graph is a NOOP it's automatically toggled to "finished" so the rest of the DAG can proceed. This covers the scenario where the source of the data is purely external - however it doesn't need to be and the file finding jobs can be the first node (I think you suggested this once). The "first node" is picked via topological sorting, so it's scientific.
  • the first unfinished job in the graph which is not dummy (NOOP) is automatically set to "defined" state, which means it can be matched to a pilot. Other jobs remain in the "template" state until their ancestors are executed.
  • in the meantime, an independent script on some worker node is creating pilots (again, looping and sleeping as configured)
  • brokerage (matching) happens and jobs get executed; their children in the graph are then toggled to "defined" state so they in turn can be picked by the pilots.
  • every time a job is finished we check if it was the last one in a DAG, in which case the whole workflow is toggled to "finished" state.


LArSoft

you will find larsoft in:
/mnt/nas01/software/dqm/larsoft/

To setup cvfms:
source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh
source /cvmfs/fermilab.opensciencegrid.org/products/larsoft/setups

then in the directory where larsoft is:
cd /mnt/nas01/software/dqm/larsoft/v06
source localProducts_larsoft_v06_33_00_e14_prof/setup
mrbslp

To run one example from Voica's modules:
lar -c job/onlinemonitorprotodune.fcl /mnt/nas01/users/radescu/Feb2017_v22/inputs/detsim_single_DistONSuppOFF_100.root

if you want only one event then:
lar -c job/onlinemonitorprotodune.fcl /mnt/nas01/users/radescu/Feb2017_v22/inputs/detsim_single_DistONSuppOFF_100.root -n1

at the end it should create:
onlinemonit.root
and three text files (Voica knows what is there)


lar -c /mnt/nas01/software/dqm/larsoft/v06/job/onlinemonitorprotodune.fcl /mnt/nas01/users/radescu/Feb2017_v22/inputs/detsim_single_DistONSuppOFF_100.root -n1

Archive

Online Hardware

P3S is not using this hardware which was initially planned for deployment in EHN1, so this is of purely historical interest

Cooling power (water) for 18 racks, each consuming up to 18KW

  • 6 cooled racks to the DAQ (3 NP04 and 3 NP02) in one counting room
  • 12 cooled racks for computing farms​ in a second counting room

AutoPyFactory

[CERN_COMPASS_PROD]
enabled = True
batchqueue = CERN_COMPASS_PROD
wmsqueue = CERN_COMPASS_PROD
batchsubmitplugin = CondorOSGCE
batchsubmit.condorosgce.gridresource = ce503.cern.ch
#batchsubmit.condorosgce.gridresource = condorce01.cern.ch
sched.maxtorun.maximum = 9999
batchsubmit.condorosgce.condor_attributes = periodic_remove = (JobStatus == 2 && (CurrentTime - EnteredCurrentStatus) > 604800)
batchsubmit.condorosgce.condor_attributes.+maxMemory = 1900
batchsubmit.condorosgce.condor_attributes.+xcount = 1
batchsubmit.condorosgce.proxy = compass-production
executable.arguments = %(executable.defaultarguments)s

executable = /home/autopyfactory/runpilot3-wrapper-compass.sh
executable.defaultarguments = -F COMPASS -s %(wmsqueue)s -h %(batchqueue)s -I vm127.jinr.ru -p 943 -w https://vm127.jinr.ru