File Metadata Schema
File Metadata Schema is a description of what excerpted, summary information we store about each file and outside each file.
The metadata schema is something that multiple systems rely on
- detector DAQ systems must ultimtely produce it
- end-user analyses rely on datasets defined in terms of it
- production data processing systems need to query it
Currently, the schema to describe data from the the verticle slice test and 35t prototype test run are paramount.
The de'facto choice for a file metadata system is "SAM" from Fermilab. It has a required/default schema. The table in this Redmine page gives a summary. It is up to the experiment to extend and for some elements, interpret, this schema. This extension and interpretation needs to be carefully nailed down and allowed to evolve in a well controlled manner.
See below for issues on this topic related to 35t.
SAM supports both subrun and file numbers. It has two fields that pertain on a per-file basis:
runs = [ [<runnumber>, <subrunnumber>, <runtype>], ] file_partition = <filenumber>
runs" variable is a list of 2-tuples or 3-tuples. In the former case,
the subrunnumber is dropped. This means that a single file may be recorded as spaning runs and/or subruns. The
file_partition is open to interpretation by the experiment (as are the other numbers).
Proposed Schema Evolution Policy and Procedure
A proposal for schema evolution policy and procedure is in DocDB #9888. It contains a description of the existing file metadata schema.
35t file numbers
Some email ca. July 2014 between online and offline people discussed this. The summary:
- artDAQ writes files to disk and enstore and is the initial source of metadata (Kurt Biery, John Freeman)
- Monotonically increasing "run number" (likely not starting at zero/one, and tracked/asserted by run control. Subrun and/or file numbers generated by artDAQ (Erik Blaufus)
- artDAQ will set subrun number and will keep one file per subrun (Kurt Biery)
In terms of SAM, this means either dropping
subrunnumber and using the 2-tuple form of
runs and using
file_partition to count the subrun OR keeping the 3-tuple form and either ignoring
file_partition or filling it with a redundant count. My (bv) recomendation is to adopt the latter.