Difference between revisions of "File Metadata Schema"
BrettViren (talk | contribs) |
BrettViren (talk | contribs) |
||
Line 87: | Line 87: | ||
| ''[(100,2,"label"),]'' | | ''[(100,2,"label"),]'' | ||
| Following SAM requirements the 3-tuple holds: (run-number, subrun-number, "label"). TBD: specify expectation for label. Note, this is a list; a given file may have multiple run/subrun numbers associated. | | Following SAM requirements the 3-tuple holds: (run-number, subrun-number, "label"). TBD: specify expectation for label. Note, this is a list; a given file may have multiple run/subrun numbers associated. | ||
+ | |||
+ | |- | ||
+ | |... | ||
+ | | | ||
+ | | | ||
+ | | | ||
|- | |- | ||
Line 97: | Line 103: | ||
| See {{DocDB|9888}} section 3.2 for a recommended interpretation of version numbers | | See {{DocDB|9888}} section 3.2 for a recommended interpretation of version numbers | ||
+ | |- | ||
+ | |... | ||
+ | | | ||
+ | | | ||
+ | | | ||
|} | |} | ||
[[Category:Metadata]] | [[Category:Metadata]] |
Revision as of 19:56, 7 November 2014
File Metadata Schema is a description of what excerpted, summary information we store about each file and outside each file.
Contents
Documents
- {{DocDB|8093|A Proposal for LBNE Metadata in SAM}
- LBNE File Metadata Schema Evolution Policy and Procedure
- Review of Initial LBNE File Metadata Schema Proposal
- t.b.d. "Proposed change to file metadata in support of 35t raw data"
Drivers
The metadata schema is something that multiple systems rely on
- detector DAQ systems must ultimtely produce it
- end-user analyses rely on datasets defined in terms of it
- production data processing systems need to query it
Currently, the schema to describe data from the the verticle slice test and 35t prototype test run are paramount.
SAM
The de'facto choice for a file metadata system is "SAM" from Fermilab. It has a required/default schema. The table in this Redmine page gives a summary. It is up to the experiment to extend and for some elements, interpret, this schema. This extension and interpretation needs to be carefully nailed down and allowed to evolve in a well controlled manner.
File numbers
See below for issues on this topic related to 35t.
SAM supports both subrun and file numbers. It has two fields that pertain on a per-file basis:
runs = [ [<runnumber>, <subrunnumber>, <runtype>], ] file_partition = <filenumber>
The "runs
" variable is a list of 2-tuples or 3-tuples. In the former case, the subrunnumber
is dropped. This means that a single file may be recorded as spaning runs and/or subruns. The file_partition
is open to interpretation by the experiment (as are the other numbers).
35 ton
This section collects info specific to metadat for 35t raw data file production
File numbering
Some email ca. July 2014 between online and offline people discussed this. The summary:
- artDAQ writes files to disk and enstore and is the initial source of metadata (Kurt Biery, John Freeman)
- Monotonically increasing "run number" (likely not starting at zero/one, and tracked/asserted by run control. Subrun and/or file numbers generated by artDAQ (Erik Blaufus)
- artDAQ will set subrun number and will keep one file per subrun (Kurt Biery)
In terms of SAM, this means either dropping subrunnumber
and using the 2-tuple form of runs
and using file_partition
to count the subrun OR keeping the 3-tuple form and either ignoring file_partition
or filling it with a redundant count. My (bv) recomendation is to adopt the latter.
Desire metadata schema
This section collects the metadata schema desired for the 35t, parameter names, types, etc. This will become the fodder for a formal change proposal.
Parameter types:
- extenum
- an extensible enumeration. This type takes values from a fixed but extensible set of strings. The set is not allowed to grow unrestricted or via one individuals action. Some consensus among a larger group is required to add to the set.
- verstr
- a string left up to the experts who determine the version of something. The values such a type may take should governed by some pattern. See DocDB #9888 section 3.2 for a recommended interpretation of version numbers.
- list(type)
- an ordered sequence of elements of type type
- N-tuple
- an ordered sequence of exactly N elements of unspecified types.
name | type | example | comment |
---|---|---|---|
Basic Parameters (types pre-defined by SAM schema) | |||
file_type | extenum | raw35t,log | indicates category of use, any application that can read files of a given file_type should be able to read all such files. file_type is different than file_format |
file_format | extenum | rawdaq, dk2nu, detsim | defines the lowevel format (eg, ASCII, ROOT) and the general schema which the content follows |
runs | list(3-tuple) | [(100,2,"label"),] | Following SAM requirements the 3-tuple holds: (run-number, subrun-number, "label"). TBD: specify expectation for label. Note, this is a list; a given file may have multiple run/subrun numbers associated. |
... | |||
Schema Extensions (parameters not directly defined by SAM) | |||
file_format_version | verstr | 1.0, 1.1 | See DocDB #9888 section 3.2 for a recommended interpretation of version numbers |
... |