Latest revision as of 14:45, 16 December 2014

File Metadata Schema is a description of what excerpted, summary information we store about each file and outside the file itself.

Documents

Relevant LBNE DocDB entries:

LBNE File Metadata Schema Evolution Policy and Procedure
- changes to the file metadata schema should follow this

A Proposal for LBNE Metadata in SAM
- v2 including lbne_data extension
- v1 including lbne_MC extension
- not yet ratified

Review of Initial LBNE File Metadata Schema Proposal

Raw Data File Name Discussion (v2)

t.b.d. "Proposed change to file metadata in support of 35t raw data"

Drivers

The metadata schema is something that multiple systems rely on

detector DAQ systems must ultimtely produce it
end-user analyses rely on datasets defined in terms of it
production data processing systems need to query it

Currently, the schema to describe data from the the verticle slice test and 35t prototype test run are paramount.

File names

This section describes the naming convention for files. There are two types of files:

collaboration: raw data files produced by the detectors or by production processing or simulation
personal: files produced by individual or groups but not in an official production manner.

Collaboration files

Collaboration files are produced in an "official" manner, either via the detectors or through production processing of these files or the producing and processing of simulation. If any file is so produced and their file name conventions are not documented in this section please add them or contact the S&C group.

35t DAQ raw data

The 35t DAQ raw data files are named following this pattern:

35t_r<run number>_s<subrun_number>_<run_mode>_raw.root

For example:

35t_r0000001_s0001_test_raw.root

With:

run_number a 7 digit, zero-padded, monotonically increasing integer labeling a "run" of the DAQ where all parameters of the data taking are expected to remain fixed.
subrun_number a 4 digit, zero-padded, monotonically increasing integer within one run and labeling a "subrun" of the DAQ. Subruns indicate simple partitions of a run in order to keep file sizes manageable.
run_mode is an alphanumeric word describing the selection that the DAQ places on the data (aka, "trigger" but it may not be implemented as a traditional trigger).

Personal Files

Personal files are those produced by an individual or group but not as part of an organized "official" production processing. It is encouraged that the naming conventions of any such files which are shared beyond the individual or group that produced them to be documented in this section. Either document them directly or include links to external documentation. If you know of such files which are not so documented, please do so or contact the S&C group.

SAM

The de'facto choice for a file metadata system is "SAM" from Fermilab. It has a required/default schema. The table in this Redmine page gives a summary. It is up to the experiment to extend and for some elements, interpret, this schema. This extension and interpretation needs to be carefully nailed down and allowed to evolve in a well controlled manner.

File numbers

See below for issues on this topic related to 35t.

SAM supports both subrun and file numbers. It has two fields that pertain on a per-file basis:

runs = [ [<runnumber>, <subrunnumber>, <runtype>], ]
file_partition = <filenumber>

The "runs" variable is a list of 2-tuples or 3-tuples. In the former case, the subrunnumber is dropped. This means that a single file may be recorded as spaning runs and/or subruns. The file_partition is open to interpretation by the experiment (as are the other numbers).

35 ton

This section collects info specific to metadat for 35t raw data file production

File numbering

Some email ca. July 2014 between online and offline people discussed this. The summary:

artDAQ writes files to disk and enstore and is the initial source of metadata (Kurt Biery, John Freeman)
Monotonically increasing "run number" (likely not starting at zero/one, and tracked/asserted by run control. Subrun and/or file numbers generated by artDAQ (Erik Blaufus)
artDAQ will set subrun number and will keep one file per subrun (Kurt Biery)

In terms of SAM, this means either dropping subrunnumber and using the 2-tuple form of runs and using file_partition to count the subrun OR keeping the 3-tuple form and either ignoring file_partition or filling it with a redundant count. My (bv) recomendation is to adopt the latter.

Desire metadata schema

This section collects the metadata schema desired for the 35t, parameter names, types, etc. This will become the fodder for a formal change proposal.

Parameter types:

extenum: an extensible enumeration. This type takes values from a fixed but extensible set of strings. The set is not allowed to grow unrestricted or via one individuals action. Some consensus among a larger group is required to add to the set.

verstr: a string left up to the experts who determine the version of something. The values such a type may take should governed by some pattern. See DocDB #9888 section 3.2 for a recommended interpretation of version numbers.

list(type): an ordered sequence of elements of type type

N-tuple: an ordered sequence of exactly N elements of unspecified types.

name	type	example	comment
Basic Parameters (types pre-defined by SAM schema)
file_type	`extenum`	raw35t,log	indicates category of use, any application that can read files of a given file_type should be able to read all such files. file_type is different than file_format
file_format	`extenum`	rawdaq, dk2nu, detsim	defines the lowevel format (eg, ASCII, ROOT) and the general schema which the content follows
runs	`list(3-tuple)`	[(100,2,"label"),]	Following SAM requirements the 3-tuple holds: (run-number, subrun-number, "label"). TBD: specify expectation for label. Note, this is a list; a given file may have multiple run/subrun numbers associated.
...
Schema Extensions (parameters not directly defined by SAM)
file_format_version	`verstr`	1.0, 1.1	See DocDB #9888 section 3.2 for a recommended interpretation of version numbers
...

@@ Line 1: / Line 1: @@
-File Metadata Schema is a description of what excerpted, summary information we store about each file and outside each file.
+File Metadata Schema is a description of what excerpted, summary information we store about each file and outside the file itself.
 = Documents =
-* {{DocDB|8093|A Proposal for LBNE Metadata in SAM}
+Relevant LBNE DocDB entries:
 * {{DocDB|9888|LBNE File Metadata Schema Evolution Policy and Procedure}}
+** changes to the file metadata schema should follow this
+* {{DocDB|8093|A Proposal for LBNE Metadata in SAM}}
+** [http://lbne2-docdb.fnal.gov:8080/cgi-bin/ShowDocument?docid=8093&version=3 v2] including <code>lbne_data</code> extension
+** [http://lbne2-docdb.fnal.gov:8080/cgi-bin/ShowDocument?docid=8093&version=1 v1] including <code>lbne_MC</code> extension
+** not yet ratified
 * {{DocDB|9950|Review of Initial LBNE File Metadata Schema Proposal}}
+* {{DocDB|9967|Raw Data File Name Discussion (v2)}}
 * t.b.d. "Proposed change to file metadata in support of 35t raw data"
+Other links:
+* [https://cdcvs.fnal.gov/redmine/projects/sam-web/wiki/Metadata_format SAM file metadata (parameters supported by default)]
 = Drivers =
@@ Line 17: / Line 31: @@
 Currently, the schema to describe data from the the verticle slice test and 35t prototype test run are paramount.
+= File names =
+This section describes the naming convention for files.  There are two types of files:
+; collaboration : raw data files produced by the detectors or by production processing or simulation
+; personal : files produced by individual or groups but not in an official production manner.
+== Collaboration files ==
+Collaboration files are produced in an "official" manner, either via the detectors or through production processing of these files or the producing and processing of simulation.  If any file is so produced and their file name conventions are not documented in this section please add them or contact the S&C group.
+=== 35t DAQ raw data ===
+The 35t DAQ raw data files are named following this pattern:
+t_r<run number>_s<subrun_number>_<run_mode>_raw.root
+For example:
+t_r0000001_s0001_test_raw.root
+With:
+* <code>run_number</code> a 7 digit, zero-padded, monotonically increasing integer labeling a "run" of the DAQ where all parameters of the data taking are expected to remain fixed.
+* <code>subrun_number</code> a 4 digit, zero-padded, monotonically increasing integer within one run and labeling a "subrun" of the DAQ.  Subruns indicate simple partitions of a run in order to keep file sizes manageable.
+* <code>run_mode</code> is an alphanumeric word describing the selection that the DAQ places on the data (aka, "trigger" but it may not be implemented as a traditional trigger).
+== Personal Files ==
+Personal files are those produced by an individual or group but not as part of an organized "official" production processing.  It is encouraged that the naming conventions of any such files which are shared beyond the individual or group that produced them to be documented in this section.  Either document them directly or include links to external documentation.  If you know of such files which are not so documented, please do so or contact the S&C group.
 = SAM =
@@ Line 33: / Line 78: @@
 The "<code>runs</code>" variable is a '''list''' of 2-tuples or 3-tuples.  In the former case, <code>the subrunnumber</code> is dropped.  This means that a single file may be recorded as spaning runs and/or subruns.  The <code>file_partition</code> is open to interpretation by the experiment (as are the other numbers).
-= Proposed Schema Evolution Policy and Procedure =
+= 35 ton =
-A proposal for schema evolution policy and procedure is in {{DocDB|9888}}.  It contains a description of the existing file metadata schema.
+This section collects info specific to metadat for 35t raw data file production
-= 35t file numbers =
+== File numbering ==
 Some email ca. July 2014 between online and offline people discussed this.  The summary:
@@ Line 46: / Line 91: @@
 In terms of SAM, this means either dropping <code>subrunnumber</code> and using the 2-tuple form of <code>runs</code> and using <code>file_partition</code> to count the subrun '''OR''' keeping the 3-tuple form and either ignoring <code>file_partition</code> or filling it with a redundant count.  My (bv) recomendation is to adopt the latter.
+== Desire metadata schema ==
+This section collects the metadata schema desired for the 35t, parameter names, types, etc.  This will become the fodder for a formal change proposal.
+Parameter types:
+; <tt>extenum</tt> : an extensible enumeration.  This type takes values from a fixed but extensible set of strings.  The set is not allowed to grow unrestricted or via one individuals action.  Some consensus among a larger group is required to add to the set.
+; <tt>verstr</tt> : a string left up to the experts who determine the version of something.  The values such a type may take should governed by some pattern.  See {{DocDB|9888}} section 3.2 for a recommended interpretation of version numbers.
+; <tt>list(type)</tt> : an ordered sequence of elements of type <tt>type</tt>
+; <tt>N-tuple</tt> : an ordered sequence of exactly N elements of unspecified types.
+{| class="wikitable"
+!name
+!type
+!example
+!comment
+|-
+! scope="row" colspan="4"| Basic Parameters (types pre-defined by SAM schema)
+|-
+|'''file_type'''
+| <tt>extenum</tt>
+|''raw35t'',''log''
+| indicates category of use, any application that can read files of a given '''file_type''' should be able to read all such files.  '''file_type''' is different than '''file_format'''
+|-
+|'''file_format'''
+| <tt>extenum</tt>
+| ''rawdaq'', ''dk2nu'', ''detsim''
+| defines the lowevel format (eg, ASCII, ROOT) and the general schema which the content follows
+|-
+|'''runs'''
+| <tt>list(3-tuple)</tt>
+| ''[(100,2,"label"),]''
+| Following SAM requirements the 3-tuple holds: (run-number, subrun-number, "label").  TBD: specify expectation for label.  Note, this is a list; a given file may have multiple run/subrun numbers associated.
+|-
+|...
+|
+|
+|
+|-
+! scope="row" colspan="4"| Schema Extensions (parameters not directly defined by SAM)
+|-
+|'''file_format_version'''
+| <tt>verstr</tt>
+| ''1.0'', ''1.1''
+| See {{DocDB|9888}} section 3.2 for a recommended interpretation of version numbers
+|-
+|...
+|
+|
+|
+|}
 [[Category:Metadata]]

Difference between revisions of "File Metadata Schema"

Latest revision as of 14:45, 16 December 2014

Contents

Documents

Drivers

File names

Collaboration files

35t DAQ raw data

Personal Files

SAM

File numbers

35 ton

File numbering

Desire metadata schema

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

protoDUNE

Tools

ARCHIVE

Tools