Searching Datasets for Analysis
The most common use of the grid for analysis is the submission of jobs to run over input files from the official Belle II data and MC.
The files on the grid are distributed on the available storage sites around the world. Fortunately, as a user you do not have to worry about their physical location. A file catalog keeps a record of where the files are located and you just need to provide a logical identifier for the samples of interest for your analysis.
Datasets and Datablocks
Each file on the grid is located using a Logical File Name (LFN), which always starts with /belle
:
/belle/data_type/some_more_directories/dataset/datablock/file.root
When the LFN is provided to any of the gbasf2 tools, a replica is retrieved from catalog in the backend. You only need to provide the LFN(s) relevant for your analysis, without dealing with the physical location of the samples.
Files are collected within datasets, which can be located using a Logical Path Name (LPN):
/belle/data_type/some_more_directories/dataset
A dataset may correspond to a run for an specific experiment number in data, or a MC type in a MC campaign.
By design, a directory on the grid can contain 1000 files at most. For this reason,
the concept of a datablock was introduced. Each dataset is subdivided into directories with names subXX
,
where the last two digits are sequentially iterated when the dataset is expanded to hold more files
(sub00
, sub01
, …).
In gbasf2, the data handling unit is the datablock.
The Dataset Searcher
The Dataset Searcher is an application on the DIRAC web portal (https://dirac.cc.kek.jp:8443/DIRAC/). You can find it from the Menu / BelleDIRAC Apps / Dataset Searcher.
The categories available to search for datasets are:
- Data type
MC or Data.
- Background level
The beam background configuration used in simulation: BGx1, BGx0, etc (only for MC).
- Beam energy
Energy point in collisions: Y(4S), Y(5S), etc. Off-resonance collisions are specified by the energy in MeV.
- Data Levels
Indicator of the level of processing: mdst, udst.
- Campaigns
MC or data processing campaign name.
- Global Tags
Identifier of the database global tag.
- General Skim Names
First stage skim name (only defined for Data). ‘All’ means no HLT skim.
- Skim Types
Physics skim decay mode. See the software manual for details and the code of the available skims.
- Releases
Basf2 release version used to produce the dataset.
- MC Event Types
Physics EventType generated (only defined for MC). For generic samples use charged, mixed, uubar, ee, etc.
- Experiment High
Upper boundary of the experiment number in the datasets.
- Experiment Low
Lower boundary of the experiment number in the datasets.
- Run High
Upper boundary of the run number in the datasets.
- Run Low
Lower boundary of the run number in the datasets.
Search for the samples you want (e.g. Campaigns: proc11, Beam Energies: 4S, etc) and click on ‘Search’. A list of LFNs matching your selection will be provided.
At the bottom of the page, there is a button “Download.txt file” that can be used to get a list of LFNs in a local file.
The list of LFNs can be used to submit your jobs with gbasf2
(see Submitting jobs to the grid).
Note
The Data Production WebHome in Confluence is the entry point to find additional information about the samples that are available.
Searching for datasets via the command-line
Another way to interact with the Dataset Searcher is using the command line tool gb2_ds_search
. It provides
options to search available metadata and datasets matching the query.
Searching available metadata
gb2_ds_search metadata
will return available values for the metadata specified:
$ gb2_ds_search metadata --table <metadata>
For example:
$ gb2_ds_search metadata --table MCEventTypes
"charged"
"uubar"
"ee"
"3960640000"
"4190020000"
...
The list of available options are
- --help
Show the help message and exit
- --table
Table to search values from: Releases, GlobalTags, MCEventTypes, GeneralSkimNames, SkimDecayModes, Campaigns, DataTypes, DataLevels, BeamEnergies, BkgLevels.
Searching datasets
In a similar way as the Dataset Searcher on web, gb2_ds_search dataset
will return the list of datasets
that matches the query. The usage is:
$ gb2_ds_search dataset [-h] [-o OUTPUT_FILE] [--campaign CAMPAIGN]
[--data_type DATA_TYPE] [--data_level DATA_LEVEL]
[--run_high RUN_HIGH] [--exp_high EXP_HIGH]
[--run_low RUN_LOW] [--exp_low EXP_LOW]
[--mc_event MC_EVENT] [--skim_decay SKIM_DECAY]
[--general_skim GENERAL_SKIM]
[--beam_energy BEAM_ENERGY]
[--global_tag GLOBAL_TAG] [--release RELEASE]
[--bkg_level BKG_LEVEL]
A full description of the options is available at Dataset Searcher tools .
Datasets and files metadata
Additional information related to the datasets and files located on the grid is stored in the metadata catalog. Metadata is the information about the data in concern. It is essential to correctly identify the input data for any kind of processing, like mdst production, skimming, or physics analysis.
Files, datablocks and datasets use a different metadata schema and therefore, information obtained from each one is different. See Computing Metadata in Confluence for details.
To retrieve metadata of datasets, use the command gb2_ds_query_dataset
.
For example:
$ gb2_ds_query_dataset -l /belle/Data/release-05-01-03/DB00001363/SkimP11x1/prod00016031/e0010/4S/r04275/18530200/udst
udst
dataset: /belle/Data/release-05-01-03/DB00001363/SkimP11x1/prod00016031/e0010/4S/r04275/18530200/udst
creationDate: 2020-11-25 14:53:17
lastUpdate: 2020-11-27 02:11:42
nFiles: 1
size: 105046
status: good
productionId: 16031
transformationId: 348793
owner: g:belle_skim
mc: SkimP11x1
stream:
dataType: data
dataLevel: udst
beamEnergy: 4S
mcEventType:
generalSkimName:
skimDecayMode: 18530200
release: release-05-01-03
dbGlobalTag: DB00001363
sourceCode:
sourceCodeRevision:
steeringFile: skim/SkimP11x1/release-05-01-03/SkimScripts/singleTagPseudoScalar_Skim_Standalone.py
steeringFileRevision:
experimentLow: 10
experimentHigh: 10
runLow: 4275
runHigh: 4275
logLfn:
parentDatasets: /belle/Data/proc/release-04-02-02/DB00000938/proc11/prod00013369/e0010/4S/r04275/mdst
description: SkimP11x1 singleTagPseudoScalar skim on proc11_exp10r2_b1.
where the -l
option displays the metadata to be easily readable. Notice that the metadata contains useful information
such as the parent datasets, a description, the steering file, etc.
Analogously, metadata of datablocks and files is obtained with gb2_ds_query_datablock
and gb2_ds_query_file
.
Using the same LPN as the previous example:
$ gb2_ds_query_datablock -l /belle/Data/release-05-01-03/DB00001363/SkimP11x1/prod00016031/e0010/4S/r04275/18530200/udst
/belle/Data/release-05-01-03/DB00001363/SkimP11x1/prod00016031/e0010/4S/r04275/18530200/udst/sub00
lastUpdate: 2020-11-26 04:03:35
nFiles: 1
status: good
creationDate: 2020-11-26 03:45:54
size: 105046
You can easily see that the dataset contains an unique datablock, and the datablock contains only one file. Retrieving the metadata of the file can be done providing the LPN of the parent datablock:
$ gb2_ds_query_file -l /belle/Data/release-05-01-03/DB00001363/SkimP11x1/prod00016031/e0010/4S/r04275/18530200/udst/sub00
/belle/Data/release-05-01-03/DB00001363/SkimP11x1/prod00016031/e0010/4S/r04275/18530200/udst/sub00/udst_000001_prod00016031_task94274000001.root
jobId: 174693183
checksumType: Adler32
runHigh: 4275
experimentHigh: 10
eventLow: 325
guid: B6D9ED03-EEA5-42B5-5DDF-550F9716E91C
status: good
site: LCG.Napoli.it
eventHigh: 1508
checksum: d9fd3a15
parentGuids: 4A487D03-0B18-2DB0-96F4-D9E2CB78EE21
nEvents: 11
runLow: 4275
experimentLow: 10
date: 2020-11-26 03:50:35
size: 105046
A full description of the usage and options for the command line tools is available at Dataset management tools.