Searching Datasets for Analysis

The most common use of the grid for analysis is the submission of jobs to run over input files from the official Belle II data and MC.

The files on the grid are distributed on the available storage sites around the world. Fortunately, as a user you do not have to worry about their physical location. A file catalog keeps a record of where the files are located and you just need to provide a logical identifier for the samples of interest for your analysis.

Datasets and Datablocks

Each file on the grid is located using a Logical File Name (LFN), which always starts with /belle:

/belle/data_type/some_more_directories/dataset/datablock/file.root

When the LFN is provided to any of the gbasf2 tools, a replica is retrieved from catalog in the backend. You only need to provide the LFN(s) relevant for your analysis, without dealing with the physical location of the samples.

Files are collected within datasets, which can be located using a Logical Path Name (LPN):

/belle/data_type/some_more_directories/dataset

A dataset may correspond to a run for an specific experiment number in data, or a MC type in a MC campaign.

By design, a directory on the grid can contain 1000 files at most. For this reason, the concept of a datablock was introduced. Each dataset is subdivided into directories with names subXX, where the last two digits are sequentially iterated when the dataset is expanded to hold more files (sub00, sub01, …).

In gbasf2, the data handling unit is the datablock.

The Dataset Searcher

The Dataset Searcher is an application on the DIRAC web portal (https://dirac.cc.kek.jp:8443/DIRAC/). You can find it from the Menu / BelleDIRAC Apps / Dataset Searcher.

_images/DatasetSearcher.png

The categories available to search for datasets are:

Data type

MC or Data.

Background level

The beam background configuration used in simulation: BGx1, BGx0, etc (only for MC).

Beam energy

Energy point in collisions: Y(4S), Y(5S), etc. Off-resonance collisions are specified by the energy in MeV.

Data Levels

Indicator of the level of processing: mdst, udst.

Campaigns

MC or data processing campaign name.

Global Tags

Identifier of the database global tag.

General Skim Names

First stage skim name (only defined for Data). ‘All’ means no HLT skim.

Skim Types

Physics skim decay mode. See the software manual for details and the code of the available skims.

Releases

Basf2 release version used to produce the dataset.

MC Event Types

Physics EventType generated (only defined for MC). For generic samples use charged, mixed, uubar, ee, etc.

Experiment High

Upper boundary of the experiment number in the datasets.

Experiment Low

Lower boundary of the experiment number in the datasets.

Run High

Upper boundary of the run number in the datasets.

Run Low

Lower boundary of the run number in the datasets.

Search for the samples you want (e.g. Campaigns: proc11, Beam Energies: 4S, etc) and click on ‘Search’. A list of LFNs matching your selection will be provided.

At the bottom of the page, there is a button “Download.txt file” that can be used to get a list of LFNs in a local file. The list of LFNs can be used to submit your jobs with gbasf2 (see Submitting jobs to the grid).

Note

The Data Production WebHome in Confluence is the entry point to find additional information about the samples that are available.

Searching for datasets via the command-line

Another way to interact with the Dataset Searcher is using the command line tool gb2_ds_search. It provides options to search available metadata and datasets matching the query.

Searching available metadata

gb2_ds_search metadata will return available values for the metadata specified:

$ gb2_ds_search metadata --table <metadata>

For example:

$ gb2_ds_search metadata --table MCEventTypes

"charged"
"uubar"
"ee"
"3960640000"
"4190020000"
...

The list of available options are

--help

Show the help message and exit

--table

Table to search values from: Releases, GlobalTags, MCEventTypes, GeneralSkimNames, SkimDecayModes, Campaigns, DataTypes, DataLevels, BeamEnergies, BkgLevels.

Searching datasets

In a similar way as the Dataset Searcher on web, gb2_ds_search dataset will return the list of datasets that matches the query. The usage is:

$ gb2_ds_search dataset [-h] [-o OUTPUT_FILE] [--campaign CAMPAIGN]
                        [--data_type DATA_TYPE] [--data_level DATA_LEVEL]
                        [--run_high RUN_HIGH] [--exp_high EXP_HIGH]
                        [--run_low RUN_LOW] [--exp_low EXP_LOW]
                        [--mc_event MC_EVENT] [--skim_decay SKIM_DECAY]
                        [--general_skim GENERAL_SKIM]
                        [--beam_energy BEAM_ENERGY]
                        [--global_tag GLOBAL_TAG] [--release RELEASE]
                        [--bkg_level BKG_LEVEL]

A full description of the options is available at Dataset Searcher tools .

Datasets and files metadata

Additional information related to the datasets and files located on the grid is stored in the metadata catalog. Metadata is the information about the data in concern. It is essential to correctly identify the input data for any kind of processing, like mdst production, skimming, or physics analysis.

Files, datablocks and datasets use a different metadata schema and therefore, information obtained from each one is different. See Computing Metadata in Confluence for details.

To retrieve metadata of datasets, use the command gb2_ds_query_dataset. For example:

$ gb2_ds_query_dataset -l /belle/Data/release-05-01-03/DB00001363/SkimP11x1/prod00016031/e0010/4S/r04275/18530200/udst

udst
    dataset: /belle/Data/release-05-01-03/DB00001363/SkimP11x1/prod00016031/e0010/4S/r04275/18530200/udst
    creationDate: 2020-11-25 14:53:17
    lastUpdate: 2020-11-27 02:11:42
    nFiles: 1
    size: 105046
    status: good
    productionId: 16031
    transformationId: 348793
    owner: g:belle_skim
    mc: SkimP11x1
    stream:
    dataType: data
    dataLevel: udst
    beamEnergy: 4S
    mcEventType:
    generalSkimName:
    skimDecayMode: 18530200
    release: release-05-01-03
    dbGlobalTag: DB00001363
    sourceCode:
    sourceCodeRevision:
    steeringFile: skim/SkimP11x1/release-05-01-03/SkimScripts/singleTagPseudoScalar_Skim_Standalone.py
    steeringFileRevision:
    experimentLow: 10
    experimentHigh: 10
    runLow: 4275
    runHigh: 4275
    logLfn:
    parentDatasets: /belle/Data/proc/release-04-02-02/DB00000938/proc11/prod00013369/e0010/4S/r04275/mdst
    description: SkimP11x1 singleTagPseudoScalar skim on proc11_exp10r2_b1.

where the -l option displays the metadata to be easily readable. Notice that the metadata contains useful information such as the parent datasets, a description, the steering file, etc.

Analogously, metadata of datablocks and files is obtained with gb2_ds_query_datablock and gb2_ds_query_file. Using the same LPN as the previous example:

$ gb2_ds_query_datablock -l /belle/Data/release-05-01-03/DB00001363/SkimP11x1/prod00016031/e0010/4S/r04275/18530200/udst

/belle/Data/release-05-01-03/DB00001363/SkimP11x1/prod00016031/e0010/4S/r04275/18530200/udst/sub00
    lastUpdate: 2020-11-26 04:03:35
    nFiles: 1
    status: good
    creationDate: 2020-11-26 03:45:54
    size: 105046

You can easily see that the dataset contains an unique datablock, and the datablock contains only one file. Retrieving the metadata of the file can be done providing the LPN of the parent datablock:

$ gb2_ds_query_file -l /belle/Data/release-05-01-03/DB00001363/SkimP11x1/prod00016031/e0010/4S/r04275/18530200/udst/sub00

/belle/Data/release-05-01-03/DB00001363/SkimP11x1/prod00016031/e0010/4S/r04275/18530200/udst/sub00/udst_000001_prod00016031_task94274000001.root
    jobId: 174693183
    checksumType: Adler32
    runHigh: 4275
    experimentHigh: 10
    eventLow: 325
    guid: B6D9ED03-EEA5-42B5-5DDF-550F9716E91C
    status: good
    site: LCG.Napoli.it
    eventHigh: 1508
    checksum: d9fd3a15
    parentGuids: 4A487D03-0B18-2DB0-96F4-D9E2CB78EE21
    nEvents: 11
    runLow: 4275
    experimentLow: 10
    date: 2020-11-26 03:50:35
    size: 105046

A full description of the usage and options for the command line tools is available at Dataset management tools.