.. rst-class:: dl-parameters .. _dataset-searcher: Searching Datasets for Analysis ******************************* The most common use of the grid for analysis is the submission of jobs to run over input files from the official Belle II data and MC. The files on the grid are distributed on the available storage sites around the world. Fortunately, as a user you do not have to worry about their physical location. A file catalog keeps a record of where the files are located and you just need to provide a logical identifier for the samples of interest for your analysis. Datasets and Datablocks ----------------------- Each file on the grid is located using a **Logical File Name (LFN)**, which always starts with ``/belle``:: /belle/data_type/some_more_directories/dataset/datablock/file.root When the LFN is provided to any of the gbasf2 tools, a replica is retrieved from catalog in the backend. You only need to provide the LFN(s) relevant for your analysis, without dealing with the physical location of the samples. Files are collected within datasets, which can be located using a **Logical Path Name (LPN)**:: /belle/data_type/some_more_directories/dataset A dataset may correspond to a run for an specific experiment number in data, or a MC type in a MC campaign. By design, a directory on the grid can contain 1000 files at most. For this reason, the concept of a datablock was introduced. Each dataset is subdivided into directories with names ``subXX``, where the last two digits are sequentially iterated when the dataset is expanded to hold more files (``sub00``, ``sub01``, ...). In gbasf2, the data handling unit is the datablock. The Dataset Searcher -------------------- The Dataset Searcher is an application on the DIRAC web portal (https://dirac.cc.kek.jp:8443/DIRAC/). You can find it from the Menu / BelleDIRAC Apps / Dataset Searcher. .. # TODO: We must update this part since tabs view is being enabled in production. .. image:: _static/DatasetSearcher.png :width: 600 The categories available to search for datasets are: **Data type** MC or Data. **Background level** The beam background configuration used in simulation: BGx1, BGx0, etc (only for MC). **Beam energy** Energy point in collisions: Y(4S), Y(5S), etc. Off-resonance collisions are specified by the energy in MeV. **Data Levels** Indicator of the level of processing: mdst, udst. **Campaigns** MC or data processing campaign name. **Global Tags** Identifier of the database global tag. **General Skim Names** First stage skim name (only defined for Data). 'All' means no HLT skim. **Skim Types** Physics skim decay mode. See the `software manual `_ for details and the code of the available skims. **Releases** Basf2 release version used to produce the dataset. **MC Event Types** Physics `EventType `_ generated (only defined for MC). For generic samples use charged, mixed, uubar, ee, etc. **Experiment High** Upper boundary of the experiment number in the datasets. **Experiment Low** Lower boundary of the experiment number in the datasets. **Run High** Upper boundary of the run number in the datasets. **Run Low** Lower boundary of the run number in the datasets. Search for the samples you want (e.g. Campaigns: proc11, Beam Energies: 4S, etc) and click on 'Search'. A list of LFNs matching your selection will be provided. At the bottom of the page, there is a button “Download.txt file” that can be used to get a list of LFNs in a local file. The list of LFNs can be used to submit your jobs with ``gbasf2`` (see :ref:`running-jobs`). .. note:: The `Data Production WebHome `_ in Confluence is the entry point to find additional information about the samples that are available. Searching for datasets via the command-line ------------------------------------------- Another way to interact with the Dataset Searcher is using the command line tool ``gb2_ds_search``. It provides options to search available metadata and datasets matching the query. Searching available metadata ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``gb2_ds_search metadata`` will return available values for the metadata specified:: $ gb2_ds_search metadata --table For example:: $ gb2_ds_search metadata --table MCEventTypes "charged" "uubar" "ee" "3960640000" "4190020000" ... The list of available options are --help Show the help message and exit --table Table to search values from: Releases, GlobalTags, MCEventTypes, GeneralSkimNames, SkimDecayModes, Campaigns, DataTypes, DataLevels, BeamEnergies, BkgLevels. Searching datasets ~~~~~~~~~~~~~~~~~~ In a similar way as the Dataset Searcher on web, ``gb2_ds_search dataset`` will return the list of datasets that matches the query. The usage is:: $ gb2_ds_search dataset [-h] [-o OUTPUT_FILE] [--campaign CAMPAIGN] [--data_type DATA_TYPE] [--data_level DATA_LEVEL] [--run_high RUN_HIGH] [--exp_high EXP_HIGH] [--run_low RUN_LOW] [--exp_low EXP_LOW] [--mc_event MC_EVENT] [--skim_decay SKIM_DECAY] [--general_skim GENERAL_SKIM] [--beam_energy BEAM_ENERGY] [--global_tag GLOBAL_TAG] [--release RELEASE] [--bkg_level BKG_LEVEL] A full description of the options is available at :ref:`bin_dssearcher` . Datasets and files metadata --------------------------- Additional information related to the datasets and files located on the grid is stored in the metadata catalog. Metadata is the information about the data in concern. It is essential to correctly identify the input data for any kind of processing, like mdst production, skimming, or physics analysis. Files, datablocks and datasets use a different metadata schema and therefore, information obtained from each one is different. See `Computing Metadata `_ in Confluence for details. To retrieve metadata of datasets, use the command ``gb2_ds_query_dataset``. For example:: $ gb2_ds_query_dataset -l /belle/Data/release-05-01-03/DB00001363/SkimP11x1/prod00016031/e0010/4S/r04275/18530200/udst udst dataset: /belle/Data/release-05-01-03/DB00001363/SkimP11x1/prod00016031/e0010/4S/r04275/18530200/udst creationDate: 2020-11-25 14:53:17 lastUpdate: 2020-11-27 02:11:42 nFiles: 1 size: 105046 status: good productionId: 16031 transformationId: 348793 owner: g:belle_skim mc: SkimP11x1 stream: dataType: data dataLevel: udst beamEnergy: 4S mcEventType: generalSkimName: skimDecayMode: 18530200 release: release-05-01-03 dbGlobalTag: DB00001363 sourceCode: sourceCodeRevision: steeringFile: skim/SkimP11x1/release-05-01-03/SkimScripts/singleTagPseudoScalar_Skim_Standalone.py steeringFileRevision: experimentLow: 10 experimentHigh: 10 runLow: 4275 runHigh: 4275 logLfn: parentDatasets: /belle/Data/proc/release-04-02-02/DB00000938/proc11/prod00013369/e0010/4S/r04275/mdst description: SkimP11x1 singleTagPseudoScalar skim on proc11_exp10r2_b1. where the ``-l`` option displays the metadata to be easily readable. Notice that the metadata contains useful information such as the parent datasets, a description, the steering file, etc. Analogously, metadata of datablocks and files is obtained with ``gb2_ds_query_datablock`` and ``gb2_ds_query_file``. Using the same LPN as the previous example:: $ gb2_ds_query_datablock -l /belle/Data/release-05-01-03/DB00001363/SkimP11x1/prod00016031/e0010/4S/r04275/18530200/udst /belle/Data/release-05-01-03/DB00001363/SkimP11x1/prod00016031/e0010/4S/r04275/18530200/udst/sub00 lastUpdate: 2020-11-26 04:03:35 nFiles: 1 status: good creationDate: 2020-11-26 03:45:54 size: 105046 You can easily see that the dataset contains an unique datablock, and the datablock contains only one file. Retrieving the metadata of the file can be done providing the LPN of the parent datablock:: $ gb2_ds_query_file -l /belle/Data/release-05-01-03/DB00001363/SkimP11x1/prod00016031/e0010/4S/r04275/18530200/udst/sub00 /belle/Data/release-05-01-03/DB00001363/SkimP11x1/prod00016031/e0010/4S/r04275/18530200/udst/sub00/udst_000001_prod00016031_task94274000001.root jobId: 174693183 checksumType: Adler32 runHigh: 4275 experimentHigh: 10 eventLow: 325 guid: B6D9ED03-EEA5-42B5-5DDF-550F9716E91C status: good site: LCG.Napoli.it eventHigh: 1508 checksum: d9fd3a15 parentGuids: 4A487D03-0B18-2DB0-96F4-D9E2CB78EE21 nEvents: 11 runLow: 4275 experimentLow: 10 date: 2020-11-26 03:50:35 size: 105046 A full description of the usage and options for the command line tools is available at :ref:`bin_ds`.