.. highlight:: shell .. _running-jobs: Submitting jobs to the grid *************************** GBasf2 is an extension of Basf2, from your desktop to the grid. The same steering files used with Basf2 in your local environment can be used with GBasf2 on the grid. The usual workflow is developing a Basf2 steering file, testing it locally, and then submitting the jobs to the grid with the same steering file. .. note:: Before starting, please understand the following: * The grid is NOT a local computing system like KEKCC. * Once you submit jobs, your jobs will be assigned to computing systems around the world. * If you job is problematic, it will be distributed to the world and all sites will be affected. * **Therefore, you must check your jobs in a local computing system carefully before you submit them to the grid!** .. note:: If any issues occur, contact the `users forum `_ for assistance. To receive assistance as quickly as possible, before posting in the forum, you can: * Check the `gbasf2 troubleshooting `_ page for solutions to the issue. * Look to see if your issue has already been posted in the `gbasf2 FAQ `_ page or at `questions.belle2.org `_. You should also include all the details the experts will need to diagnose the problem, such as your user name, project name, etc. Job submission ============== A command line client, ``gbasf2``, is used for submitting grid-based Basf2 jobs. The basic usage is:: $ gbasf2 -p -s where ```` is a name assigned by you, and ```` is the available Basf2 software version to use. .. note:: Your project name must be unique and cannot be reused, even if the project is deleted. .. note:: Please, do not use special characters in the project names (``$, #, %, /,`` etc.), as they could create problems with file names in some sites and in the databases. You can always use the flags ``-h`` and ``--usage`` to see a full list of available options and examples. If the submission is correct, you will get a summary of your project with information about the number of jobs, etc. .. warning:: Once again: before submitting jobs to the grid, be sure that your script works well on a local computer! .. warning:: If you do not set the CPU time or event throughput of your jobs manually, ``gbasf2`` sets a value as the CPU time for the jobs. Usually the estimated time is much larger than the actual time required for your jobs, for it needs to cover heavier use cases. This may prevent your jobs of being started on some sites! So you should consider using either the ``--cputime`` or the ``--evtpersec`` option. Jobs with input files ===================== Submitting jobs with a single dataset/datablock is performed using the argument ``-i``:: $ gbasf2 -p -s -i For example:: $ gbasf2 myscript.py -p myproject_v1 -s light-2110-tartarus -i /belle/MC/release-05-02-11/DB00001363/SkimM14ri_ax1/prod00020320/e1003/4S/r00000/charged/18360100/udst List of datasets as input ------------------------- If you want to use a list of datasets, like the ones obtained with the :ref:`dataset-searcher`, you can store the list in a file and submit jobs with ``--input_dslist``:: $ gbasf2 -p myproject -s release --input_dslist .. note:: If gbasf2 warns that there is no input data, it may indicate that all of the input files you specified are marked as “bad” (not part of the good run list). Input from the Dataset Searcher ------------------------------- If the metadata of the datasets desired is known, one additional possibility is performing queries to the Dataset Searcher directly during the gbasf2 submission. The metadata can be specified with ``--input_ds_search``:: $ gbasf2 -p myproject -s release --input_ds_search='metadata1=value;metadata2=value;exp=expLow:expHigh;run=runLow:runHigh' gbasf2 will search the use datasets matching the query as input. The available attributes and values for performing the queries are the same described at :ref:`bin_dssearcher`. Dataset Collections =================== Collection path works same as any other path, except only a single collection can be used in a single project. Submitting jobs with a collection is performed using the argument ``-i``:: $ gbasf2 -p -s -i Additional Options ================== Additional options for advanced usage of gbasf2 are described here, such as adding files into the input sandbox or selection of the environment for the execution. The full list of available options for gbasf2 (and any gb2 tools) can be always retrieved with ``--help`` and ``--usage``. You can also check the command-line reference :ref:`bin_gbasf2`. Submit jobs with multiple input files per job --------------------------------------------- The option ``-n`` specifies the number of input files per job. You may specify the number of input files to be fed to each job with in this limit, as far as the job finishes within ~10 hours on a standard node. There is a limit of 5 GB on the total of the input file size. Suggested maximum number of input files is 10. Otherwise your project could hammer down a grid site by heavy file access. .. note:: Be aware that the meaning of the gbasf2 option ``-n`` is different from that of the basf2 option (number of events). .. warning:: If the input files come from multiple campaigns (like proc12 + bucket21, for example), DON'T use ``-n`` option for submission. See `comp-users-forum:2255 `_ for details. Passing basf2 parameters ------------------------ If you need to pass arguments to the basf2 steering file, option ``--basf2opt`` is available. For example:: $ gbasf2 --basf2opt='-n 100' will process only 100 events per job. Adding files in the input sandbox --------------------------------- The input sandbox is delivered in the sites that will execute the jobs of your project. It contains all the files required during the execution, such as your steering file and additional dependencies. If you require to attach a file in the input sandbox, like a ``.dec`` file or a required library, you can use the option ``-f`` (``--input_sandboxfiles``). The project summary will display the files attached for confirmation. Submit jobs with my own basf2 module ------------------------------------ Executing basf2 on the grid with your own module is possible attaching the required libraries in the input sandbox. Create your own module and compile it following the instructions as in the `basf2 manual `_. Once compiled, find the ``.so`` and ``.b2mod`` files (usually below ``modules/Linux_x86_64/opt/``) and copy them in your local directory. You need to add a reference to your module in your basf2 steering file, like:: import basf2 as b2 b2.register_module('myModule' ,shared_lib_path='./myModule.so') path.add_module('myModule') and include the ``.so`` and ``.b2mod`` files in the input sandbox using the option ``-f`` during the gbasf2 submission:: $ gbasf2 -f="myModule.so, myModule.b2mod" ... .. note:: Submitting jobs with compiled modules require to specify the platform in which the libraries were compiled using ``--resource_tag``. For example, for EL7:: $ gbasf2 --resource_tag EL7 ... .. note:: If you have written a new module or variable, please consider to share it with other collaborators `submitting a pull request `_. Then the new feature will be available in coming basf2 releases. Setting the CPU time -------------------- To prevent your jobs from being stuck in waiting caused by an overestimated CPU time, you can set either the ``--cputime`` or the ``--evtpersec`` option. * The option ``--cputime`` is to be set the expected cpu time consumption of the individual jobs in minutes in the normalized unit, common to all the jobs in the project. Good for processing run-independent data. * The option ``--evtpersec`` is to set the expected the throughput, that is, the number of events processed per second in the normalized unit, the cputime of each job is calculated with the average number of events in the input (total number of events divided by the number of input files). To get a proper esimate of the required CPU time one has to multiply the jobs runtime on KEKCC with the normalization factor of the KEKCC nodes of 20, e.g.:: cputime = 20 * So if a job is expected to run 1 hour on KEKCC you should specify 60 min. * 20 = 1200:: $ gbasf2 --cputime 1200 The same normalization factor 20 needs be applied to the value for the option ``--evtpersec``. This time can be calculated in a similar way:: evtpersec = nevents / (20 * ) .. note:: The estimation of the run time on KEKCC can be done via copying one of the input files to KEKCC and running your script over it locally.