.. highlight:: shell .. _gb2_diagnostic: Diagnostic tools **************** The diagnostic tools are grouped in as single command-line application designed to assist users in diagnosing and troubleshooting issues with their work on grid. This tool provides various diagnostic tests and information retrieval capabilities to help identify and resolve problems efficiently. It accepts various command-line arguments and options to perform specific diagnostic tasks. .. note:: The output of the tool is intended to be shared with the comp-users-forum when asking for help if an error message is observed. Please copy-paste the output when relevant. gb2_diagnostic -------------- .. automodule:: BelleDIRAC.Client.gb2_scripts.gb2_diagnostic :members: .. argparse:: :filename: ../../Client/gb2_scripts/gb2_diagnostic.py :func: getOpt :prog: gb2_diagnostic Basic tests ----------- Basic connection tests are performed by default when no arguments are provided to the diagnostic tool. For example:: $ gb2_diagnostic INFO ########### General Info ###################### INFO Timestamp: 2023-06-29 09:51:06.848708 UTC INFO proxyInfo : OK INFO DIRAC username: username INFO DIRAC groupname: belle INFO User DN: /DC=org/DC=terena/DC=tcs/C=DE/O=Deutsches Elektronen-Synchrotron DESY/CN=My Name INFO Rucio ping: {'version': '1.28.5'} INFO ################################################ Any error message after the execution of the command indicates a problem with the connection to the grid. In this case, send a message to the comp-users-forum indicating the output of the diagnostic tool. Failed jobs ----------- To retrieve information about failed jobs, use the ``--failed_job`` option. For example:: $ gb2_diagnostic --failed_job diagnostic of Failed Job ..... INFO ################ Job Summary ################## INFO Failed JobID: INFO Minor status: Application Finished With Errors INFO Application Status: Unknown error 255 ( 255 : basf2helper.py Exited With Status 255) INFO ############################################### ... JobWrapper Completing Application Finished With Errors Unknown error 255 ( 255 : basf2helper.py Exited With Status 255) 2023-05-23 14:28:10 JobWrapper Failed Application Finished With Errors Unknown error 255 ( 255 : basf2helper.py Exited With Status 255) 2023-05-23 14:28:24 ... INFO ############### Error in std.out ############### ERROR [ERROR] The required object 'gamma:eff_corr' (durability: event) does not exist. Maybe you forgot the module that registers it? { module: PListCutAndCopy_gamma:brems } The tool provides useful information about the job status and the error message. Sometimes, the problem is related to an error in the Basf2 steering file. In this case, fix the steering file and submit a new project with a different name. .. note:: When it is not possible to identify the problem, send a message to the comp-users-forum indicating the output of the tool. Jobs in waiting status for long time ------------------------------------ .. note:: Close to major conferences, the demand of computing resources is higher than usual. Check the B2Monitoring Display for User in the `DIRAC web portal `_ to see the current number of waiting jobs from all users. If the number is high (> 500K), your jobs may take several days to be executed. When jobs are in waiting status for a long time (> 24 hours), there is a possibility that the problem is related to the CPU time requested by the user vs the availability of sites for execution. Use the option ``--waiting_job`` to check the parameters of the queues in the jobs available to execute your jobs, compared with the requested for your job:: $ gb2_diagnostic --waiting_job INFO ############ JDL Info ####################### INFO At which Sites to be executed: ['ARC.SIGNET.si', 'LCG.KIT-TARDIS.de', 'OSG.BNL.us', 'LCG.SIGNET.si', 'LCG.SITE.jp'] INFO Requested normalized CPU time: 13896 INFO Input ['/belle/Data/release-06-00-12/DB00002392/proc13/prod00027791/e0018/4S/r02231/hadron/10601300/mdst/sub00/mdst_000001_prod00027791_task172230000001.root'] INFO ################################################ ... INFO ##### site: OSG.BNL.us ##### INFO site status: Active INFO ##### site: LCG.SITE.xx ##### ERROR site status: Banned ... If only banned sites are shown, your job will not be executed. Contact the comp-users-forum to request replication of the input somewhere else. Statistics of the pilots on sites are also shown:: INFO ############### Pilot Status ################# INFO Pilot count (last day) for OSG.UMiss.us : {'Deleted': 6, 'Done': 200, 'Running': 8} INFO Pilot count (last day) for ARC.SIGNET.si : {'Aborted': 8, 'Done': 12515, 'Failed': 1, 'Running': 529, 'Unknown': 7, 'Waiting': 1} INFO Pilot count (last day) for LCG.KIT-TARDIS.de : {'Aborted': 921, 'Deleted': 4, 'Done': 2140, 'Running': 377} INFO Pilot count (last day) for OSG.BNL.us : {'Aborted': 98, 'Deleted': 4, 'Done': 12805, 'Running': 3327, 'Submitted': 1356, 'Unknown': 63} INFO Pilot count (last day) for LCG.SIGNET.si : {} ... No 'Done' pilots in the last day indicates that the site is not available for execution. Finally, CPU time checks are performed comparing the queues on the sites vs the requested CPU time for the job:: INFO ############### CPUTime check ################ INFO OSG.UMiss.us : JDL cputime (13896) < minCPUTimeAllQueue(86400.0) INFO ARC.SIGNET.si : JDL cputime (13896) < minCPUTimeAllQueue(172800.0) INFO LCG.KIT-TARDIS.de : JDL cputime (13896) < minCPUTimeAllQueue(72000.0) INFO OSG.BNL.us : JDL cputime (13896) < minCPUTimeAllQueue(345600.0) INFO LCG.SIGNET.si : JDL cputime (13896) < minCPUTimeAllQueue(59999999940.0) The JDL cputime is the parameter defined for your project. A few or no sites with a queue presenting enough CPU time may cause the job to be in waiting status for a long time. In this case, consider to resubmit the project with a proper estimation of the CPU time, or reduce the complexity of your steering file. .. note:: The CPUTime used in the comparison comes from the DIRAC Configuration, which works as a reasonable fair test. When the values defined on the JDL are to close to the limit, actual CPUTime comparison should be done with information from the Pilot Monitor. Send the output to experts and ask them to look at the The CPUTime in the Pilot output. Failed downloads ---------------- Most common causes of unavailability of input files are * Problem with the user certificate * Problem with the local gbasf2 installation * Problem with the storage element (SE) where the file is stored To retrieve information about failed downloads, use the ``--failed_download`` option. In addition to the basic tests, it will show information on the availability of the Storage Element:: $ gb2_diagnostic --failed_download Diagnostic for download..... INFO File to check: LFN INFO Replicas at SEs : ['Napoli-TMP-SE'] INFO Read Access status for Napoli-TMP-SE: Active INFO Get information file at Napoli-TMP-SE: OK Share the output of the tool with the comp-users-forum to get help when your files are not available for download.