Diagnostic tools
The diagnostic tools are grouped in as single command-line application designed to assist users in diagnosing and troubleshooting issues with their work on grid. This tool provides various diagnostic tests and information retrieval capabilities to help identify and resolve problems efficiently.
It accepts various command-line arguments and options to perform specific diagnostic tasks.
Note
The output of the tool is intended to be shared with the comp-users-forum when asking for help if an error message is observed. Please copy-paste the output when relevant.
gb2_diagnostic
Diagnostic for different operations. By default, checks general information (OS, proxy, DN, rucio Ping).
Examples:
$ gb2_diagnostic
$ gb2_diagnostic --no_color
$ gb2_diagnostic --failed_download <LFN>
$ gb2_diagnostic --failed_job <JobID>
$ gb2_diagnostic --waiting_job <JobID> -o logs.txt
usage: gb2_diagnostic [-h] [-v] [--usage] [--failed_job FAILED_JOB] [--failed_download FAILED_DOWNLOAD] [--waiting_job WAITING_JOB] [-u USER] [--no_color] [-o OUTPUT_FILE]
Named Arguments
- -v, --verbose
increase verbosity (up to -vv)
Default:
0
- --usage
show detailed usage
- --failed_job
JobID of a failed job or project name with failed jobs
- --failed_download
LFN of failed files
- --waiting_job
JobID of a waiting job or project name with waiting jobs
- -u, --user
specify user name
- --no_color
No Colored terminal output
Default:
False
- -o, --output_file
Output file to write the log
Basic tests
Basic connection tests are performed by default when no arguments are provided to the diagnostic tool. For example:
$ gb2_diagnostic
INFO ########### General Info ######################
INFO Timestamp: 2023-06-29 09:51:06.848708 UTC
INFO proxyInfo : OK
INFO DIRAC username: username
INFO DIRAC groupname: belle
INFO User DN: /DC=org/DC=terena/DC=tcs/C=DE/O=Deutsches Elektronen-Synchrotron DESY/CN=My Name
INFO Rucio ping: {'version': '1.28.5'}
INFO ################################################
Any error message after the execution of the command indicates a problem with the connection to the grid. In this case, send a message to the comp-users-forum indicating the output of the diagnostic tool.
Failed jobs
To retrieve information about failed jobs, use the --failed_job
option. For example:
$ gb2_diagnostic --failed_job <job_ID>
diagnostic of Failed Job .....
INFO ################ Job Summary ##################
INFO Failed JobID: <job_ID>
INFO Minor status: Application Finished With Errors
INFO Application Status: Unknown error 255 ( 255 : basf2helper.py Exited With Status 255)
INFO ###############################################
...
JobWrapper Completing Application Finished With Errors Unknown error 255 ( 255 : basf2helper.py Exited With Status 255) 2023-05-23 14:28:10
JobWrapper Failed Application Finished With Errors Unknown error 255 ( 255 : basf2helper.py Exited With Status 255) 2023-05-23 14:28:24
...
INFO ############### Error in std.out ###############
ERROR [ERROR] The required object 'gamma:eff_corr' (durability: event) does not exist.
Maybe you forgot the module that registers it? { module: PListCutAndCopy_gamma:brems }
The tool provides useful information about the job status and the error message. Sometimes, the problem is related to an error in the Basf2 steering file. In this case, fix the steering file and submit a new project with a different name.
Note
When it is not possible to identify the problem, send a message to the comp-users-forum indicating the output of the tool.
Jobs in waiting status for long time
Note
Close to major conferences, the demand of computing resources is higher than usual. Check the B2Monitoring Display for User in the DIRAC web portal to see the current number of waiting jobs from all users. If the number is high (> 500K), your jobs may take several days to be executed.
When jobs are in waiting status for a long time (> 24 hours), there is a possibility that the problem is related to the
CPU time requested by the user vs the availability of sites for execution. Use the option --waiting_job
to check the
parameters of the queues in the jobs available to execute your jobs, compared with the requested for your job:
$ gb2_diagnostic --waiting_job <job_ID>
INFO ############ JDL Info #######################
INFO At which Sites to be executed: ['ARC.SIGNET.si', 'LCG.KIT-TARDIS.de', 'OSG.BNL.us', 'LCG.SIGNET.si', 'LCG.SITE.jp']
INFO Requested normalized CPU time: 13896
INFO Input ['/belle/Data/release-06-00-12/DB00002392/proc13/prod00027791/e0018/4S/r02231/hadron/10601300/mdst/sub00/mdst_000001_prod00027791_task172230000001.root']
INFO ################################################
...
INFO ##### site: OSG.BNL.us #####
INFO site status: Active
INFO ##### site: LCG.SITE.xx #####
ERROR site status: Banned
...
If only banned sites are shown, your job will not be executed. Contact the comp-users-forum to request replication of the input somewhere else.
Statistics of the pilots on sites are also shown:
INFO ############### Pilot Status #################
INFO Pilot count (last day) for OSG.UMiss.us : {'Deleted': 6, 'Done': 200, 'Running': 8}
INFO Pilot count (last day) for ARC.SIGNET.si : {'Aborted': 8, 'Done': 12515, 'Failed': 1, 'Running': 529, 'Unknown': 7, 'Waiting': 1}
INFO Pilot count (last day) for LCG.KIT-TARDIS.de : {'Aborted': 921, 'Deleted': 4, 'Done': 2140, 'Running': 377}
INFO Pilot count (last day) for OSG.BNL.us : {'Aborted': 98, 'Deleted': 4, 'Done': 12805, 'Running': 3327, 'Submitted': 1356, 'Unknown': 63}
INFO Pilot count (last day) for LCG.SIGNET.si : {}
...
No ‘Done’ pilots in the last day indicates that the site is not available for execution.
Finally, CPU time checks are performed comparing the queues on the sites vs the requested CPU time for the job:
INFO ############### CPUTime check ################
INFO OSG.UMiss.us : JDL cputime (13896) < minCPUTimeAllQueue(86400.0)
INFO ARC.SIGNET.si : JDL cputime (13896) < minCPUTimeAllQueue(172800.0)
INFO LCG.KIT-TARDIS.de : JDL cputime (13896) < minCPUTimeAllQueue(72000.0)
INFO OSG.BNL.us : JDL cputime (13896) < minCPUTimeAllQueue(345600.0)
INFO LCG.SIGNET.si : JDL cputime (13896) < minCPUTimeAllQueue(59999999940.0)
The JDL cputime is the parameter defined for your project. A few or no sites with a queue presenting enough CPU time may cause the job to be in waiting status for a long time. In this case, consider to resubmit the project with a proper estimation of the CPU time, or reduce the complexity of your steering file.
Note
The CPUTime used in the comparison comes from the DIRAC Configuration, which works as a reasonable fair test. When the values defined on the JDL are to close to the limit, actual CPUTime comparison should be done with information from the Pilot Monitor. Send the output to experts and ask them to look at the The CPUTime in the Pilot output.
Failed downloads
Most common causes of unavailability of input files are
Problem with the user certificate
Problem with the local gbasf2 installation
Problem with the storage element (SE) where the file is stored
To retrieve information about failed downloads, use the --failed_download
option. In addition to the basic tests, it
will show information on the availability of the Storage Element:
$ gb2_diagnostic --failed_download <LFN>
Diagnostic for download.....
INFO File to check: LFN
INFO Replicas at SEs : ['Napoli-TMP-SE']
INFO Read Access status for Napoli-TMP-SE: Active
INFO Get information file at Napoli-TMP-SE: OK
Share the output of the tool with the comp-users-forum to get help when your files are not available for download.