Diagnostic tools

The diagnostic tools are grouped in as single command-line application designed to assist users in diagnosing and troubleshooting issues with their work on grid. This tool provides various diagnostic tests and information retrieval capabilities to help identify and resolve problems efficiently.

It accepts various command-line arguments and options to perform specific diagnostic tasks.

Note

The output of the tool is intended to be shared with the comp-users-forum when asking for help if an error message is observed. Please copy-paste the output when relevant.

gb2_diagnostic

Diagnostic for different operations. By default, checks general information (OS, proxy, DN, rucio Ping).

Examples:

$ gb2_diagnostic
$ gb2_diagnostic --no_color
$ gb2_diagnostic --failed_download <LFN>
$ gb2_diagnostic --failed_job <JobID>
$ gb2_diagnostic --waiting_job <JobID> -o logs.txt
usage: gb2_diagnostic [-h] [-v] [--usage] [--failed_job FAILED_JOB] [--failed_download FAILED_DOWNLOAD] [--waiting_job WAITING_JOB] [-u USER] [--no_color] [-o OUTPUT_FILE]

Named Arguments

-v, --verbose

increase verbosity (up to -vv)

Default: 0

--usage

show detailed usage

--failed_job

JobID of a failed job or project name with failed jobs

--failed_download

LFN of failed files

--waiting_job

JobID of a waiting job or project name with waiting jobs

-u, --user

specify user name

--no_color

No Colored terminal output

Default: False

-o, --output_file

Output file to write the log

Basic tests

Basic connection tests are performed by default when no arguments are provided to the diagnostic tool. For example:

$ gb2_diagnostic

INFO    ########### General Info ######################
INFO    Timestamp: 2023-06-29 09:51:06.848708 UTC
INFO    proxyInfo : OK
INFO    DIRAC username: username
INFO    DIRAC groupname: belle
INFO    User DN: /DC=org/DC=terena/DC=tcs/C=DE/O=Deutsches Elektronen-Synchrotron DESY/CN=My Name
INFO    Rucio ping: {'version': '1.28.5'}
INFO    ################################################

Any error message after the execution of the command indicates a problem with the connection to the grid. In this case, send a message to the comp-users-forum indicating the output of the diagnostic tool.

Failed jobs

To retrieve information about failed jobs, use the --failed_job option. For example:

$ gb2_diagnostic --failed_job <job_ID>

diagnostic of Failed Job .....
    INFO    ################ Job Summary ##################
    INFO    Failed JobID: <job_ID>
    INFO    Minor status: Application Finished With Errors
    INFO    Application Status: Unknown error 255 ( 255 : basf2helper.py Exited With Status 255)
    INFO    ###############################################
...
JobWrapper           Completing  Application Finished With Errors  Unknown error 255 ( 255 : basf2helper.py Exited With Status 255)  2023-05-23 14:28:10
JobWrapper           Failed      Application Finished With Errors  Unknown error 255 ( 255 : basf2helper.py Exited With Status 255)  2023-05-23 14:28:24
...
INFO        ############### Error in std.out ###############
ERROR   [ERROR] The required object 'gamma:eff_corr' (durability: event) does not exist.
Maybe you forgot the module that registers it?  { module: PListCutAndCopy_gamma:brems }

The tool provides useful information about the job status and the error message. Sometimes, the problem is related to an error in the Basf2 steering file. In this case, fix the steering file and submit a new project with a different name.

Note

When it is not possible to identify the problem, send a message to the comp-users-forum indicating the output of the tool.

Jobs in waiting status for long time

Note

Close to major conferences, the demand of computing resources is higher than usual. Check the B2Monitoring Display for User in the DIRAC web portal to see the current number of waiting jobs from all users. If the number is high (> 500K), your jobs may take several days to be executed.

When jobs are in waiting status for a long time (> 24 hours), there is a possibility that the problem is related to the CPU time requested by the user vs the availability of sites for execution. Use the option --waiting_job to check the parameters of the queues in the jobs available to execute your jobs, compared with the requested for your job:

$ gb2_diagnostic --waiting_job <job_ID>

INFO        ############ JDL Info #######################
    INFO    At which Sites to be executed: ['ARC.SIGNET.si', 'LCG.KIT-TARDIS.de', 'OSG.BNL.us', 'LCG.SIGNET.si', 'LCG.SITE.jp']
    INFO    Requested normalized CPU time: 13896
    INFO    Input ['/belle/Data/release-06-00-12/DB00002392/proc13/prod00027791/e0018/4S/r02231/hadron/10601300/mdst/sub00/mdst_000001_prod00027791_task172230000001.root']
INFO        ################################################
...
INFO        ##### site: OSG.BNL.us #####
    INFO    site status: Active
    INFO    ##### site: LCG.SITE.xx #####
    ERROR   site status: Banned
...

If only banned sites are shown, your job will not be executed. Contact the comp-users-forum to request replication of the input somewhere else.

Statistics of the pilots on sites are also shown:

INFO        ############### Pilot Status #################
    INFO    Pilot count (last day) for OSG.UMiss.us : {'Deleted': 6, 'Done': 200, 'Running': 8}
    INFO    Pilot count (last day) for ARC.SIGNET.si : {'Aborted': 8, 'Done': 12515, 'Failed': 1, 'Running': 529, 'Unknown': 7, 'Waiting': 1}
    INFO    Pilot count (last day) for LCG.KIT-TARDIS.de : {'Aborted': 921, 'Deleted': 4, 'Done': 2140, 'Running': 377}
    INFO    Pilot count (last day) for OSG.BNL.us : {'Aborted': 98, 'Deleted': 4, 'Done': 12805, 'Running': 3327, 'Submitted': 1356, 'Unknown': 63}
    INFO    Pilot count (last day) for LCG.SIGNET.si : {}
...

No ‘Done’ pilots in the last day indicates that the site is not available for execution.

Finally, CPU time checks are performed comparing the queues on the sites vs the requested CPU time for the job:

INFO        ############### CPUTime check ################
    INFO    OSG.UMiss.us : JDL cputime (13896) < minCPUTimeAllQueue(86400.0)
    INFO    ARC.SIGNET.si : JDL cputime (13896) < minCPUTimeAllQueue(172800.0)
    INFO    LCG.KIT-TARDIS.de : JDL cputime (13896) < minCPUTimeAllQueue(72000.0)
    INFO    OSG.BNL.us : JDL cputime (13896) < minCPUTimeAllQueue(345600.0)
    INFO    LCG.SIGNET.si : JDL cputime (13896) < minCPUTimeAllQueue(59999999940.0)

The JDL cputime is the parameter defined for your project. A few or no sites with a queue presenting enough CPU time may cause the job to be in waiting status for a long time. In this case, consider to resubmit the project with a proper estimation of the CPU time, or reduce the complexity of your steering file.

Note

The CPUTime used in the comparison comes from the DIRAC Configuration, which works as a reasonable fair test. When the values defined on the JDL are to close to the limit, actual CPUTime comparison should be done with information from the Pilot Monitor. Send the output to experts and ask them to look at the The CPUTime in the Pilot output.

Failed downloads

Most common causes of unavailability of input files are

  • Problem with the user certificate

  • Problem with the local gbasf2 installation

  • Problem with the storage element (SE) where the file is stored

To retrieve information about failed downloads, use the --failed_download option. In addition to the basic tests, it will show information on the availability of the Storage Element:

$ gb2_diagnostic --failed_download <LFN>

Diagnostic for download.....
    INFO    File to check: LFN
    INFO    Replicas at SEs : ['Napoli-TMP-SE']
    INFO    Read Access status for Napoli-TMP-SE: Active
    INFO    Get information file at Napoli-TMP-SE: OK

Share the output of the tool with the comp-users-forum to get help when your files are not available for download.