Collecting & Reporting Site-Specific Job Metrics

One of the many requests I get from customers is having the ability to inject site-specific metrics, associated to the job, into the PBS Professional accounting logs at the time of job completion. This might not seem to be an obvious request to some, but it is a very handy capability for being able to perform analytics on job performance and system utilization, and for generating charge-back reports that are specific to the site. This capability can also simplify the site’s infrastructure by not having to create a separate database to store metrics and to write another reporting program to collect metrics from various data sources.

Well, I am happy to say that you can fulfill this requirement by utilizing PBS Professional’s plugin framework. PBS Professional has many plugin events, as seen in Figure 1 and Figure 2, below. Now, I am not going to talk about all of these plugin/hook events in this article, but I recommend for you to review the PBS Professional Administrator Guide.

Figure 1: Admission Control and Management

Figure 1: Admission Control and Management

Figure 2: Job Execution

Figure 2: Job Execution

I had the privilege to work with Cray on a project to integrate their Resource Utilization Reporting (RUR) with PBS Professional to report on various job metrics collected on the Cray compute nodes. For those who don’t know, the PBS MOM daemon (pbs_mom) does not execute on the Cray compute node. Instead, jobs are scheduled to a login node and then the job script calls a command called aprun, which is responsible for launching the application on the compute nodes. From here, RUR is responsible for collecting the statistics on how the compute nodes are used.

Although this project focuses on an aspect of Cray systems, the same capability is hardware- and OS-agnostic and can be used with any system.

The “How”

The implementation was fairly easy, once I became familiar with the moving parts of RUR and its plugin framework. Below I’ve published snippets of the python code and described what was accomplished.

First, create a custom RUR output plugin that dumps a unique RUR data file for each job into a relatively reliable location.  You can get more information on Cray’s RUR at and search for “Resource Utilization Reporting”. The snippet of code below creates the unique RUR data file in PBS_HOME/spool/ — naming it rur.PBS_JOBID.

Screen Shot 2014-09-05 at 10.10.11 AM

The PBS_HOME variable was obtained by reading in the /etc/pbs.conf file.

Second, create the custom resources in PBS_HOME/server_priv/resourcedef, representing all the site-specific metrics you want to collect.  I recommend you review the PBS Professional Administrator Guide for the various resource types and how they are defined.

Finally, create and import an execjob_epilogue hook which will be responsible for extracting the RUR metrics from the RUR data file (PBS_HOME/spool/rur.PBS_JOBID) and recording the new job metrics with the PBS Server. Refer to the PBS Professional Administrator Guide for gory details of how to work with Hooks. The two snippets of code below utilize a python for-loop that iterates through a dynamically created RUR plugin python dictionary, and associates the metric’s name and its value to the job’s resources_used attribute.

Screen Shot 2014-09-05 at 10.09.32 AM

The “Results”

Pretty easy, right?! So let’s see what it looks like.

Below is a job record that includes the work we did. Now, I don’t claim that all of these new metrics are necessary, but this example illustrates the flexibility and extensibility of what can be done.

Screen Shot 2014-09-05 at 9.49.38 AM

I want to highlight a few aspects of the job record. The attributes in blue bold font record which Cray compute nodes were allocated to the job:

Screen Shot 2014-09-05 at 9.51.41 AM

For those who don’t know, as of PBS Professional 11.0 (which was a huge release for Cray), Altair has completely re-architected PBS and the way we support the Cray systems.

The attributes in red bold illustrate all the new custom metrics we recorded in the execjob_epilogue hook.  You will need to take care to note the units of these metrics, as defined by RUR, when performing analytics on the jobs.

Screen Shot 2014-09-05 at 9.52.06 AM

In Closing

PBS Professional has evolved into a highly configurable and extremely scalable piece of software for job scheduling and resource management in a high-performance computing (HPC) environment. Although this example focused on an aspect of Cray systems, the same capability is hardware- and OS-agnostic and can be used with any system. I hope this example has given you some ideas on what can be done to customize your site and fulfill your evolving requirements and demands from users to report on site-specific metrics. For more information, copy of the hooks or follow up questions, please feel free to leave a comment below.

Scott Suchyta
Scott Suchyta

About Scott Suchyta

Scott Suchyta is the Director of Partner Solutions and Integration at Altair Engineering, Inc. He is responsible for managing the technical relationships with Altair PBS Works division’s hardware and software partners. Scott supports partners’ pre/post-sales engagements, helps resolve technical issues, ensures roadmap alignment, and participates in technical marketing activities. Additionally, he identifies and drives integration opportunities with partners’ products that result in differentiated joint solutions. Scott is a >10 year veteran at Altair having previous worked as the PBS Works product manager responsible for PBS Professional, PBS Application Services, and PBS Portal products. Scott started his career in Altair's PBS Works division as an application engineer implementing custom solutions using PBS Professional.