Monitoring

Our instant clusters come with a dashboards and customizable alerts, to monitor the state of the cluster.

To access the dashboard navigate to the cluster dropdown and select View metric dashboard.

To access the Grafana portal, follow the provided instructions to obtain the address and login details. When prompted with a certificate warning (common with self-signed certificates), select Advanced and then Proceed to … to continue. The password can be retrieved from the jump host.

Once logged in, navigate to the Dashboards section in the side menu. Four pre-configured dashboards are available:

  • GPU Overview – General GPU monitoring.

  • NVIDIA DCGM – Metrics from the DCGM exporter.

  • Node Exporter Full – Detailed hardware and OS-level system metrics.

  • SLURM Dashboard – Monitoring of SLURM job activity.

The cluster is also pre-configured with several alerting rules, which can be viewed under the Alerts tab. Hardware-related alerts are automatically forwarded to DataCrunch for faster resolution. Additional alerts can be created and customized to notify through Grafana’s contact points by editing the grafana-default-email channel. This allows customer-specific alerts to be routed to any contact point defined by the customer directly within the Grafana UI.

Last updated

Was this helpful?