Monitoring with Grafana
Once you’ve got students using your hub regularly, the question stops being “is it up?” and starts being “why did it slow down at 14:30 in lesson 3?”. This page sets up a self-hosted Prometheus + Grafana stack on the same box as TLJH, plus a small custom collector so you can see per-student memory, CPU, disk and process counts.
By the end of this page you’ll have:
- Prometheus scraping the host (CPU/RAM/disk), JupyterHub itself, and a per-user collector
- Grafana on
http://localhost:3000with dashboards for the whole stack - Everything bound to localhost, so you can SSH-tunnel in rather than expose another public port
Install the packages
Section titled “Install the packages”-
Install Prometheus and node-exporter from apt
Terminal window sudo apt updatesudo apt install -y prometheus prometheus-node-exporter apt-transport-https software-properties-common wget -
Add Grafana’s apt repo and install
Terminal window sudo mkdir -p /etc/apt/keyringswget -qO- https://apt.grafana.com/gpg.key | sudo gpg --dearmor -o /etc/apt/keyrings/grafana.gpgecho "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.listsudo apt updatesudo apt install -y grafana
Configure node-exporter
Section titled “Configure node-exporter”We want node-exporter to expose systemd metrics (so we can see the per-user jupyter-<name>@<domain>.service units), cgroups (so we get RAM/CPU per user), and a textfile collector directory we’ll write our own metrics into later.
-
Edit the defaults file
Terminal window sudo nano /etc/default/prometheus-node-exporterReplace the
ARGS=line with:/etc/default/prometheus-node-exporter ARGS="--collector.textfile.directory=/var/lib/prometheus/node-exporter/textfile_collector --web.listen-address=127.0.0.1:9100 --collector.systemd --collector.systemd.unit-include=(jupyter-.*|.*\.service) --collector.processes --collector.cgroups --collector.cpu.info --collector.interrupts --collector.logind"Key flags:
--web.listen-address=127.0.0.1:9100— only exposed to localhost--collector.textfile.directory=...— the directory we’ll drop custom.promfiles into--collector.systemd+--collector.systemd.unit-include— matchjupyter-*user services so we can graph them--collector.cgroups— exposes per-cgroup memory/CPU, which is how TLJH’s per-user limits are enforced
-
Make sure the textfile directory exists
Terminal window sudo mkdir -p /var/lib/prometheus/node-exporter/textfile_collectorsudo chown prometheus:prometheus /var/lib/prometheus/node-exporter/textfile_collector -
Restart
Terminal window sudo systemctl restart prometheus-node-exportersudo systemctl status prometheus-node-exporter --no-pager | head -15
Expose JupyterHub’s own metrics
Section titled “Expose JupyterHub’s own metrics”JupyterHub publishes Prometheus metrics at /hub/metrics, but only authenticated callers with the read:metrics scope can read them. The cleanest way to do that on TLJH is to register a JupyterHub service that owns a token Prometheus can use.
-
Create the config snippet
Terminal window sudo nano /opt/tljh/config/jupyterhub_config.d/prometheus_service.pyPaste:
/opt/tljh/config/jupyterhub_config.d/prometheus_service.py import secrets, os, pathlib# Generate the token once and persist it to disk so Prometheus can read it.token_path = pathlib.Path("/opt/tljh/state/prometheus.token")if not token_path.exists():token_path.parent.mkdir(parents=True, exist_ok=True)token_path.write_text(secrets.token_hex(32))token_path.chmod(0o600)PROM_TOKEN = token_path.read_text().strip()c.JupyterHub.services.append({"name": "prometheus","api_token": PROM_TOKEN,})c.JupyterHub.load_roles.append({"name": "prometheus-reader","services": ["prometheus"],"scopes": ["read:metrics"],}) -
Reload the hub and grab the token
Terminal window sudo tljh-config reload hubsudo cat /opt/tljh/state/prometheus.tokenCopy that token — you’ll paste it into Prometheus next.
Configure Prometheus
Section titled “Configure Prometheus”-
Edit the scrape config
Terminal window sudo nano /etc/prometheus/prometheus.ymlReplace the contents with:
/etc/prometheus/prometheus.yml global:scrape_interval: 15sevaluation_interval: 15sscrape_configs:- job_name: prometheusstatic_configs:- targets: ['127.0.0.1:9090']- job_name: nodestatic_configs:- targets: ['127.0.0.1:9100']- job_name: jupyterhubmetrics_path: /hub/metricsscheme: httpauthorization:credentials: "<paste-prometheus-token-here>"static_configs:- targets: ['127.0.0.1:15001']The Hub’s REST API (including
/hub/metrics) is bound to127.0.0.1:15001by TLJH. Theauthorization.credentialsfield is the token you just copied from/opt/tljh/state/prometheus.token. -
Bind Prometheus to localhost (optional but recommended)
Edit
/etc/default/prometheusand set:/etc/default/prometheus ARGS="--web.listen-address=127.0.0.1:9090" -
Restart and verify
Terminal window sudo systemctl restart prometheuscurl -s http://127.0.0.1:9090/api/v1/targets | grep -oE '"health":"[a-z]+"'All three jobs should report
"health":"up". Ifjupyterhubisdown, double-check the token and thattljh-config reload hubactually restarted the hub.
Per-user metrics collector
Section titled “Per-user metrics collector”Out of the box, node-exporter’s systemd collector tells you whether jupyter-alice@example.org.service is running, but not how much memory or disk that user is consuming. This small script fills the gap by reading the per-user cgroup properties via systemctl show, and writing them to a .prom file that node-exporter’s textfile collector picks up.
-
Create the collector script
Terminal window sudo nano /usr/local/bin/jupyter-metricsPaste:
/usr/local/bin/jupyter-metrics #!/bin/bash# Emit per-jupyter-user metrics for node-exporter textfile collector.set -uOUT="${1:-/var/lib/prometheus/node-exporter/textfile_collector/jupyter.prom}"TMP="$OUT.tmp.$$"{echo "# HELP jupyter_user_memory_bytes Current memory consumption (cgroup MemoryCurrent)"echo "# TYPE jupyter_user_memory_bytes gauge"echo "# HELP jupyter_user_cpu_seconds_total Total CPU time consumed (CPUUsageNSec / 1e9)"echo "# TYPE jupyter_user_cpu_seconds_total counter"echo "# HELP jupyter_user_tasks Current number of tasks (PIDs) in the cgroup"echo "# TYPE jupyter_user_tasks gauge"echo "# HELP jupyter_user_disk_bytes Disk usage of the user's home directory"echo "# TYPE jupyter_user_disk_bytes gauge"echo "# HELP jupyter_user_running Whether the per-user service is currently running"echo "# TYPE jupyter_user_running gauge"# Match only real per-user services: jupyter-<name>@<domain>.servicesystemctl list-units --no-pager --plain --no-legend --all --type=service 2>/dev/null \| awk '$1 ~ /^jupyter-[^@]+@.*\.service$/ {print $1, $4}' \| while read -r unit active; douser="${unit#jupyter-}"user="${user%%@*}"running=0[ "$active" = "running" ] && running=1printf 'jupyter_user_running{user="%s"} %d\n' "$user" "$running"if [ "$running" = "1" ]; theneval "$(systemctl show "$unit" -p MemoryCurrent -p CPUUsageNSec -p TasksCurrent 2>/dev/null | sed 's/=/="/' | sed 's/$/"/')"if [ -n "${MemoryCurrent:-}" ] && [ "$MemoryCurrent" != "[not set]" ]; thenprintf 'jupyter_user_memory_bytes{user="%s"} %s\n' "$user" "$MemoryCurrent"fiif [ -n "${CPUUsageNSec:-}" ] && [ "$CPUUsageNSec" != "[not set]" ]; thencpu_sec=$(awk "BEGIN { printf \"%.6f\", $CPUUsageNSec / 1000000000 }")printf 'jupyter_user_cpu_seconds_total{user="%s"} %s\n' "$user" "$cpu_sec"fiif [ -n "${TasksCurrent:-}" ] && [ "$TasksCurrent" != "[not set]" ]; thenprintf 'jupyter_user_tasks{user="%s"} %s\n' "$user" "$TasksCurrent"fifihome="/home/jupyter-$user"if [ -d "$home" ]; thensize=$(du -sb "$home" 2>/dev/null | awk '{print $1}')if [ -n "$size" ]; thenprintf 'jupyter_user_disk_bytes{user="%s"} %s\n' "$user" "$size"fifidone} > "$TMP" && mv -f "$TMP" "$OUT"Make it executable:
Terminal window sudo chmod +x /usr/local/bin/jupyter-metrics -
Run it as a systemd timer
Terminal window sudo nano /etc/systemd/system/jupyter-metrics.service/etc/systemd/system/jupyter-metrics.service [Unit]Description=Collect per-user JupyterHub metrics for prometheus[Service]Type=oneshotExecStart=/usr/local/bin/jupyter-metricsNice=10IOSchedulingClass=idleTerminal window sudo nano /etc/systemd/system/jupyter-metrics.timer/etc/systemd/system/jupyter-metrics.timer [Unit]Description=Run jupyter-metrics every minute[Timer]OnBootSec=30sOnUnitActiveSec=60sAccuracySec=5s[Install]WantedBy=timers.targetEnable both:
Terminal window sudo systemctl daemon-reloadsudo systemctl enable --now jupyter-metrics.timer -
Verify the metrics are flowing
Terminal window cat /var/lib/prometheus/node-exporter/textfile_collector/jupyter.promYou should see one block per running user, e.g.:
jupyter_user_running{user="alice"} 1jupyter_user_memory_bytes{user="alice"} 412938240jupyter_user_cpu_seconds_total{user="alice"} 8.124000jupyter_user_tasks{user="alice"} 14jupyter_user_disk_bytes{user="alice"} 35892014
Configure Grafana
Section titled “Configure Grafana”-
Bind Grafana to localhost
Terminal window sudo sed -i 's|^;http_addr =.*|http_addr = 127.0.0.1|' /etc/grafana/grafana.iniBy default Grafana listens on
0.0.0.0:3000. Binding to127.0.0.1means you’ll only reach it via SSH tunnel — much safer than exposing yet another login screen on the public internet. -
Start Grafana
Terminal window sudo systemctl enable --now grafana-server -
Reach the UI
From your laptop, SSH-tunnel port 3000:
Terminal window ssh -L 3000:127.0.0.1:3000 <user>@<your-tljh-host>Then open
http://localhost:3000in a browser. Default login isadmin/admin— Grafana will force you to change it on first login. -
Add Prometheus as a data source
UI → Connections → Data sources → Add data source → Prometheus.
- URL:
http://127.0.0.1:9090 - Save & test.
- URL:
-
Import dashboards
UI → Dashboards → New → Import. Useful starting points:
- JupyterHub dashboards — clone the repo and import the JSON files in
dashboards/ - Node Exporter Full (#1860) — paste
1860into the import-by-ID box
For your
jupyter_user_*metrics, build a small custom dashboard with panels like:topk(10, jupyter_user_memory_bytes)→ top 10 users by RAMrate(jupyter_user_cpu_seconds_total[5m])→ CPU per user over timesum(jupyter_user_running)→ live student count
- JupyterHub dashboards — clone the repo and import the JSON files in
Verify the whole stack
Section titled “Verify the whole stack”Quick checklist after everything is up:
systemctl is-active prometheus prometheus-node-exporter grafana-server jupyterhub jupyter-metrics.timer# expect: active for all fivess -tlnp | grep -E ':(3000|9090|9100|15001) '# expect: 127.0.0.1 on each — none should bind to 0.0.0.0In the Grafana Explore view, query jupyter_user_running and you should see a series per active student.
What to monitor next
Section titled “What to monitor next”This is the floor, not the ceiling. Once you’ve got the basics flowing:
- Wire up alerts (Grafana → Alerting) for low free RAM, sustained high CPU, or per-user memory creeping toward your
limits.memorycap. - Add the BetterStack integration to ship logs and a heartbeat off-host, so you find out the box died from somewhere other than the box itself.
- Add the auto-cleaner so that when a student does paste a 12 MB blob into a notebook cell, it gets neutralised before it takes the hub down.