SLURM slurmd service fails to start on Raspberry Pi 5 cluster due to cgroup.conf parsing errors

I have a Raspberry Pi 5 cluster set up with a master node and a worker node. I successfully installed SLURM on the master node, and I’m currently trying to configure the slurmd daemon to run on the worker node. Issue

After configuring SLURM, I enabled and started the slurmd service on the master node with the following commands:

sudo systemctl enable slurmd
sudo systemctl start slurmd
sudo systemctl status slurmd

However, the slurmd service fails to start with the following error message:

    × slurmd.service - Slurm node daemon
        Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; preset: enabled)
       Active: failed (Result: exit-code) since Sat 2024-10-26 23:03:46 CEST; 24min ago
     Duration: 5ms
       Docs: man:slurmd(8)
    Process: 2026 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
   Main PID: 2026 (code=exited, status=1/FAILURE)
        CPU: 5ms

Oct 26 23:03:46 master systemd[1]: Started slurmd.service - Slurm node daemon.
Oct 26 23:03:46 master slurmd[2026]: slurmd: error: _parse_next_key: Parsing error at unrecognized key: TaskA>
Oct 26 23:03:46 master slurmd[2026]: slurmd: fatal: Could not open/read/parse cgroup.conf file /etc/slurm/cgr>
Oct 26 23:03:46 master slurmd[2026]: error: _parse_next_key: Parsing error at unrecognized key: TaskAffinity
Oct 26 23:03:46 master slurmd[2026]: fatal: Could not open/read/parse cgroup.conf file /etc/slurm/cgroup.conf
Oct 26 23:03:46 master systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Oct 26 23:03:46 master systemd[1]: slurmd.service: Failed with result 'exit-code'.

My current cgroup.conf is as follows:

CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"
AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"
ConstrainCores=no
TaskAffinity=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=no
ConstrainDevices=no
AllowedRamSpace=100
AllowedSwapSpace=0
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30

Questions

How do I correct the errors in cgroup.conf that lead to the parsing errors mentioned in the logs?
Are there specific configurations required for SLURM to work correctly with Raspberry Pi 5 and its architecture?
What are the common causes for the high latency error reported in SLURM, and how can I address them?

Any guidance or suggestions would be greatly appreciated!

Verified that Munge is functioning correctly:

ssh pi@node01 munge -n

Checked the status of the slurmctld service on the master node, which is also reported as down.
Investigated the cgroup.conf for parsing errors.

It sounds like you’re facing issues with the configuration and compatibility of SLURM on your Raspberry Pi 5 cluster, particularly with the slurmd service. The parsing errors indicate that SLURM is having trouble with some keys in your cgroup.conf file, and you’re also concerned about performance tuning for your cluster setup on ARM architecture. Let’s address each of your questions:

1. Correcting Parsing Errors in cgroup.conf

The error logs suggest that SLURM is failing to parse certain keys in cgroup.conf, particularly around TaskAffinity. In newer versions of SLURM, some of the cgroup configuration options have changed. Here are steps to troubleshoot this:

  • Remove Deprecated Keys: Some keys in cgroup.conf might be outdated for your SLURM version. Try commenting out or removing the TaskAffinity line and testing again. Update or remove other parameters that may be causing issues, like AllowedRamSpace and AllowedSwapSpace.
  • Minimal cgroup.conf Test: Start with a minimal cgroup.conf to see if the basic SLURM functionality is working, then incrementally add configurations. Here’s a basic configuration:
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=yes
ConstrainCores=no
ConstrainRAMSpace=yes
ConstrainSwapSpace=no
  • Check SLURM Documentation: Verify that each configuration key is compatible with your installed SLURM version. You can consult the SLURM man pages (man slurmd and man slurm.conf) for specific options or refer to the online documentation for the correct syntax.

2. Configurations for SLURM on Raspberry Pi 5

Raspberry Pi’s ARM-based architecture may require specific tweaks for SLURM’s performance and compatibility:

  • Architecture-Specific Options: ARM processors handle SLURM workloads differently than x86_64 processors, especially regarding CPU affinity and resource allocation. Ensure that TaskPlugin=task/affinity is set in slurm.conf, as this option can help SLURM better allocate tasks on multi-core ARM processors.
  • Cgroup Controller for ARM: If you’re using cgroupv2, make sure your Raspberry Pi environment supports it. Raspberry Pi OS versions and other ARM-focused distributions may use slightly different configurations for cgroups, so you might need to experiment with cgroupv1 if cgroupv2 compatibility issues persist.
  • Munge Permissions: Ensure that Munge is properly configured and synced on both the master and worker nodes, as SLURM relies on it for authentication. Running munge -n on both nodes and confirming consistent results is a good start.

3. Troubleshooting High Latency in SLURM

High latency in SLURM is often caused by network latency or resource bottlenecks on worker nodes. Here are some common factors and potential solutions:

  • Network Latency: Since you have a master-worker setup, network communication speed and reliability play a crucial role. Check that the network interfaces on your Pi nodes are stable and have low latency. Consider using a dedicated switch for the cluster to reduce interference.
  • Resource Constraints: Raspberry Pi 5’s resources may limit SLURM’s responsiveness. Ensure adequate RAM and CPU resources on each node, and try setting ConstrainRAMSpace=no in cgroup.conf to avoid restricting memory usage unnecessarily.
  • CPU Affinity: Task placement on ARM cores can benefit from optimized CPU affinity settings. Setting TaskPlugin=task/affinity in slurm.conf can reduce unnecessary task migrations between cores, improving latency.

Next Steps

  1. Reconfigure and Restart SLURM: After making any changes to cgroup.conf, restart SLURM with:
sudo systemctl restart slurmd

Then check the service status for further logs.
2. Check SLURM and System Logs: SLURM logs (/var/log/slurm/slurmd.log) and system logs (journalctl -u slurmd) can give more insights into configuration or compatibility issues.
3. Consider a SLURM Debug Mode: Run slurmd in the foreground with debug output to get more granular information:

sudo slurmd -D -vvvv

This will provide additional context for parsing and configuration issues.

By addressing these potential causes and testing in steps, you should be able to identify the root issue and improve SLURM’s performance on your Raspberry Pi cluster.