Troubleshooting#

SLURM script#

The --ntasks must be less than or equal to --nodes requested in SBATCH.

# This will work
#SBATCH --nodes=4
#SBATCH --ntasks=4

Here is an example that will cause an error:

# This will fail
#SBATCH --nodes=4
#SBATCH --ntasks=10

Important

When creating a Ray Cluster be aware of what the physical resources on each compute device are. You will only be able to allocate a maximum of physical resources from each compute type and node. You can configure defaults by modifying the src/resources.json file.

Multiple Ray Users#

If there are multiple users trying to create Ray clusters then you will need to set different ports to deconflict. This is an issue that will need to be resolved before rolling out to multiple users.

Setting TMPDIR#

Ray needs a TMPDIR directory to write out its Ray session files to connect to the Ray Cluster. The TMPDIR need to meet two criteria:

  1. TMPDIR must be a writable location.

  2. TMPDIR must be mounted on the host system.

An issue with the Ray backend is that it uses TMPDIR to set up its socket connections. The problem is described here.

Error

The error below is raised when using the backend: ray:

`OSError: AF_UNIX path length cannot exceed 107 bytes`

Ray needs to create a temporary folder. The path of this folder is composed of two parts:

  • Example of a part created by tempfile.mkdtemp(prefix="outflow_ray_"):

`/var/folders/qj/147cfvxs1xq6bp88xdcd4vt80000gq/T/outflow_ray_b2f2h7js`
  • Example of a part created with strftime in ray/node.py :

`session_2021-05-05_16-16-18_112187_70483`

Both are long, and there is a system limitation of 107 bytes for sockets names. See issue where the problem is described

A workaround is to set en environment variable TMPDIR, TEMP or TMP to /tmp/ stated here. Tempfile.mkdtemp` will return a much shorter temporary file name.