Troubleshooting#
SLURM script#
The --ntasks
must be less than or equal to --nodes
requested in SBATCH.
# This will work
#SBATCH --nodes=4
#SBATCH --ntasks=4
Here is an example that will cause an error:
# This will fail
#SBATCH --nodes=4
#SBATCH --ntasks=10
Important
When creating a Ray Cluster be aware of what the physical resources on each compute device are. You will only be able to allocate a maximum of physical resources from each compute type and node. You can configure defaults by modifying the src/resources.json
file.
Multiple Ray Users#
If there are multiple users trying to create Ray clusters then you will need to set different ports to deconflict. This is an issue that will need to be resolved before rolling out to multiple users.
Setting TMPDIR#
Ray needs a TMPDIR
directory to write out its Ray session files to connect to the Ray Cluster. The TMPDIR
need to meet two criteria:
TMPDIR
must be a writable location.TMPDIR
must be mounted on the host system.
An issue with the Ray backend is that it uses TMPDIR
to set up its socket connections. The problem is described here.
Error
The error below is raised when using the backend: ray:
`OSError: AF_UNIX path length cannot exceed 107 bytes`
Ray needs to create a temporary folder. The path of this folder is composed of two parts:
Example of a part created by
tempfile.mkdtemp(prefix="outflow_ray_")
:
`/var/folders/qj/147cfvxs1xq6bp88xdcd4vt80000gq/T/outflow_ray_b2f2h7js`
Example of a part created with
strftime
inray/node.py
:
`session_2021-05-05_16-16-18_112187_70483`
Both are long, and there is a system limitation of 107 bytes for sockets names. See issue where the problem is described
A workaround is to set en environment variable TMPDIR
, TEMP
or TMP
to /tmp/
stated here. Tempfile.mkdtemp`
will return a much shorter temporary file name.