Build: #4 failed
Job: Pipeline PR Test 6.7.4 failed
Code commits
Pipeline
-
Rui Xue <rx.astro@gmail.com> b04760eb8e5f1f793c53c497837b2d699aa760b2
PIPE-3073: Improve Dask cluster robustness for CASA C++ workloads (borrowed from `pclean` experiments)
- Integrate Dask cluster robustness optimizations proven in `pclean`
to accommodate ALMA pipeline's monolithic C++ bindings (casatools)
which frequently hold the GIL and allocate memory outside Python.
- Introduce `_patch_dask_tcp` monkey-patch to reject implausibly
large TCP frames (>1 GiB), gracefully recycling stale SLURM sockets
rather than crashing workers with `MemoryError`.
- Override Dask's default memory management inside `start_daskcluster`
(disabling pause/spill/terminate heuristics and clamping LocalCluster
memory_limit to 0) to prevent the scheduler from incorrectly starving
workers during intensive, unmanaged casatools executions.
- Massively increase TCP/comm/heartbeat timeouts (`worker-ttl` to 20m)
to ensure workers aren't falsely terminated by the scheduler during
prolonged blocking C++ tasks.
- Add an explanatory block in `pipeline/config.yaml` clarifying that these
hardcoded stability overrides will ignore corresponding user memory
adjustments for Dask.