Build: #4 failed

Job: Pipeline PR Test 6.7.4 failed

Stages & jobs

  1. Default Stage

Code commits

Pipeline

  • Rui Xue <rx.astro@gmail.com>

    Rui Xue <rx.astro@gmail.com> b04760eb8e5f1f793c53c497837b2d699aa760b2

    PIPE-3073: Improve Dask cluster robustness for CASA C++ workloads (borrowed from `pclean` experiments)
    - Integrate Dask cluster robustness optimizations proven in `pclean`
      to accommodate ALMA pipeline's monolithic C++ bindings (casatools)
      which frequently hold the GIL and allocate memory outside Python.
    - Introduce `_patch_dask_tcp` monkey-patch to reject implausibly
      large TCP frames (>1 GiB), gracefully recycling stale SLURM sockets
      rather than crashing workers with `MemoryError`.
    - Override Dask's default memory management inside `start_daskcluster`
      (disabling pause/spill/terminate heuristics and clamping LocalCluster
      memory_limit to 0) to prevent the scheduler from incorrectly starving
      workers during intensive, unmanaged casatools executions.
    - Massively increase TCP/comm/heartbeat timeouts (`worker-ttl` to 20m)
      to ensure workers aren't falsely terminated by the scheduler during
      prolonged blocking C++ tasks.
    - Add an explanatory block in `pipeline/config.yaml` clarifying that these
      hardcoded stability overrides will ignore corresponding user memory
      adjustments for Dask.