Build: #5 was successful Manual run by CASA adm account

Build result summary

Details

Completed
Queue duration
1 second
Duration
42 minutes
Labels
None
Revisions
CASA6
63a325ea22ef235dbdf4cc525f9b6cdf1e3203c3
OPEN-CASA-PKG
e5285e68467f01199807a1f1a978aec8a4781be7
Total tests
4310
Successful since
#1 ()

Tests

Code commits

CASA6
Author Commit Message Commit date
Rui Xue <rx.astro@gmail.com> Rui Xue <rx.astro@gmail.com> 63a325ea22ef235dbdf4cc525f9b6cdf1e3203c3 CAS-14756: Fix AutoLocking and `FilebufIO::readBlock` race condition in MPI parallel I/O
This commit addresses potential I/O serialization and metadata corruption
issues encountered when multiple MPI workers concurrently write to shared
image cubes during the major cycle.

Two primary fixes are introduced:
1. Replaced `AutoNoReadLocking` with `AutoLocking`: When `inspectInterval=1.0`
   was used with `AutoNoReadLocking`, workers were permitted to read `table.f0`
   metadata without an explicitly synchronized read lock. This caused race
   conditions where one worker's metadata rewrite over `table.f0` (a non-atomic
   write >4KB) would crash another worker attempting to open the table,
   triggering a `FilebufIO::readBlock` exception. `AutoLocking` ensures the
   opening phase explicitly acquires the read lock, guaranteeing a coherent state.

2. Release file locks during exception sleeps: When a `FilebufIO::readBlock`
   exception *is* caught natively (requiring the 50ms re-attempt sleep cycle),
   `im=nullptr` is now instantly called. This manually destructs the `PagedImage`
   pointer and surrenders the process's write lock *before* triggering the sleep.
   Previously, the write lock was retained during the 50ms delay, which
   needlessly serialized all other parallel workers awaiting their flush cycle.

This commit further reduces the risk of random filesystem collisions on
table re-open while enabling maximal tile streaming parallelism across workers.
Rui Xue <rx.astro@gmail.com> Rui Xue <rx.astro@gmail.com> 4b3bfca414918261df2401ecd4274572d83dbf8f CAS-14756: use AutoNoReadLocking(inspectInterval=1s) in writeBackToFullImage
NoLocking was allowing concurrent table.dat_tmp→table.dat renames to race
across MPI workers, causing "RegularFile::move error: No such file or
directory" in rank N's PagedImage destructor (job on cvpost124, build-5).

casacore's table.dat metadata file is always shared — even when tile data
blocks are strictly disjoint. NoLocking is therefore not safe for concurrent
writers.

AutoNoReadLocking with the default inspectInterval=0 is safe (the write lock
is held from first putSlice to flush), but autoRelease() becomes a no-op and
the entire writeback is serialized across workers — identical throughput to
UserLocking.

Setting inspectInterval=1.0 activates periodic lock inspection: every ~1 s
autoRelease() checks whether another worker wants the lock; if so it flushes
table.dat and releases, then the waiting worker acquires. Tile data I/O
interleaves across workers (disjoint blocks → no contention); only the brief
table.dat rename is serialized per transfer.

Jira issues

IssueDescriptionStatus
Unknown Issue TypeCAS-14756Could not obtain issue details from Jira

Shared artifacts

Artifact File size
ML228 Python 3.12 Tar distribution 802 MB
MACOS15-DMG 1 GB