r.param.scale: parallelize with OpenMP#7440
Conversation
This reworks my previous disk based parallel version so it follows the same structure r.neighbors uses. Instead of giving each thread a strip with the whole map, there's now an outer loop over a band of rows, and the work inside each band is split across the threads. Each thread keeps a small ring buffer of wsize rows and rotates it as it goes, reading a new row from disk per output row. So, a thread's input memory only depends on the window height, not the size of the map. Memory is now limited by the usual memory= option, which sets how many rows are in a band. A single shared output buffer for the band replaces each row having an allocation and row pointer arrays for the whole map. I also reworked find_obs() to precompute the window coordinates once and reuse w*z across the six weighted sums. This does less floating-point work per cell and the output is bit-identical to before. A few fixes that came out of the review: - When the window is taller than the region, the borders used to write more rows than the map has. The row counts are now clamped so exactly nrows rows get written. (This was a latent bug in the serial module too.) - Also brought back the nprocs < 1 error guard after Rast_disable_omp_on_mask, to match r.neighbors. - The each thread’s rings, window, and obs buffers are now allocated once before the band loop and indexed by thread id, instead of being reallocated on every band. Output is exactly identical to the serial module at one thread and with a raster mask (which forces a single thread). At more than one thread there's a pre-existing around 1 ULP difference from the OpenMP floating-point environment, tracked separately.
The LU solve happens once, before the parallel part starts. Running it on a single thread keeps the pivot choice the same every time, so the output now matches the serial version exactly at any thread count. Everything else still runs in parallel, so the speed doesn't change.
|
I created a benchmark similar to the benchmarks in the other tools: Results for different size of input
Results for different window sizes
Results for different memory
I need to rerun it with some other settings (pin one thread per physical core) to see if the dip disappears. |
gunittest test on a generated deterministic DEM. For each method, window size and the -c flag it compares nprocs=1 against nprocs=4 with a null-aware bit-exact check and against single-thread reference values. Covers the multi-band disk-strip path and the single-band path, asserts output is invariant to band count, and includes a masked case.
Thanks for showing me the benchmark graphs, they give me a sense on how the results are holding up overall. |
|
Not sure if my benchmark run is working correctly. It's been stuck for the past 30 minutes on i9 with 128GB RAM. Maybe, it's just slow. I can see 8 processes in htop. Overall, it agrees with Anna's result above. But not this one? |
Simplified G_malloc calls to the sizeof(*ptr) form, reorder in_buf_size so the multiplication happens in size_t, document the MiB conversion, and replaced the per-thread per-row ring allocations with a single contiguous block indexed by per-thread pointer tables.
|
Hi @petrasovaa, I've addressed Huidae's review comments and pushed in 5661e31. The testsuite still passes 7/7 and I checked the output is still matching the output from main at nprocs=1. So you can rerun your benchmarks whenever you have the time. Thanks |
I'm not entirely sure yet. My guess is it might be the P-core/E-core split on that chip, where pinning to all the cores traps some threads on the slower E-cores. macOS doesn't expose core pinning, so I'll try to reproduce it on my Linux VM and let you know what I find. |
|
The pinning of the cores made my benchmark worse, so I think there is something else going on. I will rerun the benchmark again without it and including your changes. |
Hi Anna, I couldn't reproduce the pinning of the cores on my Mac. MacOS doesn't seem to allow a user to pin threads to specific cores, so OMP_PROC_BIND and OMP_PLACES have no effect there and I wasn't able to run a pinned vs unpinned comparison on my own hardware. Do you or Huidae know whether r.neighbors also shows worse speedups under pinning, or does it hold its speedup ratios? Knowing this could help figure out if this is just how pinning cores behave or if it's a code issue. |








What this PR does
This draft adds OpenMP parallelization to r.param.scale, similar to how r.neighbors was parallelized.
The serial version fits a quadratic surface in a moving window. It uses one sliding window buffer and shuffles it down a row at a time, which forces the rows to be processed in order. That ordering creates a sequential dependency which makes it hard to parallelize.
This PR replaces that method with the two-level band layout that r.neighbors uses. The region is split into horizontal bands where their size is controlled by the memory= option. Inside a band, the output rows are divided across threads, and each thread holds its own ring buffer with the rows its window needs. The rows it takes are the current row and the halo rows above and below. Memory tracks the band size, not the whole map, so it stays within the band size limit as you add threads. When a mask is set, the module falls back to serial through Rast_disable_omp_on_mask, like the other parallel raster modules.
Changes
Makefile: added OpenMP wiring (EXTRA_CFLAGS, EXTRA_LIBS, EXTRA_INC)main.c,param.h,interface.c: addednprocsparameter usingG_OPT_M_NPROCSand memory using G_OPT_MEMORYMBprocess.c: rewrote to use multiple bands and ring buffers in each thread to handle parallelization, also handles halo rows.find_normal.c: It now precomputes the coordinate vectors and the w*z product once per call instead of inside the window loopopen_files.c,close_down.c: removed the shared input descriptor and its cleanup, now that each thread opens its ownCorrectness
The output matches the serial version at every thread count. I tested it across all 10 methods, window sizes 3, 5, 15, 31, and 51, the -c and default cases, FCELL and DCELL input, and with and without a mask, at 1, 2, 4, and 8 threads. Every run matched the serial version reference with zero differing cells.
Benchmarks (grass.benchmark, 100M cell region)
1 thread ~1.00x
2 threads ~1.92x
4 threads: ~3.38x
8 threads: ~4.62x
Status
Draft PR for GSoC 2026 coding period task.