r.param.scale: parallelize with OpenMP by krcoder123 · Pull Request #7440 · OSGeo/grass

krcoder123 · 2026-05-29T05:54:42Z

What this PR does

This draft adds OpenMP parallelization to r.param.scale, similar to how r.neighbors was parallelized.

The serial version fits a quadratic surface in a moving window. It uses one sliding window buffer and shuffles it down a row at a time, which forces the rows to be processed in order. That ordering creates a sequential dependency which makes it hard to parallelize.

This PR replaces that method with the two-level band layout that r.neighbors uses. The region is split into horizontal bands where their size is controlled by the memory= option. Inside a band, the output rows are divided across threads, and each thread holds its own ring buffer with the rows its window needs. The rows it takes are the current row and the halo rows above and below. Memory tracks the band size, not the whole map, so it stays within the band size limit as you add threads. When a mask is set, the module falls back to serial through Rast_disable_omp_on_mask, like the other parallel raster modules.

Changes

Makefile: added OpenMP wiring (EXTRA_CFLAGS, EXTRA_LIBS, EXTRA_INC)
main.c, param.h, interface.c: added nprocs parameter using G_OPT_M_NPROCS and memory using G_OPT_MEMORYMB
process.c: rewrote to use multiple bands and ring buffers in each thread to handle parallelization, also handles halo rows.
find_normal.c: It now precomputes the coordinate vectors and the w*z product once per call instead of inside the window loop
open_files.c, close_down.c: removed the shared input descriptor and its cleanup, now that each thread opens its own
benchmark/: added benchmark_rparamscale.py and make_synth_dem.py (grass.benchmark script and a synthetic DEM generator)

Correctness

The output matches the serial version at every thread count. I tested it across all 10 methods, window sizes 3, 5, 15, 31, and 51, the -c and default cases, FCELL and DCELL input, and with and without a mask, at 1, 2, 4, and 8 threads. Every run matched the serial version reference with zero differing cells.
Benchmarks (grass.benchmark, 100M cell region)
1 thread ~1.00x
2 threads ~1.92x
4 threads: ~3.38x
8 threads: ~4.62x

Status

Draft PR for GSoC 2026 coding period task.

This reworks my previous disk based parallel version so it follows the same structure r.neighbors uses. Instead of giving each thread a strip with the whole map, there's now an outer loop over a band of rows, and the work inside each band is split across the threads. Each thread keeps a small ring buffer of wsize rows and rotates it as it goes, reading a new row from disk per output row. So, a thread's input memory only depends on the window height, not the size of the map. Memory is now limited by the usual memory= option, which sets how many rows are in a band. A single shared output buffer for the band replaces each row having an allocation and row pointer arrays for the whole map. I also reworked find_obs() to precompute the window coordinates once and reuse w*z across the six weighted sums. This does less floating-point work per cell and the output is bit-identical to before. A few fixes that came out of the review: - When the window is taller than the region, the borders used to write more rows than the map has. The row counts are now clamped so exactly nrows rows get written. (This was a latent bug in the serial module too.) - Also brought back the nprocs < 1 error guard after Rast_disable_omp_on_mask, to match r.neighbors. - The each thread’s rings, window, and obs buffers are now allocated once before the band loop and indexed by thread id, instead of being reallocated on every band. Output is exactly identical to the serial module at one thread and with a raster mask (which forces a single thread). At more than one thread there's a pre-existing around 1 ULP difference from the OpenMP floating-point environment, tracked separately.

…odeQL)

The LU solve happens once, before the parallel part starts. Running it on a single thread keeps the pivot choice the same every time, so the output now matches the serial version exactly at any thread count. Everything else still runs in parallel, so the speed doesn't change.

petrasovaa · 2026-06-10T13:56:03Z

I created a benchmark similar to the benchmarks in the other tools:
benchmark_rparamscale.py

Results for different size of input

Results for different window sizes

Results for different memory

I need to rerun it with some other settings (pin one thread per physical core) to see if the dip disappears.

gunittest test on a generated deterministic DEM. For each method, window size and the -c flag it compares nprocs=1 against nprocs=4 with a null-aware bit-exact check and against single-thread reference values. Covers the multi-band disk-strip path and the single-band path, asserts output is invariant to band count, and includes a masked case.

krcoder123 · 2026-06-11T06:24:36Z

I need to rerun it with some other settings (pin one thread per physical core) to see if the dip disappears.

Thanks for showing me the benchmark graphs, they give me a sense on how the results are holding up overall.

HuidaeCho · 2026-06-12T12:20:11Z

Not sure if my benchmark run is working correctly. It's been stuck for the past 30 minutes on i9 with 128GB RAM.

> python benchmark_rparamscale.py 
GRASS 8.6.0dev
Machine: Linux-6.12.40-x86_64-12th_Gen_Intel-R-_Core-TM-_i9-12900-with-glibc2.41; 24 logical CPUs
Power: not checked (non-macOS); ensure AC power and no power saving
Workload: synth_dem_100m, method=elev, memory=300 MB, sizes=[31], nprocs=[1, 2, 4, 8], 1 warmup + 3 timed runs (median)
Pinning: OMP_PROC_BIND=unset OMP_PLACES=unset

Maybe, it's just slow. I can see 8 processes in htop.

  nprocs | median time (s) | speedup
  -------+-----------------+---------
    1   |       202.969   |   1.00x
    2   |       104.693   |   1.94x
    4   |        59.901   |   3.39x
    8   |        44.078   |   4.60x

Overall, it agrees with Anna's result above.

> OMP_PROC_BIND=true OMP_PLACES=cores python benchmark_rparamscale.py 
GRASS 8.6.0dev
Machine: Linux-6.12.40-x86_64-12th_Gen_Intel-R-_Core-TM-_i9-12900-with-glibc2.41; 24 logical CPUs
Power: not checked (non-macOS); ensure AC power and no power saving
Workload: synth_dem_100m, method=elev, memory=300 MB, sizes=[31], nprocs=[1, 2, 4, 8], 1 warmup + 3 timed runs (median)
Pinning: OMP_PROC_BIND=true OMP_PLACES=cores

r.param.scale size=31  (synth_dem_100m, memory=300 MB, baseline = 1 thread)
  nprocs | median time (s) | speedup
  -------+-----------------+---------
    1   |       202.964   |   1.00x
    2   |       130.950   |   1.55x
    4   |       129.556   |   1.57x
    8   |       130.830   |   1.55x

But not this one?

Simplified G_malloc calls to the sizeof(*ptr) form, reorder in_buf_size so the multiplication happens in size_t, document the MiB conversion, and replaced the per-thread per-row ring allocations with a single contiguous block indexed by per-thread pointer tables.

krcoder123 · 2026-06-16T01:39:39Z

Hi @petrasovaa, I've addressed Huidae's review comments and pushed in 5661e31. The testsuite still passes 7/7 and I checked the output is still matching the output from main at nprocs=1. So you can rerun your benchmarks whenever you have the time. Thanks

krcoder123 · 2026-06-16T02:01:46Z

Not sure if my benchmark run is working correctly. It's been stuck for the past 30 minutes on i9 with 128GB RAM.

> python benchmark_rparamscale.py 
GRASS 8.6.0dev
Machine: Linux-6.12.40-x86_64-12th_Gen_Intel-R-_Core-TM-_i9-12900-with-glibc2.41; 24 logical CPUs
Power: not checked (non-macOS); ensure AC power and no power saving
Workload: synth_dem_100m, method=elev, memory=300 MB, sizes=[31], nprocs=[1, 2, 4, 8], 1 warmup + 3 timed runs (median)
Pinning: OMP_PROC_BIND=unset OMP_PLACES=unset

Maybe, it's just slow. I can see 8 processes in htop.

  nprocs | median time (s) | speedup
  -------+-----------------+---------
    1   |       202.969   |   1.00x
    2   |       104.693   |   1.94x
    4   |        59.901   |   3.39x
    8   |        44.078   |   4.60x

Overall, it agrees with Anna's result above.

> OMP_PROC_BIND=true OMP_PLACES=cores python benchmark_rparamscale.py 
GRASS 8.6.0dev
Machine: Linux-6.12.40-x86_64-12th_Gen_Intel-R-_Core-TM-_i9-12900-with-glibc2.41; 24 logical CPUs
Power: not checked (non-macOS); ensure AC power and no power saving
Workload: synth_dem_100m, method=elev, memory=300 MB, sizes=[31], nprocs=[1, 2, 4, 8], 1 warmup + 3 timed runs (median)
Pinning: OMP_PROC_BIND=true OMP_PLACES=cores

r.param.scale size=31  (synth_dem_100m, memory=300 MB, baseline = 1 thread)
  nprocs | median time (s) | speedup
  -------+-----------------+---------
    1   |       202.964   |   1.00x
    2   |       130.950   |   1.55x
    4   |       129.556   |   1.57x
    8   |       130.830   |   1.55x

But not this one?

I'm not entirely sure yet. My guess is it might be the P-core/E-core split on that chip, where pinning to all the cores traps some threads on the slower E-cores. macOS doesn't expose core pinning, so I'll try to reproduce it on my Linux VM and let you know what I find.

…allelization

petrasovaa · 2026-06-18T13:25:04Z

The pinning of the cores made my benchmark worse, so I think there is something else going on. I will rerun the benchmark again without it and including your changes.

krcoder123 · 2026-06-19T07:02:51Z

The pinning of the cores made my benchmark worse, so I think there is something else going on. I will rerun the benchmark again without it and including your changes.

Hi Anna,

I couldn't reproduce the pinning of the cores on my Mac. MacOS doesn't seem to allow a user to pin threads to specific cores, so OMP_PROC_BIND and OMP_PLACES have no effect there and I wasn't able to run a pinned vs unpinned comparison on my own hardware.

Do you or Huidae know whether r.neighbors also shows worse speedups under pinning, or does it hold its speedup ratios? Knowing this could help figure out if this is just how pinning cores behave or if it's a code issue.
Thanks!

HuidaeCho · 2026-06-19T15:52:24Z

These plots show the thread-averaged time and speedup without pinning:

WIP r.param.scale per-thread file descriptor parallelization

f033870

github-actions Bot added raster Related to raster data processing C Related code is in C module labels May 29, 2026

krcoder123 changed the title ~~GSoC 2026: r.param.scale parallelization with thread file descriptors (disk based)~~ r.param.scale: parallelize using one file descriptor per thread (disk based) May 29, 2026

krcoder123 added 2 commits June 5, 2026 10:35

r.param.scale: add benchmark and synthetic DEM generator scripts

7b98052

github-actions Bot added the Python Related code is in Python label Jun 5, 2026

github-advanced-security AI found potential problems Jun 5, 2026

View reviewed changes

Comment thread raster/r.param.scale/benchmark/benchmark_rparamscale.py Fixed

Comment thread raster/r.param.scale/benchmark/benchmark_rparamscale.py Fixed

github-advanced-security AI found potential problems Jun 5, 2026

View reviewed changes

Comment thread raster/r.param.scale/process.c Fixed

Comment thread raster/r.param.scale/process.c Fixed

krcoder123 and others added 3 commits June 7, 2026 14:04

r.param.scale: use DSQR to avoid float overflow in allocation size (C…

add24cb

…odeQL)

Merge branch 'main' into gsoc-rparamscale-disk-strip

402f3eb

krcoder123 changed the title ~~r.param.scale: parallelize using one file descriptor per thread (disk based)~~ r.param.scale: parallelize with OpenMP and Two Level Band Layout Jun 11, 2026

krcoder123 changed the title ~~r.param.scale: parallelize with OpenMP and Two Level Band Layout~~ r.param.scale: parallelize with OpenMP Jun 11, 2026

github-actions Bot added the tests Related to Test Suite label Jun 11, 2026

krcoder123 mentioned this pull request Jun 11, 2026

r.param.scale: add reference-value pytest #7534

Merged

krcoder123 added 2 commits June 11, 2026 14:27

r.param.scale: enable OpenMP in the CMake build

6f6db7f

r.param.scale: use full executable paths in benchmark script

8f025aa

github-actions Bot added the CMake label Jun 11, 2026

echoix marked this pull request as ready for review June 12, 2026 01:19

echoix marked this pull request as draft June 12, 2026 01:20

HuidaeCho reviewed Jun 12, 2026

View reviewed changes

r.param.scale: rework benchmark to nprocs convention and document par…

f9a27d8

…allelization

github-actions Bot added docs markdown Related to markdown, markdown files labels Jun 17, 2026

petrasovaa mentioned this pull request Jun 18, 2026

Gsoc parallel rparamscale #7236

Closed

Uh oh!

Conversation

krcoder123 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Changes

Correctness

Status

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petrasovaa commented Jun 10, 2026

Results for different size of input

Results for different window sizes

Results for different memory

Uh oh!

krcoder123 commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuidaeCho commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

krcoder123 commented Jun 16, 2026

Uh oh!

krcoder123 commented Jun 16, 2026

Uh oh!

petrasovaa commented Jun 18, 2026

Uh oh!

krcoder123 commented Jun 19, 2026

Uh oh!

HuidaeCho commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

krcoder123 commented May 29, 2026 •

edited

Loading

krcoder123 commented Jun 11, 2026 •

edited

Loading

HuidaeCho commented Jun 12, 2026 •

edited

Loading