Skip to content

r.param.scale: parallelize with OpenMP#7440

Draft
krcoder123 wants to merge 11 commits into
OSGeo:mainfrom
krcoder123:gsoc-rparamscale-disk-strip
Draft

r.param.scale: parallelize with OpenMP#7440
krcoder123 wants to merge 11 commits into
OSGeo:mainfrom
krcoder123:gsoc-rparamscale-disk-strip

Conversation

@krcoder123

@krcoder123 krcoder123 commented May 29, 2026

Copy link
Copy Markdown
Contributor

What this PR does

This draft adds OpenMP parallelization to r.param.scale, similar to how r.neighbors was parallelized.

The serial version fits a quadratic surface in a moving window. It uses one sliding window buffer and shuffles it down a row at a time, which forces the rows to be processed in order. That ordering creates a sequential dependency which makes it hard to parallelize.

This PR replaces that method with the two-level band layout that r.neighbors uses. The region is split into horizontal bands where their size is controlled by the memory= option. Inside a band, the output rows are divided across threads, and each thread holds its own ring buffer with the rows its window needs. The rows it takes are the current row and the halo rows above and below. Memory tracks the band size, not the whole map, so it stays within the band size limit as you add threads. When a mask is set, the module falls back to serial through Rast_disable_omp_on_mask, like the other parallel raster modules.

Changes

  • Makefile: added OpenMP wiring (EXTRA_CFLAGS, EXTRA_LIBS, EXTRA_INC)
  • main.c, param.h, interface.c: added nprocs parameter using G_OPT_M_NPROCS and memory using G_OPT_MEMORYMB
  • process.c: rewrote to use multiple bands and ring buffers in each thread to handle parallelization, also handles halo rows.
  • find_normal.c: It now precomputes the coordinate vectors and the w*z product once per call instead of inside the window loop
  • open_files.c, close_down.c: removed the shared input descriptor and its cleanup, now that each thread opens its own
  • benchmark/: added benchmark_rparamscale.py and make_synth_dem.py (grass.benchmark script and a synthetic DEM generator)

Correctness

The output matches the serial version at every thread count. I tested it across all 10 methods, window sizes 3, 5, 15, 31, and 51, the -c and default cases, FCELL and DCELL input, and with and without a mask, at 1, 2, 4, and 8 threads. Every run matched the serial version reference with zero differing cells.
Benchmarks (grass.benchmark, 100M cell region)
1 thread ~1.00x
2 threads ~1.92x
4 threads: ~3.38x
8 threads: ~4.62x

Status

Draft PR for GSoC 2026 coding period task.

@github-actions github-actions Bot added raster Related to raster data processing C Related code is in C module labels May 29, 2026
@krcoder123 krcoder123 changed the title GSoC 2026: r.param.scale parallelization with thread file descriptors (disk based) r.param.scale: parallelize using one file descriptor per thread (disk based) May 29, 2026
This reworks my previous disk based parallel version so it follows the same structure
r.neighbors uses.

Instead of giving each thread a strip with the whole map, there's now an outer loop over a band of rows, and the work inside each band is split across the threads. Each thread keeps a small ring buffer of wsize rows and rotates it as it goes, reading a new row from disk per output row. So, a thread's input memory only depends on the window height, not the size of the map.

Memory is now limited by the usual memory= option, which sets how many rows are
in a band. A single shared output buffer for the band replaces each row having an allocation and row pointer arrays for the whole map. 

I also reworked find_obs() to precompute the window coordinates once and reuse
w*z across the six weighted sums. This does less floating-point work per cell and
the output is bit-identical to before.

A few fixes that came out of the review:
- When the window is taller than the region, the borders used to write more rows
  than the map has. The row counts are now clamped so exactly nrows rows get
  written. (This was a latent bug in the serial module too.)
- Also brought back the nprocs < 1 error guard after Rast_disable_omp_on_mask, to
  match r.neighbors.
- The each thread’s rings, window, and obs buffers are now allocated once before the
  band loop and indexed by thread id, instead of being reallocated on every band.

Output is exactly identical to the serial module at one thread and with a raster mask
(which forces a single thread). At more than one thread there's a pre-existing around 1
ULP difference from the OpenMP floating-point environment, tracked separately.
@github-actions github-actions Bot added the Python Related code is in Python label Jun 5, 2026
Comment thread raster/r.param.scale/benchmark/benchmark_rparamscale.py Fixed
Comment thread raster/r.param.scale/benchmark/benchmark_rparamscale.py Fixed
Comment thread raster/r.param.scale/process.c Fixed
Comment thread raster/r.param.scale/process.c Fixed
krcoder123 and others added 3 commits June 7, 2026 14:04
The LU solve happens once, before the parallel part starts. Running it
on a single thread keeps the pivot choice the same every time, so the
output now matches the serial version exactly at any thread count.
Everything else still runs in parallel, so the speed doesn't change.
@petrasovaa

Copy link
Copy Markdown
Contributor

I created a benchmark similar to the benchmarks in the other tools:
benchmark_rparamscale.py

Results for different size of input

rps_speedup rps_efficiency

Results for different window sizes

rps_window_speedup rps_window_efficiency

Results for different memory

rps_memory_speedup rps_memory_efficiency

I need to rerun it with some other settings (pin one thread per physical core) to see if the dip disappears.

@krcoder123 krcoder123 changed the title r.param.scale: parallelize using one file descriptor per thread (disk based) r.param.scale: parallelize with OpenMP and Two Level Band Layout Jun 11, 2026
@krcoder123 krcoder123 changed the title r.param.scale: parallelize with OpenMP and Two Level Band Layout r.param.scale: parallelize with OpenMP Jun 11, 2026
gunittest test on a generated deterministic DEM. For each method, window
size and the -c flag it compares nprocs=1 against nprocs=4 with a
null-aware bit-exact check and against single-thread reference values.
Covers the multi-band disk-strip path and the single-band path, asserts
output is invariant to band count, and includes a masked case.
@github-actions github-actions Bot added the tests Related to Test Suite label Jun 11, 2026
@krcoder123

krcoder123 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

I need to rerun it with some other settings (pin one thread per physical core) to see if the dip disappears.

Thanks for showing me the benchmark graphs, they give me a sense on how the results are holding up overall.

@github-actions github-actions Bot added the CMake label Jun 11, 2026
@echoix echoix marked this pull request as ready for review June 12, 2026 01:19
@echoix echoix marked this pull request as draft June 12, 2026 01:20
@HuidaeCho

HuidaeCho commented Jun 12, 2026

Copy link
Copy Markdown
Member

Not sure if my benchmark run is working correctly. It's been stuck for the past 30 minutes on i9 with 128GB RAM.

> python benchmark_rparamscale.py 
GRASS 8.6.0dev
Machine: Linux-6.12.40-x86_64-12th_Gen_Intel-R-_Core-TM-_i9-12900-with-glibc2.41; 24 logical CPUs
Power: not checked (non-macOS); ensure AC power and no power saving
Workload: synth_dem_100m, method=elev, memory=300 MB, sizes=[31], nprocs=[1, 2, 4, 8], 1 warmup + 3 timed runs (median)
Pinning: OMP_PROC_BIND=unset OMP_PLACES=unset

Maybe, it's just slow. I can see 8 processes in htop.

  nprocs | median time (s) | speedup
  -------+-----------------+---------
    1   |       202.969   |   1.00x
    2   |       104.693   |   1.94x
    4   |        59.901   |   3.39x
    8   |        44.078   |   4.60x

Overall, it agrees with Anna's result above.

> OMP_PROC_BIND=true OMP_PLACES=cores python benchmark_rparamscale.py 
GRASS 8.6.0dev
Machine: Linux-6.12.40-x86_64-12th_Gen_Intel-R-_Core-TM-_i9-12900-with-glibc2.41; 24 logical CPUs
Power: not checked (non-macOS); ensure AC power and no power saving
Workload: synth_dem_100m, method=elev, memory=300 MB, sizes=[31], nprocs=[1, 2, 4, 8], 1 warmup + 3 timed runs (median)
Pinning: OMP_PROC_BIND=true OMP_PLACES=cores

r.param.scale size=31  (synth_dem_100m, memory=300 MB, baseline = 1 thread)
  nprocs | median time (s) | speedup
  -------+-----------------+---------
    1   |       202.964   |   1.00x
    2   |       130.950   |   1.55x
    4   |       129.556   |   1.57x
    8   |       130.830   |   1.55x

But not this one?

Comment thread raster/r.param.scale/process.c Outdated
Comment thread raster/r.param.scale/process.c Outdated
Comment thread raster/r.param.scale/process.c
Comment thread raster/r.param.scale/process.c
Comment thread raster/r.param.scale/process.c
Comment thread raster/r.param.scale/process.c Outdated
Simplified G_malloc calls to the sizeof(*ptr) form, reorder in_buf_size so the multiplication happens in size_t, document the MiB conversion, and replaced the per-thread per-row ring allocations with a single contiguous block indexed by per-thread pointer tables.
@krcoder123

Copy link
Copy Markdown
Contributor Author

Hi @petrasovaa, I've addressed Huidae's review comments and pushed in 5661e31. The testsuite still passes 7/7 and I checked the output is still matching the output from main at nprocs=1. So you can rerun your benchmarks whenever you have the time. Thanks

@krcoder123

Copy link
Copy Markdown
Contributor Author

Not sure if my benchmark run is working correctly. It's been stuck for the past 30 minutes on i9 with 128GB RAM.

> python benchmark_rparamscale.py 
GRASS 8.6.0dev
Machine: Linux-6.12.40-x86_64-12th_Gen_Intel-R-_Core-TM-_i9-12900-with-glibc2.41; 24 logical CPUs
Power: not checked (non-macOS); ensure AC power and no power saving
Workload: synth_dem_100m, method=elev, memory=300 MB, sizes=[31], nprocs=[1, 2, 4, 8], 1 warmup + 3 timed runs (median)
Pinning: OMP_PROC_BIND=unset OMP_PLACES=unset

Maybe, it's just slow. I can see 8 processes in htop.

  nprocs | median time (s) | speedup
  -------+-----------------+---------
    1   |       202.969   |   1.00x
    2   |       104.693   |   1.94x
    4   |        59.901   |   3.39x
    8   |        44.078   |   4.60x

Overall, it agrees with Anna's result above.

> OMP_PROC_BIND=true OMP_PLACES=cores python benchmark_rparamscale.py 
GRASS 8.6.0dev
Machine: Linux-6.12.40-x86_64-12th_Gen_Intel-R-_Core-TM-_i9-12900-with-glibc2.41; 24 logical CPUs
Power: not checked (non-macOS); ensure AC power and no power saving
Workload: synth_dem_100m, method=elev, memory=300 MB, sizes=[31], nprocs=[1, 2, 4, 8], 1 warmup + 3 timed runs (median)
Pinning: OMP_PROC_BIND=true OMP_PLACES=cores

r.param.scale size=31  (synth_dem_100m, memory=300 MB, baseline = 1 thread)
  nprocs | median time (s) | speedup
  -------+-----------------+---------
    1   |       202.964   |   1.00x
    2   |       130.950   |   1.55x
    4   |       129.556   |   1.57x
    8   |       130.830   |   1.55x

But not this one?

I'm not entirely sure yet. My guess is it might be the P-core/E-core split on that chip, where pinning to all the cores traps some threads on the slower E-cores. macOS doesn't expose core pinning, so I'll try to reproduce it on my Linux VM and let you know what I find.

@github-actions github-actions Bot added docs markdown Related to markdown, markdown files labels Jun 17, 2026
@petrasovaa

Copy link
Copy Markdown
Contributor

The pinning of the cores made my benchmark worse, so I think there is something else going on. I will rerun the benchmark again without it and including your changes.

@krcoder123

Copy link
Copy Markdown
Contributor Author

The pinning of the cores made my benchmark worse, so I think there is something else going on. I will rerun the benchmark again without it and including your changes.

Hi Anna,

I couldn't reproduce the pinning of the cores on my Mac. MacOS doesn't seem to allow a user to pin threads to specific cores, so OMP_PROC_BIND and OMP_PLACES have no effect there and I wasn't able to run a pinned vs unpinned comparison on my own hardware.

Do you or Huidae know whether r.neighbors also shows worse speedups under pinning, or does it hold its speedup ratios? Knowing this could help figure out if this is just how pinning cores behave or if it's a code issue.
Thanks!

@HuidaeCho

Copy link
Copy Markdown
Member

These plots show the thread-averaged time and speedup without pinning:
image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C Related code is in C CMake docs markdown Related to markdown, markdown files module Python Related code is in Python raster Related to raster data processing tests Related to Test Suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants