grass.tools: Add NumPy arrays IO to Tools by wenzeslaus · Pull Request #5878 · OSGeo/grass

wenzeslaus · 2025-06-11T18:04:36Z

This is adding NumPy array as input and output to tools when called through the Tools object. This is building on top of Tools class addition in #2923. The big picture is also discussed in #5830.

My focus with this PR is to create a good API which can be used in various contexts and is useful as is. However, the specifics of the implementation, especially performance are secondary issues for me in this PR as long as there is no performance hit for the cases when NumPy arrays are not used.

Main functionality

In an existing session (e.g., in a GRASS tool):

tools = Tools()
tools.g_region(rows=2, cols=3)
slope = tools.r_slope_aspect(elevation=np.ones((2, 3)), slope=np.ndarray)

In a Python script:

gs.create_project("project1")
with gs.setup.init("project1") as session:
    tools = Tools(session=session)
    tools.g_region(rows=2, cols=3)
    slope = tools.r_slope_aspect(elevation=np.ones((2, 3)), slope=np.ndarray)

For tests

Writing pytests and testing the outputs is greatly simplified without introducing additional functionality (not saying that we should not evaluate what more we need for pytest, but we can do it in light of this):

def test_slope(xy_session):
    tools = Tools(session=xy_session)
    tools.g_region(rows=2, cols=3)
    slope = tools.r_slope_aspect(elevation=np.ones((2, 3)), slope=np.ndarray)
    assert slope.shape == (2, 3)
    assert np.all(slope == np.full((2, 3), 0))

Commit message

This is adding NumPy array as input and output to tools when called through the Tools object.

My focus with this PR was to create a good API which can be used in various contexts and is useful as is. However, the specifics of the implementation, especially low performance comparing to native data, are secondary issues for me in this addition as long as there is no performance hit for the cases when NumPy arrays are not used which is the case. Even with the performance hits, it works great as a replacement of explicit grass.script.array conversions (same code, just in the background) and in tests (replacing custom tests asserts, and data conversions).

While the interface for inputs is clear (the array with data), the interface for outputs was a pick among many choices (type used as a flag over strings, booleans, empty objects, flags). Strict adherence to NumPy universal function was left out as well as control over the actual output array type (a generic array is documented; grass.script.array.array is used now).

The NumPy import dependency is optional so that the imports and Tools objects work without NumPy installed. While the tests would fail, GRASS build should work without NumPy as of now.

This combines well with the dynamic return value with control over consistency implemented in #6278 as the arrays are one of the possible return types, but can be also made as part of a consistent return type. This lends itself to single array, tuple of arrays, or object with named arrays as possible return types.

Overall, this is building on top of Tools class addition in #2923. The big picture is also discussed in #5830.

This adds a Tools class which allows to access GRASS tools (modules) to be accessed using methods. Once an instance is created, calling a tool is calling a function (method) similarly to grass.jupyter.Map. Unlike grass.script, this does not require generic function name and unlike grass.pygrass module shortcuts, this does not require special objects to mimic the module families. Outputs are handled through a returned object which is result of automatic capture of outputs and can do conversions from known formats using properties. Usage example is in the _test() function in the file. The code is included under new grass.experimental package which allows merging the code even when further breaking changes are anticipated.

…ute with that stdin

…are different now)

…needs to be improved there). Allow usage through attributes, run with run_command syntax, and subprocess-like execution.

…bject

…est a tool when there is a close match for the name.

…bject

…tool in XY project, so useful only for things like g.extension or m.proj, but, with a significant workaround for argparse --help, it can do --help for a tool.

…bject

wenzeslaus · 2025-07-29T10:22:15Z

Performance

This aims at increased convenience, not performance, but it comes with runtime cost which needs to be discussed. However, given the added convenience, at this point at least for tests, I consider this a great improvement.

Comparison with GRASS native storage

This is expected to be slower than using GRASS native data storage in a GRASS project, because the data is transferred from NumPy arrays into the GRASS storage and then back to NumPy arrays. The penalty is higher for small data.

The small data issue applies for the first, internal use case I'm aiming for and that is use of the NumPy arrays in tests. Creating and testing NumPy arrays is very straightforward in Python and pytest's asserts work well with NumPy arrays. However, because the data is small, the absolute runtime is short and the gain is simpler testing with no specialized testing helper code (no need for an assert or compare function for rasters when NumPy arrays are used).

In my benchmark, I used non-trivial, but still relatively simple raster algebra expression. Using the NumPy interface for a number of cells in the tens, hundreds, or thousands results in a runtime that is ten times longer compared to using native storage. With millions of cells, the runtime is about twice the baseline. With hundreds of millions, the runtime is less than twice the baseline. I did not test further because I was not able to fit the arrays (>30GB) to my computer memory. Additionally, the ratio to the baseline would improve as the computation became more complex, since the IO would take up a smaller portion of the overall runtime. The benchmark script is included.

Cells	Plain grass.tools (s)	grass.tools with NumPy (s)	Plain vs with NumPy (a/b)	% worse (x*100-100)
12-2,250	0.04	0.45	10.6	963
8,100,000	1.5	3.2	2.2	121
50,625,000	8.8	16.7	1.9	90
202,500,000	34.8	64.7	1.9	86
810,000,000	148.1	259.6	1.8	75

# Plain
tools.r_mapcalc(expression="c = 2 * sqrt(a + b) * sqrt(a) * sqrt(b) + a / 2")

# with NumPy
c = tools.r_mapcalc_simple(expression="2 * sqrt(A + B) * sqrt(A) * sqrt(B) + A / 2", a=a, b=b, output=np.array)

NumPy interface here is further disadvantaged by the need of using r.mapcalc.simple wrapper around r.mapcalc for this specific benchmark.

The implementation is using grass.script.array which uses file for out-of-memory storage of array objects. There is some potential that a different or a modified NumPy array translation will achieve better performance.

Comparison with the previous version

The cost of having this feature, but not using it is small. There is only a series of if-statements and loops with zero elements which need to be evaluated, so the overhead for having it (without using it) is minimal. I was not able to reliably measure any difference.

# grass.tools with NumPy
$ grass ~/grassdata/nc_spm_08_grass7/PERMANENT/ --exec python -m timeit -n 100 -c "from grass.tools import Tools; tools = Tools(); tools.g_region(flags='p').text"
100 loops, best of 5: 30.1 msec per loop

# Original grass.tools
$ grass ~/grassdata/nc_spm_08_grass7/PERMANENT/ --exec python -m timeit -n 100 -c "from grass.tools import Tools; tools = Tools(); tools.g_region(flags='p').text"
100 loops, best of 5: 30.2 msec per loop

# grass.script baseline
$ grass ~/grassdata/nc_spm_08_grass7/PERMANENT/ --exec python -m timeit -n 100 -c "import grass.script as gs; gs.read_command('g.region', flags='p')"
100 loops, best of 5: 29.9 msec per loop

Comparison with NumPy

For computations which can be done with NumPy, this is expected to be generally slower because instead of doing everything in memory, this first stores all the data in the native storage on disk and only then starts the computation; afterwords, the data is read again into NumPy arrays. However, the point of this API is not to be faster than NumPy, but to provide a NumPy interface to tools, algorithms, and models available in GRASS. Hence, I'm not providing any specific numbers here.

…to-tools

…eturn None if no stdout or arrays

When there is no standard output (stdout), return None, instead of returning the ToolResult object. This creates expected result in interactive console and similar cases (notebooks and doctest-tested documentation), i.e., no result for tools which produce no text output (all tools which create spatial data such as r.slope.aspect or r.grow). Closes OSGeo#6272. This also adds description of the returned value to the generated documentation of each tool. We don't have a 'returns' section which would apply to all interfaces, so I went with imperfect soltion of adding 'Returns:' block at the end of Parameters section. This mimics how the block looks like in other Python doc, copying the 'Parameters:' block, but our block does completely fit with our Parameters section and breaks the hiearchy as it is nested under Parameters (but it is not a section heading, so not visible as heading or in toc). It also leave out the indent for the description because we can do this indent only for the first line, so a multiline paragraph would not create a consistent look. It also fixes couple spelling and syntax issues in the doc and adds missing tests related to no stdout. While this makes even more sense with OSGeo#5878 (returning NumPy arrays), it would make sense even without it.

wenzeslaus · 2025-08-29T20:54:35Z

This now depends on #6278.

When there is no standard output (stdout), return None, instead of returning the ToolResult object. This creates expected result in interactive console and similar cases (notebooks and doctest-tested documentation), i.e., no result for tools which produce no text output (all tools which create spatial data such as r.slope.aspect or r.grow). While this makes even more sense with #5878 (returning NumPy arrays), it would make sense even without it. Closes #6272. Provides consistent_return_value parameter to signal that a result should not switch to None, but always be an object. Using long, but very explicit consistent_return_value as a name. This also adds description of the returned value to the generated documentation of each tool. We don't have a 'returns' section which would apply to all interfaces, so I went with imperfect soltion of adding 'Returns:' block at the end of Parameters section. This mimics how the block looks like in other Python doc, copying the 'Parameters:' block, but our block does completely fit with our Parameters section and breaks the hiearchy as it is nested under Parameters (but it is not a section heading, so not visible as heading or in toc). It also leave out the indent for the description because we can do this indent only for the first line, so a multiline paragraph would not create a consistent look. It also fixes couple spelling and syntax issues in the doc and adds missing tests related to no stdout. Also improves the generated tool documentation: Parameters have space on both sides of colon.

wenzeslaus · 2025-09-03T20:59:44Z

Status

I updated the PR description and all PRs this depends on are merged, so it is ready for review again.

Requesting array outputs

The syntax to request NumPy arrays as output is providing np.ndarray or np.array as a parameter value:

tools.r_slope_aspect(elevation=np.ones((2, 3)), slope=np.array)

The return value is then the array if only one array is requested. When multiple outputs are returned, they are returned as a tuple. With consistent_return_value=True, arrays are accessible by parameter name as through a named tuple which is as attribute arrays of the ToolResult object.

# option 1: one array
slope = tools.r_slope_aspect(...)
# option 2: multiple arrays
(slope, aspect) = tools.r_slope_aspect(...)
# option 3: named arrays
tools = Tools(consistent_return_value=True)
result = tools.r_slope_aspect(...)
slope = result.arrays.slope

Performance

I don't detect any performance differences between the code which includes this change and the main branch (grass_tools_session_tools_test.py takes 5.90s-6.30s in both cases), so there is no performance cost of having this feature and not using it.

The implementation itself performs significantly worse than the native data-only computations (taking 2-10 times longer; see detailed benchmarks above). However, I see the main advantage in convenience (e.g., in tests) and when one wants to simply mix it with NumPy (and thus needs to perform some conversions anyway). So, this performance issue is not really a blocker for me.

…h CodeQL C/C++ check).

petrasovaa

Looks ready, but I would like to see updated documentation here:
https://grass.osgeo.org/grass-devel/manuals/python_intro.html#numpy-interface

(can be as a separate PR)

wenzeslaus · 2025-09-05T13:17:38Z

A draft of the doc update is now in #6313, so I'll merge this as is now.

This is adding r.pack files (aka native GRASS raster files) as input and output to tools when called through the Tools object. Tool calls such as r_grow can take r.pack files as input or output. The format is distinguished by the file extension. Notably, tool calls such as r_mapcalc don't pass input or output data as separate parameters (expressions or base names), so they can be used like that only when a wrapper exists (r_mapcalc_simple) or, in the future, when more information is included in the interface or passed between the tool and the Tools class Python code. Similarly, tools with multiple inputs or outputs in a single parameter are currently not supported. The code is using --json with the tool to get the information on what is input and what is output, because all are files which may or may not exists (this is different from NumPy arrays where the user-provided parameters clearly say what is input (object) and what is output (class)). Consequently, the whole import-export machinery is only started when there are files in the parameters as identified by the parameter converter class. Currently, the in-project raster names are driven by the file names. This will break for parallel usage and will not work for vector as is. While it is good for guessing the right (and nice) name, e.g., for r.mapcalc expression, ultimately, unique names retrieved with an API function are likely the way to go. When cashing is enabled (either through use go context manager or explicitly), import of inputs is skipped when they were already imported or when they are known outputs. Without cache, data is deleted after every tool (function) call. Cashing is keeping the in-project data in the project (as opposed to a hidden cache or deleting them). The parameter to explicitly drive this is called use_cache (originally keep_data). The objects track what is imported and also track import and cleaning tasks at function call versus object level. The data is cleaned even in case of exceptions. The interface was clarified by creating a private/protected version of run_cmd which has the internal-only parameters. This function uses a single try-finally block to trigger the cleaning in case of exceptions. While generally the code supports paths as both strings and Path objects, the actual decisions about import are made from the list of strings form of the command. From caller perspective, overwrite is supported in the same way as for in-project GRASS rasters. The tests use module scope to reduce fixture setup by couple seconds. Changes include a minor cleanup of comments in tests related to testing result without format=json and with, e.g., --json option. The class documentation discusses overhead and parallelization because the calls are more costly and there is a significant state of the object now with the cache and the rasters created in the background. This includes discussion of the NumPy arrays, too, and slightly improves the wording in part discussing arrays. This is building on top of #2923 (Tools API, and it is parallel with #5878 (NumPy array IO), although it runs at a different stage than NumPy array conversions and uses cache for the imported data (may be connected more with the arrays in the future). This can be used efficiently in Python with Tools (caching, assuming project) and in a limited way also with the experimental run subcommand in CLI (no caching, still needs an explicit project). There is more potential use of this with the standalone tools concept (#5843). The big picture is also discussed in #5830.

wenzeslaus and others added 28 commits June 3, 2023 23:57

Support verbosity, overwrite and region freezing

aaef183

Raise exception instead of calling handle_errors

54db575

Allow to specify stdin and use a new instance of Tools itself to exec…

82f5894

…ute with that stdin

Add ignore errors, r_mapcalc example, draft tests

0f1e210

Add test for exceptions

f4e3fed

Add tests and Makefile

04087e8

Convert values to ints and floats in keyval

6ab8e40

Do not overwrite by default to follow default behavior in GRASS GIS

744cfac

Add doc, remove old code and todos

24c27e6

Add to top Makefile

ff187a6

Add docs for tests

22773c8

Allow test to fail because of the missing seed parameter (so results …

2911065

…are different now)

Merge branch 'main' into add-session-tools-object

3ac46c3

Allow for optional output capture (error handling and printing still …

437d46e

…needs to be improved there). Allow usage through attributes, run with run_command syntax, and subprocess-like execution.

Merge branch 'main' into add-session-tools-object

cb8f483

Merge remote-tracking branch 'upstream/main' into add-session-tools-o…

a958142

…bject

Access JSON as dict directly without an attribute using getitem. Sugg…

61972d4

…est a tool when there is a close match for the name.

Fix whitespace and regexp

c86d8ff

Represent not captured stdout as None, not empty string.

3b995c9

Merge remote-tracking branch 'upstream/main' into add-session-tools-o…

d8c354d

…bject

Add run subcommand to have a CLI use case for the tools. It runs one …

4cc5a32

…tool in XY project, so useful only for things like g.extension or m.proj, but, with a significant workaround for argparse --help, it can do --help for a tool.

Update function name

459b2ad

Add prototype code for numpy support

513c9f8

Merge main branch

24ef6b9

Make the special features standalone objects used by composition

4a1e374

Merge remote-tracking branch 'upstream/main' into add-session-tools-o…

651df11

…bject

Remove r.pack IO

41ad7ae

wenzeslaus added enhancement New feature or request Python Related code is in Python labels Jun 11, 2025

wenzeslaus added 4 commits July 29, 2025 06:37

Add benchmark

13cc850

Merge remote-tracking branch 'upstream/main' into add-numpy-array-io-…

d6c7af4

…to-tools

Add NumPy arrays to docstring

697053f

Document the support class

79cf950

wenzeslaus mentioned this pull request Jul 29, 2025

grass.tools: Rename tool name parameter #6143

Merged

Merge tool name parameter update

b9cae0e

wenzeslaus changed the title ~~grass.experimental: Add NumPy arrays IO to Tools~~ grass.tools: Add NumPy arrays IO to Tools Jul 30, 2025

wenzeslaus marked this pull request as ready for review July 30, 2025 10:29

Merge with NumPy and StringIO object doc

8deee60

wenzeslaus mentioned this pull request Aug 28, 2025

[Feat] grass.tools: Return None when there is no text #6272

Closed

wenzeslaus added 2 commits August 28, 2025 17:47

Always return ToolResult when requested and make arrays part of it, r…

7fa296d

…eturn None if no stdout or arrays

Always return ToolResult when requested and make arrays part of it, r…

bbaba6e

…eturn None if no stdout or arrays

github-actions bot added the C Related code is in C label Aug 28, 2025

wenzeslaus mentioned this pull request Aug 29, 2025

grass.tools: Return None with no stdout #6278

Merged

wenzeslaus added 2 commits September 3, 2025 15:53

Merge return None from main

fb29f81

Update documentation

3ee2608

wenzeslaus added 2 commits September 4, 2025 09:27

Update the code to work without numpy package (in CI happens only wit…

b8a1bf7

…h CodeQL C/C++ check).

Test and document behavior with non-matching computational region

eaa05f6

petrasovaa approved these changes Sep 4, 2025

View reviewed changes

wenzeslaus merged commit 063dea1 into OSGeo:main Sep 5, 2025
27 checks passed

wenzeslaus deleted the add-numpy-array-io-to-tools branch September 5, 2025 13:37

wenzeslaus mentioned this pull request Sep 5, 2025

docs: Show grass.tools with arrays before grass.script.array #6313

Merged

wenzeslaus mentioned this pull request Oct 5, 2025

grass.tools: Add raster pack files IO to Tools #5877

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

grass.tools: Add NumPy arrays IO to Tools#5878

grass.tools: Add NumPy arrays IO to Tools#5878
wenzeslaus merged 46 commits intoOSGeo:mainfrom
wenzeslaus:add-numpy-array-io-to-tools

wenzeslaus commented Jun 11, 2025 •

edited

Loading

Uh oh!

wenzeslaus commented Jul 29, 2025

Uh oh!

wenzeslaus commented Aug 29, 2025

Uh oh!

wenzeslaus commented Sep 3, 2025

Uh oh!

petrasovaa left a comment

Uh oh!

wenzeslaus commented Sep 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Comments

Conversation

wenzeslaus commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main functionality

For tests

Commit message

Uh oh!

wenzeslaus commented Jul 29, 2025

Performance

Comparison with GRASS native storage

Comparison with the previous version

Comparison with NumPy

Uh oh!

wenzeslaus commented Aug 29, 2025

Uh oh!

wenzeslaus commented Sep 3, 2025

Status

Requesting array outputs

Performance

Uh oh!

petrasovaa left a comment

Choose a reason for hiding this comment

Uh oh!

wenzeslaus commented Sep 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wenzeslaus commented Jun 11, 2025 •

edited

Loading