grass.tools: Add NumPy arrays IO to Tools#5878
Conversation
This adds a Tools class which allows to access GRASS tools (modules) to be accessed using methods. Once an instance is created, calling a tool is calling a function (method) similarly to grass.jupyter.Map. Unlike grass.script, this does not require generic function name and unlike grass.pygrass module shortcuts, this does not require special objects to mimic the module families. Outputs are handled through a returned object which is result of automatic capture of outputs and can do conversions from known formats using properties. Usage example is in the _test() function in the file. The code is included under new grass.experimental package which allows merging the code even when further breaking changes are anticipated.
…ute with that stdin
…are different now)
…needs to be improved there). Allow usage through attributes, run with run_command syntax, and subprocess-like execution.
…est a tool when there is a close match for the name.
…tool in XY project, so useful only for things like g.extension or m.proj, but, with a significant workaround for argparse --help, it can do --help for a tool.
PerformanceThis aims at increased convenience, not performance, but it comes with runtime cost which needs to be discussed. However, given the added convenience, at this point at least for tests, I consider this a great improvement. Comparison with GRASS native storageThis is expected to be slower than using GRASS native data storage in a GRASS project, because the data is transferred from NumPy arrays into the GRASS storage and then back to NumPy arrays. The penalty is higher for small data. The small data issue applies for the first, internal use case I'm aiming for and that is use of the NumPy arrays in tests. Creating and testing NumPy arrays is very straightforward in Python and pytest's asserts work well with NumPy arrays. However, because the data is small, the absolute runtime is short and the gain is simpler testing with no specialized testing helper code (no need for an assert or compare function for rasters when NumPy arrays are used). In my benchmark, I used non-trivial, but still relatively simple raster algebra expression. Using the NumPy interface for a number of cells in the tens, hundreds, or thousands results in a runtime that is ten times longer compared to using native storage. With millions of cells, the runtime is about twice the baseline. With hundreds of millions, the runtime is less than twice the baseline. I did not test further because I was not able to fit the arrays (>30GB) to my computer memory. Additionally, the ratio to the baseline would improve as the computation became more complex, since the IO would take up a smaller portion of the overall runtime. The benchmark script is included.
# Plain
tools.r_mapcalc(expression="c = 2 * sqrt(a + b) * sqrt(a) * sqrt(b) + a / 2")
# with NumPy
c = tools.r_mapcalc_simple(expression="2 * sqrt(A + B) * sqrt(A) * sqrt(B) + A / 2", a=a, b=b, output=np.array)NumPy interface here is further disadvantaged by the need of using r.mapcalc.simple wrapper around r.mapcalc for this specific benchmark. The implementation is using grass.script.array which uses file for out-of-memory storage of array objects. There is some potential that a different or a modified NumPy array translation will achieve better performance. Comparison with the previous versionThe cost of having this feature, but not using it is small. There is only a series of if-statements and loops with zero elements which need to be evaluated, so the overhead for having it (without using it) is minimal. I was not able to reliably measure any difference. # grass.tools with NumPy
$ grass ~/grassdata/nc_spm_08_grass7/PERMANENT/ --exec python -m timeit -n 100 -c "from grass.tools import Tools; tools = Tools(); tools.g_region(flags='p').text"
100 loops, best of 5: 30.1 msec per loop
# Original grass.tools
$ grass ~/grassdata/nc_spm_08_grass7/PERMANENT/ --exec python -m timeit -n 100 -c "from grass.tools import Tools; tools = Tools(); tools.g_region(flags='p').text"
100 loops, best of 5: 30.2 msec per loop
# grass.script baseline
$ grass ~/grassdata/nc_spm_08_grass7/PERMANENT/ --exec python -m timeit -n 100 -c "import grass.script as gs; gs.read_command('g.region', flags='p')"
100 loops, best of 5: 29.9 msec per loopComparison with NumPyFor computations which can be done with NumPy, this is expected to be generally slower because instead of doing everything in memory, this first stores all the data in the native storage on disk and only then starts the computation; afterwords, the data is read again into NumPy arrays. However, the point of this API is not to be faster than NumPy, but to provide a NumPy interface to tools, algorithms, and models available in GRASS. Hence, I'm not providing any specific numbers here. |
…eturn None if no stdout or arrays
…eturn None if no stdout or arrays
When there is no standard output (stdout), return None, instead of returning the ToolResult object. This creates expected result in interactive console and similar cases (notebooks and doctest-tested documentation), i.e., no result for tools which produce no text output (all tools which create spatial data such as r.slope.aspect or r.grow). Closes OSGeo#6272. This also adds description of the returned value to the generated documentation of each tool. We don't have a 'returns' section which would apply to all interfaces, so I went with imperfect soltion of adding 'Returns:' block at the end of Parameters section. This mimics how the block looks like in other Python doc, copying the 'Parameters:' block, but our block does completely fit with our Parameters section and breaks the hiearchy as it is nested under Parameters (but it is not a section heading, so not visible as heading or in toc). It also leave out the indent for the description because we can do this indent only for the first line, so a multiline paragraph would not create a consistent look. It also fixes couple spelling and syntax issues in the doc and adds missing tests related to no stdout. While this makes even more sense with OSGeo#5878 (returning NumPy arrays), it would make sense even without it.
|
This now depends on #6278. |
When there is no standard output (stdout), return None, instead of returning the ToolResult object. This creates expected result in interactive console and similar cases (notebooks and doctest-tested documentation), i.e., no result for tools which produce no text output (all tools which create spatial data such as r.slope.aspect or r.grow). While this makes even more sense with #5878 (returning NumPy arrays), it would make sense even without it. Closes #6272. Provides consistent_return_value parameter to signal that a result should not switch to None, but always be an object. Using long, but very explicit consistent_return_value as a name. This also adds description of the returned value to the generated documentation of each tool. We don't have a 'returns' section which would apply to all interfaces, so I went with imperfect soltion of adding 'Returns:' block at the end of Parameters section. This mimics how the block looks like in other Python doc, copying the 'Parameters:' block, but our block does completely fit with our Parameters section and breaks the hiearchy as it is nested under Parameters (but it is not a section heading, so not visible as heading or in toc). It also leave out the indent for the description because we can do this indent only for the first line, so a multiline paragraph would not create a consistent look. It also fixes couple spelling and syntax issues in the doc and adds missing tests related to no stdout. Also improves the generated tool documentation: Parameters have space on both sides of colon.
StatusI updated the PR description and all PRs this depends on are merged, so it is ready for review again. Requesting array outputsThe syntax to request NumPy arrays as output is providing tools.r_slope_aspect(elevation=np.ones((2, 3)), slope=np.array)The return value is then the array if only one array is requested. When multiple outputs are returned, they are returned as a tuple. With # option 1: one array
slope = tools.r_slope_aspect(...)
# option 2: multiple arrays
(slope, aspect) = tools.r_slope_aspect(...)
# option 3: named arrays
tools = Tools(consistent_return_value=True)
result = tools.r_slope_aspect(...)
slope = result.arrays.slopePerformanceI don't detect any performance differences between the code which includes this change and the main branch (grass_tools_session_tools_test.py takes 5.90s-6.30s in both cases), so there is no performance cost of having this feature and not using it. The implementation itself performs significantly worse than the native data-only computations (taking 2-10 times longer; see detailed benchmarks above). However, I see the main advantage in convenience (e.g., in tests) and when one wants to simply mix it with NumPy (and thus needs to perform some conversions anyway). So, this performance issue is not really a blocker for me. |
petrasovaa
left a comment
There was a problem hiding this comment.
Looks ready, but I would like to see updated documentation here:
https://grass.osgeo.org/grass-devel/manuals/python_intro.html#numpy-interface
(can be as a separate PR)
|
A draft of the doc update is now in #6313, so I'll merge this as is now. |
This is adding r.pack files (aka native GRASS raster files) as input and output to tools when called through the Tools object. Tool calls such as r_grow can take r.pack files as input or output. The format is distinguished by the file extension. Notably, tool calls such as r_mapcalc don't pass input or output data as separate parameters (expressions or base names), so they can be used like that only when a wrapper exists (r_mapcalc_simple) or, in the future, when more information is included in the interface or passed between the tool and the Tools class Python code. Similarly, tools with multiple inputs or outputs in a single parameter are currently not supported. The code is using --json with the tool to get the information on what is input and what is output, because all are files which may or may not exists (this is different from NumPy arrays where the user-provided parameters clearly say what is input (object) and what is output (class)). Consequently, the whole import-export machinery is only started when there are files in the parameters as identified by the parameter converter class. Currently, the in-project raster names are driven by the file names. This will break for parallel usage and will not work for vector as is. While it is good for guessing the right (and nice) name, e.g., for r.mapcalc expression, ultimately, unique names retrieved with an API function are likely the way to go. When cashing is enabled (either through use go context manager or explicitly), import of inputs is skipped when they were already imported or when they are known outputs. Without cache, data is deleted after every tool (function) call. Cashing is keeping the in-project data in the project (as opposed to a hidden cache or deleting them). The parameter to explicitly drive this is called use_cache (originally keep_data). The objects track what is imported and also track import and cleaning tasks at function call versus object level. The data is cleaned even in case of exceptions. The interface was clarified by creating a private/protected version of run_cmd which has the internal-only parameters. This function uses a single try-finally block to trigger the cleaning in case of exceptions. While generally the code supports paths as both strings and Path objects, the actual decisions about import are made from the list of strings form of the command. From caller perspective, overwrite is supported in the same way as for in-project GRASS rasters. The tests use module scope to reduce fixture setup by couple seconds. Changes include a minor cleanup of comments in tests related to testing result without format=json and with, e.g., --json option. The class documentation discusses overhead and parallelization because the calls are more costly and there is a significant state of the object now with the cache and the rasters created in the background. This includes discussion of the NumPy arrays, too, and slightly improves the wording in part discussing arrays. This is building on top of #2923 (Tools API, and it is parallel with #5878 (NumPy array IO), although it runs at a different stage than NumPy array conversions and uses cache for the imported data (may be connected more with the arrays in the future). This can be used efficiently in Python with Tools (caching, assuming project) and in a limited way also with the experimental run subcommand in CLI (no caching, still needs an explicit project). There is more potential use of this with the standalone tools concept (#5843). The big picture is also discussed in #5830.
This is adding NumPy array as input and output to tools when called through the Tools object. This is building on top of Tools class addition in #2923. The big picture is also discussed in #5830.
My focus with this PR is to create a good API which can be used in various contexts and is useful as is. However, the specifics of the implementation, especially performance are secondary issues for me in this PR as long as there is no performance hit for the cases when NumPy arrays are not used.
Main functionality
In an existing session (e.g., in a GRASS tool):
In a Python script:
For tests
Writing pytests and testing the outputs is greatly simplified without introducing additional functionality (not saying that we should not evaluate what more we need for pytest, but we can do it in light of this):
Commit message
This is adding NumPy array as input and output to tools when called through the Tools object.
My focus with this PR was to create a good API which can be used in various contexts and is useful as is. However, the specifics of the implementation, especially low performance comparing to native data, are secondary issues for me in this addition as long as there is no performance hit for the cases when NumPy arrays are not used which is the case. Even with the performance hits, it works great as a replacement of explicit grass.script.array conversions (same code, just in the background) and in tests (replacing custom tests asserts, and data conversions).
While the interface for inputs is clear (the array with data), the interface for outputs was a pick among many choices (type used as a flag over strings, booleans, empty objects, flags). Strict adherence to NumPy universal function was left out as well as control over the actual output array type (a generic array is documented; grass.script.array.array is used now).
The NumPy import dependency is optional so that the imports and Tools objects work without NumPy installed. While the tests would fail, GRASS build should work without NumPy as of now.
This combines well with the dynamic return value with control over consistency implemented in #6278 as the arrays are one of the possible return types, but can be also made as part of a consistent return type. This lends itself to single array, tuple of arrays, or object with named arrays as possible return types.
Overall, this is building on top of Tools class addition in #2923. The big picture is also discussed in #5830.