Skip to content

Comments

grass.script: More carefully setup env in create_project#5854

Merged
wenzeslaus merged 5 commits intoOSGeo:mainfrom
wenzeslaus:better-env-handling-in-create-project
Jun 20, 2025
Merged

grass.script: More carefully setup env in create_project#5854
wenzeslaus merged 5 commits intoOSGeo:mainfrom
wenzeslaus:better-env-handling-in-create-project

Conversation

@wenzeslaus
Copy link
Member

This changes the condition used to create the runtime environment in create_project so that the environment is not set up not only when GISBASE is not set, but also when GISBASE is not in PATH which (at least at this point) indicates that no tools can run, so a call to g.proj later (and technically the potential g.message call before that) will fail.

This helps when GISBASE is set, even correctly, for some reason, but PATH is not. This may happen when someone is trying to GISBASE in some special way before GRASS actually runs. I have suspicion this is happening on Windows, but hard to tell without reading the setup scripts in details (when running on Windows you could print the environment and see). However, judging from @echoix experiments in #5717, this may be unrelated to the issues with create_project. Anyway, before this PR, the following results in a traceback, but with this PR it works.

GISBASE="/.../dist.x86_64-pc-linux-gnu/" python -c "import grass.script as gs; import tempfile; d = tempfile.TemporaryDirectory(); gs.create_project(d.name, 'project', epsg=3358); d.cleanup();"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
...
    raise OSError(_("Cannot find the executable {0}").format(args[0]))
OSError: Cannot find the executable g.proj

The env setup is done only when needed, so simple XY does not trigger creation of the environment, and so it is more lightweight in terms of things happening and simply faster.

code time command
original create_project 558 usec per loop python -m timeit -n 1000 -- "import grass.script as gs; import tempfile; d = tempfile.TemporaryDirectory(); gs.create_project(d.name, 'project'); d.cleanup();"
new create_project 365 usec per loop python -m timeit -n 1000 -- "import grass.script as gs; import tempfile; d = tempfile.TemporaryDirectory(); gs.create_project(d.name, 'project'); d.cleanup();"
low-level calls 309 usec per loop python -m timeit -n 1000 -- "import grass.script as gs; import tempfile; d = tempfile.TemporaryDirectory(); gs.core._create_location_xy(d.name, 'project'); gs.core._set_location_description(d.name, 'project', None); d.cleanup();"

With faster create_project, I replaced the old gs.core._create_location_xy calls in tests by simple gs.create_project calls.

The code to ensure runtime environment is moved to grass.script.setup, broken down and made reusable. The difference for usage in create_project is that os.environ is now always copied even if it already has GISBASE (we saved that copy operation before if GISBASE was already present). The handling of env, however, is consistent with other functions in grass.script.setup.

The env in create_project is not set up not only when GISBASE is not set, but also when GISBASE is not in PATH which at this point indicates that tools will not run.

The env setup is done only when needed, so simple XY does not trigger creation of the environment, and so it is more lightweight.

The code to ensure runtime environment is moved to grass.script.setup, broken down and made reusable. The difference for usage in create_project is that os.environ is now always copied even if it already has GISBASE (we saved that copy operation before if GISBASE was already present). The handling of env, however, is consistent with other functions in grass.script.setup.

With faster create_project, I replaced the old gs.core._create_location_xy calls by simple gs.project calls in tests.
@github-actions github-actions bot added vector Related to vector data processing temporal Related to temporal data processing Python Related code is in Python libraries module general tests Related to Test Suite notebook labels Jun 6, 2025
Copy link
Member

@echoix echoix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other higher level observations:

  • What's the real difference between gs.create_project() and gs.setup.init() (disregarding the specifics or implementation, looking at a distance above all that...). From the adjustments made in the tests, it seems one would always need to use both (aka boilerplate). If the pattern of using a "with" (context manager) for gs.setup.init() was supposed to be the end-game of the wrappers to make grass work, why would we need another layer above to patch the shortcomings of gs.setup.init()
  • The topic of performance was brought in into the PRs description. Even though currently, I think (as I'm not able to properly measure and profile due to how all grass+python works currently, or at least do it easily) the biggest slowdowns/bottlenecks are the uncountable process calls (launching a binary), usually in python it would be the number of function calls that would be addressed. These are some parts that no newer version of python* (famous last words...) can really improve and reduce. It's hard to optimize when crossing function boundaries, as there's a whole lot going on when "calling a function".
    • Otherwise the next place I would look into, for performance, would be the import times. What is the "cost" of the imports we do, and ending up to "import the world". It's a bit easier to measure, as there's a python cli flag to trace the import times. I suspect temporal import times would be kinda big too. I already know from older exploration that when an import of one of our modules is made to one of our ctypes wrappers, that the import times significantly increase (the proportions are something like 300 for our ctypes part for 80 for normal/the rest: it's been a while and it's by memory). I also noticed in that exploration session a while ago that when there was two imports of two different ctypes modules, the total import time was almost doubled (because of the importance of the ctypes import time). That's why the suspicion of temporal is founded.

@echoix
Copy link
Member

echoix commented Jun 6, 2025

Also, do you want me to redo some tests/exploration on test+pytest, especially on windows, and also randomizing the order to see if this helps the global situation discussed in #4480?

(Ie, if I can get further than just what was ready in #5584)

@wenzeslaus
Copy link
Member Author

About redoing the tests, yes, perhaps better done after merging this, but I would definitely appreciate some help with debugging create_project on Windows. For starters, I don't know if we have a test which would actually test the same thing you a @landam did manually when you both got to g.proj not found.

I need more time to respond to the rest.

@wenzeslaus
Copy link
Member Author

What's the real difference between gs.create_project() and gs.setup.init() ...From the adjustments made in the tests, it seems one would always need to use both (aka boilerplate)...why would we need another layer above to patch the shortcomings of gs.setup.init()

You need to have a project (data) and an active session using a project (or more precisely: env variables to make the runtime work and connection to a project). gs.setup.init() could create a project if we add additional parameters or it could work on some delayed start basis and having a user set up the project later.

# One-step init + create
gs.setup.init("~/myproject", crs="some PROJ-accepted string")  # crs implies create
# Delayed CRS setup
session = gs.setup.init("~/myproject")  # creates an empty XY project if dir does not exist
session.set_crs("some PROJ-accepted string")

The topic of performance was brought in into the PRs description. Even though currently, I think (as I'm not able to properly measure and profile due to how all grass+python works currently, or at least do it easily) the biggest slowdowns/bottlenecks are the uncountable process calls (launching a binary)...

I measured these times way back when in experiment with small Python processes. While I don't have the code or the results anymore, I'm sure it is still an issue. Tools written in Python, even when executed directly in command line, invoke g.parser to deal with their command line parameters. In #2923, I'm even adding another subprocess call to deal with with command line parameters before the tool is executed. This is not ideal, but usually more stable and straightforward than going through ctypes. Ultimately, having direct library function access to the tools (in C or whatever) would be the best, but it is worth seeing what the best options are now for the subprocesses. From API perspective, that's what #2923 is doing. From performance perspective, we are mostly limited by the subprocess call cost which is out of our hands, but from some experiment at some point, it looked like Linux is much faster than Windows for subprocess execution, so perhaps there is still room for improvement. Let's have this discussion some time later.

...What is the "cost" of the imports we do, and ending up to "import the world". It's a bit easier to measure, as there's a python cli flag to trace the import times...

We did some work to avoid ctypes for stability reasons, but the Python library was originally minimal and started in the time when from x import * was the favorite style, so a lot of that legacy is still there. I'm experimenting with lazy imports through module-level __getattr__ in #2923, so that's perhaps a way how to reduce some of the times. It may not make sense for grass.script imports from core.py, but it may make sense for other things.

I already know from older exploration that when an import of one of our modules is made to one of our ctypes wrappers, that the import times significantly increase...

The complexity of ctypes is one of the reasons why subprocesses, while far from perfect, still seem to be somewhat okay. :-)

@echoix
Copy link
Member

echoix commented Jun 19, 2025

Looks promising and eager to try it out

Copy link
Member

@echoix echoix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't try it out manually yet, but if you need to merge it before I get to it, it still looks right, and tests are right.

@wenzeslaus wenzeslaus merged commit 9dd92f9 into OSGeo:main Jun 20, 2025
27 checks passed
@wenzeslaus wenzeslaus deleted the better-env-handling-in-create-project branch June 20, 2025 19:13
@github-actions github-actions bot added this to the 8.5.0 milestone Jun 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

general libraries module notebook Python Related code is in Python temporal Related to temporal data processing tests Related to Test Suite vector Related to vector data processing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants