grass.script: More carefully setup env in create_project#5854
grass.script: More carefully setup env in create_project#5854wenzeslaus merged 5 commits intoOSGeo:mainfrom
Conversation
The env in create_project is not set up not only when GISBASE is not set, but also when GISBASE is not in PATH which at this point indicates that tools will not run. The env setup is done only when needed, so simple XY does not trigger creation of the environment, and so it is more lightweight. The code to ensure runtime environment is moved to grass.script.setup, broken down and made reusable. The difference for usage in create_project is that os.environ is now always copied even if it already has GISBASE (we saved that copy operation before if GISBASE was already present). The handling of env, however, is consistent with other functions in grass.script.setup. With faster create_project, I replaced the old gs.core._create_location_xy calls by simple gs.project calls in tests.
echoix
left a comment
There was a problem hiding this comment.
Other higher level observations:
- What's the real difference between
gs.create_project()andgs.setup.init()(disregarding the specifics or implementation, looking at a distance above all that...). From the adjustments made in the tests, it seems one would always need to use both (aka boilerplate). If the pattern of using a "with" (context manager) for gs.setup.init() was supposed to be the end-game of the wrappers to make grass work, why would we need another layer above to patch the shortcomings of gs.setup.init() - The topic of performance was brought in into the PRs description. Even though currently, I think (as I'm not able to properly measure and profile due to how all grass+python works currently, or at least do it easily) the biggest slowdowns/bottlenecks are the uncountable process calls (launching a binary), usually in python it would be the number of function calls that would be addressed. These are some parts that no newer version of python* (famous last words...) can really improve and reduce. It's hard to optimize when crossing function boundaries, as there's a whole lot going on when "calling a function".
- Otherwise the next place I would look into, for performance, would be the import times. What is the "cost" of the imports we do, and ending up to "import the world". It's a bit easier to measure, as there's a python cli flag to trace the import times. I suspect temporal import times would be kinda big too. I already know from older exploration that when an import of one of our modules is made to one of our ctypes wrappers, that the import times significantly increase (the proportions are something like 300 for our ctypes part for 80 for normal/the rest: it's been a while and it's by memory). I also noticed in that exploration session a while ago that when there was two imports of two different ctypes modules, the total import time was almost doubled (because of the importance of the ctypes import time). That's why the suspicion of temporal is founded.
|
About redoing the tests, yes, perhaps better done after merging this, but I would definitely appreciate some help with debugging create_project on Windows. For starters, I don't know if we have a test which would actually test the same thing you a @landam did manually when you both got to g.proj not found. I need more time to respond to the rest. |
You need to have a project (data) and an active session using a project (or more precisely: env variables to make the runtime work and connection to a project). # One-step init + create
gs.setup.init("~/myproject", crs="some PROJ-accepted string") # crs implies create# Delayed CRS setup
session = gs.setup.init("~/myproject") # creates an empty XY project if dir does not exist
session.set_crs("some PROJ-accepted string")
I measured these times way back when in experiment with small Python processes. While I don't have the code or the results anymore, I'm sure it is still an issue. Tools written in Python, even when executed directly in command line, invoke g.parser to deal with their command line parameters. In #2923, I'm even adding another subprocess call to deal with with command line parameters before the tool is executed. This is not ideal, but usually more stable and straightforward than going through ctypes. Ultimately, having direct library function access to the tools (in C or whatever) would be the best, but it is worth seeing what the best options are now for the subprocesses. From API perspective, that's what #2923 is doing. From performance perspective, we are mostly limited by the subprocess call cost which is out of our hands, but from some experiment at some point, it looked like Linux is much faster than Windows for subprocess execution, so perhaps there is still room for improvement. Let's have this discussion some time later.
We did some work to avoid ctypes for stability reasons, but the Python library was originally minimal and started in the time when
The complexity of ctypes is one of the reasons why subprocesses, while far from perfect, still seem to be somewhat okay. :-) |
…aning GISRC without processing g.proj run.
|
Looks promising and eager to try it out |
echoix
left a comment
There was a problem hiding this comment.
I didn't try it out manually yet, but if you need to merge it before I get to it, it still looks right, and tests are right.
This changes the condition used to create the runtime environment in create_project so that the environment is not set up not only when GISBASE is not set, but also when GISBASE is not in PATH which (at least at this point) indicates that no tools can run, so a call to g.proj later (and technically the potential g.message call before that) will fail.
This helps when GISBASE is set, even correctly, for some reason, but PATH is not. This may happen when someone is trying to GISBASE in some special way before GRASS actually runs. I have suspicion this is happening on Windows, but hard to tell without reading the setup scripts in details (when running on Windows you could print the environment and see). However, judging from @echoix experiments in #5717, this may be unrelated to the issues with create_project. Anyway, before this PR, the following results in a traceback, but with this PR it works.
The env setup is done only when needed, so simple XY does not trigger creation of the environment, and so it is more lightweight in terms of things happening and simply faster.
python -m timeit -n 1000 -- "import grass.script as gs; import tempfile; d = tempfile.TemporaryDirectory(); gs.create_project(d.name, 'project'); d.cleanup();"python -m timeit -n 1000 -- "import grass.script as gs; import tempfile; d = tempfile.TemporaryDirectory(); gs.create_project(d.name, 'project'); d.cleanup();"python -m timeit -n 1000 -- "import grass.script as gs; import tempfile; d = tempfile.TemporaryDirectory(); gs.core._create_location_xy(d.name, 'project'); gs.core._set_location_description(d.name, 'project', None); d.cleanup();"With faster create_project, I replaced the old gs.core._create_location_xy calls in tests by simple gs.create_project calls.
The code to ensure runtime environment is moved to grass.script.setup, broken down and made reusable. The difference for usage in create_project is that os.environ is now always copied even if it already has GISBASE (we saved that copy operation before if GISBASE was already present). The handling of env, however, is consistent with other functions in grass.script.setup.