Skip to content

Conversation

@trexfeathers
Copy link
Contributor

@trexfeathers trexfeathers commented Dec 6, 2022

🚀 Pull Request

Description

Taking the lead from #5061, I have minimised the footprint of this change, making it hopefully easier for future maintenance and better suited to a patch release (our goal).

I still get very occasional segfaults and invalid ID NetCDF errors. These were guaranteed before, but I would now estimate they fail a full test run ~ 1 in 20 times. This of course makes them very difficult to investigate - I don't know if there are any commits in this PR's sequence where there was genuinely no problem, or whether it was just luck. The only working theory I have is that they show up when certain combinations of tests are being run on separate processors, but I don't know how I would test such a theory.

I'm putting this up now because I need help with what's left:

  • Investigate remaining intermittent errors (see above) and decide if they are a problem.
  • Write test(s) specifically aimed at thread safety.
  • Arrange for some real world user testing.

Consult Iris pull request check list

@pp-mo
Copy link
Member

pp-mo commented Feb 14, 2023

Ok for your latest !
I'm totally 👍 on the change to a context managerdask.config.set -- I probably should have spotted that + raised it myself.
Still looking forward to some notes about when they fail...

@trexfeathers
Copy link
Contributor Author

trexfeathers commented Feb 15, 2023

Test results

(☟ code) (environment ☞) netcdf_segs v3.4.0
netcdf_segs ✔️ x5 ✔️ x5
v3.4.0 test_realise_data(), test_comparison() - seg-fault * ✔️ x5

Notes

  • Needed to use pytest -k to identify the specific seg-faulting test.
  • test_thread_safety.py obviously doesn't exist in v3.4.0 so I had to bring that in manually when running with the v3.4.0 code base.

@trexfeathers
Copy link
Contributor Author

trexfeathers commented Feb 15, 2023

Discoveries from investigating tests

  • More, smaller threads increase the chance of thread collision. So tests using tiny_chunks fail more reliably than _multisource tests.
  • Tests involving da.store() rather than da.compute() - so those for saving and streaming - NEVER fail, even when streaming from multiple sources. This suggests that Dask's [current] strategy is to serialise the WHOLE operation whenever file writing is involved.
    Good to have those tests in there should the situation ever change in future.
  • Inserting a sleep(random()) line into the NetCDFDataProxy does not produce more thread collisions. In fact it looks like it decreases them, possibly by desynchronising some moments during computation that are naturally synched up.

@trexfeathers
Copy link
Contributor Author

Closes #5016

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very very many thanks for tackling this @trexfeathers and everyone else who reviewed and tested - I am approving this from the side of us doing testing with ESMValCore, tests that show this fixes the issue completely; of course, you guys decide when to merge, but it'd be very cool to have this in a new iris release soon 🍻 🥳

@trexfeathers trexfeathers linked an issue Feb 20, 2023 that may be closed by this pull request
@trexfeathers
Copy link
Contributor Author

Do you need anything else @pp-mo?

@pp-mo
Copy link
Member

pp-mo commented Feb 20, 2023

Do you need anything else @pp-mo?

No, all good now the tests have passed.
Huge applause ! 👏 👏 🥇

@pp-mo pp-mo merged commit 11da71b into SciTools:main Feb 20, 2023
trexfeathers added a commit to trexfeathers/iris that referenced this pull request Feb 21, 2023
* Unpin netcdf4.

* Temporarily enable GHA on this branch.

* Temporarily enable GHA on this branch.

* Temporarily enable GHA on this branch.

* Experiment to disable wheel CI on forks.

* Disable segfaulting routines.

* More temporary changes to get CI passing.

* More temporary changes to get CI passing.

* Finessed segfault skipping.

* Bring in changed from SciTools#5061.

* Re-instate test_load_laea_grid.

* Adaptations to get the tests passing.

* Use typing.Mapping instead.

* Get doctests passing.

* CF only resolve non-url filenames.

* Confirm thread safety fixes.

* Remove dummy assignment.

* Restored plot_nemo What's New entry.

* _add_aux_factories temporarily release global lock.

* Remove per-file locking.

* Remove remaining test workarounds.

* Remove remaining comments.

* Correct use of CFReader context manager.

* Refactor for easier future maintenance.

* Rename netcdf _thread_safe, add header.

* Full use of ThreadSafeAggregators.

* Full use of ThreadSafeAggregators.

* Remove remaining imports of NetCDF4.

* Test to ensure netCDF4 is via _thread_safe module.

* More refined netcdf._thread_safe classes.

* _thread_safe docstrings.

* Restore original NetCDF code where possible.

* Revert changes to 2.3.rst.

* Update lockfiles.

* Additions to _thread_safe.py

* Remove temporary CI shims.

* New locking stategy for NetCDFDataProxy.

* NetCDFDataProxy simpler use of netCDF4 lock.

* Update lock files.

* Go back to using a Threading Lock.

* Remove superfluous pass commands in test_cf.py.

* Rename _thread_safe to _thread_safe_nc.

* Rename thread safe classes to be 'Wrappers'.

* Better contained getattr and setattr pattern.

* Explicitly name netCDF4 module in _thread_safe_nc docstring.

* Better docstring for _ThreadSafeWrapper.

* Better comment about THREAD_SAFE_FLAG.

* list() wrapping within _GLOBAL_NETCDF4_LOCK, to account for generators.

* More accurate thread_safe docstrings in netcdf.saver.

* Split netcdf integration tests into multiple modules.

* Tests for non-thread-safe NetCDF behaviour.

* Docstring accuracy.

* Correct use of dask config set (context manager).

* Update dependencies.

* Review - don't need the first-class import of iris.tests.

* Better name for the loading test.

* Better selection of data to load.

* What's New entry.

* Improve tests.

* Update lock files.

* Increase chunking on test_save.

---------

Co-authored-by: Patrick Peglar <[email protected]>
@trexfeathers trexfeathers deleted the netcdf_segs branch February 24, 2023 13:54
This was referenced Mar 27, 2023
tkknight added a commit to tkknight/iris that referenced this pull request Apr 22, 2023
* upstream/main: (23 commits)
  Lockfiles and pydata-sphinx-theme fix (SciTools#5188)
  Allow smarter weights (cubes, coordinates, cell measures, or ancillary variables) for aggregation (SciTools#5084)
  removed cell measure mask check and error (SciTools#5181)
  Updated environment lockfiles (SciTools#5177)
  Lazy weighted RMS calculation (SciTools#5017)
  Add coverage badge to README.md (SciTools#5176)
  Add coverage testing (SciTools#4765)
  Whats new updates for v3.4.1 .
  NetCDF thread safety take two (SciTools#5095)
  Updated environment lockfiles (SciTools#5163)
  Plugin support (SciTools#5144)
  Expand scope of common contributor links (SciTools#5159)
  Replace apparently retired UDUNITS documentation link. (SciTools#5153)
  [pre-commit.ci] pre-commit autoupdate (SciTools#5150)
  Fixing typo's in Gitwash. (SciTools#5145)
  add readme #showyourstripes (SciTools#5141)
  [pre-commit.ci] pre-commit autoupdate (SciTools#5143)
  Iris ❤ Xarray docs page. (SciTools#5025)
  [pre-commit.ci] pre-commit autoupdate (SciTools#5136)
  Updated citation (SciTools#5116)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Investigate segfaults with NetCDF4 >1.6.0

3 participants