Reaction generation using multiprocessing#1459
Conversation
cd3d25c to
97ce8a1
Compare
Codecov Report
@@ Coverage Diff @@
## master #1459 +/- ##
=========================================
+ Coverage 41.53% 41.6% +0.07%
=========================================
Files 176 176
Lines 29111 29142 +31
Branches 5975 5990 +15
=========================================
+ Hits 12090 12125 +35
+ Misses 16189 16173 -16
- Partials 832 844 +12
Continue to review full report at Codecov.
|
mliu49
left a comment
There was a problem hiding this comment.
Great work! I made some comments and suggestions. It would be nice to see some profiling results before this is merged.
Regarding trailing whitespaces and other Codacy issues, you generally only need to address issues in lines that you changed. Removing all trailing whitespaces in a file is not necessary.
rmg.py
Outdated
|
|
||
| # Add option to select max number of processes for reaction generation | ||
| parser.add_argument('-n', '--maxProc', type=int, nargs=1, default=1, | ||
| help='used by multiprocessing in react.py') |
There was a problem hiding this comment.
This may not be the most helpful info, since users may not know what react.py is. Perhaps max number of processors to use during reaction generation or something similar.
rmgpy/rmg/main.py
Outdated
| pass | ||
|
|
||
| if maxProc > psutil.cpu_count(): | ||
| raise ValueError('Invalid format for user defined maximum number of procesors {0}; should be an integer and smaller or equal to your available number of cpu {1}'.format(maxProc, psutil.cpu_count())) |
There was a problem hiding this comment.
It doesn't seem like you're checking the format here?
There was a problem hiding this comment.
Changed to 'Invalid input ...'
rmgpy/rmg/model.py
Outdated
| spcs.extend(rxn.reactants) | ||
| spcs.extend(rxn.products) | ||
|
|
||
| ensure_independent_atom_ids(spcs, resonance=True) |
There was a problem hiding this comment.
Why do you need to ensure_independent_atom_ids here?
There was a problem hiding this comment.
Otherwise Species.getResonanceHybrid() fails due to not having valid atom ids. We call ensure_independent_atom_ids in generate_reactions_from_families, however, this is not working for multiprocessing. Using ensure_independent_atom_ids here fixes that issue.
rmgpy/rmg/react.py
Outdated
| tmp = divmod(user_mem, resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) | ||
| tmp2 = min(maxProc, tmp[0]) | ||
| procNum = max(1, tmp2) | ||
| print 'For reaction generation {0} processes are used.'.format(procNum) |
There was a problem hiding this comment.
We like to avoid print statements in favor of using the logging module. So you would just replace print with logging.info() or maybe logging.debug depending on whether you want this to be printed normally.
Something you should consider is how many times this will be printed (seems like it will be a lot) and whether it's useful info for a typical user.
There was a problem hiding this comment.
Done. It will be printed for each time we are spawning a pool of worker. It might be of interest as the number of processes/worker might differ from the user input, depending on the available RAM and the memory consumption of the RMG base process before spawning the workers.
rmgpy/scoop_framework/util.py
Outdated
| when SCOOP is loaded, the future object. | ||
| """ | ||
| warnings.warn("The option scoop is no longer supported"\ | ||
| " and may be removed in Version: 2.3 ", DeprecationWarning) |
There was a problem hiding this comment.
For all of these warnings, could you indent the second line to line up with the " in the first line? Also, the slashes shouldn't be necessary.
There was a problem hiding this comment.
Done. Got the template from here:
https://github.com/ReactionMechanismGenerator/RMG-Py/wiki/RMG-Contributor-Guidelines
I guess it should be changed accordingly?
rmgpy/rmg/react.py
Outdated
| tmp = divmod(usermem, resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) | ||
|
|
||
| # Get available RAM (GB)and procnum dependent on OS | ||
| if platform == "linux" or platform == "linux2": |
There was a problem hiding this comment.
You could use platform.startswith('linux') here to check for both.
rmgpy/rmg/react.py
Outdated
| # OS X | ||
| memoryavailable = psutil.virtual_memory().available/(1000.0 ** 3) | ||
| memoryuse = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/(1000.0 ** 3) | ||
|
|
There was a problem hiding this comment.
I think you should add handling for Windows as well. It doesn't have to be equivalent handling, but you could simply add an else for any other platform and perhaps only use one processor in that case.
There was a problem hiding this comment.
Done, but couldn't test it. Is someone with a Windows OS volunteering?
1604410 to
1ad85ac
Compare
|
Hmm, something seems to have gone wrong with your rebase. This branch now includes a ton of commits from master. |
ee97295 to
915910b
Compare
|
I added a commit refactoring the code for family splitting. Let me know what you think, and whether it matches your original design. |
rmgpy/rmg/model.py
Outdated
| # This method chops the iterable into a number of chunks which it | ||
| # submits to the process pool as separate tasks. | ||
| p = Pool(processes=procnum) | ||
| p.map(calculate_thermo_parallel,spcs) |
There was a problem hiding this comment.
I have a couple concerns about the overall implementation for parallel QM. Normally, thermo is calculated for one species at a time in self.processNewReactions() below. Here, you determine all species which would be calculated using QM and pre-calculate their thermo in parallel, then in processNewReactions, it sees the thermo and does not re-calculate it.
- The
calculate_thermo_parallelfunction seems to skip a lot of processing done inThermoDatabase.getThermoDataandthermoEngine.processThermoData, which can have a significant effect on the final values. - It appears that you're using SMILES to determine whether species are unique? This is not safe because different resonance structures will have different SMILES. Also, generating SMILES for every new molecule will be relatively time consuming.
Minor suggestions, which may or may not be relevant after resolving the above issues:
- Since the procnum determination based on RAM is done both here and in react.py, it might be useful to turn it into a function somewhere and import it here.
- I would also split this section off as a new method, since
enlargeis already a really big method.
There was a problem hiding this comment.
The thought behind the implemented procedure is that the QM files should be generated in parallel and then looked up for one species at a time in self.processNewReactions() below. Testing showed that calculating QM thermo for one species at a time is more time consuming than in a bulk up here.
I use SMILES for comparison and then generate resonance structures within calculate_thermo_parallel() to generate QM files for each resonance structure. What would be a better way to ensure that the same QM file is not generated twice/overlapping during the parallel generation?
There was a problem hiding this comment.
The SMILES comparison is now substituted with isIsomorphic comparison.
The procnum determination is now a function and imported for QMTP parallel and reaction generation in parallel.
And the parallel QM files writing section is split off outside of enlarge.
8e3e27b to
55e1fc6
Compare
e82fe61 to
d7a7f84
Compare
d7a7f84 to
63e2976
Compare
6ed0fc6 to
d27647b
Compare
Allows thermo pruning with parallel QMTP.
Add explicit `rename` argument to generateThermo which is only true when called from enlarge. Thus, only new species without thermo are renamed, which prevents initial species and bath gases from getting accidentally renamed.
e69745b to
7318b37
Compare
|
I changed the species re-labeling to only be done for new species created from reaction generation. Turns out that thermo generation for input species was happening earlier than we thought. I also removed the whitespace changes to clean up the PR a bit. |
ab4afeb to
f362c9d
Compare
Reduce number of families being tested Make separate tests for serial and parallel processing
f362c9d to
f749f49
Compare
|
I think this is ready code-wise. Are there any additional tests that you think should be run? |
Just re-tested, QMTP, pdep, and thermo filtering in serial and parallel and seems to work fine. Ready to go from my side. |
|
It is updated. Where do you think it needs more update? |
Motivation: - processNewReactions requires a newSpecies argument which was not being properly identified after multiprocessing changes in #1459 Background: - The newSpecies argument is retained from early RMG when reactions were generated right after adding a new species to the core. - Hence, newSpecies referred to the newly added core species which was used to generate the new reactions. - The overall model generation algorithm has since changed, so identifying newSpecies is not as straightforward. - The multiprocessing PR changed newSpecies to be any core species, missing the original purpose of identifying the species from which the reaction was generated. Changes: - Use the species tuples created during reaction generation to keep track of which species were used to generate the reaction - Update unit tests which are affected by the changes in return values from react and react_all
Motivation: - processNewReactions requires a newSpecies argument which was not being properly identified after multiprocessing changes in #1459 Background: - The newSpecies argument is retained from early RMG when reactions were generated right after adding a new species to the core. - Hence, newSpecies referred to the newly added core species which was used to generate the new reactions. - The overall model generation algorithm has since changed, so identifying newSpecies is not as straightforward. - The multiprocessing PR changed newSpecies to be any core species, missing the original purpose of identifying the species from which the reaction was generated. Changes: - Use the species tuples created during reaction generation to keep track of which species were used to generate the reaction - Update unit tests which are affected by the changes in return values from react and react_all
Motivation: - processNewReactions requires a newSpecies argument which was not being properly identified after multiprocessing changes in #1459 Background: - The newSpecies argument is retained from early RMG when reactions were generated right after adding a new species to the core. - Hence, newSpecies referred to the newly added core species which was used to generate the new reactions. - The overall model generation algorithm has since changed, so identifying newSpecies is not as straightforward. - The multiprocessing PR changed newSpecies to be any core species, missing the original purpose of identifying the species from which the reaction was generated. Changes: - Use the species tuples created during reaction generation to keep track of which species were used to generate the reaction - Update unit tests which are affected by the changes in return values from react and react_all
Motivation: - processNewReactions requires a newSpecies argument which was not being properly identified after multiprocessing changes in #1459 Background: - The newSpecies argument is retained from early RMG when reactions were generated right after adding a new species to the core. - Hence, newSpecies referred to the newly added core species which was used to generate the new reactions. - The overall model generation algorithm has since changed, so identifying newSpecies is not as straightforward. - The multiprocessing PR changed newSpecies to be any core species, missing the original purpose of identifying the species from which the reaction was generated. Changes: - Use the species tuples created during reaction generation to keep track of which species were used to generate the reaction - Update unit tests which are affected by the changes in return values from react and react_all
Motivation: - processNewReactions requires a newSpecies argument which was not being properly identified after multiprocessing changes in #1459 Background: - The newSpecies argument is retained from early RMG when reactions were generated right after adding a new species to the core. - Hence, newSpecies referred to the newly added core species which was used to generate the new reactions. - The overall model generation algorithm has since changed, so identifying newSpecies is not as straightforward. - The multiprocessing PR changed newSpecies to be any core species, missing the original purpose of identifying the species from which the reaction was generated. Changes: - Use the species tuples created during reaction generation to keep track of which species were used to generate the reaction - Update unit tests which are affected by the changes in return values from react and react_all
Motivation or Problem
Scoop has been found to increase the memory consumption until the OS's out of memory killer kills processes and the reaction generation fails. Consequently, python's multiprocessing module is implemented for MAC and Linux and a deprecation warning is added to scoop references.
Description of Changes
Multiprocessing is implemented for reaction generation in rmgpy/rmg/react.py. The processes are spawned and closed on a single node within the function 'react()'. The number of processes is determined based on the ratio of currently available RAM and currently used RAM. The user can input the maximum number of allowed processes from the command line. For each reaction generation the number of processes will be the minimum value either being the number of allowed processes due to user input or the value obtained by the RAM ratio. The RAM limitation is employed, because multiprocessing is forking the base process and the memory limit (SWAP + RAM) might be exceeded when using too many processes for a base process large in memory.
Multiprocessing is employed from the command line using the -n command and the maximum number of processes the user want to use, here 4. The default is 1 process, only an integer value smaller than the number of available cpu is allowed.
python rmg -n 4 input.py
Testing
The implementation has been tested for the superminimal, minimal and PDD cases on both MAC and RMG-server, ranging from 1 to 24 processes on a single node. An example submission script for the RMG-server is attached here:
submit.txt
Reviewer Tips
Try your own cases and document the changes in run time and memory consumption.