perf(pack): Improve Intra-Structure Parallel Efficiency#14
Merged
Conversation
…y removing unnecessary sub parallel iterator
6 tasks
There was a problem hiding this comment.
Pull request overview
Performance-focused refactor of packing phases to reduce redundant work and parallel overhead, including denser graph representations and zero-copy pair-energy writes into the global tables.
Changes:
- Reworked DEE to converge via a neighbor-driven worklist with cached per-slot alive candidates and precomputed neighbor-edge metadata.
- Switched DP adjacency to a packed
BitMatrixand refactored tree-decomposition DP to use cached per-node/per-edge info with adaptive parallelism. - Updated pair-energy computation and
PairEnergyTableAPIs to write directly into non-overlapping mutable matrix slices, removing intermediate allocations and nested parallelism.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/dreid-pack/src/pack/phase/prune.rs | Removes redundant nested parallel iterator in frame-energy computation. |
| crates/dreid-pack/src/pack/phase/pair.rs | Writes pair energies directly into PairEnergyTable edge slices in a per-edge parallel loop. |
| crates/dreid-pack/src/pack/phase/dp.rs | Introduces BitMatrix adjacency + refactors elimination/DP execution and caching. |
| crates/dreid-pack/src/pack/phase/dee.rs | Implements worklist-based DEE convergence with cached alive sets and prebuilt edge metadata. |
| crates/dreid-pack/src/pack/model/spatial.rs | Comment spelling correction (“initialized”). |
| crates/dreid-pack/src/pack/model/energy.rs | Replaces per-entry set() with matrices_mut() for bulk, zero-copy mutable access; updates tests accordingly. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Eliminated redundant work and parallel dispatch overhead across all compute-bound packing phases. DEE: 1.59 s → 0.81 s (−49%) via worklist convergence and prebuilt per-slot neighbor caches. DP adjacency migrated to
BitMatrix(64-bit packed), reducing memory footprint and simplifying graph-algorithm signatures throughout. Pair energy writes directly to table slices, removing intermediate allocation and a nested parallel loop. DB379 quality unchanged (χ₁–₄ (20°) 71.5%, RMSD 0.728 Å); all 401 tests pass.Changes:
dee.rs): Worklist convergence — each round only rechecks slots neighboring a newly pruned slot, skipping unchanged slots entirely. Alive candidates cached per slot (sorted by ascending self-energy for earlier witness hits) and kept in sync with pruning; graph is traversed once before the loop rather than per round. Adaptive parallelism viawith_min_len.dp.rs):Vec<bool>adjacency replaced withBitMatrix(u64-packed rows) — denser, cache-friendly, and cleaner API; all graph-algorithm functions (mcs_order,is_peo,fill_in, etc.) simplified accordingly. Standalonebuild_alive_table/topo_orderhelpers inlined into the solve path. Separator DP uses adaptivewith_min_lenparallelism.energy.rs):PairEnergyTable::set()replaced bymatrices_mut(), which returns non-overlapping mutable slices viasplit_at_mutfor zero-copy parallel bulk writes.pair.rs):compute()writes directly into table slices frommatrices_mut(), eliminating an intermediateVec. Inner rotamer loop made sequential within the per-edge parallel dispatch.prune.rs): Removed redundant nested parallel iterator from frame energy computation.