Skip to content

Refactor extract_financial_structure to use NumPy arrays and CSR format#1864

Merged
sstruzik merged 3 commits intomainfrom
feature/fm-memory-optimization
Feb 26, 2026
Merged

Refactor extract_financial_structure to use NumPy arrays and CSR format#1864
sstruzik merged 3 commits intomainfrom
feature/fm-memory-optimization

Conversation

@sstruzik
Copy link
Copy Markdown
Contributor

@sstruzik sstruzik commented Feb 6, 2026

Summary

Replace all Numba typed Dict and List objects in extract_financial_structure() with memory-efficient NumPy arrays and CSR (Compressed Sparse Row) format.

Changes

  • Replace node_layers dict with node_layers_arr (1D array)
  • Replace node_cross_layers dict with node_cross_layers_arr (1D array)
  • Replace node_to_output_id nested dict with output_id_arr (2D array)
  • Replace programme_node_to_layers dict with layer_source array
  • Replace parent_to_children dict with children_indptr/children_data CSR
  • Replace child_to_parents dict with parents_indptr/parents_data CSR
  • Replace programme_node_to_profiles dict with profiles_indptr/profiles_data CSR
  • Replace has_tiv_policy dict with is_tiv_profile array

Implementation Approach

  • Use two-pass algorithms (count then fill) to build CSR structures directly without intermediate dicts
  • Use node_level_start[level] + agg_id for flat node indexing
  • Add CSR-based helper functions: get_all_children_csr, get_all_parent_csr, get_tiv_csr
  • Add documentation describing the 11-phase data transformation pipeline

Performance Results

Tested with: fmpy --create-financial-structure-files -a2 -p ~/OasisLMF/runs/big_dataset

Metric Before After Improvement
Peak RSS 9,804 MB (9.6 GB) 1,830 MB (1.8 GB) -81%
Numba internal overhead 8,086 MB (84%) 113 MB (6.8%) -99%
Wall time ~107s ~10s ~10.7x faster

Memory Breakdown

Category Before After
Input arrays (loaded from disk) 485 MB 485 MB
Output arrays (nodes, profiles, etc.) 1,062 MB 1,062 MB
Numba internal (dicts/lists/temps) 8,086 MB (84%) 113 MB (6.8%)

Replace all Numba typed Dict and List objects with memory-efficient
NumPy arrays and CSR (Compressed Sparse Row) format.

Changes:
- Replace node_layers dict with node_layers_arr (1D array)
- Replace node_cross_layers dict with node_cross_layers_arr (1D array)
- Replace node_to_output_id nested dict with output_id_arr (2D array)
- Replace programme_node_to_layers dict with layer_source array
- Replace parent_to_children dict with children_indptr/children_data CSR
- Replace child_to_parents dict with parents_indptr/parents_data CSR
- Replace programme_node_to_profiles dict with profiles_indptr/profiles_data CSR
- Replace has_tiv_policy dict with is_tiv_profile array

Implementation approach:
- Use two-pass algorithms (count then fill) to build CSR structures
  directly without intermediate dicts
- Use node_level_start[level] + agg_id for flat node indexing
- Add CSR-based helper functions: get_all_children_csr, get_all_parent_csr,
  get_tiv_csr

Add documentation describing the 11-phase data transformation pipeline.
@sstruzik sstruzik force-pushed the feature/fm-memory-optimization branch from 17b1bb0 to bbe9828 Compare February 9, 2026 14:09
@sstruzik sstruzik requested review from SkylordA and vinulw February 9, 2026 17:19
@sstruzik sstruzik moved this to Waiting for Review in Oasis Dev Team Tasks Feb 9, 2026
@sstruzik sstruzik linked an issue Feb 9, 2026 that may be closed by this pull request
5 tasks
@sstruzik sstruzik added enhancement New feature or request LTS - 2.5 labels Feb 10, 2026
Comment on lines +485 to +487
profiles_indptr = np.zeros(total_nodes + 1, dtype=oasis_int)
for i in range(total_nodes):
profiles_indptr[i + 1] = profiles_indptr[i] + profiles_count[i]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again can use numpy vectorised code here:

np.cumulative_sum(profiles_count, dtype=oasis_int, include_initial=True)[:-1]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar point to the np.bincount before, but for this one claude says both loops and cumulative_sum are identical.

Image

Comment on lines +478 to +482
profiles_count = np.zeros(total_nodes, dtype=oasis_int)
for i in range(fm_policytc.shape[0]):
policytc = fm_policytc[i]
node_idx = node_level_start[policytc['level_id']] + policytc['agg_id']
profiles_count[node_idx] += 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be vectorised, something like:

np.bincount(node_level_start[fm_policytc['level_id']] + fm_policytc['agg_id'], minlength=total_nodes).astype(oasis_int)

Not sure if it will provide any performance speedup?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that sometimes using simple for loops is quicker than numpy specific functions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image Nice graph by claude

Comment on lines 553 to +559
output_len = 0
for i in range(fm_xref.shape[0]):
xref = fm_xref[i]
programme_node = (out_level, xref['agg_id'])
if output_len < xref['output']:
output_len = nb_oasis_int(xref['output'])

if programme_node in node_to_output_id:
node_to_output_id[programme_node][nb_oasis_int(xref['layer_id'])] = nb_oasis_int(xref['output'])
else:
_dict = Dict.empty(nb_oasis_int, nb_oasis_int)
_dict[nb_oasis_int(xref['layer_id'])] = nb_oasis_int(xref['output'])
node_to_output_id[programme_node] = _dict
output_id_arr[xref['agg_id'], xref['layer_id']] = xref['output']
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This too can be vectorised:

output_id_arr[fm_xref['agg_id'], fm_xref['layer_id']] = fm_xref['output']
output_len = np.max(fm_xref['output'])

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fails in numba
output_id_arr[fm_xref['agg_id'], fm_xref['layer_id']] = fm_xref['output']

Comment on lines 573 to +580
for programme in fm_programme:
parent = (nb_oasis_int(programme['level_id']), nb_oasis_int(programme['to_agg_id']))
if parent not in node_layers:
node_layers[parent] = nb_oasis_int(len(programme_node_to_profiles[parent]))
programme_node_to_layers[parent] = programme_node_to_profiles[parent]

# create 2 mapping to get the parents and the childs of each nodes
# update the number of layer for nodes based on the number of layer of their parents
# go through each level from top to botom
parent_to_children = Dict.empty(node_type, List.empty_list(node_type))
child_to_parents = Dict.empty(node_type, List.empty_list(node_type))

parent_level = nb_oasis_int(programme['level_id'])
parent_agg = nb_oasis_int(programme['to_agg_id'])
parent_idx = node_level_start[parent_level] + parent_agg
if node_layers_arr[parent_idx] == 0:
# Use CSR format: profiles_indptr[idx+1] - profiles_indptr[idx] = count
node_layers_arr[parent_idx] = profiles_indptr[parent_idx + 1] - profiles_indptr[parent_idx]
# layer_source[parent_idx] = parent_idx already (uses own profiles)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can avoid for loop:

parent_idxs = np.unique(node_level_start[fm_programme['level_id']] + fm_programme['to_agg_id']) 
node_layers_arr[parent_idxs] = profiles_indptr[parent_idxs + 1] - profiles_indptr[parent_idxs]

Comment on lines +606 to +614
children_indptr = np.zeros(total_nodes + 1, dtype=oasis_int)
for i in range(total_nodes):
children_indptr[i + 1] = children_indptr[i] + children_count[i]
if children_count[i] > 0:
children_len += 1 + children_count[i]

parents_indptr = np.zeros(total_nodes + 1, dtype=oasis_int)
for i in range(total_nodes):
parents_indptr[i + 1] = parents_indptr[i] + parents_count[i]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use np.cumulative_sum here again

children_inptr = np.cumulative_sum(children_count, include_initial=True, dtype=oasis_int) 
parents_inptr = np.cumulative_sum(parents_count, include_initial=True, dtype=oasis_int) 
children_len = children_len + np.sum(children_count > 0) + np.sum(children_count)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not supported in numba
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Use of unsupported NumPy function 'numpy.cumulative_sum' or unsupported use of the function.
parents_indptr = np.cumulative_sum(parents_count, include_initial=True, dtype=oasis_int)

Comment on lines +478 to +482
profiles_count = np.zeros(total_nodes, dtype=oasis_int)
for i in range(fm_policytc.shape[0]):
policytc = fm_policytc[i]
node_idx = node_level_start[policytc['level_id']] + policytc['agg_id']
profiles_count[node_idx] += 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that sometimes using simple for loops is quicker than numpy specific functions.

Comment on lines +478 to +482
profiles_count = np.zeros(total_nodes, dtype=oasis_int)
for i in range(fm_policytc.shape[0]):
policytc = fm_policytc[i]
node_idx = node_level_start[policytc['level_id']] + policytc['agg_id']
profiles_count[node_idx] += 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image Nice graph by claude

# Pass 1: Count children per parent and parents per child
children_count = np.zeros(total_nodes, dtype=oasis_int)
parents_count = np.zeros(total_nodes, dtype=oasis_int)
children_len = 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused var

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused var

for j in range(n):
for k in range(start, end - 1 - j):
if profiles_data[k]['layer_id'] > profiles_data[k + 1]['layer_id']:
# Swap entries
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should swap entire entries at once rather than each individual key

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested that locally,

                        # Swap entries
                        temp_profile = profiles_data[k]
                        profiles_data[k] = profiles_data[k + 1]
                        profiles_data[k + 1] = temp_profile

It doesn't seem to work
I get an error for step policy, not sure why, I agree it would be nicer but current is good enough for me

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by bad I know why temp_profile is just a pointer

Comment on lines +485 to +487
profiles_indptr = np.zeros(total_nodes + 1, dtype=oasis_int)
for i in range(total_nodes):
profiles_indptr[i + 1] = profiles_indptr[i] + profiles_count[i]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar point to the np.bincount before, but for this one claude says both loops and cumulative_sum are identical.

Image

@sambles sambles mentioned this pull request Feb 26, 2026
@sstruzik sstruzik merged commit f788d0e into main Feb 26, 2026
25 checks passed
@github-project-automation github-project-automation bot moved this from Waiting for Review to Done in Oasis Dev Team Tasks Feb 26, 2026
@awsbuild awsbuild added this to the 2.5.1 milestone Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request LTS - 2.5

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Improve time and memory performance for big Portfolio

4 participants