Refactor extract_financial_structure to use NumPy arrays and CSR format#1864
Refactor extract_financial_structure to use NumPy arrays and CSR format#1864
Conversation
Replace all Numba typed Dict and List objects with memory-efficient NumPy arrays and CSR (Compressed Sparse Row) format. Changes: - Replace node_layers dict with node_layers_arr (1D array) - Replace node_cross_layers dict with node_cross_layers_arr (1D array) - Replace node_to_output_id nested dict with output_id_arr (2D array) - Replace programme_node_to_layers dict with layer_source array - Replace parent_to_children dict with children_indptr/children_data CSR - Replace child_to_parents dict with parents_indptr/parents_data CSR - Replace programme_node_to_profiles dict with profiles_indptr/profiles_data CSR - Replace has_tiv_policy dict with is_tiv_profile array Implementation approach: - Use two-pass algorithms (count then fill) to build CSR structures directly without intermediate dicts - Use node_level_start[level] + agg_id for flat node indexing - Add CSR-based helper functions: get_all_children_csr, get_all_parent_csr, get_tiv_csr Add documentation describing the 11-phase data transformation pipeline.
17b1bb0 to
bbe9828
Compare
| profiles_indptr = np.zeros(total_nodes + 1, dtype=oasis_int) | ||
| for i in range(total_nodes): | ||
| profiles_indptr[i + 1] = profiles_indptr[i] + profiles_count[i] |
There was a problem hiding this comment.
Again can use numpy vectorised code here:
np.cumulative_sum(profiles_count, dtype=oasis_int, include_initial=True)[:-1]| profiles_count = np.zeros(total_nodes, dtype=oasis_int) | ||
| for i in range(fm_policytc.shape[0]): | ||
| policytc = fm_policytc[i] | ||
| node_idx = node_level_start[policytc['level_id']] + policytc['agg_id'] | ||
| profiles_count[node_idx] += 1 |
There was a problem hiding this comment.
This can be vectorised, something like:
np.bincount(node_level_start[fm_policytc['level_id']] + fm_policytc['agg_id'], minlength=total_nodes).astype(oasis_int)Not sure if it will provide any performance speedup?
There was a problem hiding this comment.
I found that sometimes using simple for loops is quicker than numpy specific functions.
| output_len = 0 | ||
| for i in range(fm_xref.shape[0]): | ||
| xref = fm_xref[i] | ||
| programme_node = (out_level, xref['agg_id']) | ||
| if output_len < xref['output']: | ||
| output_len = nb_oasis_int(xref['output']) | ||
|
|
||
| if programme_node in node_to_output_id: | ||
| node_to_output_id[programme_node][nb_oasis_int(xref['layer_id'])] = nb_oasis_int(xref['output']) | ||
| else: | ||
| _dict = Dict.empty(nb_oasis_int, nb_oasis_int) | ||
| _dict[nb_oasis_int(xref['layer_id'])] = nb_oasis_int(xref['output']) | ||
| node_to_output_id[programme_node] = _dict | ||
| output_id_arr[xref['agg_id'], xref['layer_id']] = xref['output'] |
There was a problem hiding this comment.
This too can be vectorised:
output_id_arr[fm_xref['agg_id'], fm_xref['layer_id']] = fm_xref['output']
output_len = np.max(fm_xref['output'])There was a problem hiding this comment.
this fails in numba
output_id_arr[fm_xref['agg_id'], fm_xref['layer_id']] = fm_xref['output']
| for programme in fm_programme: | ||
| parent = (nb_oasis_int(programme['level_id']), nb_oasis_int(programme['to_agg_id'])) | ||
| if parent not in node_layers: | ||
| node_layers[parent] = nb_oasis_int(len(programme_node_to_profiles[parent])) | ||
| programme_node_to_layers[parent] = programme_node_to_profiles[parent] | ||
|
|
||
| # create 2 mapping to get the parents and the childs of each nodes | ||
| # update the number of layer for nodes based on the number of layer of their parents | ||
| # go through each level from top to botom | ||
| parent_to_children = Dict.empty(node_type, List.empty_list(node_type)) | ||
| child_to_parents = Dict.empty(node_type, List.empty_list(node_type)) | ||
|
|
||
| parent_level = nb_oasis_int(programme['level_id']) | ||
| parent_agg = nb_oasis_int(programme['to_agg_id']) | ||
| parent_idx = node_level_start[parent_level] + parent_agg | ||
| if node_layers_arr[parent_idx] == 0: | ||
| # Use CSR format: profiles_indptr[idx+1] - profiles_indptr[idx] = count | ||
| node_layers_arr[parent_idx] = profiles_indptr[parent_idx + 1] - profiles_indptr[parent_idx] | ||
| # layer_source[parent_idx] = parent_idx already (uses own profiles) |
There was a problem hiding this comment.
Can avoid for loop:
parent_idxs = np.unique(node_level_start[fm_programme['level_id']] + fm_programme['to_agg_id'])
node_layers_arr[parent_idxs] = profiles_indptr[parent_idxs + 1] - profiles_indptr[parent_idxs]| children_indptr = np.zeros(total_nodes + 1, dtype=oasis_int) | ||
| for i in range(total_nodes): | ||
| children_indptr[i + 1] = children_indptr[i] + children_count[i] | ||
| if children_count[i] > 0: | ||
| children_len += 1 + children_count[i] | ||
|
|
||
| parents_indptr = np.zeros(total_nodes + 1, dtype=oasis_int) | ||
| for i in range(total_nodes): | ||
| parents_indptr[i + 1] = parents_indptr[i] + parents_count[i] |
There was a problem hiding this comment.
can use np.cumulative_sum here again
children_inptr = np.cumulative_sum(children_count, include_initial=True, dtype=oasis_int)
parents_inptr = np.cumulative_sum(parents_count, include_initial=True, dtype=oasis_int)
children_len = children_len + np.sum(children_count > 0) + np.sum(children_count)There was a problem hiding this comment.
not supported in numba
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Use of unsupported NumPy function 'numpy.cumulative_sum' or unsupported use of the function.
parents_indptr = np.cumulative_sum(parents_count, include_initial=True, dtype=oasis_int)
| profiles_count = np.zeros(total_nodes, dtype=oasis_int) | ||
| for i in range(fm_policytc.shape[0]): | ||
| policytc = fm_policytc[i] | ||
| node_idx = node_level_start[policytc['level_id']] + policytc['agg_id'] | ||
| profiles_count[node_idx] += 1 |
There was a problem hiding this comment.
I found that sometimes using simple for loops is quicker than numpy specific functions.
| profiles_count = np.zeros(total_nodes, dtype=oasis_int) | ||
| for i in range(fm_policytc.shape[0]): | ||
| policytc = fm_policytc[i] | ||
| node_idx = node_level_start[policytc['level_id']] + policytc['agg_id'] | ||
| profiles_count[node_idx] += 1 |
| # Pass 1: Count children per parent and parents per child | ||
| children_count = np.zeros(total_nodes, dtype=oasis_int) | ||
| parents_count = np.zeros(total_nodes, dtype=oasis_int) | ||
| children_len = 1 |
| for j in range(n): | ||
| for k in range(start, end - 1 - j): | ||
| if profiles_data[k]['layer_id'] > profiles_data[k + 1]['layer_id']: | ||
| # Swap entries |
There was a problem hiding this comment.
I think you should swap entire entries at once rather than each individual key
There was a problem hiding this comment.
I tested that locally,
# Swap entries
temp_profile = profiles_data[k]
profiles_data[k] = profiles_data[k + 1]
profiles_data[k + 1] = temp_profile
It doesn't seem to work
I get an error for step policy, not sure why, I agree it would be nicer but current is good enough for me
There was a problem hiding this comment.
by bad I know why temp_profile is just a pointer
| profiles_indptr = np.zeros(total_nodes + 1, dtype=oasis_int) | ||
| for i in range(total_nodes): | ||
| profiles_indptr[i + 1] = profiles_indptr[i] + profiles_count[i] |


Summary
Replace all Numba typed Dict and List objects in
extract_financial_structure()with memory-efficient NumPy arrays and CSR (Compressed Sparse Row) format.Changes
node_layersdict withnode_layers_arr(1D array)node_cross_layersdict withnode_cross_layers_arr(1D array)node_to_output_idnested dict withoutput_id_arr(2D array)programme_node_to_layersdict withlayer_sourcearrayparent_to_childrendict withchildren_indptr/children_dataCSRchild_to_parentsdict withparents_indptr/parents_dataCSRprogramme_node_to_profilesdict withprofiles_indptr/profiles_dataCSRhas_tiv_policydict withis_tiv_profilearrayImplementation Approach
node_level_start[level] + agg_idfor flat node indexingget_all_children_csr,get_all_parent_csr,get_tiv_csrPerformance Results
Tested with:
fmpy --create-financial-structure-files -a2 -p ~/OasisLMF/runs/big_datasetMemory Breakdown