Skip to content

Add DQ scoring to FBS#441

Merged
catherinebirney merged 111 commits intodevelopfrom
dqi_update
Jul 2, 2025
Merged

Add DQ scoring to FBS#441
catherinebirney merged 111 commits intodevelopfrom
dqi_update

Conversation

@catherinebirney
Copy link
Copy Markdown
Contributor

@catherinebirney catherinebirney commented Feb 28, 2025

Major changes:

  • Data quality scoring implemented for FBS
    • New adjust_dqi_reliability_collection_scores() to modify data reliability and data collection based on source and target sector levels
    • assign_temporal_correlation() assigns temporal DQ based on difference between year of data and target year of FBS
    • assign_geographical_correlation() assigns DQ for geoscale based on data geoscale vs target FBS geoscale
    • assign_technological_correlation() assigns DQ scores based on difference between source and target sectors
  • Modified how data are merged on location so we can correctly merge state with county data
  • Modified how activities are mapped to sectors
    • Changed how activities are mapped to properly account for data quality scores
      - Technological scores
      - Modify data reliability and data collection scores after mapping
    • First map to sector year identified in data crosswalk, then later convert to target sector year, previously we immediately converted the crosswalk to target sector year
    • Modified NAICS year conversion method
      - Pull all NAICS6 and determine mapping changes for child naics to parent naics in generate_naics_crosswalk_conversion_ratios()
      - For example, if we are converting NAICS4 across years, we identify all child NAICS6 and determine how those NAICS6 map between years. If there are 4 child NAICS6 and one child NAICS6 maps to a different parent NAICS4 in the target year, than ¼ of the original NAICS4 parent value is mapped to a different NAICS4 in the target year
      • Conversion is not based on numeric values within the FBS because we might only have NAICS4 values, not NAICS6 and therefore do not have the data to create proportional conversions
      - We previously mapped all activities to NAICS6+, then converted, then aggregated. This is not a good method for a multitude of reasons, but especially problematic when assigning DQ scores
    • New subset_sector_key()
      - Subsets sector key to return industry that most closely maps activity/source sectors to target sectors – drops parent sectors within crosswalk and assigns tech corr scoring, modifies datareliability and datacollection scores based on mapping
  • Modified how naics are converted to target naics years
    • Had a data check that checked if a sector-like activity was found in any naics year outside of the target year and if so, mapped to target year. Did not always map correctly because sector could be found in multiple NAICS years, and the NAICS years map differently to target year
      - Revised this function to check for the closest NAICS year to the target year and use that year to map to target NAICS

Minor changes:

  • Correct error in attribute_flows_to_sectors()
    • Original group_total assignment was based on original df FlowAmount values, but we reset the index, so needed to base group_total on new index of the df
  • Adds FIPS scale (1,3,5) to FIPS_Crosswalk
  • Add NAICS 2002, 2007, 2022 crosswalks
  • Expand NAICS_Crosswalk_TimeSeries to include NAICS 2022
  • New NAICS_Year_Concordance which maps published 6-digit sectors across years
  • New Sector_Levels csv which lables sector level and sector length for all sectors
  • In source_catalog.ymal
    • Correct BLS_QCEW NAICS years for 2011, 2022, and 2023
  • BLS QCEW estimate_suppressed_qcew()
    • Update the function to only estimate suppressed data up to max sector level. No longer estimate suppressed 6-digit sectors, when our target is 3-digit
  • Data Quality scores
    • Update GHGI scores
  • Consistent fips scale assignments. National = 5, state = 2, county = 1
  • url updates to government FBA links

FBA changes

  • BLS_QCEW: expand to include 2000 – 2023, add county FBS, some changes to target_naics_year to match those of the FBA

bl-young and others added 30 commits February 16, 2024 14:12
…uppressed data up to the target sector level in FBS method
# Conflicts:
#	flowsa/data_source_scripts/EPA_GHGI.py
#	flowsa/methods/flowbyactivitymethods/EPA_GHGI.yaml
…e sectors that exist in the flowbyactivity and those that most closely map to the target sectors
…th rather than string length for hosuehold and gov codes
@bl-young
Copy link
Copy Markdown
Contributor

bl-young commented May 9, 2025

I reviewed the FBS generation in the action at 59d24a9, for the CRHW national FBS, the facilities that come in as 5 digit NAICS instead of 6 are getting dropped. I think this is only when there is a single 6 digit child for that 5 digit.
image

Also seeing that the 5 digit NAICS with multiple children are not being handled correctly:

image

Old (correct): (21222 split evenly between 212221 and 212222)

image

New (incorrect): All of 21222 is assigned to 212222

image

@bl-young
Copy link
Copy Markdown
Contributor

bl-young commented May 9, 2025

for the CRHW national FBS, the facilities that come in as 5 digit NAICS instead of 6 are getting dropped. I think this is only when there is a single 6 digit child for that 5 digit.

This was resolved by c09f6f3

@bl-young
Copy link
Copy Markdown
Contributor

bl-young commented May 9, 2025

In the revised map_to_sectors() under proportional attribution, the grouped df that enters indicates the group_id, which later is used during proportional attribution as the groupby_col. Somehwere in map_to_sectors() this value is getting reset.

…don't want to reset group_id here); no need to check_if_sectors_are_naics twice if 0 the first time
@bl-young
Copy link
Copy Markdown
Contributor

bl-young commented May 9, 2025

In the revised map_to_sectors() under proportional attribution, the grouped df that enters indicates the group_id, which later is used during proportional attribution as the groupby_col. Somehwere in map_to_sectors() this value is getting reset.

I believe that ebe8ae6 addresses this, though need to confirm it doesn't impact other methods negatively. I was reviewing this in the context of GHG_national, which was showing major diffs (and duplicate values). It now looks correct and shows no change from remote.

@bl-young bl-young mentioned this pull request May 23, 2025
@bl-young
Copy link
Copy Markdown
Contributor

bl-young commented Jun 4, 2025

We decided to drop the county employment FBS (or perhaps all but one example). As well as the interim national and state employment FBS files (like 2000-2012), right?

@bl-young
Copy link
Copy Markdown
Contributor

bl-young commented Jun 6, 2025

using collapse_FlowBySector() is causing DQI info to be dropped

fix mapping subset when applied to parent-incompleteChild
@catherinebirney catherinebirney marked this pull request as ready for review July 2, 2025 17:31
@catherinebirney
Copy link
Copy Markdown
Contributor Author

catherinebirney commented Jul 2, 2025

merging with develop to consolidate changes for v2.1 release @bl-young - moving documentation to new PR #455

@catherinebirney catherinebirney requested a review from bl-young July 2, 2025 17:35
@catherinebirney catherinebirney mentioned this pull request Jul 2, 2025
@catherinebirney catherinebirney merged commit b77af76 into develop Jul 2, 2025
11 of 12 checks passed
@catherinebirney catherinebirney deleted the dqi_update branch July 2, 2025 23:13
@catherinebirney catherinebirney mentioned this pull request Sep 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants