EPATADA (EPA Tools for Automated Data Analysis) is an R package developed by the U.S. Environmental Protection Agency to facilitate water quality data analysis using data from the Water Quality Portal (WQP). The package provides a comprehensive workflow for retrieving, cleaning, validating, harmonizing, and analyzing water quality monitoring data from multiple organizations and databases.
This page provides an architectural overview of the entire EPATADA system. For detailed information about specific capabilities, see:
EPATADA assists data partners in performing water quality analyses by addressing common challenges in working with WQP data:
Primary user groups include state and tribal environmental agencies, EPA regional offices, research scientists, and water quality professionals.
Sources: DESCRIPTION40-41 vignettes/TADAModule1.Rmd45-51
EPATADA is both a standalone R package and the computational foundation for TADAShiny, an R Shiny web application hosted at the EPA. The two projects are maintained in separate GitHub repositories.
| Aspect | EPATADA | TADAShiny |
|---|---|---|
| Type | R package | R Shiny application |
| Repository | USEPA/EPATADA | USEPA/TADAShiny |
| Interface | Programmatic (R functions) | Interactive web UI |
| Audience | R users, analysts | Broader user base including non-coders |
| Deployment | Local R session | Web (EPA RConnect) or local |
TADAShiny calls EPATADA functions directly as its backend. Any update to EPATADA is reflected in TADAShiny when the app package dependencies are updated.
Installing and running TADAShiny locally:
The TADAShiny.tab column (see last.cols in R/RequiredCols.R287) is a dataframe column reserved specifically for state management within the Shiny application, indicating how the app tracks rows across its tabbed interface.
Sources: README.md9 README.md43-57 R/RequiredCols.R287
EPATADA is organized into three distinct modules that build upon each other to transform raw WQP data into assessment-ready information. Each module can be used independently or in combination.
Module 1 retrieves and cleans data from the Water Quality Portal, producing standardized datasets with quality flags.
Key Functions:
TADA_DataRetrieval() R/DataDiscoveryRetrieval.R174-1254 - Primary data retrieval with automatic TADA_JoinWQPProfiles()TADA_BigDataHelper() - Handles queries exceeding maxrecs parameter (default: 350,000 records or 300 sites)TADA_AutoClean() - Multi-stage cleaning: calls TADA_ConvertSpecialChars(), TADA_SubstituteDeprecatedChars(), TADA_ConvertResultUnits(), TADA_ConvertDepthUnits()TADA_FlagMethod(), TADA_FlagFraction(), TADA_FlagSpeciation(), TADA_FlagResultUnit() - Create TADA.*.Flag columnsTADA_HarmonizeSynonyms() - Creates TADA.ComparableDataIdentifier using HarmonizationTemplate.csv referenceTADA_IDCensoredData(), TADA_SimpleCensoredMethods() - Handle non-detects/over-detectsModule 2 adds spatial context by linking monitoring locations to EPA ATTAINS assessment units and USGS NHD hydrography features.
Key Functions:
TADA_MakeSpatial() R/GeospatialFunctions.R32-133 - Converts dataframe to sf object (CRS 4326)fetchATTAINS() R/GeospatialFunctions.R176-677 - Retrieves ATTAINS features from ArcGIS REST servicesfetchNHD() - Retrieves NHDPlus HR/V2 features via nhdplusTools packageTADA_CreateAUMLCrosswalk() - Three-priority matching: (1) ATTAINS-submitted, (2) Geospatial intersection, (3) User-suppliedTADA_FindNearbySites() - Uses dbscan clustering to identify proximate sites (creates TADA.NearbySiteGroups column)TADA_OverviewMap(), TADA_ViewATTAINS() - leaflet interactive maps with tribal boundary overlaysModule 3 applies water quality criteria and prepares data for regulatory assessment workflows.
Key Functions:
TADA_ParametersForAnalysis() - Maps WQP CharacteristicName to ATTAINS ParameterName using ATTAINSParamToWQPCharRef.csvTADA_UsesForAnalysis() - Assigns designated uses via rExpertQuery::EQ_AUsMLs()TADA_DefineCriteriaMethodology() R/CriteriaMethods.R167-850 - Generates criteria template with EPA 304(a) integration via CriteriaSearchToolRef.csvTADA_MLSummary() - Generates site-specific criteria summaries with UniqueSpatialCriteria groupingsTADA_AssignUsesToAU() - Crosswalks uses to assessment units using ATTAINSParamUseOrgRefThree-Module Architecture and R File Organization
Sources: R/DataDiscoveryRetrieval.R174-1254 R/GeospatialFunctions.R32-677 R/CriteriaMethods.R167-850 R/ATTAINSCrosswalks.R1-1200 R/Utilities.R541-857
EPATADA exports 106 functions organized into families that map to the three-module architecture. Function names follow consistent naming patterns: TADA_VerbNoun() format with purpose-driven prefixes.
| Function Family | Key Functions | Purpose |
|---|---|---|
| Data Retrieval | TADA_DataRetrieval(), TADA_BigDataHelper(), TADA_JoinWQPProfiles(), TADA_ReadWQPWebServices() | Query WQP and combine profiles |
| Auto-Cleaning | TADA_AutoClean(), TADA_ConvertSpecialChars(), TADA_SubstituteDeprecatedChars() | Initial data standardization |
| Quality Flagging | TADA_FlagMethod(), TADA_FlagFraction(), TADA_FlagSpeciation(), TADA_FlagResultUnit(), TADA_FlagAboveThreshold(), TADA_FlagBelowThreshold(), TADA_FlagCoordinates(), TADA_FlagContinuousData(), TADA_FlagMeasureQualifierCode() | Identify suspect data |
| Unit Conversion | TADA_ConvertResultUnits(), TADA_ConvertDepthUnits() | Standardize measurement units |
| Harmonization | TADA_HarmonizeSynonyms(), TADA_CreateComparableID() | Resolve parameter synonyms |
| Censored Data | TADA_IDCensoredData(), TADA_SimpleCensoredMethods() | Handle non-detects and over-detects |
| Duplicate Detection | TADA_FindPotentialDuplicatesSingleOrg(), TADA_FindPotentialDuplicatesMultipleOrgs() | Identify potential duplicate records |
| Function Family | Key Functions | Purpose |
|---|---|---|
| Spatial Conversion | TADA_MakeSpatial() | Convert to sf object (CRS 4326) |
| ATTAINS Integration | fetchATTAINS(), TADA_CreateAUMLCrosswalk(), TADA_GetATTAINSByAUID(), TADA_CreateATTAINSAUMLCrosswalk(), TADA_UpdateATTAINSAUMLCrosswalk(), TADA_GetATTAINSAUMLCrosswalk() | Link to assessment units |
| NHD Integration | Functions leverage nhdplusTools package | Retrieve catchments and flowlines |
| Spatial Analysis | TADA_FindNearbySites(), TADA_GetUniqueNearbySites() | Identify proximate monitoring locations |
| Interactive Mapping | TADA_OverviewMap(), TADA_FlaggedSitesMap(), TADA_ViewATTAINS(), TADA_NearbySitesMap() | Create leaflet visualizations |
| Function Family | Key Functions | Purpose |
|---|---|---|
| Parameter Mapping | TADA_ParametersForAnalysis(), TADA_GetATTAINSParamToWQPCharRef() | Map WQP to ATTAINS parameters |
| Use Assignment | TADA_UsesForAnalysis(), TADA_AssignUsesToAU() | Assign designated uses |
| Criteria Definition | TADA_DefineCriteriaMethodology(), TADA_GetCriteriaSearchToolRef() | Create criteria tables |
| Spatial Summaries | TADA_MLSummary() | Site-specific criteria application |
| Alias Discovery | TADA_AdditionalCharAliasForReview(), TADA_UsesAliasForReview() | Identify new parameter/use aliases |
| Function Family | Key Functions | Purpose |
|---|---|---|
| Visualization | TADA_Boxplot(), TADA_Histogram(), TADA_Scatterplot(), TADA_TwoCharacteristicScatterplot(), TADA_DepthProfilePlot() | Create plotly figures |
| Statistical Analysis | TADA_Stats(), TADA_SummarizeColumn(), TADA_FieldCounts() | Generate summaries |
| Reference Retrieval | TADA_GetWQXCharValRef(), TADA_GetMeasureUnitRef(), TADA_GetCharacteristicRef(), TADA_GetCriteriaSearchToolRef() | Access reference tables |
| Data Export | TADA_CreateCSV(), TADA_TableExport() | Export results |
| Validation | TADA_CheckRequiredFields(), TADA_CheckType(), TADA_CheckColumns() | Validate inputs |
Sources: NAMESPACE1-109 R/DataDiscoveryRetrieval.R174-195 R/Utilities.R466-471
EPATADA implements a four-stage linear pipeline that transforms raw WQP data into analysis-ready datasets. The system follows a non-destructive transformation philosophy: original WQP columns are never modified, and new TADA.* prefixed columns are created alongside originals.
Data Processing Pipeline with Function Names and File Locations
| Stage | Purpose | Key Functions | Key Output Columns |
|---|---|---|---|
| 1. Retrieval | Download and combine WQP profiles | TADA_DataRetrieval(), TADA_JoinWQPProfiles() | ~150 original WQP columns: ResultMeasureValue, CharacteristicName, MonitoringLocationIdentifier |
| 2. Quality Control | Validate data and identify suspect values | TADA_AutoClean(), TADA_Flag*() | TADA.AnalyticalMethod.Flag, TADA.SampleFraction.Flag, TADA.ResultUnit.Flag (9+ flag types) |
| 3. Standardization | Convert units and harmonize synonyms | TADA_ConvertResultUnits(), TADA_HarmonizeSynonyms(), TADA_CreateComparableID() | TADA.ComparableDataIdentifier, TADA.CharacteristicName, TADA.ResultMeasure.MeasureUnitCode, TADA.ConsolidatedDepth |
| 4. Enrichment | Add spatial and regulatory context | TADA_MakeSpatial(), fetchATTAINS(), TADA_CreateAUMLCrosswalk(), TADA_DefineCriteriaMethodology() | geometry, ATTAINS.AssessmentUnitIdentifier, ATTAINS.WaterType, ATTAINS.OrganizationIdentifier |
The TADA_OrderCols() function R/Utilities.R2690-2862 organizes columns so that each TADA.* column appears adjacent to its original WQP counterpart for easy comparison. The require.cols list R/RequiredCols.R5-202 defines the complete set of required columns for TADA workflows.
Sources: R/DataDiscoveryRetrieval.R174-1254 R/Utilities.R541-1047 R/GeospatialFunctions.R32-677 R/CriteriaMethods.R167-850 R/RequiredCols.R5-202
The central data structure in EPATADA is the "TADA dataframe" - a data.frame or sf object containing water quality monitoring data with both WQP-native columns and TADA-created columns.
| Column Prefix | Description | Examples | Created By |
|---|---|---|---|
| (none) | Original WQP columns | CharacteristicName, ResultMeasureValue, OrganizationIdentifier | WQP/dataRetrieval |
TADA. | TADA-processed versions of WQP columns | TADA.CharacteristicName, TADA.ResultMeasureValue | TADA functions |
TADA.*.Flag | Quality control flags | TADA.AnalyticalMethod.Flag, TADA.CensoredData.Flag | TADA flagging functions |
ATTAINS. | EPA ATTAINS assessment data | ATTAINS.AssessmentUnitIdentifier, ATTAINS.WaterType | ATTAINS crosswalk functions |
NHD. | National Hydrography Dataset | NHD.comid, NHD.catchmentareasqkm | NHD integration functions |
EPATADA defines a standardized set of required columns in R/RequiredCols.R5-202 These include:
ResultIdentifier, ActivityIdentifier, MonitoringLocationIdentifier, OrganizationIdentifierTADA.CharacteristicName, TADA.ResultMeasureValue, TADA.ResultMeasure.MeasureUnitCodeActivityStartDate, ActivityStartDateTimeTADA.LatitudeMeasure, TADA.LongitudeMeasureTADA.ResultSampleFractionText, TADA.MethodSpeciationName, TADA.ActivityMediaNameResultDetectionConditionText, DetectionQuantitationLimitTypeNameThe TADA_CheckRequiredFields() function validates that a dataframe contains all required columns for TADA workflows.
A critical concept in EPATADA is the "comparable data identifier" - a concatenated key that groups results representing the same observable property:
TADA.ComparableDataIdentifier = TADA.CharacteristicName_TADA.ResultSampleFractionText_TADA.MethodSpeciationName_TADA.ResultMeasure.MeasureUnitCode
Example: "TOTAL PHOSPHORUS, MIXED FORMS_UNFILTERED_AS P_UG/L"
This identifier enables:
Created by TADA_CreateComparableID() R/Utilities.R984-1018 and updated by TADA_HarmonizeSynonyms().
Sources: R/RequiredCols.R5-202 R/Utilities.R1024-1047 vignettes/TADAModule1.Rmd506-519
EPATADA implements a three-tier reference data access strategy to ensure reliability while maintaining access to current data: (1) attempt live download from authoritative sources, (2) fall back to internal reference files if download fails, (3) cache successfully retrieved data in memory for session performance.
Reference Data Architecture Diagram
| Reference Type | File Location | Update Method | Used By |
|---|---|---|---|
| WQX Domain Tables | inst/extdata/WQXcharValRef.rda | GitHub Actions (daily) | TADA_Flag* functions |
| EPA Criteria Search Tool | inst/extdata/CriteriaSearchToolRef.csv | GitHub Actions (daily) | TADA_DefineCriteriaMethodology() |
| ATTAINS Crosswalks | inst/extdata/ATTAINSParamToWQPCharRef.csv | GitHub Actions (daily) | TADA_ParametersForAnalysis() |
| Harmonization Templates | inst/extdata/HarmonizationTemplate.csv | Manual updates | TADA_HarmonizeSynonyms() |
| Unit Conversion Tables | inst/extdata/TADAPriorityCharConvertRef.csv | Manual updates | TADA_ConvertResultUnits() |
| Tribal Geographic Data | inst/extdata/*.dbf (6 files) | Manual updates | TADA_DataRetrieval() tribal queries |
The internal reference files in inst/extdata/ serve as fallback when network access is unavailable, ensuring offline operation capability.
Sources: R/Utilities.R466-471 R/RequiredCols.R249-284
EPATADA supports multiple workflow patterns depending on user expertise and analysis requirements. The system accommodates both beginner-friendly automated workflows and advanced custom processing.
Workflow Pattern Diagram
The default TADA_DataRetrieval() with applyautoclean = TRUE automatically applies all standard cleaning and quality control steps, producing analysis-ready data in a single function call.
Users requiring custom quality control or harmonization can disable automatic cleaning and manually control each processing step.
Queries exceeding 350,000 records or 300 sites automatically invoke TADA_BigDataHelper() which splits queries, tracks progress, and combines results. Users can control the maxrecs parameter (default: 350000) R/DataDiscoveryRetrieval.R193
Sources: vignettes/TADAModule1.Rmd55-93 vignettes/TADAModule1.Rmd367-404 R/DataDiscoveryRetrieval.R174-195
EPATADA integrates with five major external data systems through established R packages and REST APIs. Each integration point uses specific function calls and data formats.
External System Integration with Package Dependencies and Function Calls
| Package | Version | Key Functions Used | TADA Functions That Call Them |
|---|---|---|---|
dataRetrieval | >= 2.7.21 | readWQPdata(), whatWQPsites(), whatWQPdata() | TADA_DataRetrieval() R/DataDiscoveryRetrieval.R264-267 |
rExpertQuery | develop branch | EQ_AUsMLs(), EQ_DomainValues(), EQ_NationalExtract() | TADA_GetATTAINSAUMLCrosswalk() R/ATTAINSCrosswalks.R96-102 TADA_UsesForAnalysis() |
nhdplusTools | latest | NHD data retrieval functions | fetchNHD() R/GeospatialFunctions.R708-1190 |
arcgislayers | latest | arc_open(), get_layer(), arc_select() | fetchNHD() R/GeospatialFunctions.R756-777 TADA_TribalOptions() |
sf | latest | st_as_sf(), st_transform(), st_bbox(), st_intersects() | TADA_MakeSpatial() R/GeospatialFunctions.R122-128 fetchATTAINS() |
leaflet | latest | addPolygons(), addMarkers(), addLayersControl() | TADA_OverviewMap(), TADA_ViewATTAINS() |
plotly | latest | plot_ly(), add_histogram(), add_trace() | TADA_Boxplot(), TADA_Histogram(), TADA_Scatterplot() |
dplyr | >= 1.1.0 | filter(), mutate(), left_join(), group_by() | All data manipulation functions |
TADA_DataRetrieval() queries three WQP profiles and joins them using TADA_JoinWQPProfiles() R/DataDiscoveryRetrieval.R731-737:
| Profile | WQP Name | Columns | Key Fields |
|---|---|---|---|
| Result | resultPhysChem | ~87 | ResultMeasureValue, ResultMeasure.MeasureUnitCode, CharacteristicName, ResultDetectionConditionText |
| Station | Station | ~35 | MonitoringLocationIdentifier, LatitudeMeasure, LongitudeMeasure, HUCEightDigitCode |
| Project | Project | ~14 | ProjectIdentifier, ProjectName, QAPPApprovedIndicator |
The TADA_JoinWQPProfiles() function performs left_join() operations on OrganizationIdentifier, ActivityIdentifier, and MonitoringLocationIdentifier columns.
fetchATTAINS() queries four ArcGIS MapServer layers R/GeospatialFunctions.R229-234:
| Layer ID | Feature Type | URL Endpoint | Returned Columns |
|---|---|---|---|
| 3 | Catchments | .../MapServer/3/query? | assessmentunitidentifier, catchmentareasqkm, organizationid |
| 0 | Points | .../MapServer/0/query? | Assessment unit point geometries |
| 1 | Lines | .../MapServer/1/query? | Assessment unit line geometries |
| 2 | Polygons | .../MapServer/2/query? | Assessment unit polygon geometries |
The function uses urltools::param_set() to build query parameters and geojsonsf::geojson_sf() to parse responses into sf objects.
Sources: DESCRIPTION48-86 R/DataDiscoveryRetrieval.R264-267 R/GeospatialFunctions.R176-677 R/ATTAINSCrosswalks.R66-200 tests/testthat/test-DataDiscoveryRetrieval.R1-134
A core design principle of EPATADA is non-destructive transformation: original WQP data columns are never modified or deleted. All transformations create new columns with the TADA. prefix.
Benefits of this approach:
Example: TADA_ConvertSpecialChars() R/Utilities.R541-854 creates:
TADA.ResultMeasureValue: Numeric version of ResultMeasureValueTADA.ResultMeasureValueDataTypes.Flag: Documents what transformation occurred (e.g., "Less Than", "Numeric Range - Averaged")Users control whether flagged data is removed via clean parameters (default: FALSE - retain all data with flags).
Sources: R/Utilities.R541-593 R/Utilities.R596-857 vignettes/TADAModule1.Rmd449-476
EPATADA provides multi-tiered documentation:
| Resource Type | Examples | Purpose |
|---|---|---|
| Training Modules | TADAModule1, TADAModule2, TADAModule3 | Step-by-step workflows |
| Function Documentation | ?TADA_DataRetrieval | Parameter descriptions, examples |
| Example Datasets | Data_6Tribes_5y, Data_Nutrients_UT | Reproducible examples |
| Vignettes | GeospatialDataIntegration, ExampleMod2Workflow | Use case demonstrations |
| GitHub Pages Site | https://usepa.github.io/EPATADA/ | Comprehensive online documentation |
The package includes 4 example datasets covering tribal monitoring, nutrient analysis, and watershed assessments. See Example Data and Use Cases for details.
Sources: vignettes/TADAModule1.Rmd1-22 DESCRIPTION120 Diagram 6 from system architecture
This overview provides the foundation for understanding EPATADA's architecture and capabilities. For detailed information about specific subsystems, proceed to the linked pages in the table of contents.
Refresh this wiki