0% found this document useful (0 votes)
46 views3 pages

TensorFlow Data Validation Guide

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views3 pages

TensorFlow Data Validation Guide

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

9.

TensorFlow Data Validation (TFDV)

is made of three components: statisticsgen, schemagen, example validator

Skew occurs when training data is generated differently from how the data used to request
predictions is generated.
should be checked in training, validation and testing data splits.
distribution skew occurs when the distribution of feature values for training data is
significantly different from serving data
one of the key causes for distribution skew is how data is handled or changed in
training versus in production

StatisticsGen
Generates feature statistics and random samples from training data for visualization and
validation.
Requires minimal configuration.
Inputs: Datasets (e.g., from ExampleGen, Pandas DataFrame, CSV, TFRecord).
Outputs: Visualizable statistics (numeric and categorical features).
Identifies data gaps (e.g., missing early morning trip data).
Compares statistics between datasets (e.g., day one vs day two) to analyze differences.
Categorical feature statistics include missing and unique value counts.
Detects unbalanced data distribution, listing most unbalanced features.
Data validation checks include: min, max, mean, mode, median, correlation, class imbalance,
missing values, histograms (numerical and categorical).

SchemaGen
Specifies data types, feature presence requirements, allowed value ranges, etc.
Automatically generates a schema by inferring properties from training data (types,
categories, ranges).
Visualization tool available to review and fix the inferred schema.
Schema visualization elements:
"Type": Feature datatype (int, float, categorical).
"Presence": Whether the feature is required (100% presence) or optimal.
"Valency": Number of values: Feature domain and its valid values.
For categorical features, "single" indicates exactly one category per example.

Example Validator
Identifies anomalies in training and serving data.
Detects different classes of anomalies and emits validation results.
Compares data statistics from StatisticsGen against the defined schema.
Reports anomalies (e.g., missing values).

When to use TFDV


It's easy to think of TFDV as only applying to start of your training pipeline, but in fact it has
many uses. Some of them are,
Validating new data for inference to make sure that we haven't suddenly started
receiving bad features
Validating new data for inference to make sure that our model has trained on that part
of the decision surface
Validating our data after we've transformed it and done feature engineering (probably
using TensorFlow Transform) to make sure we haven't done something wrong.

You might also like