0% found this document useful (0 votes)

21 views5 pages

Data Infrastructure For Machine Learning

The paper discusses the importance of data quality in machine learning and presents the data infrastructure developed at Google for managing training and serving data in production pipelines. It highlights the system's capabilities for data validation, skew detection, and model unit testing, which help ensure the integrity and correctness of machine learning models. The infrastructure analyzes over one petabyte of data daily, providing significant benefits in error detection and model quality.

Uploaded by

vasanthi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views5 pages

Data Infrastructure For Machine Learning

Uploaded by

vasanthi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

7 IV April 2019

https://doi.org/10.22214/ijraset.2019.4133
Inter national Jour nal for Resear ch in Applied Science & Engineering Technology ( IJRASET )
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Fact or : 6.887
Volume 7 Issue IV, Apr 2019- Available at www .ijr aset.com

Data Infrastructure for Machine Learning

Samridhi Jha1, Anshuman Tyagi2
2 2
HOD, Electrical and Electronics Pranveer Singh Institute Of Technology, Kanpur

Abstract: Data quality is critical for effective machine learning, and this makes data a first-class citizen in the context of
machine learning, on par with algorithms, software, and infrastructure. As a result, machine-learning platforms need to support
data analysis and validation in a principled manner, throughout the lifecycle of the machine learning process. This paper
reviews the data infrastructure we built at Google to address these challenges in the context of large-scale production machine
learning pipelines.
Index Terms: Algorithm used (K-nearest neighbour, Logistic Regression, Naïve Bias, Decision Tree, Support Vector Machine. )

I. INTRODUCTION
At a high level, a machine learning pipeline uses training data to generate a model, which can then be used to perform inference
with serving data (see Figure 1). Most work in academia and industry has focused on improving the efficiency and quality of
training and inference. However, we contend that it is equally important to take a data point-of-view and examine critically the
management of the (training and serving) data at the end points: simply put, the quality and speed of training and inference are
immaterial if the input data is wrong.

Training Serving

Train Model Serve

Data Data

Figure 1: Machine learning pipeline from 10,000 feet

There are several challenges to address in managing training and serving data. Data is typically large, it may arrive continuously,
and in the latter case it may also arrive in (incomplete) chunks. Moreover, data can contain errors that need to be caught early,
before they propagate downstream and taint models. In addition, data typically comes with few semantics attached to it, which
makes error detection a hard problem.
In this paper we describe the infrastructure that we built at Google to analyze and validate the (training and serving) data that drive
production machine learning pipelines.
Our system allows machine learning users to continuously check for errors in each instance of training or serving data, detect drifts
between instances, and test how data errors can affect the correctness of models. At the core is a data schema that describes the
user’s expectations for correct data.
This concept is borrowed from database management systems but its maintenance and application acquires new flavors in the
context of machine learning.
Our system has been deployed at Google as part of the TFX platform and currently analyzes and validates more than one petabyte
of data per day. Our system has caught several data errors in production pipelines with two tangible benefits for machine learning
users: savings in engineering hours to detect, debug, and fix the errors, and model-quality wins from using better data. While data
schema and validation were introduced in the TFX paper, we provide more details on skew detection and introduce model unit
testing.

A. Data-Driven Schema
A schema represents a logical model of data for machine learning that contains constraints and captures semantics that are necessary
for machine learning data validation and model testing.

© IJRASET: All Right s are Reserved 740

Inter national Jour nal for Resear ch in Applied Science & Engineering Technology ( IJRASET )
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Fact or : 6.887
Volume 7 Issue IV, Apr 2019- Available at www .ijr aset.com

Examples

0: { Schema
features: {
feature: { feature {
name: ‘event’ name: ‘event’
string_list: { CLICK }
presence: { min_fraction: 1 }
}
value_count: {
}
min: 1
...
} max: 1

1: { }
features: { type: BYTES
feature: { string_domain {
name: ‘event’ value: ‘CLICK’
string_list: { CONVERSION } value: ‘CONVERSION’
}
}
}
}
...
}
...

Figure 2: A data-driven schema

Figure2 shows an example schema (represented in protocol buffer format. The data in this case is expected to contain an event
feature which appears in 100% of the examples he presence specification), takes exactly one value (the value count specification),
and this can be either ‘CLICK’ or ‘CONVERSION’ (the domain specification). Although not shown in the figure, the schema can
also encode domains for numeric features and contain additional metadata including semantic information (e.g., string values for
boolean True or False), whether a feature’s integer values are considered as identifiers, or constraints on the distribution of the data,
to name a few.
The schema is designed to be both human- and machine-readable/writable: human-readable/writable because the machine learning
user is ultimately the owner of the schema and is expected to curate it over time; machine-readable/writable because schema is used
to perform data validation and model testing.
A schema can become large (many machine learning pipelines have 1000s of features) and so our system takes two steps to help the
machine learning user curate the schema.
First, our system can generate an initial version of the schema based on analysis of existing data. This initial version attempts to
capture salient properties of the data without overfitting to the particular data instance. The latter is important, since an overfitted
schema will lead to false-positives in data validation and hence noisy alerts for the machine learning user. Second, our system can
recommend changes to the schema as new data is examined.

B. Data Validation And Skew Detection

Our system validates an instance of the data by comparing it with the schema (Figure 3). Any discrepancy results in an alert to the
machine learning user.
An important point is that these alerts must be interpretable and actionable. For example, saying that the label does not satisfy the
schema is less actionable than saying that a label is not in 10% of examples. Our system can also suggest changes to the schema
for errors that can reflect a natural evolution of the data (e.g., the appearance of new domain values). The goal is to help the
machine learning user curate the schema as the data evolves over time.
Our system also detects skew between instances (e.g. drifts over time, or between training and serving) comparing them to each
other. Of particular importance is training-serving skew, which can occur when different code paths are followed in the generation
of training and serving data.
This type of skew is common in production pipelines and has adverse effects on the quality of inferences. Our system detects skew
both at the level of individual examples (if they have identifiers) and in aggregate (based on statistical goodness-of-fit metrics).
Again, our system focuses on errors that are easy to localize and thus debug.
For instance, we consider “the most frequent value changed” to be a more informative indication of drift than “the KL-divergence
is too high”. This intuition has led us to choose more interpretable, but perhaps less general, statistical measures of drift. For
example, we use bounds on L-∞ rather than generic measures such as the chi-squared test or KL divergence. (The chi-squared test
also has the problem of giving many false positives for large data where a small amount drift is expected.)

C. Model Unit Testing

In addition to validating data, the schema can also be used to generate synthetic data for model unit testing. Data fuzz
Da
Fig. 2 Schema !Examples contain missing values.
ta
Fix: Add value to domain
Va
feature { lid string_domain {
key: ‘event’ ati name: ‘event’
value { on value: ‘CLICKS’
bytes_list { ‘IMPRESSIONS’ } value: ‘CONVERSIONS’
} + value: ‘IMPRESSIONS’
} }

Figure 3: Schema-driven validation

testing is a common practice to evaluate services on a variety of inputs and is recently being used in machine learning model testing
as well. However, purely randomly generated data might lack required features or otherwise cause training code to crash in a way
that would not occur with real data. In contrast, the schema enables us to randomly generate data in a principled fashion, such that
crashes would indicate a problem needing to be addressed.
For instance, our unit test framework uses the schema in Figure2 and ensures that each randomly generated data example has the
event feature, each of which would have one value, uniformly chosen at random to be either ‘CLICK’ or ‘CONVERSION’.
Similarly, integral features would be random integers from the range specified in the schema, to name another case. The framework
also includes specific generators for image examples and is easily extensible by users, e.g. to include additional generation
constraints not expressible in the schema. It also includes the option to use a specific saved snapshot of data, but this is usually less
robust than generating from data, and few teams use it.
The generated data is then used to train and evaluate a machine learning model for a small number of steps. The goal is not to test
the model’s ability to learn, but to test the code’s ability to run, process data, and call machine learning APIs. Model unit testing is
one part of testing an end-to-end machine learning system.
Using this type of testing, our system can uncover discrepancies between the schema (and hence, the machine learning user’s
expectation of the data) and the assumptions made in modeling code. For instance, suppose that the modeling code applies a
logarithm transformation on an integer feature, but the schema does not specify the constraint that the feature is positive. During
testing, the modeling code will be exercised with synthetic examples where the feature has non-positive values, thus leading to an
error. This error can direct the machine learning user to update either the modeling code or the schema, so as to align data
expectations between validation and training.

REFERENCES
[1] 2017. ProtocolBuffers. https://developers.google.com/ protocol-buffers/. (2017).
[2] Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, and D. Sculley. 2017. The ML Test Score: A Rubric for ML Production Readiness and Technical Debt
Reduction. In “Proceedings of IEEE Big Data 2017.”

Prediction of Mental Health (Depression) Using Data Science Technique
No ratings yet
Prediction of Mental Health (Depression) Using Data Science Technique
6 pages
The Data Mining Based Model For Detection of Fraudulent Behavior in Water Consumption
No ratings yet
The Data Mining Based Model For Detection of Fraudulent Behavior in Water Consumption
5 pages
A Review of Grey Scale Normalization in Machine Learning and Artificial Intelligence For Bioinformatics Using Convolution Neural Networks
No ratings yet
A Review of Grey Scale Normalization in Machine Learning and Artificial Intelligence For Bioinformatics Using Convolution Neural Networks
7 pages
DrRKNayak Article IEEE AnOverviewofDeepLearninginSmartGrids
No ratings yet
DrRKNayak Article IEEE AnOverviewofDeepLearninginSmartGrids
5 pages
Comparision Between Software and Manual Design of g+2 Residential Building
No ratings yet
Comparision Between Software and Manual Design of g+2 Residential Building
7 pages
Big Data in The Machine Learning Techniq
No ratings yet
Big Data in The Machine Learning Techniq
8 pages
Effective Garbage Data Filtering Algorithm For SNS Big Data Processing by Machine Learning
No ratings yet
Effective Garbage Data Filtering Algorithm For SNS Big Data Processing by Machine Learning
8 pages
Recommendation Systems Using Graph Neural Networks
No ratings yet
Recommendation Systems Using Graph Neural Networks
6 pages
Stock Market Analysis and Prediction
100% (1)
Stock Market Analysis and Prediction
12 pages
A Survey Paper On Hard Disk Failure Prediction Using Machine Learning
No ratings yet
A Survey Paper On Hard Disk Failure Prediction Using Machine Learning
6 pages
Study of Image Recognition Using Machine Learning 2ut46iil
No ratings yet
Study of Image Recognition Using Machine Learning 2ut46iil
5 pages
A Survey On Machine Learning Algorithms-IJRASET
No ratings yet
A Survey On Machine Learning Algorithms-IJRASET
12 pages
Data Science Industrial Report
No ratings yet
Data Science Industrial Report
22 pages
A Survey of Big Data and Machine Learning
No ratings yet
A Survey of Big Data and Machine Learning
6 pages
Identifying Fake News Via Machine Learning and Web Scrapping
No ratings yet
Identifying Fake News Via Machine Learning and Web Scrapping
7 pages
Machine Learning in Manufacturing Efficiency
No ratings yet
Machine Learning in Manufacturing Efficiency
7 pages
Toxic Comments Classification
No ratings yet
Toxic Comments Classification
10 pages
A Comparative Analysis of Deep Learning and Machine Learning
No ratings yet
A Comparative Analysis of Deep Learning and Machine Learning
14 pages
Object Detection Using Machine Learning and Deep Learning
No ratings yet
Object Detection Using Machine Learning and Deep Learning
6 pages
SSRN 5021508
No ratings yet
SSRN 5021508
23 pages
Elktermprojectgr 15
No ratings yet
Elktermprojectgr 15
28 pages
3 - Alberto Diez Oliva 1 1
No ratings yet
3 - Alberto Diez Oliva 1 1
134 pages
Fault Detection in Power Systems Using ML
No ratings yet
Fault Detection in Power Systems Using ML
9 pages
A Review of Machine Learning For Big Data Analysis: Nadia - Cs89@uomustansiriyah - Edu.iq
No ratings yet
A Review of Machine Learning For Big Data Analysis: Nadia - Cs89@uomustansiriyah - Edu.iq
4 pages
Articulo en Ingles Sobre IA
No ratings yet
Articulo en Ingles Sobre IA
43 pages
Deep Learning for IoT Predictive Maintenance
No ratings yet
Deep Learning for IoT Predictive Maintenance
22 pages
Tu 2020
No ratings yet
Tu 2020
8 pages
Prediction of Skin Diseases Using Machine Learning
No ratings yet
Prediction of Skin Diseases Using Machine Learning
8 pages
Query Generation Using Nadaq System
No ratings yet
Query Generation Using Nadaq System
11 pages
Stock Market Prediction Using LSTM
100% (2)
Stock Market Prediction Using LSTM
9 pages
Anticipating Consumer Demand Using ML
No ratings yet
Anticipating Consumer Demand Using ML
8 pages
Documento PDF
No ratings yet
Documento PDF
1 page
Documento PDF
No ratings yet
Documento PDF
1 page
Deep Learning Architectures Enabling Sophisticated Feature Extraction and Representation For Complex Data Analysis
No ratings yet
Deep Learning Architectures Enabling Sophisticated Feature Extraction and Representation For Complex Data Analysis
11 pages
Self-Optimizing Database Architecture
No ratings yet
Self-Optimizing Database Architecture
6 pages
Stock Market Analysis
100% (1)
Stock Market Analysis
4 pages
Development of A Predictive Maintenance Algorithm For A Diesel Generator Using Machine Learning
No ratings yet
Development of A Predictive Maintenance Algorithm For A Diesel Generator Using Machine Learning
11 pages
Machine Learning On Big Data: Opportunities and Challenges: Version of Record
No ratings yet
Machine Learning On Big Data: Opportunities and Challenges: Version of Record
27 pages
Challenges in Computational Statistics and Data Mining (Matwin & Mielniczuk 2015-07-08)
No ratings yet
Challenges in Computational Statistics and Data Mining (Matwin & Mielniczuk 2015-07-08)
404 pages
Machine Learning For Predictive Analytics: Trends and Future Directions
No ratings yet
Machine Learning For Predictive Analytics: Trends and Future Directions
8 pages
Data Science in Practice (Alan Said Vicenç Torra) (Z-Library)
No ratings yet
Data Science in Practice (Alan Said Vicenç Torra) (Z-Library)
265 pages
Supporting Future Electrical Utilities - Infotech
No ratings yet
Supporting Future Electrical Utilities - Infotech
6 pages
20
No ratings yet
20
19 pages
Machine Learning for Fault Detection
No ratings yet
Machine Learning for Fault Detection
9 pages
Think Big and Understanding Format of Salary Prediction Using Machine Learning
No ratings yet
Think Big and Understanding Format of Salary Prediction Using Machine Learning
7 pages
A Web-Based Platform For Predicting Student Careers Using Machine Learning With Learning Resources
No ratings yet
A Web-Based Platform For Predicting Student Careers Using Machine Learning With Learning Resources
6 pages
MATLAB Machine Learning for Nuclear Safety
No ratings yet
MATLAB Machine Learning for Nuclear Safety
12 pages
Artificial Neural Network Based Load Forecasting
No ratings yet
Artificial Neural Network Based Load Forecasting
8 pages
Big Data for Smart Grid Experts
No ratings yet
Big Data for Smart Grid Experts
5 pages
Chapter2 - Literature Review
No ratings yet
Chapter2 - Literature Review
21 pages
Machine Learning Systems: A Survey From A Data-Oriented Perspective
No ratings yet
Machine Learning Systems: A Survey From A Data-Oriented Perspective
37 pages
TnP Portal: Web & ML Solution for Colleges
No ratings yet
TnP Portal: Web & ML Solution for Colleges
9 pages
Network ML Data Generation Framework
No ratings yet
Network ML Data Generation Framework
5 pages
A Model-Driven Engineering Approach For Monitoring Machine Learning Models
No ratings yet
A Model-Driven Engineering Approach For Monitoring Machine Learning Models
5 pages
Forecasting Future Sales of Bigmarts
100% (1)
Forecasting Future Sales of Bigmarts
5 pages
Student Placement Prediction System
No ratings yet
Student Placement Prediction System
5 pages
1 s2.0 S0952197624004986 Main
No ratings yet
1 s2.0 S0952197624004986 Main
15 pages
Dzemyda G. Data Science in Applications 2023
No ratings yet
Dzemyda G. Data Science in Applications 2023
260 pages
IJRPR40204
No ratings yet
IJRPR40204
2 pages
Sem, Consolidated Marksheet
No ratings yet
Sem, Consolidated Marksheet
9 pages
Non
No ratings yet
Non
4 pages
DM
No ratings yet
DM
4 pages
New Doc Mar 4, 2021 7.59 PM
No ratings yet
New Doc Mar 4, 2021 7.59 PM
8 pages
SL - No Description Rate PER Amount Lbour Statment Labour Schedule Grounds and Buildings (HT)
No ratings yet
SL - No Description Rate PER Amount Lbour Statment Labour Schedule Grounds and Buildings (HT)
1 page
Business Mathematics
No ratings yet
Business Mathematics
2 pages
Resume: School Stream Year of Passing
No ratings yet
Resume: School Stream Year of Passing
2 pages
Detection of Moving Object in Video Using Morphological Operation and Thresholding
No ratings yet
Detection of Moving Object in Video Using Morphological Operation and Thresholding
2 pages
Information Theory & Coding Basics
100% (2)
Information Theory & Coding Basics
108 pages
Entropy and Probability Calculations
100% (3)
Entropy and Probability Calculations
2 pages
Maximum Entropy on the Mean in Estimation
No ratings yet
Maximum Entropy on the Mean in Estimation
50 pages
2 Gravity From Entropy - 2025 Bianconi
No ratings yet
2 Gravity From Entropy - 2025 Bianconi
17 pages
REPORT On DECISION TREE
No ratings yet
REPORT On DECISION TREE
40 pages
Complete Mathematical Theory of Bayesian Statistics First Edition Watanabe PDF For All Chapters
100% (8)
Complete Mathematical Theory of Bayesian Statistics First Edition Watanabe PDF For All Chapters
55 pages
ML Unit-2 Material
No ratings yet
ML Unit-2 Material
20 pages
Linear Simulation-based Inference in Cosmology
No ratings yet
Linear Simulation-based Inference in Cosmology
12 pages
Fault Diagnosis Techniques in Digital Systems
No ratings yet
Fault Diagnosis Techniques in Digital Systems
9 pages
Exercise Problems: Information Theory and Coding
No ratings yet
Exercise Problems: Information Theory and Coding
6 pages
Loss Odyssey in Medical Image Segmentation
No ratings yet
Loss Odyssey in Medical Image Segmentation
13 pages
Artificial Intelligence in Action by Ahmed Banafa (Epub) (Nonfiction)
No ratings yet
Artificial Intelligence in Action by Ahmed Banafa (Epub) (Nonfiction)
362 pages
Luterbacher Predicting Crises and Monitoring Their Evolution-Ices 2015 Grenoble
No ratings yet
Luterbacher Predicting Crises and Monitoring Their Evolution-Ices 2015 Grenoble
24 pages
Ángela: Capel Cuevas
No ratings yet
Ángela: Capel Cuevas
11 pages
Approximating The Kullback Leibler Divergence Between Gaussian Mixture Models
No ratings yet
Approximating The Kullback Leibler Divergence Between Gaussian Mixture Models
4 pages
Kullback-Leibler Divergence Estimation of Continuous Distributions
No ratings yet
Kullback-Leibler Divergence Estimation of Continuous Distributions
5 pages
Variational Inference Ref Paper
No ratings yet
Variational Inference Ref Paper
13 pages
Distributions - Questions About KL Divergence - Cross Validated
No ratings yet
Distributions - Questions About KL Divergence - Cross Validated
3 pages
MaxEnt Guide for Species Distribution Modeling
No ratings yet
MaxEnt Guide for Species Distribution Modeling
12 pages
Business Analytics & Machine Learning: Decision Tree Classifiers
No ratings yet
Business Analytics & Machine Learning: Decision Tree Classifiers
60 pages
Unit 1
No ratings yet
Unit 1
38 pages
Understanding Binary Cross
No ratings yet
Understanding Binary Cross
11 pages
Speaker Diarization
No ratings yet
Speaker Diarization
47 pages
Probabilistic Machine Learning An Introduction Book 1 (Kevin P Murphy)
100% (1)
Probabilistic Machine Learning An Introduction Book 1 (Kevin P Murphy)
949 pages
Info Theory
No ratings yet
Info Theory
59 pages
(WWW2022) Confidence May Cheat Self-Training On Graph Neural Networks Under Distribution Shift
No ratings yet
(WWW2022) Confidence May Cheat Self-Training On Graph Neural Networks Under Distribution Shift
11 pages
Assignment 4.solution
100% (1)
Assignment 4.solution
7 pages
Ecologists' Guide to AIC
No ratings yet
Ecologists' Guide to AIC
13 pages
Class 16 Decision Tree
No ratings yet
Class 16 Decision Tree
45 pages
DSIL-DDI A Domain-Invariant Substructure Interaction Learning For Generalizable DrugDrug Interaction Prediction
No ratings yet
DSIL-DDI A Domain-Invariant Substructure Interaction Learning For Generalizable DrugDrug Interaction Prediction
9 pages

Data Infrastructure For Machine Learning

Uploaded by

Data Infrastructure For Machine Learning

Uploaded by

7 IV April 2019

Data Infrastructure for Machine Learning

Train Model Serve

Figure 1: Machine learning pipeline from 10,000 feet

© IJRASET: All Right s are Reserved 740

Figure 2: A data-driven schema

B. Data Validation And Skew Detection

© IJRASET: All Right s are Reserved 741

C. Model Unit Testing

Figure 3: Schema-driven validation

© IJRASET: All Right s are Reserved 742

You might also like