0% found this document useful (0 votes)
105 views9 pages

TFX: A Tensorflow-Based Production-Scale Machine Learning Platform

A TensorFlow-Based Production-Scale Machine Learning Platform

Uploaded by

Luicia Sotres
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views9 pages

TFX: A Tensorflow-Based Production-Scale Machine Learning Platform

A TensorFlow-Based Production-Scale Machine Learning Platform

Uploaded by

Luicia Sotres
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada

TFX: A TensorFlow-Based Production-Scale Machine Learning


Platform
Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque,
Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, Chiu Yuen Koo, Lukasz Lew,
Clemens Mewald, Akshay Naresh Modi, Neoklis Polyzotis, Sukriti Ramesh, Sudip Roy,
Steven Euijong Whang, Martin Wicke, Jarek Wilkiewicz, Xin Zhang, Martin Zinkevich
Google Inc.∗
ABSTRACT adopt machine learning as a tool to gain knowledge from
Creating and maintaining a platform for reliably producing data across a broad spectrum of use cases and products, rang-
and deploying machine learning models requires careful or- ing from recommender systems [6, 7], to clickthrough rate
chestration of many components—a learner for generating prediction for advertising [13, 15], and even the protection
models based on training data, modules for analyzing and val- of endangered species [5].
idating both data as well as models, and finally infrastructure The conceptual workflow of applying machine learning
for serving models in production. This becomes particularly to a specific use case is simple: at the training phase, a
challenging when data changes over time and fresh models learner takes a dataset as input and emits a learned model;
need to be produced continuously. Unfortunately, such or- at the inference phase, the model takes features as input and
chestration is often done ad hoc using glue code and custom emits predictions. However, the actual workflow becomes
scripts developed by individual teams for specific use cases, more complex when machine learning needs to be deployed
leading to duplicated effort and fragile systems with high in production. In this case, additional components are re-
technical debt. quired that, together with the learner and model, comprise
We present TensorFlow Extended (TFX), a TensorFlow- a machine learning platform. The components provide au-
based general-purpose machine learning platform implemented tomation to deal with a diverse range of failures that can
at Google. By integrating the aforementioned components happen in production and to ensure that model training and
into one platform, we were able to standardize the compo- serving happen reliably. Building this type of automation is
nents, simplify the platform configuration, and reduce the non-trivial, and it becomes even more challenging when we
time to production from the order of months to weeks, while consider the following complications:
providing platform stability that minimizes disruptions. • Building one machine learning platform for many different
We present the case study of one deployment of TFX in the learning tasks: Products can have substantially different
Google Play app store, where the machine learning models needs in terms of data representation, storage infrastruc-
are refreshed continuously as new data arrive. Deploying ture, and machine learning tasks. The machine learning
TFX led to reduced custom code, faster experiment cycles, platform must be generic enough to handle the most com-
and a 2% increase in app installs resulting from improved mon set of learning tasks as well as be extensible to support
data and model analysis. one-off atypical use-cases.
• Continuous training and serving: The platform has to
KEYWORDS support the case of training a single model over fixed data,
large-scale machine learning; end-to-end platform; continuous but also the case of generating and serving up-to-date
training models through continuous training over evolving data
(e.g., a moving window over the latest n days of a log
1 INTRODUCTION stream).
It is hard to overemphasize the importance of machine learn- • Human-in-the-loop: The machine learning platform needs
ing in modern computing. More and more organizations to expose simple user interfaces to make it easy for engi-
∗ neers to deploy and monitor the platform with minimal
Corresponding authors: Heng-Tze Cheng, Clemens
Mewald, Neoklis Polyzotis, and Steven Euijong Whang: configuration. Furthermore, it also needs to help users
{hengtze,clemensm,npolyzotis,swhang}@google.com with various levels of machine-learning expertise under-
Permission to make digital or hard copies of part or all of this work
stand and analyze their data and models.
for personal or classroom use is granted without fee provided that • Production-level reliability and scalability: The platform
copies are not made or distributed for profit or commercial advantage
and that copies bear this notice and the full citation on the first page. needs to be resilient to disruptions from inconsistent data,
Copyrights for third-party components of this work must be honored. software, user configurations, and failures in the underlying
For all other uses, contact the owner/author(s). execution environment. In addition, the platform must
KDD’17, August 13–17, 2017, Halifax, NS, Canada.
© 2017 Copyright held by the owner/author(s). 978-1-4503-4887- scale gracefully to the high data volume that is common
4/17/08. in training, and also to increases in the production traffic
DOI: http://dx.doi.org/10.1145/3097983.3098021 to the serving system.

1387
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada

Having this type of platform enables teams to easily deploy it also imposes requirements on all other components. Data
machine learning in production for a wide range of prod- analysis, validation, and visualization tools need to support
ucts, ensures best practices for different components of the sparse, dense, or sequence data. Model validation, evalua-
platform, and limits the technical debt arising from one-off tion, and serving tools need to support all kinds of inference
implementations that cannot be reused in different contexts. types, including (among others) regression, classification, and
This paper presents the anatomy of end-to-end machine sequences.
learning platforms and introduces TensorFlow Extended
(TFX), one implementation of such a platform that we built Continuous training. Most machine learning pipelines are
at Google to address the aforementioned challenges. We set up as workflows or dependency graphs (e.g. [14, 20]) that
describe the key platform components and the salient points execute specific operations or jobs in a defined sequence. If
behind their design and functionality. We also present a case a team needs to train over new data, the same workflow or
study of deploying the platform in Google Play, a commercial graph is executed again. However, many real-world use-cases
mobile app store with over one billion active users and over require continuous training. TFX supports several continua-
one million apps, and discuss the lessons that we learned in tion strategies that result from the interaction between data
this process. These lessons reflect best practices for machine- visitation and warm-starting options. Data visitation can
learning platforms in a diverse set of contexts and are thus of be configured to be static or dynamic (over a rolling range
general interest to researchers and practitioners in the field. of directories). Warm-starting initializes a subset of model
parameters from a previous state.
2 PLATFORM OVERVIEW Easy-to-use configuration and tools. Providing a uni-
2.1 Background and Related Work fied configuration framework is only possible if components
Prior art has addressed a subset of the challenges in deploying also share utilities that allow them to communicate and
machine learning in production. Related work has reported share assets. A TFX user is only exposed to one common
that the learning algorithm is only one component of a ma- configuration that is passed to all components and shared
chine learning platform that represents a small fraction of the where necessary. Utilities that are used by all components en-
code [19, 20]. Data and model parallelism require distributed able enforcement of global garbage collection policies, unified
systems and orchestration that exceed capabilities of many debugging and status signals, etc.
single-machine solutions [12, 16]. Beyond simply stitching Production-level reliability and scalability. Only a
together components, a machine learning pipeline also needs small fraction of a machine learning platform is the actual
to be simple to set up [16], maybe even support automated code implementing the training algorithm [19]. If the plat-
pipeline construction [20]. Once a team can train multiple form handles and encapsulates the complexity of machine
models it needs to keep track of their experiment history learning deployment, engineers and scientists have more time
in a centralized database [21]. Ideally, the platform auto- to focus on the modeling tasks. Since it is difficult to pre-
matically surveys different machine learning techniques and dict whether a learning algorithm will behave reasonably on
suggests the best solution, allowing even non-experts access new data [8], model validation is critical. In turn, model
to machine learning [10]. However, putting together several validation must be coupled with data validation in order to
disjoint components to do the job can result in significant detect corrupted training data and thus prevent bad (yet,
technical debt in forms of hard-to-maintain glue code, hidden validated) models from reaching production. To give an
dependencies, feedback loops, etc. [19]. example, training data that accidentally includes the label
will lead to a good quality model that passes validation,
2.2 Platform Design and Anatomy but would not perform well in production where the label
In this paper we expand on existing literature and address is not available. Validating the serving infrastructure before
the challenges outlined in the introduction by presenting a pushing to the production environment is vital to the relia-
reusable machine learning platform developed at Google. Our bility and robustness of any machine learning platform. Our
design adopts the following principles: platform provides implementations of these components that
encode best practices observed in many production pipelines.
One machine learning platform for many learning
Moreover, our experience shows that the distributed data
tasks. We chose to use TensorFlow [4] as the trainer but the
processing model offered by Apache Beam [1] (and similar
platform design is not limited to this specific library. One fac-
internal infrastructure) is a good fit for handling the large
tor in choosing (or dismissing) a machine learning platform is
volume of data during training, model evaluation, and batch
its coverage of existing algorithms [12]. TensorFlow provides
inference.
full flexibility for implementing any type of model architec-
ture. To name just a few, we have seen implementations of Figure 1 shows a high-level component overview of a ma-
linear, deep, linear and deep combined, tree-based, sequen- chine learning platform and highlights the components dis-
tial, multi-tower, multi-head, etc. architectures. This allows cussed in the following sections: data analysis (Sections 3.1),
users of TFX to switch out the learning algorithm without data transformation (Section 3.2), data validation (Section
migrating the entire pipeline to a different stack. However, 3.3), trainer (Section 4), model evaluation and validation

1388
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada

-RXIKVEXIH*VSRXIRHJSV.SF1EREKIQIRX1SRMXSVMRK(IFYKKMRK(EXE1SHIP)ZEPYEXMSR:MWYEPM^EXMSR

7LEVIH'SRJMKYVEXMSR*VEQI[SVOERH.SF3VGLIWXVEXMSR

8YRIV
*SGYWSJXLMWTETIV

(EXE (EXE (EXE (EXE 1SHIP)ZEPYEXMSR


8VEMRIV 7IVZMRK 0SKKMRK
-RKIWXMSR %REP]WMW 8VERWJSVQEXMSR :EPMHEXMSR ERH:EPMHEXMSR

7LEVIH9XMPMXMIWJSV+EVFEKI'SPPIGXMSR(EXE%GGIWW'SRXVSPW

4MTIPMRI7XSVEKI

Figure 1: High-level component overview of a machine learning platform.

(Section 5), and serving (Section 6). In isolation, these com- for data management in the context of machine learning plat-
ponents implement high-level functionality that is typical forms [11]. Small bugs in the data can significantly degrade
in machine-learning platforms, e.g., data sampling, feature model quality over a period of time in a way that is hard
generation, training, and evaluation [12, 14, 16]. However, to detect and diagnose (unlike catastrophic bugs that cause
it is worth pointing out two differentiations. First, we built spectacular failures and are thus easy to track down), so
these components to adhere to the aforementioned principles, constant data vigilance should be a part of any long running
which introduced several technical difficulties. Second, the development of a machine learning platform.
integration of these components in a single platform, with Building such a component is challenging for several rea-
shared configuration and utilities, enabled key improvements sons. First, the component needs to support a wide range
over existing alternatives. To give an example, transforma- of data-analysis and validation cases that correspond to ma-
tions applied in the trainer and at serving time may need chine learning applications. The component must also be
statistics generated by the data analysis component. Integrat- easy to deploy for a basic set of useful checks without re-
ing these components ensures consistency across the pipeline quiring excessive customization, with additional checks being
and guarantees that the same transformations are applied possible at the cost of some setup by the user. Moreover, the
at training and serving, which in turn prevents one form of component should help the users monitor and react to data
training-serving skew (a common production headache in ma- quality problems in a way that is non-intrusive (users do
chine learning). Although almost all of the components can not receive “spammy” data-anomaly alerts) and actionable
be considered optional, TFX users find it beneficial to adopt (users understand how to debug a particular data anomaly).
the full stack in order to achieve more robust and reliable Our experience has shown that users tend to switch off data-
production systems and take advantage of all management quality checks if they receive a large number of false-negative
and visualization tools in the integrated frontend. alerts or if the alerts are hard to understand.
Throughout the paper, we refer to the engineers and ML The following subsections describe the implementation
practitioners using our platform as “users”. In the case study of this component in TFX and how it addresses the afore-
of Google Play (Section 7), we refer to people who visit the mentioned challenges. The component treats data analysis,
Google Play store as “Play users”. transformation, and validation as separate yet closely related
processes, with complementary roles.
3 DATA ANALYSIS, TRANSFORMA-
TION, AND VALIDATION 3.1 Data Analysis
Machine learning models are only as good as their training For data analysis, the component processes each dataset fed
data, so understanding the data and finding any anomalies to the system and generates a set of descriptive statistics
early is critical for preventing data errors downstream, which on the included features. These statistics cover the presence
are more subtle and harder to debug. Often the data is gener- of each feature in the data, e.g., the distribution of the
ated by adhoc pipelines involving multiple products, systems, number of values per example or the number of examples
and usage logs. Faults (e.g., code bugs, system failures, or with and without the feature. The component also gathers
human errors, to name a few) can occur at multiple points statistics over feature values: for continuous features, the
of this generation process, which makes anomalies in the statistics include quantiles, equi-width histograms, the mean
data not an exception, but more the norm. As a machine and standard deviation, to name a few, whereas for discrete
learning platform scales to larger data and runs continuously, features they include the top-K values by frequency. The
there is a strong need for a reusable component that enables component also supports statistics on configurable slices
rigorous checks for data quality and promotes best practices of the data (e.g., on negative and positive examples in a
binary classification problem) and cross-feature statistics (e.g.,

1389
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada

correlation and covariance between features). By looking • The expected domain of a feature, i.e., the small universe of
at these feature statistics, users can gain insights into the values for a string feature, or range for an integer feature.
shape of each dataset. Note that it is possible to extend the
component with further statistics, but we found that this
subset provides good coverage for the needs of our users. IHDWXUH^
In a continuous training and serving environment, the QDPH´FDWHJRU\µ )L[$GGYDOXHWRGRPDLQ
above statistics must be computed efficiently at scale. Un- YDOXH´('8&$7,21µ IHDWXUH^
` QDPH´FDWHJRU\µ
available feature statistics may result in missed opportunities IHDWXUH^ 
to correct data anomalies, while outdated feature-to-integer QDPH´QXPBLPSUHVVLRQVµ GRPDLQ^
YDOXH´18//µ YDOXH´*$0(6µ
mappings and feature value distributions may result in a ` YDOXH´%86,1(66µ
drop in model quality. On large training data, some of these YDOXHC('8&$7,21µ
statistics become difficult to compute exactly, and the com- `
IHDWXUH^ `
ponent resorts to distributed streaming algorithms that give QDPH´FDWHJRU\µ
approximate results [9, 17]. SUHVHQFH5(48,5('
W\SH675,1*
GRPDLQ^
3.2 Data Transformation YDOXH´*$0(6µ )L['HSUHFDWHIHDWXUH
YDOXH´%86,1(66µ IHDWXUH^
Our component implements a suite of data transformations ` QDPH´QXPBLPSUHVVLRQVµ
to allow feature wrangling for model training and serving. ` W\SH,17
IHDWXUH^
For instance, this suite includes the generation of feature- GHSUHFDWHGWUXH
QDPH´QXPBLPSUHVVLRQVµ `
to-integer mappings, also known as vocabularies. In most W\SH,17
machine learning platforms that deal with sparse categor- `

ical features, both training and serving require mappings


from string values of a sparse feature to integer IDs. The
Figure 2: Sample validation of an example against a sim-
integer IDs allow operations like looking up model weights
ple schema for an app store application. The schema in-
or embeddings given a specific value. Representing features dicates that the expected type for the ‘category’ feature
in ID space often saves memory and computation time as is STRING and that for the ‘num impressions’ feature is
well. Since there can be a large number (∼1–100B) of unique INT. Furthermore, the category feature must be present
values per sparse feature, it is a common practice to assign in all examples and assume values from the specified do-
unique IDs only to the most “relevant” values. The less main. On validating the example against this schema,
relevant values are either dropped (i.e., no IDs assigned) or the module detects two anomalies with simple explana-
are assigned IDs from a fixed set of IDs. There are different tions as well as suggested schema modifications. The
ways to define relevance, including the common approach of first suggestion reflects a schema change to account for
an evolution of the data (the appearance of a new value).
using the frequency of appearance in the data.
In contrast, the second suggestion reflects the fact that
A crucial issue is ensuring consistency of the transforma-
there is an underlying data problem that needs to be
tion logic during training and serving. Any discrepancies fixed, so the feature should be marked as problematic
imply that the model receives prediction requests with dif- while the problem is being investigated.
ferently transformed input features, which typically hurts
model quality. TFX exports any data transformations as
Using the schema, the component can validate the prop-
part of the trained model, which in turn avoids problems
erties of specific (training and serving) datasets, flag any
with inconsistency.
deviations from the schema as potential anomalies, and in
most cases, provide actionable suggestions to fix the anomaly.
3.3 Data Validation
These actions may include recommending the user to block
After completing the analysis of the data, the component training on particular features by marking them as “depre-
deals with the task of validation: is the data healthy or are cated”, or for expected deviations in the data, updating the
there anomalies that need to be flagged to the user? schema itself to match the data. We assume that teams are
To perform validation, the component relies on a schema responsible for maintaining the schema and updating it to
that provides a versioned, succinct description of the expected newer versions as needed. We also provide tooling to help
properties of the data. The following are examples of the generate the first version automatically by analyzing a sample
properties that can be encoded in the schema: of the data as well as suggest concrete fixes to the schema
• Features present in the data. as data evolves. An example of a simple schema for an app
• The expected type of each feature. store application, the anomalies detected using this schema,
and the actionable suggestions are shown in Figure 2.
• The expected presence of each feature, in terms of a min-
While the above list of properties captures a large class of
imum count and fraction of examples that must contain
data errors that occur in practice, the schema can also encode
the feature.
more elaborate properties, e.g., constrain the distribution of
• The expected valency of the feature in each example, i.e.,
minimum and maximum number of values.

1390
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada

the values in a specific feature, or describe a specific inter- models need to train on huge datasets to generate good qual-
pretation of a feature’s values (say, a string feature may have ity models. This practice is becoming infeasible for many
the values “TRUE” and “FALSE”, which are interpreted as teams because it is both time and resource intensive to retrain
booleans according to the schema). However, such complex these models. At Google, we have, on several occasions, lever-
properties are often hard to specify since they involve thresh- aged warm-starting to attain high quality models without
olds that are hard to tune, especially since some churn in spending too many resources.
the characteristics of the data is expected in any real-life
deployment. 4.1 Warm-Starting
Based on our engagements with users, both in deciding the For many production use cases, freshness of machine learning
properties that the schema should permit and in designing models is critical (e.g., Google Play store, where thousands
the end-to-end data validation component, we relied on the of new apps are added to the store every day). A lot of such
following key design principles: use cases also have huge training datasets (O(100B) data
• The user should understand at a glance which anomalies points) which may take hours (or days in some cases) of
are detected and their coverage over the data. training to attain models meeting desired quality thresholds.
• Each anomaly should have a simple description that helps This results in a trade-off between model quality and model
the user understand how to debug and fix the data. One freshness. Warm-starting is a practical technique to offset
such example is an anomaly that says a feature’s value is this trade-off and, when used correctly, can result in models
out of a certain range. As an antithetical example, it is of same quality as one would obtain after training for many
much harder to understand and debug an anomaly that hours in much less time and fewer resources.
says the KL divergence between the expected and actual Warm-starting is inspired by transfer learning [18]. One
distributions has exceeded some threshold. of the approaches to transfer learning is to first train a base
network on some base dataset, then use the ‘general’ param-
• In some cases the anomalies correspond to a natural evo- eters from the base network to initialize the target network,
lution of the data, and the appropriate action is to change and finally train the target network on the target dataset [23].
the schema (rather than fix the data). To accommodate The effectiveness of this technique depends on the generality
this option, our component generates for each anomaly a of the features whose corresponding parameters are trans-
corresponding schema change that can bring the schema ferred. The same approach can be applied in the context
up-to-date (essentially, make the anomaly part of the nor- of continuous training. In this approach, we identify a few
mal state of the data). general features of the network being trained (e.g., embed-
• We want the user to treat data errors with the same rigor dings of sparse features). When training a new version of
and care that they deal with bugs in code. To promote the network, we initialize (or warm-start) the parameters
this practice, we allow anomalies to be filed just like any corresponding to these features from the previously trained
software bug where they are documented, tracked, and version of the network and fine tune them with the rest of the
eventually resolved. network. Since we are transferring parameters between dif-
These principles have affected both the logic to detect anom- ferent versions of the same network, this technique results in
alies and the presentation of anomalies in the UI component much quicker convergence of the new version, thus resulting
of TFX. in the same quality models using fewer resources.
Beyond detecting anomalies in the data, users can also look In the past, many teams wrote and maintained custom
at the schema (and its versions) in order to understand the binaries to warm-start new models. This incurred a lot of
evolution of the data fed to the machine learning platform. duplicated effort that went into writing and maintaining
The schema also serves as a stable description of the data similar code. While building TFX, the ability to selectively
that can drive other platform components, e.g., automatic warm-start selected features of the network was identified as
feature-engineering or data-analysis tools. a crucial component and its implementation in TensorFlow
was subsequently open sourced.
4 MODEL TRAINING Together with features like warm-starting, TensorFlow pro-
One of the core design philosophies of TFX is to streamline vides a high-level unified API to configure model training
(and automate as much as possible) the process of training using various learning techniques (e.g., deep learning, wide
production quality models which can support all training and deep learning, sequence learning).
use cases. We chose to design the model trainer such that it
supports training any model configured using TensorFlow [4],
4.2 High-Level Model Specification API
including implementations necessary for continuous training. We decided to use an established high-level TensorFlow model
It takes minimal, one-time effort to integrate modeling code specification API [22]. Our experience points to large pro-
written in TensorFlow with the trainer. Once done, users ductivity gains via a higher-level abstraction layer that hides
can seamlessly switch from one learning algorithm to another implementation details and encodes best practices.
without any efforts to re-integrate with the trainer. One of the useful abstractions we leveraged is FeatureColumns.
While continuously training and exporting machine learn- FeatureColumns help users focus on which features to use
ing models is a common production use case, often such

1391
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada

in their machine learning model. They are a declarative of end-user experiences via the degradation of the machine-
way of defining the input layer of a model. Another compo- learned model. Some examples of such bugs include: different
nent of abstraction layer we relied on is the concept of an components expecting different serialized model formats, or
Estimator. For a given model, Estimator handles training bugs in training or serving code causing binary crashes.
and evaluation. It can be used on a local machine as well as These issues can be difficult for humans to detect, espe-
for distributed training. Following is a simplified example of cially in a continuous training setting where new models are
how training a feed-forward dense neural network looks like refreshed and pushed to production frequently. Having a
in this framework: reusable component that automatically evaluates and vali-
dates models to ensure that they are “good” before serving
1 # Declare a numeric feature: them to users can help prevent unexpected degradations in
2 num_rooms = numeric_column(’number-of-rooms’) the user experience.
3 # Declare a categorical feature:
4 country = categorical_column_with_vocabulary_list( 5.1 Defining a “good” model
5 ’country’, [’US’, ’CA’]) The model evaluation and validation component of TFX is
6 # Declare a categorical feature and use hashing: designed for this purpose. The key question that the compo-
7 zip_code = categorical_column_with_hash_bucket( nent helps answer is: is this specific model a “good” model?
8 ’zip_code’, hash_bucket_size=1K) We suggest two pieces: that a model is safe to serve, and
9 # Define the model and declare the inputs that it has the desired prediction quality.
10 estimator = DNNRegressor( By safe to serve, we mean obvious requirements such as:
11 hidden_units=[256, 128, 64], the model should not crash or cause errors in the serving
12 feature_columns=[ system when being loaded, or when sent bad or unexpected
13 num_rooms, country, inputs, and the model shouldn’t use too many resources (such
14 embedding_column(zip_code, 8)], as CPU or RAM). One specific problem we have encountered
15 activation_fn=relu, is when the model is trained using a newer version of a ma-
16 dropout=0.1) chine learning library than is used at serving time, resulting
17 # Prepare the training data in a model representation that cannot be used by the serv-
18 def my_training_data(): ing system. Product teams care about user satisfaction and
19 # Read, parse training data and convert it into product health, which are better captured by measures of
20 # tensors. Returns a mini-batch of data every prediction quality (such as app install rate) on live traffic
21 # time returned tensors are fetched. than by the objective function on the training data.
22 return features, labels
23 # Prepare the validation data 5.2 Evaluation: human-facing metrics of
24 def my_eval_data(): model quality
25 # Read, parse validation data and convert it into Evaluation is used as part of the interactive process where
26 # tensors. Returns a mini-batch of data every teams try to iteratively improve their models. Since it is
27 # time returned tensors are fetched. costly and time-consuming to run A/B experiments on live
28 return features, labels traffic, models are evaluated offline on held-out data to de-
29 estimator.train(input_fn=my_training_data) termine if they are promising enough to start an online A/B
30 estimator.evaluate(input_fn=my_eval_data) experiment. The evaluation component provides proxy met-
rics such as AUC or cost-weighted error that approximate
From our experience, users find it easier to first train a business metrics more closely than training loss, but are com-
simple model in the available setting (e.g., single machine putable offline. Once teams are satisfied with their models’
or distributed system) before experimenting with various offline performance, they can conduct product-specific A/B
optimization settings [24, Rule #4]. Once a baseline is es- experiments to determine how their models actually perform
tablished, users can experiment with these settings. A tuner on live traffic on relevant business metrics.
integrated with the trainer can also automatically optimize
the hyperparameters based on users’ objectives and data. 5.3 Validation: machine-facing judgment
5 MODEL EVALUATION AND of model goodness
VALIDATION Once a model is launched to production and is continuously
being updated, automated validation is used to ensure that
Machine-learned models are often parts of complex systems the updated models are good. We validate that a model is
comprising a large number of data sources and interacting safe to serve with a simple canary process. We evaluate pre-
components, which are commonly entangled together [19]. diction quality by comparing the model quality against a fixed
This creates large surfaces on which bugs can grow and unex- threshold as well as against a baseline model (e.g., the current
pected interactions can develop, potentially to the detriment production model). Any new model failing any of these checks
is not pushed to serving, and product teams are alerted.

1392
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada

One challenge with validating safety is that our canary 6 MODEL SERVING
process will not catch all potential errors. Another challenge Reaping the benefits of sophisticated machine-learned models
with validation in a continuously training pipeline is that it is only possible when they can be served effectively. Ten-
is hard to distinguish expected and unexpected variations in sorFlow Serving [2] provides a complete serving solution for
a model’s behaviour. When the training data is continuously machine-learned models to be deployed in production envi-
changing, some variation in the model’s behaviour and its ronments. TensorFlow Serving’s framework is also designed
performance on business metrics is to be expected. to be flexible, supporting new algorithms and experiments.
Hence, there is a tension between being too conservative Scaling to varied traffic patterns is an important goal of
and alerting users to small changes in these metrics, which TensorFlow Serving. By providing a complete yet customiz-
results in users tiring of and eventually ignoring the alerts; able framework for machine learning serving, TensorFlow
and being too loose and failing to catch unexpected changes. Serving aims to reduce the boilerplate code needed to deploy a
From our experience working with several product teams, production-grade serving system for machine-learned models.
most bugs cause dramatic changes to model quality metrics Serving systems for production environments require low
that can be caught by using loose thresholds. However, there latency, high efficiency, horizontal scalability, reliability and
is a strong selection bias here, since more subtle issues may robustness. This section elaborates on two specific system
not have drawn our attention. challenges: low latency and high efficiency.

5.4 Slicing 6.1 Multitenancy with Isolation


One of the features the evaluation component offers is the Multitenancy in the context of TensorFlow Serving means
ability to compute metrics on slices of data. We define a enabling a single instance of the server to serve multiple
slice as a subset of the data containing certain features. For machine-learned models concurrently. Serving multiple mod-
instance, a product team might be concerned about the els at production scale can lead to cross-model interference,
performance of their model in the US, so they might wish which is a challenging problem to solve. TensorFlow Serv-
to compute metrics on the subset of data that contains the ing provides soft model-isolation, so that the performance
feature “Country = US”. characteristics of one model has minimal impact on other
This is useful in both evaluating and validating models in models. While deploying servers that handle a high number
the case where product teams have specific slices which they of queries per second, we encountered interference between
are concerned about, especially small slices, since metrics on the request processing and model-load processing flows of
the entire dataset can fail to reflect the performance on these the system. Specifically, we observed latency peaks during
small slices [15]. Slicing can help product teams understand the interval when the system was loading a new model or a
and improve performance on these slices, and also avoid new version of an existing model.
serving models that sacrifice quality on these slices for better To enhance isolation between these operations, we imple-
overall performance. mented a feature that allows the configuration of a separate
dedicated threadpool for model-loading operations. This is
5.5 User Attitudes towards Validation built upon a feature in TensorFlow that allows any operation
In the process of deploying the model and validation com- to be executed with a caller-specified threadpool. As a result,
ponent, we made an interesting discovery regarding user we were able to ensure that threads performing request pro-
attitudes towards validation in machine learning platforms. cessing would not contend with the long operations involved
Our general sense is that the value of validation is not im- with loading a model from disk.
mediately apparent to users; however, the costs in terms of Empirically, we found that setting the threadpool size
additional configuration and greater resource consumption for model-loading operations to 1 or 2 was ideal for system
immediately stand out to them. As an illustration, no prod- performance. This configuration supports faster request pro-
uct teams actively requested the validation function when cessing consistently, trading off slower model-loads. Prior
the component was first built, and when the feature was to defining a separate threadpool for load operations, for a
explained to them, few activated it. The fact that the valida- specific model, we observed that the 99.9-percentile inference
tion feature did not directly improve their machine-learned request latency measured during loads was in the range of
models’ performance, and on the contrary, could result in ∼500 to ∼1500 msec. However, with the specification of
them serving old models if the checks did not pass, also added a separate threadpool, the 99.9-percentile inference request
to their hesitation. latency during loads reduced to a range of ∼75 to ∼150 msec.
However, encountering a real issue in production which
could have been prevented by validation made the value of 6.2 Fast Training Data Deserialization
the validation apparent to the teams, who were then eager Unlike previous machine learning libraries at Google, each
to activate validation. As a result of this observation, we using custom input formats and parsing code, TensorFlow
plan to provide a configuration-free validation setup that is uses a common data format. This approach enables the
enabled by default for all users. community to share their data, models, tools, visualizations,
optimizations, and other techniques.

1393
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada

On the other hand, the common format was a subopti- As we moved the Google Play ranking system from its
mal solution for some sources of data. Choosing a common previous version to TFX, we saw an increased velocity of
format involves tradeoffs such as size of the data, cost of iterating on new experiments, reduced technical debt, and
parsing, and the need to write format-conversion code. We improved model quality. To list a few, the overall product
decided on the tensorflow.Example [3] protocol buffer format has benefitted in the following ways:
(cross-language, serializable data-structure). • The data validation and analysis component helped in
Non neural network (e.g., linear) models are often more discovering a harmful training-serving feature skew. By
data intensive than CPU intensive. For such models, data comparing the statistics of serving logs and training data
input, output, and preprocessing tend to be the bottleneck. on the same day, Google Play discovered a few features
Using a generic protocol buffer parser proved to be inefficient. that were always missing from the logs, but always present
To resolve this, a specialized protocol buffer parser was in training. The results of an online A/B experiment
built based on profiles of various real data distributions in showed that removing this skew improved the app install
multiple parsing configurations. Lazy parsing was employed, rate on the main landing page of the app store by 2%.
including skipping complete parts of the input protocol buffer
• Warm-starting helped improve model quality and fresh-
that the configuration specified as unnecessary. In addition,
ness while reducing the time and resources spent on train-
to ensure the protocol buffer parsing optimizations were con-
ing over hundreds of billions of examples. Training from
sistently useful, a benchmarking suite was built. This suite
scratch can take several days to converge to a high-quality
was useful in ensuring that optimizing for one type of data
model, making it hard for the model to recommend trend-
distribution or configuration did not negatively impact perfor-
ing or recently added apps that were missing from the
mance for other types of data distributions or configurations.
training data. One option to produce new models more
While implementing this system, extreme care was taken to
frequently is to reduce the number of training iterations,
minimize data copying. This was especially challenging for
at the expense of lower quality models. This is the same
sparse data configurations.
trade-off between model quality and model freshness de-
The application of the specialized protocol buffer parser
scribed in Section 4. Hence, Google Play, for whom it is
resulted in a speedup of 2-5 times on benchmarked datasets.
infeasible to train every new model from scratch, adopted
7 CASE STUDY: GOOGLE PLAY the technique of warm-starting to selectively initialize a
subset of model parameters (e.g., embeddings) from a pre-
One of the first deployments of TFX is the recommender viously trained model. This enabled Google Play to push
system for Google Play, a commercial mobile app store. The a high-quality fresh model for serving frequently.
goal of the Google Play recommender system is to recommend
relevant Android apps to the Play app users when they visit • Model validation helped in understanding and troubleshoot-
the homepage of the store, with an aim of driving discovery ing performance differences between the old and new mod-
of apps that will be useful to the user. The input to the els. The model validation component tests the new model
system is a “query” that includes the information about the against the production model, preventing issues like acci-
app user and context. The recommender system returns a list dentally pushing partially-trained models to serving be-
of apps, which the user can either click on or install. Since cause of system failures.
the corpus contains over a million apps, it is intractable to • The model serving component enabled deploying the trained
score every app for every query. Hence, the first step in this model to production, while guaranteeing high performance
system is retrieval, which returns a short list of apps based and flexibility. Specifically, Google Play benefitted from
on various signals. Once we have this short list of apps, the optimizations in the serving system, including support for
ranking system uses the machine-learned model to compute isolation in a multi-tenant environment and fast custom
a score per app and presents a ranked list to the user. In proto parsing, described in Section 6.
this case study, we focus on the ranking system.
The machine learning model that ranks the items is trained 8 CONCLUSIONS
continuously as fresh training data arrives (usually in batches). We discussed the anatomy of general-purpose machine learn-
The typical training dataset size is hundreds of billions of ing platforms and introduced TFX, an implementation of
examples where each example has query features (e.g., the such a platform with TensorFlow-based learners and support
user’s context) as well as impression features (e.g., ratings for continuous training and serving with production-level
and developer of app being ranked). After rigorous valida- reliability. The key approach is to orchestrate reusable com-
tion (e.g., comparing quality metrics with models serving live ponents (data analysis and transformation, data validation,
traffic), the trained models are deployed through TensorFlow model training, model evaluation and validation, and serving
Serving in data centers around the globe and collectively infrastructure) effectively and provide a simple unified con-
serve thousands of queries per second with a strict latency re- figuration for users. TFX has been successfully deployed in
quirement of tens of milliseconds. Due to fresh models being the Google Play app store, reducing the time to production
trained daily, the servers have to reload multiple models (both and increasing its app install rate by 2%.
the production models, as well as other experimental models)
per day. This is done without any degradation in latency.

1394
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada

Many interesting challenges remain. While TFX is general- [10] Tim Kraska, Ameet Talwalkar, John C. Duchi, Rean Griffith,
purpose and already supports a variety of model and data Michael J. Franklin, and Michael I. Jordan. 2013. MLbase: A
Distributed Machine-learning System. In CIDR.
types, it must be flexible and accommodate new innova- [11] Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin,
tions from the machine learning community. For example, and Ken Goldberg. 2016. ActiveClean: Interactive Data Cleaning
For Statistical Modeling. PVLDB 9, 12 (2016), 948–959.
sequence-to-sequence models have recently been used to pro- [12] Sara Landset, Taghi M. Khoshgoftaar, Aaron N. Richter, and
duce state-of-the-art results in machine translation. Sup- Tawfiq Hasanin. 2015. A survey of open source tools for machine
porting these models required us to carefully think about learning with big data in the Hadoop ecosystem. Journal of Big
Data 2, 1 (2015), 24.
capturing sequential information in the common data format [13] Cheng Li, Yue Lu, Qiaozhu Mei, Dong Wang, and Sandeep Pandey.
and posed new challenges for the serving infrastructure and 2015. Click-through Prediction for Advertising in Twitter Time-
model validation, among others. In general, each addition line. In KDD. 1959–1968.
[14] Jimmy J. Lin and Alek Kolcz. 2012. Large-scale machine learning
may affect multiple components of TFX, so all components at twitter. In SIGMOD. 793–804.
must be extensible. Furthermore, as machine learning be- [15] H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young,
Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene
comes more prevalent, there is a strong need for understand- Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin
ability where a model can explain its decision and actions Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, and Jeremy
to users. We believe the lessons we learned from deploying Kubica. 2013. Ad Click Prediction: A View from the Trenches.
In KDD. 1222–1230.
TFX provide a basis for building an interactive platform that [16] Xiangrui Meng, Joseph K. Bradley, Burak Yavuz, Evan R. Sparks,
provides deeper insights to users. Shivaram Venkataraman, Davies Liu, Jeremy Freeman, D. B. Tsai,
Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J.
Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2015.
REFERENCES MLlib: Machine Learning in Apache Spark. CoRR abs/1505.06807
[1] Apache Beam: An Advanced Unified Programming Model. https: (2015).
//beam.apache.org/, accessed 2017-06-05. [17] J.I. Munro and M.S. Paterson. 1980. Selection and sorting with
[2] Running your models in production with TensorFlow limited storage. Theoretical Computer Science 12, 3 (1980),
Serving. https://research.googleblog.com/2016/02/ 315–323.
running-your-models-in-production-with.html, accessed [18] Sinno Jialin Pan and Qiang Yang. 2010. A Survey on Transfer
2017-02-08. Learning. IEEE Trans. on Knowl. and Data Eng. 22, 10 (Oct.
[3] TensorFlow - Reading data. https://www.tensorflow.org/how 2010), 1345–1359.
tos/reading data/, accessed 2017-02-08. [19] D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd
[4] Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-
Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey François Crespo, and Dan Dennison. 2015. Hidden Technical
Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Debt in Machine Learning Systems. In NIPS. 2503–2511.
Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, [20] Evan R. Sparks, Shivaram Venkataraman, Tomer Kaftan,
Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Michael J. Franklin, and Benjamin Recht. 2016. KeystoneML:
Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Optimizing Pipelines for Large-Scale Advanced Analytics. CoRR
Large-Scale Machine Learning. In OSDI. 265–283. abs/1610.09451 (2016).
[5] Rami Abousleiman, Guangzhi Qu, and Osamah A. Rawashdeh. [21] Manasi Vartak, Harihar Subramanyam, Wei-En Lee, Srinidhi
2013. North Atlantic Right Whale Contact Call Detection. CoRR Viswanathan, Saadiyah Husnoo, Samuel Madden, and Matei
abs/1304.7851 (2013). Zaharia. 2016. ModelDB: a system for machine learning model
[6] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, management. In HILDA@SIGMOD. 14.
Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, [22] Cassandra Xia, Clemens Mewald, D. Sculley, David So-
Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan ergel, George Roumpos, Heng-Tze Cheng, Illia Polosukhin,
Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. 2016. Wide & Jamie Alexander Smith, Jianwei Xie, Lichan Hong, Martin Wicke,
Deep Learning for Recommender Systems. In DLRS. 7–10. Mustafa Ispir, Philip Daniel Tucker, Yuan Tang, and Zakaria
[7] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Haque. 2017. Train and Distribute: Managing Simplicity vs. Flex-
Networks for YouTube Recommendations. In RecSys. 191–198. ibility in High-Level Machine Learning Frameworks. KDD (under
[8] Yann Dauphin, Razvan Pascanu, Çaglar Gülçehre, Kyunghyun review).
Cho, Surya Ganguli, and Yoshua Bengio. 2014. Identifying and at- [23] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014.
tacking the saddle point problem in high-dimensional non-convex How transferable are features in deep neural networks?. In NIPS.
optimization. CoRR abs/1406.2572 (2014). 3320–3328.
[9] Philippe Flajolet, ric Fusy, Olivier Gandouet, and et al. 2007. Hy- [24] Martin Zinkevich. 2016. Rules of Machine Learning. In NIPS
perloglog: The analysis of a near-optimal cardinality estimation Workshop on Reliable Machine Learning. Invited Talk.
algorithm. In AOFA.

1395

You might also like