TFX: A Tensorflow-Based Production-Scale Machine Learning Platform
TFX: A Tensorflow-Based Production-Scale Machine Learning Platform
1387
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada
Having this type of platform enables teams to easily deploy it also imposes requirements on all other components. Data
machine learning in production for a wide range of prod- analysis, validation, and visualization tools need to support
ucts, ensures best practices for different components of the sparse, dense, or sequence data. Model validation, evalua-
platform, and limits the technical debt arising from one-off tion, and serving tools need to support all kinds of inference
implementations that cannot be reused in different contexts. types, including (among others) regression, classification, and
This paper presents the anatomy of end-to-end machine sequences.
learning platforms and introduces TensorFlow Extended
(TFX), one implementation of such a platform that we built Continuous training. Most machine learning pipelines are
at Google to address the aforementioned challenges. We set up as workflows or dependency graphs (e.g. [14, 20]) that
describe the key platform components and the salient points execute specific operations or jobs in a defined sequence. If
behind their design and functionality. We also present a case a team needs to train over new data, the same workflow or
study of deploying the platform in Google Play, a commercial graph is executed again. However, many real-world use-cases
mobile app store with over one billion active users and over require continuous training. TFX supports several continua-
one million apps, and discuss the lessons that we learned in tion strategies that result from the interaction between data
this process. These lessons reflect best practices for machine- visitation and warm-starting options. Data visitation can
learning platforms in a diverse set of contexts and are thus of be configured to be static or dynamic (over a rolling range
general interest to researchers and practitioners in the field. of directories). Warm-starting initializes a subset of model
parameters from a previous state.
2 PLATFORM OVERVIEW Easy-to-use configuration and tools. Providing a uni-
2.1 Background and Related Work fied configuration framework is only possible if components
Prior art has addressed a subset of the challenges in deploying also share utilities that allow them to communicate and
machine learning in production. Related work has reported share assets. A TFX user is only exposed to one common
that the learning algorithm is only one component of a ma- configuration that is passed to all components and shared
chine learning platform that represents a small fraction of the where necessary. Utilities that are used by all components en-
code [19, 20]. Data and model parallelism require distributed able enforcement of global garbage collection policies, unified
systems and orchestration that exceed capabilities of many debugging and status signals, etc.
single-machine solutions [12, 16]. Beyond simply stitching Production-level reliability and scalability. Only a
together components, a machine learning pipeline also needs small fraction of a machine learning platform is the actual
to be simple to set up [16], maybe even support automated code implementing the training algorithm [19]. If the plat-
pipeline construction [20]. Once a team can train multiple form handles and encapsulates the complexity of machine
models it needs to keep track of their experiment history learning deployment, engineers and scientists have more time
in a centralized database [21]. Ideally, the platform auto- to focus on the modeling tasks. Since it is difficult to pre-
matically surveys different machine learning techniques and dict whether a learning algorithm will behave reasonably on
suggests the best solution, allowing even non-experts access new data [8], model validation is critical. In turn, model
to machine learning [10]. However, putting together several validation must be coupled with data validation in order to
disjoint components to do the job can result in significant detect corrupted training data and thus prevent bad (yet,
technical debt in forms of hard-to-maintain glue code, hidden validated) models from reaching production. To give an
dependencies, feedback loops, etc. [19]. example, training data that accidentally includes the label
will lead to a good quality model that passes validation,
2.2 Platform Design and Anatomy but would not perform well in production where the label
In this paper we expand on existing literature and address is not available. Validating the serving infrastructure before
the challenges outlined in the introduction by presenting a pushing to the production environment is vital to the relia-
reusable machine learning platform developed at Google. Our bility and robustness of any machine learning platform. Our
design adopts the following principles: platform provides implementations of these components that
encode best practices observed in many production pipelines.
One machine learning platform for many learning
Moreover, our experience shows that the distributed data
tasks. We chose to use TensorFlow [4] as the trainer but the
processing model offered by Apache Beam [1] (and similar
platform design is not limited to this specific library. One fac-
internal infrastructure) is a good fit for handling the large
tor in choosing (or dismissing) a machine learning platform is
volume of data during training, model evaluation, and batch
its coverage of existing algorithms [12]. TensorFlow provides
inference.
full flexibility for implementing any type of model architec-
ture. To name just a few, we have seen implementations of Figure 1 shows a high-level component overview of a ma-
linear, deep, linear and deep combined, tree-based, sequen- chine learning platform and highlights the components dis-
tial, multi-tower, multi-head, etc. architectures. This allows cussed in the following sections: data analysis (Sections 3.1),
users of TFX to switch out the learning algorithm without data transformation (Section 3.2), data validation (Section
migrating the entire pipeline to a different stack. However, 3.3), trainer (Section 4), model evaluation and validation
1388
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada
-RXIKVEXIH*VSRXIRHJSV.SF1EREKIQIRX1SRMXSVMRK(IFYKKMRK(EXE1SHIP)ZEPYEXMSR:MWYEPM^EXMSR
7LEVIH'SRJMKYVEXMSR*VEQI[SVOERH.SF3VGLIWXVEXMSR
8YRIV
*SGYWSJXLMWTETIV
7LEVIH9XMPMXMIWJSV+EVFEKI'SPPIGXMSR(EXE%GGIWW'SRXVSPW
4MTIPMRI7XSVEKI
(Section 5), and serving (Section 6). In isolation, these com- for data management in the context of machine learning plat-
ponents implement high-level functionality that is typical forms [11]. Small bugs in the data can significantly degrade
in machine-learning platforms, e.g., data sampling, feature model quality over a period of time in a way that is hard
generation, training, and evaluation [12, 14, 16]. However, to detect and diagnose (unlike catastrophic bugs that cause
it is worth pointing out two differentiations. First, we built spectacular failures and are thus easy to track down), so
these components to adhere to the aforementioned principles, constant data vigilance should be a part of any long running
which introduced several technical difficulties. Second, the development of a machine learning platform.
integration of these components in a single platform, with Building such a component is challenging for several rea-
shared configuration and utilities, enabled key improvements sons. First, the component needs to support a wide range
over existing alternatives. To give an example, transforma- of data-analysis and validation cases that correspond to ma-
tions applied in the trainer and at serving time may need chine learning applications. The component must also be
statistics generated by the data analysis component. Integrat- easy to deploy for a basic set of useful checks without re-
ing these components ensures consistency across the pipeline quiring excessive customization, with additional checks being
and guarantees that the same transformations are applied possible at the cost of some setup by the user. Moreover, the
at training and serving, which in turn prevents one form of component should help the users monitor and react to data
training-serving skew (a common production headache in ma- quality problems in a way that is non-intrusive (users do
chine learning). Although almost all of the components can not receive “spammy” data-anomaly alerts) and actionable
be considered optional, TFX users find it beneficial to adopt (users understand how to debug a particular data anomaly).
the full stack in order to achieve more robust and reliable Our experience has shown that users tend to switch off data-
production systems and take advantage of all management quality checks if they receive a large number of false-negative
and visualization tools in the integrated frontend. alerts or if the alerts are hard to understand.
Throughout the paper, we refer to the engineers and ML The following subsections describe the implementation
practitioners using our platform as “users”. In the case study of this component in TFX and how it addresses the afore-
of Google Play (Section 7), we refer to people who visit the mentioned challenges. The component treats data analysis,
Google Play store as “Play users”. transformation, and validation as separate yet closely related
processes, with complementary roles.
3 DATA ANALYSIS, TRANSFORMA-
TION, AND VALIDATION 3.1 Data Analysis
Machine learning models are only as good as their training For data analysis, the component processes each dataset fed
data, so understanding the data and finding any anomalies to the system and generates a set of descriptive statistics
early is critical for preventing data errors downstream, which on the included features. These statistics cover the presence
are more subtle and harder to debug. Often the data is gener- of each feature in the data, e.g., the distribution of the
ated by adhoc pipelines involving multiple products, systems, number of values per example or the number of examples
and usage logs. Faults (e.g., code bugs, system failures, or with and without the feature. The component also gathers
human errors, to name a few) can occur at multiple points statistics over feature values: for continuous features, the
of this generation process, which makes anomalies in the statistics include quantiles, equi-width histograms, the mean
data not an exception, but more the norm. As a machine and standard deviation, to name a few, whereas for discrete
learning platform scales to larger data and runs continuously, features they include the top-K values by frequency. The
there is a strong need for a reusable component that enables component also supports statistics on configurable slices
rigorous checks for data quality and promotes best practices of the data (e.g., on negative and positive examples in a
binary classification problem) and cross-feature statistics (e.g.,
1389
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada
correlation and covariance between features). By looking • The expected domain of a feature, i.e., the small universe of
at these feature statistics, users can gain insights into the values for a string feature, or range for an integer feature.
shape of each dataset. Note that it is possible to extend the
component with further statistics, but we found that this
subset provides good coverage for the needs of our users. IHDWXUH^
In a continuous training and serving environment, the QDPH´FDWHJRU\µ )L[$GGYDOXHWRGRPDLQ
above statistics must be computed efficiently at scale. Un- YDOXH´('8&$7,21µ IHDWXUH^
` QDPH´FDWHJRU\µ
available feature statistics may result in missed opportunities IHDWXUH^
to correct data anomalies, while outdated feature-to-integer QDPH´QXPBLPSUHVVLRQVµ GRPDLQ^
YDOXH´18//µ YDOXH´*$0(6µ
mappings and feature value distributions may result in a ` YDOXH´%86,1(66µ
drop in model quality. On large training data, some of these YDOXHC('8&$7,21µ
statistics become difficult to compute exactly, and the com- `
IHDWXUH^ `
ponent resorts to distributed streaming algorithms that give QDPH´FDWHJRU\µ
approximate results [9, 17]. SUHVHQFH5(48,5('
W\SH675,1*
GRPDLQ^
3.2 Data Transformation YDOXH´*$0(6µ )L['HSUHFDWHIHDWXUH
YDOXH´%86,1(66µ IHDWXUH^
Our component implements a suite of data transformations ` QDPH´QXPBLPSUHVVLRQVµ
to allow feature wrangling for model training and serving. ` W\SH,17
IHDWXUH^
For instance, this suite includes the generation of feature- GHSUHFDWHGWUXH
QDPH´QXPBLPSUHVVLRQVµ `
to-integer mappings, also known as vocabularies. In most W\SH,17
machine learning platforms that deal with sparse categor- `
1390
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada
the values in a specific feature, or describe a specific inter- models need to train on huge datasets to generate good qual-
pretation of a feature’s values (say, a string feature may have ity models. This practice is becoming infeasible for many
the values “TRUE” and “FALSE”, which are interpreted as teams because it is both time and resource intensive to retrain
booleans according to the schema). However, such complex these models. At Google, we have, on several occasions, lever-
properties are often hard to specify since they involve thresh- aged warm-starting to attain high quality models without
olds that are hard to tune, especially since some churn in spending too many resources.
the characteristics of the data is expected in any real-life
deployment. 4.1 Warm-Starting
Based on our engagements with users, both in deciding the For many production use cases, freshness of machine learning
properties that the schema should permit and in designing models is critical (e.g., Google Play store, where thousands
the end-to-end data validation component, we relied on the of new apps are added to the store every day). A lot of such
following key design principles: use cases also have huge training datasets (O(100B) data
• The user should understand at a glance which anomalies points) which may take hours (or days in some cases) of
are detected and their coverage over the data. training to attain models meeting desired quality thresholds.
• Each anomaly should have a simple description that helps This results in a trade-off between model quality and model
the user understand how to debug and fix the data. One freshness. Warm-starting is a practical technique to offset
such example is an anomaly that says a feature’s value is this trade-off and, when used correctly, can result in models
out of a certain range. As an antithetical example, it is of same quality as one would obtain after training for many
much harder to understand and debug an anomaly that hours in much less time and fewer resources.
says the KL divergence between the expected and actual Warm-starting is inspired by transfer learning [18]. One
distributions has exceeded some threshold. of the approaches to transfer learning is to first train a base
network on some base dataset, then use the ‘general’ param-
• In some cases the anomalies correspond to a natural evo- eters from the base network to initialize the target network,
lution of the data, and the appropriate action is to change and finally train the target network on the target dataset [23].
the schema (rather than fix the data). To accommodate The effectiveness of this technique depends on the generality
this option, our component generates for each anomaly a of the features whose corresponding parameters are trans-
corresponding schema change that can bring the schema ferred. The same approach can be applied in the context
up-to-date (essentially, make the anomaly part of the nor- of continuous training. In this approach, we identify a few
mal state of the data). general features of the network being trained (e.g., embed-
• We want the user to treat data errors with the same rigor dings of sparse features). When training a new version of
and care that they deal with bugs in code. To promote the network, we initialize (or warm-start) the parameters
this practice, we allow anomalies to be filed just like any corresponding to these features from the previously trained
software bug where they are documented, tracked, and version of the network and fine tune them with the rest of the
eventually resolved. network. Since we are transferring parameters between dif-
These principles have affected both the logic to detect anom- ferent versions of the same network, this technique results in
alies and the presentation of anomalies in the UI component much quicker convergence of the new version, thus resulting
of TFX. in the same quality models using fewer resources.
Beyond detecting anomalies in the data, users can also look In the past, many teams wrote and maintained custom
at the schema (and its versions) in order to understand the binaries to warm-start new models. This incurred a lot of
evolution of the data fed to the machine learning platform. duplicated effort that went into writing and maintaining
The schema also serves as a stable description of the data similar code. While building TFX, the ability to selectively
that can drive other platform components, e.g., automatic warm-start selected features of the network was identified as
feature-engineering or data-analysis tools. a crucial component and its implementation in TensorFlow
was subsequently open sourced.
4 MODEL TRAINING Together with features like warm-starting, TensorFlow pro-
One of the core design philosophies of TFX is to streamline vides a high-level unified API to configure model training
(and automate as much as possible) the process of training using various learning techniques (e.g., deep learning, wide
production quality models which can support all training and deep learning, sequence learning).
use cases. We chose to design the model trainer such that it
supports training any model configured using TensorFlow [4],
4.2 High-Level Model Specification API
including implementations necessary for continuous training. We decided to use an established high-level TensorFlow model
It takes minimal, one-time effort to integrate modeling code specification API [22]. Our experience points to large pro-
written in TensorFlow with the trainer. Once done, users ductivity gains via a higher-level abstraction layer that hides
can seamlessly switch from one learning algorithm to another implementation details and encodes best practices.
without any efforts to re-integrate with the trainer. One of the useful abstractions we leveraged is FeatureColumns.
While continuously training and exporting machine learn- FeatureColumns help users focus on which features to use
ing models is a common production use case, often such
1391
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada
in their machine learning model. They are a declarative of end-user experiences via the degradation of the machine-
way of defining the input layer of a model. Another compo- learned model. Some examples of such bugs include: different
nent of abstraction layer we relied on is the concept of an components expecting different serialized model formats, or
Estimator. For a given model, Estimator handles training bugs in training or serving code causing binary crashes.
and evaluation. It can be used on a local machine as well as These issues can be difficult for humans to detect, espe-
for distributed training. Following is a simplified example of cially in a continuous training setting where new models are
how training a feed-forward dense neural network looks like refreshed and pushed to production frequently. Having a
in this framework: reusable component that automatically evaluates and vali-
dates models to ensure that they are “good” before serving
1 # Declare a numeric feature: them to users can help prevent unexpected degradations in
2 num_rooms = numeric_column(’number-of-rooms’) the user experience.
3 # Declare a categorical feature:
4 country = categorical_column_with_vocabulary_list( 5.1 Defining a “good” model
5 ’country’, [’US’, ’CA’]) The model evaluation and validation component of TFX is
6 # Declare a categorical feature and use hashing: designed for this purpose. The key question that the compo-
7 zip_code = categorical_column_with_hash_bucket( nent helps answer is: is this specific model a “good” model?
8 ’zip_code’, hash_bucket_size=1K) We suggest two pieces: that a model is safe to serve, and
9 # Define the model and declare the inputs that it has the desired prediction quality.
10 estimator = DNNRegressor( By safe to serve, we mean obvious requirements such as:
11 hidden_units=[256, 128, 64], the model should not crash or cause errors in the serving
12 feature_columns=[ system when being loaded, or when sent bad or unexpected
13 num_rooms, country, inputs, and the model shouldn’t use too many resources (such
14 embedding_column(zip_code, 8)], as CPU or RAM). One specific problem we have encountered
15 activation_fn=relu, is when the model is trained using a newer version of a ma-
16 dropout=0.1) chine learning library than is used at serving time, resulting
17 # Prepare the training data in a model representation that cannot be used by the serv-
18 def my_training_data(): ing system. Product teams care about user satisfaction and
19 # Read, parse training data and convert it into product health, which are better captured by measures of
20 # tensors. Returns a mini-batch of data every prediction quality (such as app install rate) on live traffic
21 # time returned tensors are fetched. than by the objective function on the training data.
22 return features, labels
23 # Prepare the validation data 5.2 Evaluation: human-facing metrics of
24 def my_eval_data(): model quality
25 # Read, parse validation data and convert it into Evaluation is used as part of the interactive process where
26 # tensors. Returns a mini-batch of data every teams try to iteratively improve their models. Since it is
27 # time returned tensors are fetched. costly and time-consuming to run A/B experiments on live
28 return features, labels traffic, models are evaluated offline on held-out data to de-
29 estimator.train(input_fn=my_training_data) termine if they are promising enough to start an online A/B
30 estimator.evaluate(input_fn=my_eval_data) experiment. The evaluation component provides proxy met-
rics such as AUC or cost-weighted error that approximate
From our experience, users find it easier to first train a business metrics more closely than training loss, but are com-
simple model in the available setting (e.g., single machine putable offline. Once teams are satisfied with their models’
or distributed system) before experimenting with various offline performance, they can conduct product-specific A/B
optimization settings [24, Rule #4]. Once a baseline is es- experiments to determine how their models actually perform
tablished, users can experiment with these settings. A tuner on live traffic on relevant business metrics.
integrated with the trainer can also automatically optimize
the hyperparameters based on users’ objectives and data. 5.3 Validation: machine-facing judgment
5 MODEL EVALUATION AND of model goodness
VALIDATION Once a model is launched to production and is continuously
being updated, automated validation is used to ensure that
Machine-learned models are often parts of complex systems the updated models are good. We validate that a model is
comprising a large number of data sources and interacting safe to serve with a simple canary process. We evaluate pre-
components, which are commonly entangled together [19]. diction quality by comparing the model quality against a fixed
This creates large surfaces on which bugs can grow and unex- threshold as well as against a baseline model (e.g., the current
pected interactions can develop, potentially to the detriment production model). Any new model failing any of these checks
is not pushed to serving, and product teams are alerted.
1392
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada
One challenge with validating safety is that our canary 6 MODEL SERVING
process will not catch all potential errors. Another challenge Reaping the benefits of sophisticated machine-learned models
with validation in a continuously training pipeline is that it is only possible when they can be served effectively. Ten-
is hard to distinguish expected and unexpected variations in sorFlow Serving [2] provides a complete serving solution for
a model’s behaviour. When the training data is continuously machine-learned models to be deployed in production envi-
changing, some variation in the model’s behaviour and its ronments. TensorFlow Serving’s framework is also designed
performance on business metrics is to be expected. to be flexible, supporting new algorithms and experiments.
Hence, there is a tension between being too conservative Scaling to varied traffic patterns is an important goal of
and alerting users to small changes in these metrics, which TensorFlow Serving. By providing a complete yet customiz-
results in users tiring of and eventually ignoring the alerts; able framework for machine learning serving, TensorFlow
and being too loose and failing to catch unexpected changes. Serving aims to reduce the boilerplate code needed to deploy a
From our experience working with several product teams, production-grade serving system for machine-learned models.
most bugs cause dramatic changes to model quality metrics Serving systems for production environments require low
that can be caught by using loose thresholds. However, there latency, high efficiency, horizontal scalability, reliability and
is a strong selection bias here, since more subtle issues may robustness. This section elaborates on two specific system
not have drawn our attention. challenges: low latency and high efficiency.
1393
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada
On the other hand, the common format was a subopti- As we moved the Google Play ranking system from its
mal solution for some sources of data. Choosing a common previous version to TFX, we saw an increased velocity of
format involves tradeoffs such as size of the data, cost of iterating on new experiments, reduced technical debt, and
parsing, and the need to write format-conversion code. We improved model quality. To list a few, the overall product
decided on the tensorflow.Example [3] protocol buffer format has benefitted in the following ways:
(cross-language, serializable data-structure). • The data validation and analysis component helped in
Non neural network (e.g., linear) models are often more discovering a harmful training-serving feature skew. By
data intensive than CPU intensive. For such models, data comparing the statistics of serving logs and training data
input, output, and preprocessing tend to be the bottleneck. on the same day, Google Play discovered a few features
Using a generic protocol buffer parser proved to be inefficient. that were always missing from the logs, but always present
To resolve this, a specialized protocol buffer parser was in training. The results of an online A/B experiment
built based on profiles of various real data distributions in showed that removing this skew improved the app install
multiple parsing configurations. Lazy parsing was employed, rate on the main landing page of the app store by 2%.
including skipping complete parts of the input protocol buffer
• Warm-starting helped improve model quality and fresh-
that the configuration specified as unnecessary. In addition,
ness while reducing the time and resources spent on train-
to ensure the protocol buffer parsing optimizations were con-
ing over hundreds of billions of examples. Training from
sistently useful, a benchmarking suite was built. This suite
scratch can take several days to converge to a high-quality
was useful in ensuring that optimizing for one type of data
model, making it hard for the model to recommend trend-
distribution or configuration did not negatively impact perfor-
ing or recently added apps that were missing from the
mance for other types of data distributions or configurations.
training data. One option to produce new models more
While implementing this system, extreme care was taken to
frequently is to reduce the number of training iterations,
minimize data copying. This was especially challenging for
at the expense of lower quality models. This is the same
sparse data configurations.
trade-off between model quality and model freshness de-
The application of the specialized protocol buffer parser
scribed in Section 4. Hence, Google Play, for whom it is
resulted in a speedup of 2-5 times on benchmarked datasets.
infeasible to train every new model from scratch, adopted
7 CASE STUDY: GOOGLE PLAY the technique of warm-starting to selectively initialize a
subset of model parameters (e.g., embeddings) from a pre-
One of the first deployments of TFX is the recommender viously trained model. This enabled Google Play to push
system for Google Play, a commercial mobile app store. The a high-quality fresh model for serving frequently.
goal of the Google Play recommender system is to recommend
relevant Android apps to the Play app users when they visit • Model validation helped in understanding and troubleshoot-
the homepage of the store, with an aim of driving discovery ing performance differences between the old and new mod-
of apps that will be useful to the user. The input to the els. The model validation component tests the new model
system is a “query” that includes the information about the against the production model, preventing issues like acci-
app user and context. The recommender system returns a list dentally pushing partially-trained models to serving be-
of apps, which the user can either click on or install. Since cause of system failures.
the corpus contains over a million apps, it is intractable to • The model serving component enabled deploying the trained
score every app for every query. Hence, the first step in this model to production, while guaranteeing high performance
system is retrieval, which returns a short list of apps based and flexibility. Specifically, Google Play benefitted from
on various signals. Once we have this short list of apps, the optimizations in the serving system, including support for
ranking system uses the machine-learned model to compute isolation in a multi-tenant environment and fast custom
a score per app and presents a ranked list to the user. In proto parsing, described in Section 6.
this case study, we focus on the ranking system.
The machine learning model that ranks the items is trained 8 CONCLUSIONS
continuously as fresh training data arrives (usually in batches). We discussed the anatomy of general-purpose machine learn-
The typical training dataset size is hundreds of billions of ing platforms and introduced TFX, an implementation of
examples where each example has query features (e.g., the such a platform with TensorFlow-based learners and support
user’s context) as well as impression features (e.g., ratings for continuous training and serving with production-level
and developer of app being ranked). After rigorous valida- reliability. The key approach is to orchestrate reusable com-
tion (e.g., comparing quality metrics with models serving live ponents (data analysis and transformation, data validation,
traffic), the trained models are deployed through TensorFlow model training, model evaluation and validation, and serving
Serving in data centers around the globe and collectively infrastructure) effectively and provide a simple unified con-
serve thousands of queries per second with a strict latency re- figuration for users. TFX has been successfully deployed in
quirement of tens of milliseconds. Due to fresh models being the Google Play app store, reducing the time to production
trained daily, the servers have to reload multiple models (both and increasing its app install rate by 2%.
the production models, as well as other experimental models)
per day. This is done without any degradation in latency.
1394
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada
Many interesting challenges remain. While TFX is general- [10] Tim Kraska, Ameet Talwalkar, John C. Duchi, Rean Griffith,
purpose and already supports a variety of model and data Michael J. Franklin, and Michael I. Jordan. 2013. MLbase: A
Distributed Machine-learning System. In CIDR.
types, it must be flexible and accommodate new innova- [11] Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin,
tions from the machine learning community. For example, and Ken Goldberg. 2016. ActiveClean: Interactive Data Cleaning
For Statistical Modeling. PVLDB 9, 12 (2016), 948–959.
sequence-to-sequence models have recently been used to pro- [12] Sara Landset, Taghi M. Khoshgoftaar, Aaron N. Richter, and
duce state-of-the-art results in machine translation. Sup- Tawfiq Hasanin. 2015. A survey of open source tools for machine
porting these models required us to carefully think about learning with big data in the Hadoop ecosystem. Journal of Big
Data 2, 1 (2015), 24.
capturing sequential information in the common data format [13] Cheng Li, Yue Lu, Qiaozhu Mei, Dong Wang, and Sandeep Pandey.
and posed new challenges for the serving infrastructure and 2015. Click-through Prediction for Advertising in Twitter Time-
model validation, among others. In general, each addition line. In KDD. 1959–1968.
[14] Jimmy J. Lin and Alek Kolcz. 2012. Large-scale machine learning
may affect multiple components of TFX, so all components at twitter. In SIGMOD. 793–804.
must be extensible. Furthermore, as machine learning be- [15] H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young,
Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene
comes more prevalent, there is a strong need for understand- Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin
ability where a model can explain its decision and actions Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, and Jeremy
to users. We believe the lessons we learned from deploying Kubica. 2013. Ad Click Prediction: A View from the Trenches.
In KDD. 1222–1230.
TFX provide a basis for building an interactive platform that [16] Xiangrui Meng, Joseph K. Bradley, Burak Yavuz, Evan R. Sparks,
provides deeper insights to users. Shivaram Venkataraman, Davies Liu, Jeremy Freeman, D. B. Tsai,
Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J.
Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2015.
REFERENCES MLlib: Machine Learning in Apache Spark. CoRR abs/1505.06807
[1] Apache Beam: An Advanced Unified Programming Model. https: (2015).
//beam.apache.org/, accessed 2017-06-05. [17] J.I. Munro and M.S. Paterson. 1980. Selection and sorting with
[2] Running your models in production with TensorFlow limited storage. Theoretical Computer Science 12, 3 (1980),
Serving. https://research.googleblog.com/2016/02/ 315–323.
running-your-models-in-production-with.html, accessed [18] Sinno Jialin Pan and Qiang Yang. 2010. A Survey on Transfer
2017-02-08. Learning. IEEE Trans. on Knowl. and Data Eng. 22, 10 (Oct.
[3] TensorFlow - Reading data. https://www.tensorflow.org/how 2010), 1345–1359.
tos/reading data/, accessed 2017-02-08. [19] D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd
[4] Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-
Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey François Crespo, and Dan Dennison. 2015. Hidden Technical
Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Debt in Machine Learning Systems. In NIPS. 2503–2511.
Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, [20] Evan R. Sparks, Shivaram Venkataraman, Tomer Kaftan,
Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Michael J. Franklin, and Benjamin Recht. 2016. KeystoneML:
Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Optimizing Pipelines for Large-Scale Advanced Analytics. CoRR
Large-Scale Machine Learning. In OSDI. 265–283. abs/1610.09451 (2016).
[5] Rami Abousleiman, Guangzhi Qu, and Osamah A. Rawashdeh. [21] Manasi Vartak, Harihar Subramanyam, Wei-En Lee, Srinidhi
2013. North Atlantic Right Whale Contact Call Detection. CoRR Viswanathan, Saadiyah Husnoo, Samuel Madden, and Matei
abs/1304.7851 (2013). Zaharia. 2016. ModelDB: a system for machine learning model
[6] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, management. In HILDA@SIGMOD. 14.
Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, [22] Cassandra Xia, Clemens Mewald, D. Sculley, David So-
Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan ergel, George Roumpos, Heng-Tze Cheng, Illia Polosukhin,
Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. 2016. Wide & Jamie Alexander Smith, Jianwei Xie, Lichan Hong, Martin Wicke,
Deep Learning for Recommender Systems. In DLRS. 7–10. Mustafa Ispir, Philip Daniel Tucker, Yuan Tang, and Zakaria
[7] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Haque. 2017. Train and Distribute: Managing Simplicity vs. Flex-
Networks for YouTube Recommendations. In RecSys. 191–198. ibility in High-Level Machine Learning Frameworks. KDD (under
[8] Yann Dauphin, Razvan Pascanu, Çaglar Gülçehre, Kyunghyun review).
Cho, Surya Ganguli, and Yoshua Bengio. 2014. Identifying and at- [23] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014.
tacking the saddle point problem in high-dimensional non-convex How transferable are features in deep neural networks?. In NIPS.
optimization. CoRR abs/1406.2572 (2014). 3320–3328.
[9] Philippe Flajolet, ric Fusy, Olivier Gandouet, and et al. 2007. Hy- [24] Martin Zinkevich. 2016. Rules of Machine Learning. In NIPS
perloglog: The analysis of a near-optimal cardinality estimation Workshop on Reliable Machine Learning. Invited Talk.
algorithm. In AOFA.
1395