pdf2110 10234
pdf2110 10234
Organization 3
P3a
projects has created the need for software engineers to collabo- 1
P3c P3b model 3
rate with data scientists and other specialists. While collaboration inference product ML pipeline
continuous evolution and monitoring, and non-traditional quality 1 Prod. requirements 2 Integration (API & QA) 3 Public data
P7a Responsibility
deploying ML systems into production. We report on common col- 1
product Data
laboration points in the development of production ML systems 2
P7b
Collab. point
for requirements, data, and integration, as well as corresponding model
Softw. Eng.
3
team patterns and challenges. We �nd that most of these challenges ML pipeline
Data Scientist
inference
center around communication, documentation, engineering, and End user
process, and collect recommendations to address these challenges. 1 Model req. 2 Training data 3 Integr. (API)
we interviewed, there is little systematic or shared understanding often studied in the context of distributed teams of global corpora-
of common collaboration challenges and best practices for devel- tions [38, 67] and open-source ecosystems [16, 94].
oping ML-enabled systems and coordinating developers with very More broadly, interdisciplinary collaboration often poses chal-
di�erent backgrounds (e.g., data science vs. software engineering). lenges. It has been shown that when team members di�er in their
We �nd that smaller and new-to-ML organizations struggle more, academic and professional backgrounds and possess di�erent expec-
but have limited advice to draw from for improvement. tations on the same system, communication, cultural, and methodi-
Three collaboration points surfaced as particularly challenging: cal challenges often emerge when working together [21, 72]. Key
(1) Identifying and decomposing requirements, (2) negotiating train- insights are that successful interdisciplinary collaboration depends
ing data quality and quantity, and (3) integrating data science and on professional role, structural characteristics, personal character-
software engineering work. We found that organizational struc- istics, and a history of collaboration; speci�cally, structural factors
ture, team composition, power dynamics, and responsibilities di�er such as unclear mission, insu�cient time, excessive workload, and
substantially, but also found common organizational patterns at lack of administrative support are barriers to collaboration [24].
speci�c collaboration points and challenges associated with them. The component interface plays a key role in collaboration as a
Overall, our observations suggest four themes that would bene�t negotiation and collaboration point. It is where teams (re-)negotiate
from more attention when building ML-enabled systems: � Invest how to divide work and assign responsibilities [19]. Team mem-
in supporting interdisciplinary teams to work together (including bers often seek information that may not be captured in interface
education and avoiding silos), � Pay more attention to collabora- descriptions, as interfaces are rarely fully speci�ed [32]. In an ide-
tion points and clearly document responsibilities and interfaces, alized development process, interfaces are de�ned early based on
� Consider engineering work as a key contribution to the project, what is assumed to remain stable [71], because changes to inter-
and � Invest more into process and planning. faces later are expensive and require the involvement of multiple
In summary, we make the following contributions: (1) We iden- teams. In addition, interfaces re�ect key architectural decisions for
tify three core collaboration points and associated collaboration the system, aimed to achieve desired overall qualities [11].
challenges based on interviews with 45 practitioners, triangulated In practice though, the idealized divide-and-conquer approach
with a literature review, (2) We highlight the di�erent ways in following top-down planning does not always work without friction.
which teams organize, but also identify organizational patterns that Not all changes can be anticipated, leading to later modi�cations
associate with certain collaboration challenges, and (3) We identify and renegotiation of interfaces [16, 31]. It may not be possible to
recommendations to improve collaboration practices. identify how to decompose work and design stable interfaces until
substantial experimentation has been performed [12]. To manage,
negotiate, and communicate changes of interfaces, developers have
2 STATE OF THE ART
adopted a wide range of strategies for communication [16, 33, 96],
Researchers and practitioners have discussed whether and how often relying on informal broadcast mechanisms to share planned
machine learning changes software engineering with the introduc- or performed changes with other teams.
tion of learned models as components in software systems [e.g., Software lifecycle models [22] also address this tension of when
1, 5, 42, 68, 80, 82, 89, 102, 110]. To lay the foundation for our inter- and how to design stable interfaces: Traditional top-down mod-
view study and inform the questions we ask, we �rst provide an els (e.g., waterfall) plan software design after careful requirements
overview of the related work and existing theories on collabora- analysis; the spiral model pursues a risk-�rst approach in which de-
tion in traditional software engineering and discuss how machine velopers iterate to prototype risky parts, which then informs future
learning may change this. system design iterations; agile approaches de-emphasize upfront
Collaboration in Software Engineering. Most software projects architectural design for fast iteration on incremental prototypes.
exceed the capacity of a single developer, requiring multiple devel- The software architecture community has also grappled with the
opers and teams to collaborate (“work together”) and coordinate question of how much upfront architectural design is feasible, prac-
(“align goals”). Collaboration happens across teams, often in a more tical, or desirable [11, 106], showing a tension between the desire
formal and structured form, and within teams, where familiarity for upfront planning on one side and technical risks and unsta-
with other team members and frequent co-location fosters informal ble requirements on the other. In this context, our research explores
communication [63]. At a technical level, to allow multiple develop- how introducing machine learning into software projects challenges
ers to work together, abstraction and a divide and conquer strategy collaboration.
are essential. Dividing software into components (modules, func- Software Engineering with ML Components. In a ML-enabled
tions, subsystems) and hiding internals behind interfaces is a key system, machine learning contributes one or multiple components
principle of modular software development that allows teams to to a larger system with traditional non-ML components. We refer to
divide work, and work mostly independently until the �nal system the whole system that an end user would use as the product. In some
is integrated [62, 71]. systems, the learned model may be a relatively small and isolated
Teams within an organization tend to align with the technical addition to a large traditional software system (e.g., audit prediction
structure of the system, with individuals or teams assigned to com- in tax software); in others it may provide the system’s essential core
ponents [30], hence the technical structure (interfaces and depen- with only minimal non-ML code around it (e.g., a sales prediction
dencies between components) in�uences the points where teams system sending daily predictions by email). In addition to models,
collaborate and coordinate. Coordination challenges are especially an ML-enabled system typically also has components for training
observed when teams cannot easily and informally communicate,
Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process Arxiv, Feb 2022, USA
and monitoring the model(s) [42, 51]. Much attention in practice Table 1: Participant and Company Demographics
recently focuses on building robust ML pipelines for training and
deploying models in a scalable fashion, often under names such Type Break-down
as “AI engineering,” “SysML,” and “MLOps” [51, 59, 66, 89]. In this Participant Role (45) ML-focused (23), SE-focused (9), Manage-
work, we focus more broadly on the development of the entire ment (5), Operations (2), Domain Expert (2),
ML-enabled system, including both ML and non-ML components. Other (4)
Compared to traditional software systems, ML-enabled systems Participant Seniority (45) 5 years of experience or more (28), 2-5
require additional expertise in data science to build the models and years (9), under 2 years (8)
Company Type (28) Big tech (6), Non IT (4), Mid-size tech (11),
may place additional emphasis on expertise such as data manage-
Startup (5), Consulting (2)
ment, safety, and ethics [5, 49]. In this paper, we primarily focus on Company Location (28) North America (11), South America (1), Eu-
the roles of software engineers and data scientists, who typically have rope (5), Asia (10), Africa (1)
di�erent skills and educational backgrounds [48, 49, 83, 110]: Data
science education tends to focus more on statistics, ML algorithms,
and practical training of models from data (typically given a �xed interview questions throughout the study.
dataset, not deploying the model, not building a system), whereas
software engineering education focuses on engineering tradeo�s Step 1: Scoping and interview guide. To scope our research and
with competing qualities, limited information, limited budget, and prepare for interviews, we looked for collaboration problems men-
the construction and deployment of systems. Research shows that tioned in existing literature on software engineering for ML-enabled
software engineers who engage in data science without further systems (Sec. 2). In this phase, we selected 15 papers opportunis-
education are often naive when building models [110] and that data tically through keyword search and our own knowledge of the
scientists prefer to focus narrowly on modeling tasks [83] but are �eld. We marked all sections in those papers that potentially relate
frequently faced with engineering work [105]. While there is plenty to collaboration challenges between team members with di�er-
of work on supporting collaboration among software engineers [26, ent skills or educational backgrounds, following a standard open
33, 84, 114] and more recently on supporting collaboration among coding process [98]. Even though most papers did not talk about
data scientists [104, 113], we are not aware of work exploring collab- problems in terms of collaboration, we marked discussions that
oration challenges between these roles, which we do in this work. may plausibly relate to collaboration, such as data quality issues
The software engineering community has recently started to between teams. We then analyzed and condensed these codes into
explore software engineering for ML-enabled systems as a research nine initial collaboration areas and developed an initial codebook
�eld, with many contributions on bringing software-engineering and interview guide (provided in Supplement B and C at the end).
techniques to ML tasks, such as testing models and ML algorithms Step 2: Interviews. We conducted semi-structured interviews
[10, 20, 28, 109], deploying models [4, 13, 29, 34, 51], robustness and with 45 participants from 28 organizations, each 30 to 60 minutes
fairness of models [80, 93, 100], life cycles for ML models [1, 5, 34, long. All participants are involved in professional software projects
61, 73], and engineering challenges or best practices for developing using machine learning that are either already or planned to be
ML components [3, 5, 18, 27, 40, 44, 60, 89]. A smaller body of deployed in production. In Table 1, we show the demographics of
work focuses on the ML-enabled system beyond the model, such the interview participants and their organizations. Details can be
as exploring system-level quality attributes [72, 92], requirements found in the Supplement A at the end.
engineering [102], architectural design [112], safety mechanisms We tried to sample participants purposefully (maximum varia-
[17, 82], and user interaction design [7, 25, 111]. In this paper, we tion sampling [36]) to cover participants in di�erent roles, types of
adopt this system-wide scope and explore how data scientists and companies, and countries. We intentionally recruited most partic-
software engineers work together to build the system with ML and ipants from organizations outside of big tech companies, as they
non-ML components. represent the vast majority of projects that have recently adopted
machine learning and often face substantially di�erent challenges
[40]. Where possible, we tried to separately interview multiple
3 RESEARCH DESIGN participants in di�erent roles within the same organization to get
Because there is limited research on collaboration in building ML- di�erent perspectives. We identi�ed potential participants through
enabled systems, we adopt a qualitative research strategy to explore personal networks, ML-related networking events, LinkedIn, and
collaboration points and corresponding challenges, primarily with recommendations from previous interviewees and local tech lead-
stakeholder interviews. We proceeded in four steps: (1) We prepared ers. We adapted our recruitment strategy throughout the research
interviews based on an initial literature review, (2) we conducted based on our �ndings, at later stages focusing primarily on spe-
interviews, (3) we triangulated results with literature �ndings, and ci�c roles and organizations to �ll gaps in our understanding, until
(4) we validated our �ndings with the interview participants. We reaching saturation. For con�dentiality, we refer to organizations
base our research design on Straussian Grounded Theory [97, 98], by number and to participants by PXy where X refers to the orga-
which derives research questions from literature, analyzes inter- nization number and y distinguishes participants from the same
views with open and axial coding, and consults literature through- organization.
out the process. In particular, we conduct interviews and literature We transcribed and analyzed all interviews. Then, to map chal-
analysis in parallel, with immediate and continuous data analysis, lenges to collaboration points, we created visualizations of orga-
performing constant comparisons, and re�ning our codebook and nizational structure and responsibilities in each organization (we
Arxiv, Feb 2022, USA Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner
show two examples in Figure 1) and mapped collaboration problems which they are responsible, such as, who develops the model, who
mentioned in the interviews to collaboration points within these builds a repeatable pipeline, who operates the model (inference),
visualizations. We used these visualizations to further organize our who is responsible for or owns the data, and who is responsible
data; in particular, we explored whether collaboration problems for the �nal product. A team often has multiple responsibilities and
associate with certain types of organizational structures. interfaces with other teams at multiple collaboration points. Where
Step 3: Triangulation with literature. As we gained insights unambiguous, we refer to teams by their primary responsibility as
from interviews, we returned to the literature to identify related product team or model team.
discussions and possible solutions (even if not originally framed in Organization 3 (Figure 1, top) develops an ML-enabled system for
terms of collaboration) to triangulate our interview results. Relevant a government client. The product (health domain), including an ML
literature spans multiple research communities and publication model and multiple non-ML components, is developed by a single
venues, including machine learning, human-computer interaction, 8-person team. The team focuses on training a model �rst, before
software engineering, systems, and various application domains building a product around it. Software engineering and data science
(e.g., healthcare, �nance), and does not always include obvious tasks are distributed within the team, where members cluster into
keywords; simply searching for machine-learning research yields a groups with di�erent responsibilities and roughly equal negotiation
far too wide net. Hence, we decided against a systematic literature power. A single data scientist is part of this team, though they
review and pursued a best e�ort approach that relied on keyword feel somewhat isolated. Data is sourced from public sources. The
search for topics surfaced in the interviews, as well as backward relationship between the client and development team is somewhat
and forward snowballing. Out of over 300 papers read, we identi�ed distant and formal. The product is delivered as a service, but the
61 as possibly relevant and coded them with the same evolving team only receives feedback when things go wrong.
codebook. The complete list can be found in Supplement D. Organization 7 (Figure 1, bottom) develops a product for in-house
use (quality control for a production process). A small team is devel-
Step 4: Validity check with interviewees. For checking �t and oping and using the product, but model development is delegated
applicability as de�ned by Corbin and Strauss [98] and validating to an external team (di�erent company) composed of four data sci-
our �ndings, we went back to the interviewees after creating a full entists, of which two have some software engineering background.
draft of this paper. We presented the interviewees both a summary The product team interacts with the model team to de�ne and revise
and the full draft, including the supplementary material, along model requirements based on product requirements. The product
with questions prompting them to look for correctness and areas team provides con�dential proprietary data for training. The model
of agreement or disagreement (i.e., �t), and any insights gained team deploys the model and provides a ready-to-use inference API
from reading about experiences of the other companies, roles, or to the product team. The relationship between the teams crosses
�ndings as a whole (i.e., applicability). Ten interviewees responded company boundaries and is rather distant and formal. The product
with comments and all indicated general agreement, some explicitly team clearly has the power in negotiations between the teams.
rea�rmed some �ndings. We incorporated two minor suggested These two organizations di�ered along many dimensions, and we
changes about details of two organizations. found no clear global patterns when looking across organizations.
Threats to validity and credibility. Our work exhibits the typ- Nonetheless patterns did emerge when focusing on three speci�c
ical threats common and expected for this kind of qualitative re- collaboration aspects, as we will discuss in the next sections.
search. Generalizations beyond the sampled participant distribu-
tion should be made with care; for example, we interviewed few 5 COLLABORATION POINT: REQUIREMENTS
managers, no dedicated data experts, and no clients. In several AND PLANNING
organizations, we were only able to interview a single person, giv-
ing us a one-sided perspective. Observations may be di�erent in In an idealized top-down process, one would �rst solicit product re-
organizations in speci�c domains or geographic regions not well quirements and then plan and design the product by dividing work
represented in our data. Self-selection of participants may in�u- into components (ML and non-ML), deriving each component’s re-
ence results; for example developers in government-related projects quirements/speci�cations from the product requirements. In this
more frequently declined interview requests. As described earlier, process, collaboration is needed for: (1) product team needs to ne-
we followed standard practices for coding and memoing, but, as gotiate product requirements with clients and other stakeholders;
usual in qualitative research, we cannot entirely exclude biases (2) product team needs to plan and design product decomposition,
introduced by us researchers. negotiating with component teams the requirements for individ-
ual components; and (3) product project manager needs to plan
and manage the work across teams in terms of budgeting, e�ort
4 DIVERSITY OF ORG. STRUCTURES estimation, milestones, and work assignments.
Throughout our interviews, we found that the number and type
of teams that participate in ML-enabled system development dif- 5.1 Common Development Trajectories
fers widely, as do their composition and responsibilities, power Few organizations, if any, follow an idealized top-down process, and
dynamics, and the formality of their collaborations, in line with it may not even be desirable, as we will discuss later. While we did
�ndings by Aho et al. [1]. To illustrate these di�erences, we provide not �nd any global patterns for organizational structures (Sec. 4),
simpli�ed descriptions of teams found in two organizations in Fig- there are indeed distinct patterns relating to how organizations
ure 1. We show teams and their members, as well as the artifacts for elicit requirements and decompose their systems. Most importantly,
Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process Arxiv, Feb 2022, USA
we see di�erences in terms of the order in which teams identify model team sometimes engage directly with clients (and also report
product and model requirements: having to educate them about ML capabilities). However, when
Model-�rst trajectory: 13 of the 28 organizations (3, 10, 14–17, requirements elicitation is left to the model team, members tend to
19, 20, 22, 23, 25–27) focus on building the model �rst, and build focus on requirements relevant for the model, but neglect require-
a product around the model later. In these organizations, product ments for the product, such as expectations for usability, e.g., P3c’s
requirements are usually shaped by model capabilities after the customers “were kind of happy with the results, but weren’t happy
(initial) model has been created, rather than being de�ned upfront. with the overall look and feel or how the system worked.” Several re-
In organizations with separate model and product teams, the model search papers similarly identi�ed how the goals of data scientists di-
team typically starts the project and the product team joins later verge from product goals if product requirements are not obvious at
with low negotiating power to build a product around the model. modeling time, leading to ine�cient development, worse products,
Product-�rst trajectory: In 13 organizations (1, 4, 5, 7–9, 11– or constant renegotiation of requirements, especially [66, 72, 111].
13, 18, 21, 24, 28), models are built later to support an existing Model development with unclear model requirements is com-
product. In these cases, a product often already exists and product mon (�). Participants from model teams frequently explain how
requirements are collected for how to extend the product with new they are expected to work independently, but are given sparse
ML-supported functionality. Here, the model requirements are de- model requirements. They try to infer intentions behind them, but
rived from the product requirements and often include constraints are constrained by having limited understanding of the product
on model qualities, such as latency, memory and explainability. that the model will eventually support (P3a, P3b, P16b, P17b, P19a).
Parallel trajectory: Two organizations (2, 6) follow no clear Model teams often start with vague goals and model requirements
temporal order; model and product teams work in parallel. evolve over time as product teams or clients re�ne their expec-
tations in response to provided models (P3b, P7a, P9a, P5b, P19b,
5.2 Product and Model Requirements P21a). Especially in organizations following the model-�rst trajec-
We found a constant tension between product and model require- tory, model teams may receive some data and a goal to predict
ments in our interviews. Functional and nonfunctional product something with high accuracy, but no further context, e.g., P3a
requirements set expectations for the entire product. Model re- shared “there isn’t always an actual spec of exactly what data they
quirements set goals and constraints for the model team, such as have, what data they think they’re going to have and what they want
expected accuracy and latency, target domain, and available data. the model to do.” Several papers similarly report projects starting
with vague model goals [50, 76, 82, 110].
Product requirements require input from the model team Even in organizations following a product-�rst trajectory, product
(�, �). A common theme in the interviews is that it is di�cult to requirements are often not translated into clear model requirements.
elicit product requirements without a good understanding of ML ca- For example, participant P17b reports how the model team was
pabilities, which almost always requires involving the model team not clear about the model’s intended target domain, thus could
and performing some initial modeling when eliciting product re- not decide what data was considered in scope. As a consequence,
quirements. Regardless of whether product requirements or model model teams usually cannot focus just on their component, but have
requirements are elicited �rst, data scientists often mentioned being to understand the entire product to identify model requirements
faced with unrealistic expectations about model capabilities. in the context of the product (P3a, P10a, P13a, P17a, P17b, P19b,
Participants that interact with clients to negotiate product re- P20b, P23a), requiring interactions with the product team or even
quirements (which may involve members of the model team) indi- bypassing the product team to talk directly to clients. The di�culty
cate that they need to educate clients about capabilities of ML tech- of providing clear requirements for an ML model has also been
niques to set correct expectations (P3a, P6a, P6b, P7b, P9a, P10a, P15c, raised in the literature [49, 55, 79, 91, 103, 110], partially arguing
P19b, P22b, P24a). This need to educate customers about ML capabil- that uncertainty makes it di�cult to specify model requirements
ities has also been raised in the literature [1, 17, 44, 49, 99, 102, 105]. upfront [1, 44, 50, 68, 105]. Ashemore et al. report mapping product
For many organizations, especially in product-�rst trajectories, requirements to model requirements as an open challenge [10].
the model team indicates similar challenges when interacting with
the product team. If the product team does not involve the model Provided model requirements rarely go beyond accuracy and
team in negotiating product requirements, the product team may data security (�, �). Requirements given to model teams pri-
not identify what data is needed for building the model, and may marily relate to some notion of accuracy. Beyond accuracy, require-
commit to unrealistic requirements. For example, P26a shared “For ments for data security and privacy are common, typically imposed
this project, [the project manager] wanted to claim that we have no by the data owner or by legal requirements (P5a, P7a, P9a, P13a,
false positives and I was like, that’s not gonna work.” Members of P14a, P18a, P20a-b, P21a-b, P22a, P23a, P24a, P25a, P26a). Literature
the model team often report lack of ML literacy in members of also frequently discusses how privacy requirements impact and
the product team and project managers (P1b, P4a, P7a, P12a, P26a, restrict ML work [15, 41, 43, 55, 56, 77].
P27a) and a lack of involvement (e.g., P7b: “The [product team] We rarely heard of any qualities other than accuracy. Some par-
decided what type of data would make sense. I had no say on that.” ). ticipants report that ignoring qualities such as latency or scalability
Usually the product team cannot identify product requirements has resulted in integration and operation problems (P3c, P11a). In a
alone, instead product and model teams need to interact to explore few cases requirements for inference latency were provided (P1a,
what is achievable. P6a, P14a) and in one case hardware resources provided constraints
In organizations with a model-�rst trajectory, members of the on memory usage (P14a), but no other qualities such as training
Arxiv, Feb 2022, USA Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner
latency, model size, fairness, or explainability were required that di�cult for the model team to set expectations or contracts with
could be important for product integration and deployment. clients or the product team regarding e�ort, cost, or accuracy. While
When prompted, very few of our interviewees report consider- data scientists �nd e�ort estimation di�cult, lack of ML literacy
ations for fairness either at the product or the model level. Only in managers makes it worse (P15b, P16a, P19b, P20a, P22b). Teams
two participants from model teams (P14a, P22a) reported receiving report deploying subpar models when running out of time (P3a,
fairness requirements, whereas many others explicitly mentioned P15b, P19a), or postponing or even canceling deployments (P25a).
that fairness is not a concern for them yet (P4a, P5b, P6b, P11a, P15c, These �ndings align with literature mentioning di�culties associ-
P20a, P21b, P25a, P26a). The lack of fairness and explainability re- ated with e�ort estimation for ML tasks [1, 9, 61, 105] and planning
quirements is in stark contrast to the emphasis that these qualities projects in a structured manner with diverse methodologies, with
receive in the literature [e.g., 7, 15, 25, 39, 40, 57, 88, 91, 108, 113]. diverse trajectories, and without practical guidance [1, 17, 61, 105].
Recommendations. Our observations suggest that involving data Generally, participants frequently report that synchronization
scientists early when soliciting product requirements is important between teams is challenging because of di�erent team pace, di�er-
(�) and that pursuing a model-�rst trajectory entirely without ent development processes, and tangled responsibilities (P2a, P11a,
considering product requirements is problematic (�). Conversely, P12a, P14-b, P15b-c, P19a; see also Sec. 7.2).
model requirements are rarely speci�c enough to allow data scien- Recommendations. Participants suggested several mitigation
tists to work in isolation without knowing the broader context of strategies: keeping extra bu�er times and adding additional time-
the system and interaction with the product team should ideally boxes for R&D in initial phases (P8a, P19a, P22b-c, P23a; �), contin-
be planned as part of the process. Requirements form a key col- uously involving clients in every phase so that they can understand
laboration point between product and model teams, which should the progression of the project and be aware of potential missed
be emphasized even in more distant collaboration styles (e.g., out- deadlines (P6b, P7a, P22a, P23a; �). From the interviews, we also
sourced model development). The few organizations that use the observe the bene�ts of managers who understand both software
parallel trajectory report fewer problems by involving data scien- engineering and machine learning and can align product and model
tists in negotiating product requirements to discard unrealistic ones teams toward common goals (P2a, P6a, P8a, P28a; �).
early on (P6b). Vogelsang and Borg also provide similar recommen-
dations to consult data scientists from the beginning to help elicit 6 COLLABORATION POINT: TRAINING DATA
requirements [102]. While many papers place emphasis on clearly Data is essential for machine learning, but disagreements and frus-
de�ning ML use cases and scope [49, 92, 99], several others mention trations around training data were the most common collaboration
how collaboration of technical and non-technical stakeholders such challenges mentioned in our interviews. In most organizations, the
as domain experts helps [72, 88, 103, 105]. team that is responsible for building the model is not the team that
ML literacy for customers and product teams appears to be im- collects, owns, and understands the data, making data a key collab-
portant (�). P22a and P19a suggested conducting technical ML oration point between teams in ML-enabled systems development.
training sessions to educate clients; similar training is also useful
for members of product teams. Several papers argue for similar
6.1 Common Organizational Structures
training for non-technical users of ML products [44, 88, 102].
Most organizations elicit requirements only rather informally We observed three patterns around data that in�uence collaboration
and rarely have good documentation, especially when it comes challenges from the perspective of the model team:
to model requirements. It seems bene�cial to adopt more formal Provided data: The product team
requirements documentation for product and model (�), as several has the responsibility of providing data
participants reported that it fosters shared understanding at this to the model team (org. 6–8, 13, 18, 21,
collaboration point (P11a, P13a, P19b, P22a, P22c, P24a, P25a, P26a). 23). The product team is the initial point
Checklists could help to cover a broader range of model quality of contact for all data-related questions from the model team. The
requirements, such as training latency, fairness, and explainability. product team may own the data or acquire it from a separate data
Formalisms such as model cards [64] and FactSheets [8] could be team (internal or external). Coordination regarding data tends to
used as a starting point for documenting model requirements. be distant and formal, and the product team tends to hold more
negotiation power.
5.3 Project Planning External data: The product team
does not have direct responsibility for
ML uncertainty makes e�ort estimation di�cult (�). Irre- providing data, but instead, the model
spective of trajectory, 19 participants (P3a, P4a, P7a-b, P8a, P14b, team relies on external data providers.
P15b-c, P16a, P17a, P18a, P19a-b, P20a, P22a-c, P23a, P25a) men- Commonly, the model team (i) uses publicly available resources
tioned that the uncertainty associated with ML components makes (e.g., academic datasets, org. 2–4, 6, 19) or (ii) hires a third party for
it di�cult to estimate the timeline for developing an ML compo- collecting or labeling data (org. 9, 15–17, 22, 23). In the former case,
nent and by extension the product. Model development is typically the model team has little to no negotiation power over data; in the
seen as a science-like activity, where iterative experimentation and latter, it can set expectations.
exploration is needed to identify whether and how a problem can In-house data: Product, model, and
be solved, rather than as an engineering activity that follows a data teams are all part of the same or-
somewhat predictable process. This science-like nature makes it ganization and the model team relies
Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process Arxiv, Feb 2022, USA
on internal data from that organization direct communication with the data team, data understanding is-
(org. 1, 5, 9–12, 14, 20, 24–28). In these cases, both product and sues often cannot be resolved e�ectively. For example, P13a reports
model teams often �nd it challenging to negotiate access to internal “Ideally, for us it would be so good to spend maybe a week or two with
data due to di�ering priorities, internal politics, permissions, and one person continuously trying to understand the data. It’s one of
security constraints. the biggest problems actually, because even if you have the person,
if you’re not in contact all the time, then you misinterpreted some
things and you build on it.” The low negotiation power of the model
6.2 Negotiating Data Quality and Quantity team in these organizations hinders access to domain experts.
Disagreements and frustrations around training data were the most Model teams using public data similarly struggle with data un-
common collaboration challenges in our interviews. In almost ev- derstanding and getting help (P3a, P4a, P19a), relying on sparse
ery project, data scientists were unsatis�ed with the quality and data documentation or trying to reach any experts on the data.
quantity of data they received at this collaboration point, in line For in-house projects, in several organizations the model team
with a recent survey showing data availability and management to relies on data in shared databases (org. 5, 11, 26, 27, 28), collected by
be the top-ranked challenge in building ML-enabled systems [5]. instrumenting a production system, but shared by multiple teams.
Several teams shared problems with evolving and often poorly
Provided and public data is often inadequate (�, �). In or- documented data sources, as participant P5a illustrates “[data rows]
ganizations where data is provided by the product team, the model can have 4,000 features, 10,000 features. And no one really cares.
team commonly states that it is di�cult to get su�cient data (P7a, They just dump features there. [...] I just cannot track 10,000 features.”
P8a, P13a, P22a, P22c). The data that they receive is often of low Model teams face challenges in understanding data and identifying
quality, requiring signi�cant investment in data cleaning. Similar to a team that can help (P5a, P25a, P20b, P27a), a problem also reported
the requirements challenges discussed earlier, they often state that in a prior study about data scientists at Microsoft [49].
the product team has little knowledge or intuition for the amount Challenges in understanding data and needing domain experts
and quality of data needed. For example, participant P13a stated are also frequently mentioned in the literature [13, 40, 41, 46, 49,
that they were given a spreadsheet with only 50 rows to build a 65, 76, 83], as is the danger of building models with insu�cient
model and P7a reported having to spend a lot of time convincing understanding of the data [34, 102]. Although we are not aware of
the product team of the importance of data quality. This aligns with literature discussing the challenges of accessing domain experts,
past observations that software engineers often have little appreci- papers have shown that even when data scientists have access,
ation for data quality concerns [49, 54, 65, 76, 83] and that training e�ective knowledge transfer is challenging [70, 90].
data is often insu�cient and incomplete [6, 43, 56, 76, 82, 92, 105].
When the model team uses public data sources, its members Ambiguity when hiring a data team (�). When the model team
also have little in�uence over data quality and quantity and report hires an external data team for collecting or labelling data (org. 9,
signi�cant e�ort for cleaning low quality and noisy data (P2a, P3a, 15, 16, 17, 22, 23), the model team has much more negotiation
P4a, P3c, P6b, P19b, P23a). Papers have similarly questioned the power over setting data quality and quantity expectations (though
representativeness and trustworthiness of public training data [34, Kim et al. report that model teams may have di�culty getting
102, 108] as “nobody gets paid to maintain such data” [104]. buy-in from the product team for hiring a data team in the �rst
Training-serving skew is a common challenge when training data place [49]). Our interviews did not surface the same frustrations as
is provided to the model team: models show promising results, with provided data and public data, but instead participants from
but do not generalize to production data because it di�ers from these organizations reported communication vagueness and hidden
provided training data (P4a, P8a, P13a, P15a, P15c, P21a, P22c, P23a) assumptions as key challenges at this collaboration point (P9a, P15a,
[9, 23, 55, 56, 76–78, 83, 99, 108, 115]. Our interviews show that P15c, P16a, P17b, P22a, P22c, P23a). For example, P9a related how
this skew often originates from inadequate training data combined di�erent labelling companies given the same speci�cation widely
with unclear information about production data, and therefore no disagreed on labels, when the speci�cation was not clear enough.
chance to evaluate whether the training data is representative of We found that expectations between model and data teams are
production data. often communicated verbally without clear documentation. As
a result, the data team often does not have su�cient context to
Data understanding and access to domain experts is a bottle- understand what data is needed. For example, participant P17b
neck (�, �). Existing data documentation (e.g, data item de�ni- states “Data collectors can’t understand the data requirements all the
tions, semantics, schema) is almost never su�cient for model teams time. Because, when a questionnaire [for data collection] is designed,
to understand the data (also mentioned in a prior study [46]). In the overview of the project is not always described to them. Even
the absence of clear documentation, team members often collect if we describe it, they can’t always catch it.” Reports about low
information and keep track of unwritten details in their heads (P5a), quality data from hired data teams have been also discussed in the
known as institutional or tribal knowledge [5, 40]. Data understand- literature [10, 43, 55, 83, 102, 105].
ing and debugging often involve members from di�erent teams and
thus cause challenges at this collaboration point. Need to handle evolving data (�, �). In most projects, mod-
Model teams receiving data from the product team report strug- els need to be regularly retrained with more data or adapted to
gling with data understanding and having a di�cult time getting changes in the environment (e.g., data drift) [42, 55, 83], which is a
help from the product team (or the data team that the product team challenge for many model teams (P3a, P3c, P5a, P7a-b, P11a, P15c,
works with) (P8a, P7b, P13a). As the model team does not have P18a, P19b, P22a). When product teams provide the data, they often
Arxiv, Feb 2022, USA Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner
have a static view and provide only a single snapshot of data rather and often limited documentation (�), even when hiring a dedi-
than preparing for updates, and model teams with their limited cated data team—in stark contrast to more established contracts for
negotiation power have a di�cult time fostering a more dynamic traditional software components. Not all organizations allow the
mindset (P7a-b, P15c, P18a, P22a), as expressed by participant P15c: more agile, constant close collaboration between model and data
“People don’t understand that for a machine learning project, data teams that some suggest [76, 78]. With a more formal or distant
has to be provided constantly.” It can be challenging for a model relationship (e.g., across organizations, teams without buy-in), it
team to convince the product team to invest in continuous model seems bene�cial to adopt a more formal contract, specifying data
maintenance and evolution (P7a, P15c) [46]. quantity and quality expectations, which are well researched in
Conversely, if data is provided continuously (most commonly the database literature [58] and have been repeatedly discussed
with public data sources, in-house sources, and own data teams), in the context of ML-enabled systems [43, 46, 49, 56, 90]. This has
model teams struggle with ensuring consistency over time. Data also been framed as data requirements in the software engineering
sources can suddenly change without announcement (e.g., changes literature [82, 99, 102]. When working with a dedicated data team,
to schema, distributions, semantics), surprising model teams that participants suggested to invest in making expectations very clear,
make but do not check assumptions about the data (P3a, P3c, P19b). for example, by providing precise speci�cations and guidelines (P9a,
For example, participants P5a and P11a report similar challenges P6b, P28a), running training sessions for the data collectors and
with in-house data, where their low negotiation power does not annotators (P17b, P22c), and measuring inter-rater agreement (P6b).
allow them to set quality expectations, but they face undesired and Automated checks are also important as data evolves (�). For
unannounced changes in data sources made by other teams. Most example, participant P13a mentioned proactively setting up data
organizations do not have a monitoring infrastructure to detect monitoring to detect problems (e.g., schema violations, distribution
changes in data quality or quantity, as we will discuss in Sec. 7.3. shifts) at this collaboration point; a practice suggested also in the
In-house priorities and security concerns often obstruct data literature [53, 56, 76, 78, 83, 88, 99] and supported by recent tooling
access (�). In in-house projects, we frequently heard about the [e.g., 47, 78, 85]. The risks regarding possible unnoticed changes to
product or model team struggling to work with another team within data make it important to consider data validation and monitoring
the same organization that owns the data. Often, these in-house infrastructure as a key feature of the product early on (�, �), as
projects are local initiatives (e.g., logistics optimization) with more also emphasized by several participants (P5a, P25a, P26a, P28a).
or less buy-in from management and without buy-in from other
teams that have their own priorities; sometimes other teams explic- 7 COLLABORATION POINT:
itly question the business value of the product. The interviewed PRODUCT-MODEL INTEGRATION
model teams usually have little negotiation power to request data As discussed earlier, to build an ML-enabled system both ML com-
(especially if it involves collecting additional data) and almost never ponents and traditional non-ML components need to be integrated
get an agreement to continuously receive data in a certain format, and deployed, requiring data scientists and software engineers to
quality, or quantity (P5a, P10a, P11a, P20a-b, P27a) (also observed in work together, typically across multiple teams. We found many con-
studies at Microsoft, ING and other organizations [34, 49, 65]). For �icts at this collaboration point, stemming from unclear processes
example, P10a shared “we wanted to ask the data warehouse team to and responsibilities, as well as di�ering practices and expectations.
[provide data], and it was really hard to get resources. They wouldn’t
do that because it was hard to measure the impact [our in-house
7.1 Common Organizational Structures
project] had on the bottom line of the business.” Model teams in these
settings tend to work with whatever data they can get eventually. We saw large di�erences among organizations in how engineering
Security and privacy concerns can also limit access to data (P7a, responsibilities were assigned, most visible in how responsibility
P7b, P21a-b, P22a, P24a) [46, 55, 56, 65, 76], especially when data for model deployment and operation is assigned, which typically
is owned by a team in a di�erent organization, causing frustra- involves signi�cant engineering e�ort for building reproducible
tion, lengthy negotiations, and sometimes expensive data-handling pipelines, API design, or cloud deployment, often with MLOps
restrictions (e.g., no use of cloud resources) for model teams. technologies. We found the following patterns:
Shared model code: In
Recommendations. Data quality and quantity is important to some organizations (2, 6, 23,
model teams, yet they often �nd themselves in a position of low 25), the model team is respon-
negotiation power, leading to frustration and collaboration ine�- sible only for model develop-
ciencies. Model teams that have the freedom to set expectations and ment and delivers training code (e.g., in a notebook) or model
hire their own data teams are noticeably more satis�ed. When plan- �les to the product team; the product team takes responsibility
ning the entire product, it seems important to pay special attention for deployment and operation of the model, possibly rewriting the
to this collaboration point, and budget for data collection, access training code as a pipeline. Here, the model team has little or no
to domain experts, or even a dedicated data team (�). Explicitly engineering responsibilities.
planning to provide substantial access to domain experts early in Model as API: In most organi-
the project was suggested as important (P25a). zations (18 out of 28), the model
We found it surprising that despite the importance of this col- team is responsible for developing
laboration point there is little written agreement on expectations and deploying the model. Hence,
the model team requires substantial engineering skills in addition
Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process Arxiv, Feb 2022, USA
to data science expertise. Here, some model teams are mostly com- seriously engaging with others only during integration (P3a, P3c,
posed of data scientists with little engineering capabilities (org. 7, P6a, P7b, P11a, P13a, P15b, P25a) [41], where problems may surface.
13, 17, 22, 26), some consist mostly of software engineers who have For example, participant P11a reported a problem where product
picked up some data science knowledge (org. 4, 15, 16, 18, 19, 21, and model teams had di�erent assumptions about the expected
24), and others have mixed team members (org. 1, 9, 11, 12, 14, 28). inputs and the issue could only be identi�ed after a lot of back and
These model teams typically provide an API to the product team, forth between teams at a late stage in the project.
or release individual model predictions (e.g., shared �les, email; org. Technical jargon challenges communication (�). Participants
17, 19, 22) or install models directly on servers (org. 4, 9, 12). frequently described communication issues arising from di�ering
All-in-one: If only few people terminology used by members from di�erent backgrounds (P1a-b,
work on model and product, some- P2a, P3a, P5b, P8a, P12a, P14a-b, P16a, P17a-b, P18a-b, P20a, P22b,
times a single team (or even a sin- P23a), leading to ambiguity, misunderstandings, and inconsistent
gle person) shares all responsibilities assumptions (on top of communication challenges with domain
(org. 3, 5, 10, 20, 27). It can be a small team with only data scientists experts) [1, 46, 75, 103]. P1b reports, “There are a lot of conversations
(org. 10, 20, 27) or mixed teams with data scientists and software in which disambiguation becomes necessary. We often use di�erent
engineers (org. 3, 5). kinds of words that might be ambiguous.” For example, data scien-
We also observed two outliers: One startup (org. 8) had a distinct tists may refer to prediction accuracy as performance, a term many
model deployment team, allowing the model team to focus on software engineers associate with response time. These challenges
data science without much engineering responsibility. In one large can be observed more frequently between teams, but they even
organization (org. 28), an engineering-focused model team (model occur within a team with members from di�erent backgrounds
as API) was supported by a dedicated research team focused on (P3a-c, P20a).
data-science research with fewer engineering responsibilities.
Code quality, documentation, and versioning expectations
di�er widely and cause con�icts (�, �). Many participants
7.2 Responsibility and Culture Clashes
reported con�icts around development practices between data sci-
Interdisciplinary collaboration is challenging (cf. Sec. 2). We ob- entists and software engineers during integration and deployment.
served many con�icts between data science and software engineer- Participants report poor practices that may also be observed in
ing culture, made worse by unclear responsibilities and boundaries. traditional software projects; but particularly software engineers
Team responsibilities often do not match capabilities and expressed frustration in interviews that data scientists do not follow
preferences (�). When the model team has responsibilities re- the same development practices or have the same quality standards
quiring substantial engineering work, we observed some dissatis- when it comes to writing code. Reported problems relate to poor
faction when its members were assigned undesired responsibilities. code quality (P1b, P2a, P3b, P5a, P6a-b, P10a, P11a, P14a, P15b-c,
Data scientists preferred engineering support rather than needing P17a, P18a, P19a, P20a-b, P26a) [9, 27, 34, 37, 74, 86, 105], insu�cient
to do everything themselves (P7a-b, 13a), but can �nd it hard to documentation (P5a-b, P6a-b, P10a, P15c, P26a) [8, 46, 64, 113], and
convince management to hire engineers (P10a, P20a, P20b). For not extending version control to data and models (P3c, P7a, P10a,
example P10a describes “I was struggling to change the mindset of P14a, P20b). In two shared-model-code organizations, participants
the team lead, convincing him to hire an engineer...I just didn’t want report having to rewrite code from the data scientists (P2a, P6a-b).
this to be my main responsibility.” Especially in small teams, data Missing documentation for ML code and models is considered the
scientists report struggling with the complexity of the typical ML cause for di�erent assumptions that lead to incompatibility between
infrastructure (P7b, P9a, P14a, P26a, P28a). ML and non-ML components (P10a) and for losing knowledge and
In contrast, when deployment is the responsibility of software even the model when faced with turnover (P6a-b). Recent papers
engineers in the product team or of dedicated engineers in all-in- similarly hold poor documentation responsible for team decisions
one teams, some of those engineers report problems integrating becoming invisible and inadvertently causing hidden assumptions
the models due to insu�cient knowledge on model context or do- [34, 40, 43, 46, 75, 113]. Hopkins and Booth called model and data
main, and the model code not being packaged well for deployment versioning in small companies as desired but “elusive” [40].
(P20b, P23a, P27a). In several organizations, we heard about soft- Recommendations. Many con�icts relate to boundaries of re-
ware engineers performing ML tasks without having enough ML sponsibility (especially for engineering responsibilities) and to dif-
understanding (P5a, P15b-c, P16b, 18b, 19b, 20b). Mirroring obser- ferent expectations by team members with di�erent backgrounds.
vations from past research [110], P5a reports “there are people who Better teams tend to de�ne processes, responsibilities, and bound-
are ML engineers at [company] , but they don’t really understand aries more carefully (�), document APIs at collaboration points
ML. They were actually software engineers... they don’t understand between teams (�), and recruit dedicated engineering support for
[over�tting, under�tting, ...]. They just copy-paste code.” model deployment (�), but also establish a team culture with mu-
Siloing data scientists fosters integration problems (�, �). tual understanding and exchange (�). Big tech companies usually
We observed data scientists often working in isolation—known as have more established processes and clearer responsibility assign-
siloing—in all types of organizational structures, even within single ments than smaller organizations and startups that often follow
small teams (see Sec. 4) and within engineering-focused teams. ad-hoc processes or �gure out responsibilities as they go.
In such settings, data scientists often work in isolation with weak The need for engineering skills for ML projects has frequently
requirements (cf. Sec. 5.2) without understanding the larger context, been discussed [5, 66, 86, 89, 95, 111, 115], but our interviewees
Arxiv, Feb 2022, USA Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner
di�er widely in whether all data scientists should have substantial struggle with testing the entire product after integrating ML and
engineering responsibilities or whether engineers should support non-ML components. Model teams frequently explicitly mentioned
data scientists so that they can focus on their core expertise (�). that they assume no responsibility for product quality (including
Especially interviewees from big tech emphasized that they expect integration testing and testing in production) and have not been
engineering skills from all data science hires (P28a). Others empha- involved in planning for system testing, but that their responsibili-
sized that recruiting software engineers and operations sta� with ties end with delivering a model evaluated for accuracy (P3a, P14a,
basic data-science knowledge can help at many communication P15b, P25a, P26a). However, in several organizations, product teams
and integration tasks, such as converting experimental ML code also did not plan for testing the entire system with the model(s)
for deployment (P2a, P3b), fostering communication (P3c, P25a), and, at most, conducted system testing in an ad-hoc way (P2a, P6a,
and monitoring models in production (P5b). Generally, siloing data P16a, P18a, P22a). Recent literature has reported a similar lack of
scientists is widely recognized as problematic and many intervie- focus on system testing in product teams [13, 113], mirroring also
wees suggest practices for improving communication (�), such as a focus in academic research on testing models rather than testing
training sessions for establishing common terminology (P11a, P17a, the entire system [10, 20]. Interestingly, some established software
P22a, P22c, P23a), weekly all-hands meetings to present all tasks development organizations delegated testing to an existing separate
and synchronize (P2a, P3c, P6b, P11a), and proactive communica- quality assurance team with no process or experience testing ML
tion to broadcast upcoming changes in data or infrastructure (P11a, products (P2a, P8a, P16a, P18b, P19a).
P14a, P14b). This mirrors suggestions to invest in interdisciplinary Planning for online testing and monitoring is rare (�, �,
training [5, 48, 49, 68, 75, 111] and proactive communication [54]. �). Due to possible training-serving skew and data drift, literature
emphasizes the need for online evaluation [4, 10, 13, 14, 23, 42, 44,
47, 51, 65, 86, 87, 89, 102]. With collected telemetry, one can usually
7.3 Quality Assurance for Model and Product
approximate both product and model quality, monitor updates,
During development and integration, questions of responsibility and experiment in production [14]. Online testing usually requires
for quality assurance frequently arise, often requiring coordination coordination among multiple teams responsible for product, model,
and collaboration between multiple teams. This includes evaluating and operation. We observed that most organizations do not perform
components individually (including the model) as well as their monitoring or online testing, as it is considered di�cult, in addition
integration and the whole system, often including evaluating and to lack of standard process, automation, or even test awareness
monitoring the system online (in production). (P2a, P3a, P3b, P4a, P6b, P7a, P10a, P15b, P16b, P18b, P19b, 25a,
Model adequacy goals are di�cult to establish (�, �). O�- P27a). Only 11 out of 28 organizations collected any telemetry; it is
line accuracy evaluation of models is almost always performed by most established in big tech organizations. When to retrain models
the model team responsible for building the model, though often is often decided based on intuition or manual inspection, though
they have di�culty deciding locally when the model is good enough many aspire to more automation (P1a, P3a, P3c, P5a, P10a, P22a,
(P1a, P3a, P5a, P6a, P7a, P15b, P16b, P23a) [34, 44]. As discussed P25a, P27a). Responsibilities around online evaluation are often
in Sec. 5 and Sec. 6, model team members often receive little guid- neither planned nor assigned upfront as part of the project.
ance on model adequacy criteria and are unsure about the actual Most model teams are aware of possible data drift, but many do
distribution of production data. They also voice concerns about not have any monitoring infrastructure for detecting and managing
establishing ground truth, for example, needing to support data drift in production. If telemetry is collected, it is the responsibility
for di�erent clients, and hence not being able to establish (o�ine) of the product or operations team and it is not always accessible to
measures for model quality (P1b, P16b, P18a, P28a). As quality re- the model team. Four participants report that they rely on manual
quirements beyond accuracy are rarely provided for models, model feedback about problems from the product team (P1a, P3a, P4a,
teams usually do not feel responsible for testing latency, memory P10a). At the same time, others report that product and operation
consumption, or fairness (P2a, P3c, P4a, P5a, P6b, P7a, P14a, P15b, teams do not necessarily have su�cient data science knowledge to
P20b). Whereas literature discussed challenges in measuring busi- provide meaningful feedback (P3a, P3b, P5b, P18b, P22a) [81].
ness impact of a model [10, 14, 43, 49] and balancing business goals Recommendations. Quality assurance involves multiple teams
with model goals [72], interviewed data scientists were concerned and bene�ts from explicit planning and making it a high priority
about this only with regards to convincing clients, managers or (�). While the product team should likely take responsibility for
product teams to provide resources (P7a-b, P10a, P26a, P27a). product quality and system testing, such testing often involves build-
Limited con�dence without transparent model evaluation ing monitoring and experimentation infrastructure (�), which re-
(�). Participants in several organizations report that model teams quires planning and coordination with teams responsible for model
do not prioritize model evaluation and have no systematic evalua- development, deployment, and operation (if separate) to identify
tion strategy (especially if they do not have established adequacy the right measures. Model teams bene�t from receiving feedback
criteria they try to meet), performing occasional “ad-hoc inspec- on their model from production systems, but such support needs
tions” instead (P2a, P15b, P16b, P18b, P19b, P20b, P21b, P22a, P22b). to be planned explicitly, with corresponding engineering e�ort as-
Without transparency about their test processes and test results, signed and budgeted, even in organizations following a model-�rst
other teams voiced reduced con�dence in the model, leading to trajectory. We suspect that education about bene�ts of testing in
skepticism to adopt the model (P7a, P10a, P21b, P22a). production and common infrastructure (often under the label Dev-
Unclear responsibilities for system testing (�). Teams often Ops/MLOps [59]) can increase buy-in from all involved teams (�).
Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process Arxiv, Feb 2022, USA
Organizations that have established monitoring and experimenta- � Documentation: Clearly documenting expectations between
tion infrastructure strongly endorse it (P5a, P25a, P26a, P28a). teams is important. Traditional interface documentation familiar
De�ning clear quality requirements for model and product can to software engineers may be a starting point, but practices for
help all teams to focus their quality assurance activities (cf. Sec. 5; documenting model requirements (Sec. 5.2), data expectations (Sec.
�). Even when it is challenging to de�ne adequacy criteria upfront, 6.2), and assured model qualities (Sec. 7.3) are not well established.
teams can together develop a quality assurance plan for model and Recent suggestions like model cards [64], and FactSheets [8] are
product. Participants and literature emphasized the importance of a good starting point for encouraging better, more standardized
human feedback to evaluate model predictions (P11a, P14a) [87], documentation of ML components. Given the interdisciplinary na-
which requires planning to collect such feedback (�). System and ture at these collaboration points, such documentation must be
usability testing may similarly require planning for user studies understood by all involved – theories of boundary objects [2] may
with prototypes and shadow deployment [88, 99, 108]. help to develop better interface description mechanisms.
� Engineering: With attention focused on ML innovations,
8 DISCUSSION AND CONCLUSIONS many organizations seem to underestimate the engineering ef-
Through our interviews we identi�ed three central collaboration fort required to turn a model into a product to be operated and
points where organizations building ML-enabled systems face sub- maintained reliably. Arguably adopting machine learning increases
stantial challenges: (1) requirements and project planning, (2) train- software complexity [48, 68, 86] and makes engineering practices
ing data, and (3) product-model integration. Other collaboration such as data quality checks, deployment automation, and testing in
points surfaced, but were mentioned far less frequently (e.g., inter- production even more important. Project managers should ensure
action with legal experts and operators), did not relate to problems that the ML and the non-ML parts of the project have su�cient
between multiple disciplines (e.g., data scientists documenting their engineering capabilities and foster product and operations thinking
work for other data scientists), or mirrored conventional collabora- from the start.
tion in software projects (e.g., many interviewees wanted to talk � Process: Finally, machine learning with its more science-like
about unstable ML libraries and challenges interacting with teams process challenges traditional software process life cycles. It seems
building and maintaining such libraries, though the challenges clear that product requirements cannot be established without in-
largely mirrored those of library evolution generally [16, 31]). volving data scientists for model prototyping, and often it may
Data scientists and software engineers are certainly not the �rst be advisable to adopt a model-�rst trajectory to reduce risk. But
to realize that interdisciplinary collaborations are challenging and while a focus on the product and overall process may cause delays,
fraught with communication and cultural problems [21], yet it neglecting it entirely invites the kind of problems reported by our
seems that many organizations building ML-enabled systems pay participants. Whether it may look more like the spiral model or
little attention to fostering better interdisciplinary collaboration. agile [22], more research into integrated process life cycles for ML-
Organizations di�er widely in their structures and practices, and enabled systems (covering software engineering and data science)
some organizations have found strategies that work for them (see is needed.
recommendation sections). Yet, we �nd that most organizations do Acknowledgements. Kästner’s and Nahar’s work was supported
not deliberately plan their structures and practices and have little in part by NSF grants NSF award 1813598 and 2131477. Zhou’s
insight into available choices and their tradeo�s. We hope that this work was supported in part by Natural Sciences and Engineering
work can (1) encourage more deliberation about organization and Research Council of Canada (NSERC), RGPIN2021-03538. Lewis’
process at key collaboration points, and (2) serve as a starting point work was funded and supported by the Department of Defense
for cataloging and promoting best practices. under Contract No. FA8702-15-D-0002 with Carnegie Mellon Uni-
Beyond the speci�c challenges discussed throughout this paper, versity (CMU) for the operation of the Software Engineering Insti-
we see four broad themes that bene�t from more attention both in tute (SEI), a federally funded research and development center. We
engineering practice and in research: would thank all our interview participants (K M Jawadur Rahman,
� Communication: Many issues are rooted in miscommunica- Miguel Jette, and anonymous others) and the people who helped
tion between participants with di�erent backgrounds. To facilitate us connect with them.
interdisciplinary collaboration, education is key, including ML liter-
acy for software engineers and managers (and even customers) but
also training data scientists to understand software engineering
concerns. The idea of T-shaped professionals [101] (deep expertise
in one area, broad knowledge of others) can provide guidance for
hiring and training.
Arxiv, Feb 2022, USA Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner
[59] Mäkinen, S., Skogström, H., Laaksonen, E. and Mikkonen, T. 2021. Who Needs Discovery and Data Mining, 274–282.
MLOps: What Data Scientists Seek to Accomplish and How Can MLOps Help? In [88] Sendak, M.P. et al. 2020. Real-World Integration of a Sepsis Deep Learning
Proc. Workshop on AI Engineering-Software Engineering for AI (WAIN), 109-112. Technology Into Routine Clinical Care: Implementation Study. JMIR medical
[60] Martínez-Fernández, S., Bogner, J., Franch, X., Oriol, M., Siebert, J., Trendowicz, informatics. 8, 7, e15182.
A., Vollmer, A.M. and Wagner, S. 2021. Software Engineering for AI-Based [89] Serban, A., van der Blom, K., Hoos, H. and Visser, J. 2020. Adoption and Ef-
Systems: A Survey. arXiv 2105.01984. fects of Software Engineering Best Practices in Machine Learning. In Proc. Int’l
[61] Martinez-Plumed, F., Contreras-Ochando, L., Ferri, C., Hernandez Orallo, J., Symposium on Empirical Software Engineering and Measurement, 1–12.
Kull, M., Lachiche, N., Ramirez Quintana, M.J. and Flach, P.A. 2021. CRISP-DM [90] Seymoens, T., Ongenae, F. and Jacobs, A. 2018. A methodology to involve domain
twenty years later: From data mining processes to data science trajectories. IEEE experts and machine learning techniques in the design of human-centered
Transactions on Knowledge and Data Engineering. 33, 8, 3048–3061. algorithms. In Proc. IFIP Working Conf. Human Work Interaction Design, 200-214.
[62] Meyer, B. 1997. Object-Oriented Software Construction. Prentice-Hall. [91] Shneiderman, B. 2020. Bridging the gap between ethics and practice. ACM
[63] Mistrík, I., Grundy, J., van der Hoek, A. and Whitehead, J. 2010. Collaborative Transactions on Interactive Intelligent Systems. 10, 4, 1–31.
Software Engineering. Springer. [92] Siebert, J., Joeckel, L., Heidrich, J., Nakamichi, K., Ohashi, K., Namba, I., Ya-
[64] Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., mamoto, R. and Aoyama, M. 2020. Towards Guidelines for Assessing Qualities
Spitzer, E., Raji, I.D. and Gebru, T. 2019. Model Cards for Model Reporting. In of Machine Learning Systems. In Proc. Int’l Conf. on the Quality of Information
Proc. Conf. Fairness, Accountability, and Transparency, 220–229. and Communications Technology, 17–31.
[65] Muiruri, D., Lwakatare, L. E., K Nurminen, J. and Mikkonen, T. 2021. Practices [93] Singh, G., Gehr, T., Püschel, M. and Vechev, M. 2019. An abstract domain for
and Infrastructures for ML Systems–An Interview Study. TechRxiv 16939192.v1. certifying neural networks. Proc. ACM Program. Lang. 3, POPL, 1–30.
[66] O’Leary, K. and Uchida, M. 2020. Common problems with creating machine [94] Smith, D., Alshaikh, A., Bojan, R., Kak, A. and Manesh, M.M.G. 2014. Overcoming
learning pipelines from existing code. In Proc. Conf. Machine Learning and barriers to collaboration in an open source ecosystem. Technology Innovation
Systems (MLSys). Management Review. 4, 1.
[67] Ovaska, P., Rossi, M. and Marttiin, P. 2003. Architecture as a coordination tool [95] d. S. Nascimento, E. et al. 2019. Understanding Development Process of Ma-
in multi-site software development. Software Process Improvement and Practice. chine Learning Systems: Challenges and Solutions. In Proc. Int’l Symposium on
8, 4, 233–247. Empirical Software Engineering and Measurement (ESEM), 1–6.
[68] Ozkaya, I. 2020. What Is Really Di�erent in Engineering AI-Enabled Systems? [96] de Souza, C.R.B. and Redmiles, D.F. 2008. An Empirical Study of Software
IEEE Software. 37, 4, 3–6. Developers’ Management of Dependencies and Changes. In Proc. Int’l Conf.
[69] Panetta, K. 2020. Gartner Identi�es the Top Strategic Technology Trends for Software Engineering (ICSE), 241–250.
2021. URL: https:// www.gartner.com/ smarterwithgartner/ gartner-top-strategic- [97] Strauss, A. and Corbin, J. 1994. Grounded theory methodology: An overview.
technology-trends-for-2021. Handbook of qualitative research. N.K. Denzin, ed. 273–285.
[70] Park, S., Wang, A., Kawas, B., Vera Liao, Q., Piorkowski, D. and Danilevsky, M. [98] Strauss, A. and Corbin, J.M. Basics of Qualitative Research: Grounded Theory
2021. Facilitating Knowledge Sharing from Domain Experts to Data Scientists Procedures and Techniques. SAGE Publications.
for Building NLP Models. In Proc. 26th Int’l Conf. on Intelligent User Interfaces, [99] Studer, S. et al. 2021. Towards CRISP-ML(Q): A Machine Learning Process
585-596. Model with Quality Assurance Methodology. Machine Learning and Knowledge
[71] Parnas, D.L. 1972. On the Criteria to be used in Decomposing Systems into Extraction, 3(2), 392-413.
Modules. Communications of the ACM. 15, 12, 1053–1058. [100] Tramèr, F. et al. 2017. FairTest: Discovering Unwarranted Associations in Data-
[72] Passi, S., and Phoebe S. 2020. Making Data Science Systems Work. Big Data & Driven Applications. In Proc. European Symposium on Security and Privacy (EuroS
Society 7 (2): 1-13. P), 401–416.
[73] Patel, K., Fogarty, J., Landay, J.A. and Harrison, B. 2008. Investigating statistical [101] Tranquillo, J. 2017. The T-Shaped Engineer. Journal of Engineering Education
machine learning as a tool for software development. In Proc. Conf. Human Transformations. 30, 4, 12–24.
Factors in Computing Systems (CHI), 667–676. [102] Vogelsang, A. and Borg, M. 2019. Requirements Engineering for Machine Learn-
[74] Pimentel, J.F., Murta, L., Braganholo, V. and Freire, J. 2019. A large-scale study ing: Perspectives from Data Scientists. In Proc. Int’l Requirements Engineering
about quality and reproducibility of Jupyter notebooks. In Proc. 16th Int’l Conf. Conf. Workshops (REW), 245–251.
on Mining Software Repositories (MSR), 507-517. [103] Wagsta�, K. 2012. Machine Learning that Matters. arXiv 1206.4656.
[75] Piorkowski, D. et al. 2021. How AI Developers Overcome Communication Chal- [104] Wang, A.Y., Mittal, A., Brooks, C. and Oney, S. 2019. How Data Scientists Use
lenges in a Multidisciplinary Team: A Case Study. In Proc. ACM on Human- Computational Notebooks for Real-Time Collaboration. Proc. Human-Computer
Computer Interaction, 5, (CSCW1), 1-25. Interaction. 3, CSCW, 39.
[76] Polyzotis, N., Roy, S., Whang, S.E. and Zinkevich, M. 2018. Data Lifecycle Chal- [105] Wan, Z., Xia, X., Lo, D. and Murphy, G.C. 2019. How does Machine Learn-
lenges in Production Machine Learning: A Survey. SIGMOD Rec. 47, 2, 17–28. ing Change Software Development Practices? IEEE Transactions on Software
[77] Polyzotis, N., Roy, S., Whang, S.E. and Zinkevich, M. 2017. Data Management Engineering, 47(9), 1857-1871.
Challenges in Production Machine Learning. In Proc. Int’l Conf. on Management [106] Waterman, M., Noble, J. and Allan, G. 2015. How Much Up-Front? A Grounded
of Data, 1723–1726. theory of Agile Architecture. In Proc. Int’l Conf. Software Engineering, 347–357.
[78] Polyzotis, N., Zinkevich, M., Roy, S., Breck, E. and Whang, S. 2019. Data valida- [107] Sta�, V. B. 2019. Why do 87% of data science projects never make it into pro-
tion for machine learning. In Proc. Machine Learning and Systems, 334–347. duction? URL: https:// venturebeat.com/ 2019/ 07/ 19/ why-do-87-of-data-science-
[79] Rahimi, M., Guo, J.L.C., Kokaly, S. and Chechik, M. 2019. Toward Requirements projects-never-make-it-into-production/ .
Speci�cation for Machine-Learned Components. In Proc. Int’l Requirements [108] Wiens, J., et al. 2019. Do no harm: A roadmap for responsible machine learning
Engineering Conf. Workshops (REW), 241–244. for health care. Nature medicine. 25, 9, 1337–1340.
[80] Rakova, B., Yang, J., Cramer, H. and Chowdhury, R. 2021. Where Responsible AI [109] Xie, X., Ho, J.W.K., Murphy, C., Kaiser, G., Xu, B. and Chen, T.Y. 2011. Testing
meets Reality: Practitioner Perspectives on Enablers for Shifting Organizational and Validating Machine Learning Classi�ers by Metamorphic Testing. Journal
Practices. Proc. ACM Hum.-Comput. Interact. 5, CSCW1, 1–23. of Systems and Software. 84, 4, 544–558.
[81] Ré, C., Niu, F., Gudipati, P. and Srisuwananukorn, C. 2019. Overton: A data sys- [110] Yang, Q., Suh, J., Chen, N.-C. and Ramos, G. 2018. Grounding Interactive Machine
tem for monitoring and improving machine-learned products. arXiv 1909.05372. Learning Tool Design in How Non-Experts Actually Build Models. In Proc. Conf.
[82] Salay, R., Queiroz, R. and Czarnecki, K. 2017. An Analysis of ISO 26262: Using Designing Interactive Systems, 573–584.
Machine Learning Safely in Automotive Software. arXiv 1709.02435. [111] Yang, Q. The role of design in creating machine-learning-enhanced user experi-
[83] Sambasivan, N. et al. 2021. “Everyone wants to do the model work, not the data ence. In Proc. AAAI Spring Symposium Series, 406-411.
work”: Data Cascades in High-Stakes AI. In Proc. Conf. on Human Factors in [112] Yokoyama, H. 2019. Machine Learning System Architectural Pattern for Improv-
Computing Systems (CHI). 1–15. ing Operational Stability. In Proc. Int’l Conf. on Software Architecture Companion
[84] Sarma, A., Redmiles, D.F. and van der Hoek, A. 2012. Palantir: Early Detection of (ICSA-C), 267–274.
Development Con�icts Arising from Parallel Code Changes. IEEE Transactions [113] Zhang, A.X., Muller, M. and Wang, D. 2020. How do data science workers
on Software Engineering. 38, 4, 889–908. collaborate? Roles, work�ows, and tools. Proc. Human-Computer Interaction. 4,
[85] Schelter, S et al. 2018. Automating Large-scale Data Quality Veri�cation. Proc. CSCW1, 1–23.
VLDB Endowment Int’l Conf. Very Large Data Bases. 11, 12, 1781–1794. [114] Zhou, S., Vasilescu, B. and Kästner, C. 2020. How Has Forking Changed in the
[86] Sculley, D. et al. 2015. Hidden Technical Debt in Machine Learning Systems. Last 20 Years? A Study of Hard Forks on GitHub. In Proc. Int’l Conf. Software
Advances in Neural Information Processing Systems 28. 2503–2511. Engineering (ICSE), 445–456.
[87] Sculley, D., Otey, M.E., Pohl, M., Spitznagel, B., Hainsworth, J. and Zhou, Y. 2011. [115] Zinkevich, M. 2017. Rules of machine learning: Best practices for ML engineering.
Detecting adversarial advertisements in the wild. In Proc. Int’l Conf. Knowledge URL: https:// developers.google.com/ machine-learning/ guides/ rules-of-ml.
Supplementary Material
SUPPLEMENT A: INTERVIEW PARTICIPANTS Table 4: Distribution of Company Type
SUPPLEMENT B: INTERVIEW GUIDE Components Can you tell me a bit about your ML pipeline?
What are the interaction points to the non-ML components?
Introductory Comments
Can you describe/draw the architecture of the system with
– This study intended to understand the challenges of collabo- ml and non-ml components?
ration between di�erent stakeholders in a machine learning System decomposition How did you decide to decompose
production system, and the current industry best practices to the whole system into dependent/independent components
deal with those challenges. In support of this study, a series this way? Was it obvious or did it change? Do you think of
of targeted interviews are being conducted to gather infor- ML as one component? Who was involved in the decompo-
mation from the stakeholders involved in di�erent stages of sition process and deciding the module boundaries?
building, deploying, and operating machine learning produc- System Correctness How do you plan to evaluate your sys-
tion systems. tem correctness? How do you set quality goals?
– Please do not share any con�dential information with us. If ML component testing How do you plan to test the model?
you prefer information not to be included in the recording, Do you conduct o�ine/online testing? What data is col-
just let us know. lected later (e.g., for telemetry)? Who made these decisions
– The information gathered from the interviews will be aggre- and do they evolve?
gated in the form of mismatches and consequences, and will System testing In testing, how much do you focus on test-
have no connection to any organization. ing the model versus testing the entire system? How do
– Before we begin, we would like to notify you that we in- you test the system? Who is involved in testing? Who
tend to record the interview for transcription purposes. Only makes decisions?
the research team will be able to access the recordings and Breaking change Did you face any application break re-
transcripts. After the interview is transcribed and identify- cently? What is considered a breaking change? How do
ing information is removed from the transcript, the audio you deal with breaking changes? How do you detect them?
recording will be destroyed. – (probe) Do you consider that model might evolve during
– The audio recording will be done in a private space for both model development and are likely to in�uence other parts
physical and over phone conversations. of the system?
Data quality How much do you or others in the project worry
Interview Questions about data quality?
[Intro]. Data need Think about data that you receive from other
Project and Role Can you tell me more about you; your role parts of the system (or from outside the system) or of data
at your company, and your academic and professional back- you produce for other parts of the system: Are data quality
ground? Tell me about the last ml-based project you worked and quantity requirements documented? Are there schema
on. To what extent you were involved in the system, and de�nitions or data quality checks or monitoring? Who
what was/were your roles in it? "owns" the data? Who cleans the data? Who is responsible
ML Component Can you tell me about the ml or non-ml parts for checks or documentation? What happens when data
of the project? How the ML component is used as part of format or quality change? How much in�uence do you
the larger system? have on data quality and quantity?
Team Can you describe a bit about your team? What are the Data understanding Is the data schema and its semantics
roles involved? What are their tasks? And how do they com- documented somewhere? Who does the documentation? If
municate with each other, and for what purpose? Who is you don’t understand the data, whom do you communicate
involved with which components? with to understand it?
Planning and monitoring for drift What happens if the
[Topic-Specific �estions].
data changes? Who is responsible for notifying changes?
Understanding system requirements and ML capabilities How do you get to know about the change? How do you
Who were involved in the decision to use ML in this project deal with schema evolution?
(or whether to turn an ML prototype into a product)? Who Special qualities Do you consider requirements like explain-
do you talk to for the requirements of the system as well ability or fairness, and legal requirements like privacy? Are
as the ML components in the project? – (probe) ask about the requirements documented at the system or model level?
feedback loops Who is responsible? Do you plan for fairness/ robustness
Project planning and Process Project planning How do you testing?
plan or estimate for the project, speci�cally the ML com- Versioning Do you maintain model versioning? What about
ponents and dependencies between components? Who is data versioning? Do you think about provenance tracking?
involved in the planning? Are these documented? Who made decisions? Who is a�ected?
Process In what order are the things developed in the ML Reusability, Reproducibility, Maintainability To what ex-
project? Do you follow a process model? tent do you develop documentation for the component? Do
Dealing with change Do you plan for change management? you follow any coding standards or conventions? Do you con-
How does this planning / replanning happen? Who gets sider designing the module to reuse? Do you care whether
involved? components and results are reproducible?
Arxiv, Feb 2022, USA Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner
Operations How do you deploy updates to models and non- Output: Documentation on requirements including functional-
ML components? How often? Who? Do you consider con- nonfunctional requirements, environment interaction analysis rea-
tinuous experimentation? Who is responsible for the opera- soning about feedback loops.
tions? Involved: Primarily - Requirements Engineer, Domain Expert,
Composition/Integration How do you transfer the model Customer and Manager. Needs consultation with - Software Engi-
from prototype to production? Who is responsible for in- neer, Data Scientist, Tester, Operator.
tegrating the ML model to the system? Do multiple roles Lifecycle: Traditionally, this would happen in early requirements
collaborate for this or is it done by one speci�c role? What stages of a waterfall-like process or iteratively in other process
is the di�culty level for that, according to you? models. In ML projects this may happen when shifting from initial
modeling to building a production system.
[Last Thoughts].
Why challenging: While developing ML components, the data
ML vs non-ML Can you share your experience with the ML scientists often forget about the overall system and focus on the
project in comparison with other types of projects that do not model part only. Thus, inconsistencies may arise if the model does
include a machine learning component (if you have worked not correspond to the system goals. Also, it is often reported that
on any)? stakeholders �nd it di�cult to de�ne the scope of the project that
Challenges/ Bene�ts Can you think about any challenges includes ML components. ML components often lead to false expec-
you faced during the project development or afterward? tations that makes it di�cult to set reasonable targets. Additionally,
Why do you think the challenge arose, and in which step? it’s di�cult to quantify the quality targets or the qualitative non-
– (probe) [Interdisciplinary Collaboration] Is working with functional requirements. Identifying the feedback loops is also not
team members of di�erent backgrounds a challenge in this a trivial task. Moreover, as both the traditional and ML parts are
project? involved and the team consists of people with heterogeneous lan-
guages and priorities, these people need to interact with each other
SUPPLEMENT C: CODEBOOK to come to a common position.
1. Understanding System Requirements Example: Google photos app has ML modules for photo tagging.
Description: This is the collaboration point where the broad sys- However, the app has a lot of other features, and the ML mod-
tem requirements are collected/de�ned. The overall system goal and ule needs to interact and be consistent with those. For example -
system requirements may di�er from those of individual ML compo- should the app show the photo tags to the users or will the users be
nents and focus on the overall system behavior and its interactions able to change a tag if they think its incorrect. Thus, the di�erent
with the environment. While a system can grow around keeping stakeholders of the Google photos app need to be involved in the
an ML module as a central point, another system can consist of requirements collection and realization process.
less important ML module(s) working along with other traditional 2. Project Planning, Process and Interdisciplinary Collabo-
features. Therefore, system requirements may be collected upfront ration
or late and incrementally after the initial model development. For Description: This is the collaboration point that deals with the
the �rst category of the system, the requirements are generated overall project planning. It de�nes the process model to be fol-
after the ML component is de�ned. Thus, the ML design decisions lowed. While a project might start with collecting requirements
in�uence how the entire system is shaped, and take control on �rst, another can start with letting some data scientists play with
de�ning the requirements of the overall system. For the second cat- data for a year. So the process needs to be determined at this point.
egory of system, the requirements gathering is executed similarly Additionally, general planning like time estimation, risk mitigation
as a traditional software. One or few ML features are incorporated plan etc. also need to be conducted and synchronized among the
into the overall system, that are in�uenced and shaped by the broad stakeholders at this collaboration point. As this collaboration point
system-level requirements constraining the data scientists during is where the overall system planning and process is discussed, the
modeling. Based on these categories, we expect that stakeholders whole team involved in the project should be involved here, at least
think di�erently about the business needs of the system. In this col- implicitly. That is why, the communication challenges between the
laboration point, the stakeholders need to de�ne the system scope, interdisciplinary team members is also combined with this point.
system-level requirements and metrics to set quality expectations Agree on: Project plan and process models to be followed. Com-
for the overall system. This includes considering how the system munication agendas between di�erent teams/roles.
interacts with the environment and how this in�uences safety, se- Output: Informal project planning to more formal planning doc-
curity, fairness, and feedback-loop issues at the system level. Also, uments, possible process documentation, adoption of process prac-
this is the collaboration point where the special requirements like tices, forming of interdisciplinary teams, adoption of team work
explainability, fairness, privacy, safety, security, Human-AI inter- practices
action planning, etc. need to be de�ned properly for the overall Involved: Primarily - Project Manager. Needs consultation with -
system as well as realize their e�ect on the ML part. In short, the Software Engineer, Data Scientist, Domain Experts, Tester, Operator,
di�erent stakeholders of the system need to be involved in the etc.
requirements collection and realization process. Lifecycle: Generally, in a waterfall-like process, the tasks can be
Agree on: System-level requirements, reasoning about feedback actively handled in both System Requirements and Planning/ High-
loops, non-functional requirements of the system level Design (outside ML pipeline) stages. However, in practice, it
generally spreads out and continues as ongoing tasks.
Why challenging: In general, time estimation is hard for the ML
Supplementary Material Arxiv, Feb 2022, USA
applications due to their exploratory nature. This also creates a Engineer, Data Scientist, Operator, Manager.
lot of collaboration challenges in planning due to the di�erences 3.2. Fairness, Privacy and Accountability
between SE and ML components. Example: Google photos app has Description: Inclusion of ML in a system often creates expecta-
ML modules and other traditional functionalities. The ML modules tions of some special system-level quality requirements that in-
have structures and time requirements di�erent from the other cludes - fairness, privacy, explainability, provenance tracking, etc.
traditional parts. Thus, before building the app, the project planning Again, to achieve the quality at the system level, the role of individ-
needs to incorporate planning for the ML parts. For example - the ual components needs to be negotiated, de�ned, and assured; the
photo tagging component might need a di�erent timeline than the integration of components needs to be evaluated.
traditional components. A risk analysis might be required for the Agree on: Protected data points, level of fairness expected, level
component, and similar to the iterative model, the risk-associated of explainability/provenance tracking expected
component can be attempted to build �rst. Component-level Evaluation: Model Fairness Testing (increasing
3. System Decomposition, Local Checking and System Eval- data quantity might be needed), Constraints on Protected Values in
uation Data, Use of Explainable Algorithms, Versioning Code, Model and
Description: This collaboration point deals with decomposing the Data.
overall system into ML and non-ML SE components. This de�nes Involved: Primarily - Data Scientist. Needs consultation with -
the module boundaries and negotiates the requirements or interface Domain Expert, Legal Department, Manager, Tester.
contracts for each of the components. While each component will System-level Evaluation: System-level Fairness Testing, Testing
be evaluated internally against its interfaces, also later composi- whether provenance tracking can be performed, Testing whether
tion and integration-/system-level quality assurance is planned, a system output can be explained or whether the results are inter-
ensuring that the composed system meets the system speci�cations. pretable.
During decomposition and local-system evaluation many qualities Involved: Primarily - Tester. Needs consultation with - Software
need to be considered, each resulting possibly with corresponding Engineer, Data Scientist, Domain Expert, Legal Department, Opera-
obligations at the module interfaces. Generally, we consider four tor, Manager.
kinds of components: 3.3. Data Quality
• non-ML component: These are the traditional SE compo- Description: Data is one of the most important parts of an ML
nents application. This includes tracking data needs both from the as-
• Pipeline component: This component represents the process pect of quality and quantity, de�ning an appropriate data schema,
of producing the model, roughly how data is transferred to monitoring data drifts and taking necessary steps to incorporate
the model and the deployment of it. change or evolution, documentation of data, etc. Agree on: Quality
• Inference component: This is the component that is con- and quantity of data to be collected, data schema de�nition, data
cerned about the model prediction, and thus, relates to using understanding and reporting, data monitoring and managing drifts
the model. Component-level Evaluation: Improve/Degrade of Accuracy can
• Monitoring component: This represents the component for be one indicator of data needs. Privacy is a concern here as data can
monitoring the system after the deployment of the system. contain protected attributes. Data cleaning and storing in schema
Agree on: Module boundaries and quality requirements for the are steps to check on data integrity in mode level.
components after the decomposition. Also, evaluation mechanisms Involved: Primarily - Data Scientist. Needs consultation with - Le-
for the components and the overall system after the composition. gal Department [for privacy], Domain Expert [for schema de�nition
3.1. Functional Correctness / Target Domain / Fit (Accu- and drift], Manager [for more data need].
racy) System-level Evaluation: Degrade of model accuracy in the inte-
Description: Functional "correctness" is used as a broad term here, grated system can be one indicator of data drift.
that measures if the system produces outputs that match a given Involved: Primarily - Data Scientist. Needs consultation with
problem. The system has behavioral expectations that need to be - Software Engineer, Tester, Operator [for collecting telemetry],
broken down into individual components, tested locally, and then Domain Expert [for drift], Manager [for more data need].
be composed and evaluated again at the system level. 3.4. Reusability, Reproducibility, Maintainability, Infras-
Agree on: Accuracy expectations for ML components, functional tructure Quality
behavior expectations of all individual components, test strategy Description: ML components have di�erent coding languages
for components, test strategies for integration and di�erent conventions. However, for a complete software, the
Component-level Evaluation: Model Accuracy. Both O�ine/ On- parts might follow similar coding conventions so that the other
line testing can be needed. Telemetry collection needs collaboration. team members can understand the code easily and integrate the
Functional correctness of non-ML components. parts together without much e�ort. Also, the documentation of the
Involved: Primarily - Data Scientist. For online testing/telemetry implementation needs to be maintained to a certain extent so that
collection, needs consultation with - Domain Expert, Software En- the system can be easily updated later on. This will help to increase
gineer, Operator, Tester. the system maintainability. This also leads to easy reusability and
System-level Evaluation: Integration, system and acceptance test- reproducibility of the system.
ing. Agree on: Coding Convention and Level of Documentation, Ex-
Involved: Primarily - Tester. Needs consultation with - Software pectations of Reusability, Reproducibility and Maintainability
Arxiv, Feb 2022, USA Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner
Component-level Evaluation: Consistent Coding and Documen- • Arpteg, A., Brinne, B., Crnkovic-Friis, L., & Bosch, J. (2018).
tation of the Component (Involved: Primarily - Software Engineer Software Engineering Challenges of Deep Learning. 2018
and Data Scientist. Needs consultation with - Manager) 44th Euromicro Conference on Software Engineering and Ad-
System-level Evaluation: Consistent Coding and Documentation vanced Applications (SEAA), 50–59.
throughout the System. • Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017).
Involved: Primarily - Software Engineer and Data Scientist. Needs The ML test score: A rubric for ML production readiness and
consultation with - Manager. technical debt reduction. 2017 IEEE International Conference
3.5. Updatability, Operations on Big Data (Big Data), 1123–1132.
Description: As ML systems need to be monitored for data drifts • Kery, M. B., Radensky, M., Arya, M., John, B. E., & Myers, B.
and online testing with telemetry collection, deployment is not A. (2018). The story in the notebook. Proceedings of the 2018
the last stage of development. Operations is an important part of CHI Conference on Human Factors in Computing Systems -
such projects leading to the popularity of MLOps. Along with the CHI ’18. the 2018 CHI Conference, Montreal QC, Canada.
data monitoring, change updates are also required for both ML and • Zhang, A. X., Muller, M., & Wang, D. (2020). How do data
traditional parts of the software. Continuous experimentation is science workers collaborate? Roles, work�ows, and tools.
another aspect that requires updating the application continuously. Proceedings of the ACM on Human-Computer Interaction,
This relates to system versioning and provenance tracking as well. 4(CSCW1), 1–23.
Agree on: Deployment and Monitoring Mechanisms, Frequency • Pimentel, J. F., Murta, L., Braganholo, V., & Freire, J. (2019,
of Updates, Continuous Experimentation Requirements May). A large-scale study about quality and reproducibility of
Component-level Evaluation: Stable Model Deployment and Mon- jupyter notebooks. 2019 IEEE/ACM 16th International Confer-
itoring Mechanisms for Data Drifts. Experimentation with Di�erent ence on Mining Software Repositories (MSR). 2019 IEEE/ACM
Versions of the Model. (Involved: Primarily - Software Engineer and 16th International Conference on Mining Software Reposito-
Data Scientist. Needs consultation with - Operator, Manager and ries (MSR), Montreal, QC, Canada.
Tester) • Head, A., Hohman, F., Barik, T., Drucker, S. M., & DeLine,
System-level Evaluation: Stable System Deployment and Monitor- R. (2019). Managing messes in computational notebooks.
ing Mechanisms. Monitoring for change requests and continuous Proceedings of the 2019 CHI Conference on Human Factors
bug �xes generate continuous patch or version updates. in Computing Systems - CHI ’19. the 2019 CHI Conference,
Involved: Primarily - Operator. Needs consultation with - Soft- Glasgow, Scotland Uk.
ware Engineer, Tester and Manager. • Studer, S., Bui, T. B., Drescher, C., Hanuschkin, A., Winkler,
3.6. Usability L., Peters, S., & Mueller, K.-R. (2020). Towards CRISP-ML(Q):
Description: Like the other traditional systems, the ML systems A Machine Learning Process Model with Quality Assurance
also need user interaction in di�erent forms. Whatever the form Methodology. In arXiv [cs.LG]. arXiv.
is, user satisfaction/usability is an important aspect of any system. • Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T.,
Especially, ML systems need a collection of telemetry that demands Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., & Den-
extra attention to the UI design. nison, D. (2015). Hidden Technical Debt in Machine Learn-
Agree on: User Experience Expectations ing Systems. In C. Cortes, N. D. Lawrence, D. D. Lee, M.
Involved: Primarily - Software Engineer and UX Expert. Needs Sugiyama, & R. Garnett (Eds.), Advances in Neural Informa-
consultation with - Data Scientist, Operator, Manager, and Tester tion Processing Systems 28 (pp. 2503–2511). Curran Associates,
Inc.
SUPPLEMENT D: LIST OF PAPERS • Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E.,
Initial Set of Papers (15) Nagappan, N., Nushi, B., & Zimmermann, T. (2019). Software
Engineering for Machine Learning: A Case Study. In 2019
• Chattopadhyay, S., Prasad, I., Henley, A. Z., Sarma, A., & IEEE/ACM 41st International Conference on Software Engi-
Barik, T. (2020). What’s Wrong with Computational Note- neering: Software Engineering in Practice (ICSE-SEIP).
books? Pain Points, Needs, and Design Opportunities. In • Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2017).
Proceedings of the 2020 CHI Conference on Human Factors in Data Management Challenges in Production Machine Learn-
Computing Systems (pp. 1–12). Association for Computing ing. Proceedings of the 2017 ACM International Conference on
Machinery. Management of Data, 1723–1726.
• O’Leary, K., & Uchida, M. (2020). Common problems with • Vogelsang, A., & Borg, M. (2019). Requirements Engineer-
creating machine learning pipelines from existing code. ing for Machine Learning: Perspectives from Data Scientists.
• Li, P. L., Ko, A. J., & Begel, A. (2017). Cross-Disciplinary 2019 IEEE 27th International Requirements Engineering Con-
Perspectives on Collaborations with Software Engineers. In ference Workshops (REW), 245–251.
2017 IEEE/ACM 10th International Workshop on Cooperative
and Human Aspects of Software Engineering (CHASE). Complete Set of Papers (61)
• Kim, M., Zimmermann, T., DeLine, R., & Begel, A. (2018).
• Chattopadhyay, S., Prasad, I., Henley, A. Z., Sarma, A., &
Data Scientists in Software Teams: State of the Art and Chal-
Barik, T. (2020). What’s Wrong with Computational Note-
lenges. In IEEE Transactions on Software Engineering (Vol. 44,
books? Pain Points, Needs, and Design Opportunities. In
Issue 11, pp. 1024–1038).
Proceedings of the 2020 CHI Conference on Human Factors in
Supplementary Material Arxiv, Feb 2022, USA
Computing Systems (pp. 1–12). Association for Computing Management of Data, 1723–1726.
Machinery. • Vogelsang, A., & Borg, M. (2019). Requirements Engineer-
• O’Leary, K., & Uchida, M. (2020). Common problems with ing for Machine Learning: Perspectives from Data Scientists.
creating machine learning pipelines from existing code. 2019 IEEE 27th International Requirements Engineering Con-
• Li, P. L., Ko, A. J., & Begel, A. (2017). Cross-Disciplinary ference Workshops (REW), 245–251.
Perspectives on Collaborations with Software Engineers. In • Holstein, K., Wortman Vaughan, J., Daumé, H., Dudik, M., &
2017 IEEE/ACM 10th International Workshop on Cooperative Wallach, H. (2019). Improving Fairness in Machine Learning
and Human Aspects of Software Engineering (CHASE). Systems: What Do Industry Practitioners Need? Proceedings
• Kim, M., Zimmermann, T., DeLine, R., & Begel, A. (2018). of the 2019 CHI Conference on Human Factors in Computing
Data Scientists in Software Teams: State of the Art and Chal- Systems, 1–16.
lenges. In IEEE Transactions on Software Engineering (Vol. 44, • Rahimi, M., Guo, J. L. C., Kokaly, S., & Chechik, M. (2019).
Issue 11, pp. 1024–1038). Toward Requirements Speci�cation for Machine-Learned
• Arpteg, A., Brinne, B., Crnkovic-Friis, L., & Bosch, J. (2018). Components. 2019 IEEE 27th International Requirements En-
Software Engineering Challenges of Deep Learning. 2018 gineering Conference Workshops (REW), 241–244.
44th Euromicro Conference on Software Engineering and Ad- • Nushi, B., Kamar, E., Horvitz, E., & Kossmann, D. (2017). On
vanced Applications (SEAA), 50–59. human intellect and machine failures: troubleshooting inte-
• Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). grative machine learning systems. Proceedings of the Thirty-
The ML test score: A rubric for ML production readiness and First AAAI Conference on Arti�cial Intelligence, 1017–1025.
technical debt reduction. 2017 IEEE International Conference • Borg, M., Englund, C., Wnuk, K., Duran, B., Levandowski, C.,
on Big Data (Big Data), 1123–1132. Gao, S., Tan, Y., Kaijser, H., Lönn, H., & Törnqvist, J. (2019).
• Kery, M. B., Radensky, M., Arya, M., John, B. E., & Myers, B. Safely Entering the Deep: A Review of Veri�cation and Vali-
A. (2018). The story in the notebook. Proceedings of the 2018 dation for Machine Learning and a Challenge Elicitation in
CHI Conference on Human Factors in Computing Systems - the Automotive Industry. In Journal of Automotive Software
CHI ’18. the 2018 CHI Conference, Montreal QC, Canada. Engineering (Vol. 1, Issue 1, p. 1).
• Zhang, A. X., Muller, M., & Wang, D. (2020). How do data • Kandel, S., Paepcke, A., Hellerstein, J. M., & Heer, J. (2012). En-
science workers collaborate? Roles, work�ows, and tools. terprise data analysis and visualization: An interview study.
Proceedings of the ACM on Human-Computer Interaction, IEEE Transactions on Visualization and Computer Graphics,
4(CSCW1), 1–23. 18(12), 2917–2926.
• Pimentel, J. F., Murta, L., Braganholo, V., & Freire, J. (2019, • Madaio, M. A., Stark, L., Wortman Vaughan, J., & Wallach,
May). A large-scale study about quality and reproducibility of H. (2020). Co-Designing Checklists to Understand Organiza-
jupyter notebooks. 2019 IEEE/ACM 16th International Confer- tional Challenges and Opportunities around Fairness in AI.
ence on Mining Software Repositories (MSR). 2019 IEEE/ACM Proceedings of the 2020 CHI Conference on Human Factors in
16th International Conference on Mining Software Reposito- Computing Systems, 1–14.
ries (MSR), Montreal, QC, Canada. • Salay, R., Queiroz, R., & Czarnecki, K. (2017). An Analysis of
• Head, A., Hohman, F., Barik, T., Drucker, S. M., & DeLine, ISO 26262: Using Machine Learning Safely in Automotive
R. (2019). Managing messes in computational notebooks. Software. In arXiv [cs.AI]. arXiv.
Proceedings of the 2019 CHI Conference on Human Factors • Ashmore, R., Calinescu, R., & Paterson, C. (2019). Assuring
in Computing Systems - CHI ’19. the 2019 CHI Conference, the Machine Learning Lifecycle: Desiderata, Methods, and
Glasgow, Scotland Uk. Challenges. In arXiv [cs.LG]. arXiv.
• Studer, S., Bui, T. B., Drescher, C., Hanuschkin, A., Winkler, • Sendak, M. P., Ratli�, W., Sarro, D., Alderton, E., Futoma, J.,
L., Peters, S., & Mueller, K.-R. (2020). Towards CRISP-ML(Q): Gao, M., Nichols, M., Revoir, M., Yashar, F., Miller, C., Kester,
A Machine Learning Process Model with Quality Assurance K., Sandhu, S., Corey, K., Brajer, N., Tan, C., Lin, A., Brown,
Methodology. In arXiv [cs.LG]. arXiv. T., Engelbosch, S., Anstrom, K., . . . O’Brien, C. (2020). Real-
• Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., World Integration of a Sepsis Deep Learning Technology Into
Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., & Den- Routine Clinical Care: Implementation Study. JMIR Medical
nison, D. (2015). Hidden Technical Debt in Machine Learn- Informatics, 8(7), e15182.
ing Systems. In C. Cortes, N. D. Lawrence, D. D. Lee, M. • Sculley, D., Otey, M. E., Pohl, M., Spitznagel, B., Hainsworth,
Sugiyama, & R. Garnett (Eds.), Advances in Neural Informa- J., & Zhou, Y. (2011). Detecting adversarial advertisements
tion Processing Systems 28 (pp. 2503–2511). Curran Associates, in the wild. Proceedings of the 17th ACM SIGKDD Interna-
Inc. tional Conference on Knowledge Discovery and Data Mining,
• Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., 274–282.
Nagappan, N., Nushi, B., & Zimmermann, T. (2019). Software • Bernardi, L., Mavridis, T., & Estevez, P. (2019). 150 success-
Engineering for Machine Learning: A Case Study. In 2019 ful machine learning models. Proceedings of the 25th ACM
IEEE/ACM 41st International Conference on Software Engi- SIGKDD International Conference on Knowledge Discovery &
neering: Software Engineering in Practice (ICSE-SEIP). Data Mining - KDD ’19. the 25th ACM SIGKDD International
• Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2017). Conference, Anchorage, AK, USA.
Data Management Challenges in Production Machine Learn-
ing. Proceedings of the 2017 ACM International Conference on
Arxiv, Feb 2022, USA Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner
• Lwakatare, L. E., Raj, A., Bosch, J., Olsson, H. H., & Crnkovic, Dzhulgakov, D., Fawzy, M., Jia, B., Jia, Y., Kalro, A. and Law,
I. (2019). A taxonomy of software engineering challenges for J., 2018, February. Applied machine learning at facebook: A
machine learning systems: An empirical investigation. Inter- datacenter infrastructure perspective. In 2018 IEEE Interna-
national Conference on Agile Software Development, 227–243. tional Symposium on High Performance Computer Architecture
• Yang, Q., Suh, J., Chen, N.-C., & Ramos, G. (2018). Ground- (HPCA) (pp. 620-629). IEEE.
ing Interactive Machine Learning Tool Design in How Non- • Amershi, S., Chickering, M., Drucker, S. M., Lee, B., Simard,
Experts Actually Build Models. Proceedings of the 2018 De- P., & Suh, J. (2015). ModelTracker: Redesigning Performance
signing Interactive Systems Conference, 573–584. Analysis Tools for Machine Learning. Proceedings of the 33rd
• Martinez-Plumed, F., Contreras-Ochando, L., Ferri, C., Her- Annual ACM Conference on Human Factors in Computing
nandez Orallo, J., Kull, M., Lachiche, N., Ramirez Quintana, Systems, 337–346.
M. J., & Flach, P. A. (2020). CRISP-DM twenty years later: • Peng, Z., Yang, J., Chen, T.-H. (peter), & Ma, L. (2020, Novem-
From data mining processes to data science trajectories. IEEE ber 8). A �rst look at the integration of machine learning
Transactions on Knowledge and Data Engineering, 1–1. models in complex autonomous driving systems: a case study
• Ishikawa, F. and Yoshioka, N., 2019, May. How do engineers on Apollo. Proceedings of the 28th ACM Joint Meeting on Eu-
perceive di�culties in engineering of machine-learning systems?- ropean Software Engineering Conference and Symposium on
questionnaire survey. In 2019 IEEE/ACM Joint 7th Interna- the Foundations of Software Engineering. ESEC/FSE ’20: 28th
tional Workshop on Conducting Empirical Studies in Industry ACM Joint European Software Engineering Conference and
(CESI) and 6th International Workshop on Software Engineer- Symposium on the Foundations of Software Engineering,
ing Research and Industrial Practice (SER&IP) (pp. 2-9). IEEE. Virtual Event USA.
• Ozkaya, I. (2020). What Is Really Di�erent in Engineering • Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2018).
AI-Enabled Systems? IEEE Software, 37(4), 3–6. Data Lifecycle Challenges in Production Machine Learning:
• Wiens, J., Saria, S., Sendak, M., Ghassemi, M., Liu, V. X., Doshi- A Survey. SIGMOD Rec., 47(2), 17–28.
Velez, F., Jung, K., Heller, K., Kale, D., Saeed, M., Ossorio, P. N., • Humbatova, N., Jahangirova, G., Bavota, G., Riccio, V., Stocco,
Thadaney-Israni, S., & Goldenberg, A. (2019). Do no harm: a A., & Tonella, P. (2020, June 27). Taxonomy of real faults in
roadmap for responsible machine learning for health care. deep learning systems. Proceedings of the ACM/IEEE 42nd
Nature Medicine, 25(9), 1337–1340. International Conference on Software Engineering. ICSE ’20:
• Wagsta�, K. (2012). Machine Learning that Matters. In arXiv 42nd International Conference on Software Engineering,
[cs.LG]. arXiv. Seoul South Korea.
• Hynes, N., Sculley, D. and Terry, M., 2017. The data linter: • Polyzotis, N., Zinkevich, M., Roy, S., Breck, E., & Whang, S.
Lightweight, automated sanity checking for ml data sets. In (2019). Data validation for machine learning. Proceedings of
NIPS MLSys Workshop. Machine Learning and Systems, 1, 334–347.
• Strubell, E., Ganesh, A. and McCallum, A., 2019. Energy • Wan, Z., Xia, X., Lo, D., & Murphy, G. C. (2019). How does
and policy considerations for deep learning in NLP. arXiv Machine Learning Change Software Development Practices?
preprint arXiv:1906.02243. IEEE Transactions on Software Engineering, 1–1.
• Hukkelberg, I., & Rolland, K. (2020). EXPLORING MACHINE • Lwakatare, L. E., Raj, A., Crnkovic, I., Bosch, J., & Olsson,
LEARNING IN A LARGE GOVERNMENTAL ORGANIZATION: H. H. (2020). Large-scale machine learning systems in real-
AN INFORMATION INFRASTRUCTURE PERSPECTIVE. world industrial settings: A review of challenges and so-
• Hill, C., Bellamy, R., Erickson, T. and Burnett, M., 2016, Sep- lutions. Information and Software Technology, 127(106368),
tember. Trials and tribulations of developers of intelligent 106368.
systems: A �eld study. In 2016 IEEE Symposium on Visual • Siebert, J., Joeckel, L., Heidrich, J., Nakamichi, K., Ohashi,
Languages and Human-Centric Computing (VL/HCC) (pp. 162- K., Namba, I., Yamamoto, R., & Aoyama, M. (2020). Towards
170). Guidelines for Assessing Qualities of Machine Learning Sys-
• Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C. Y., tems. In Communications in Computer and Information Sci-
Haque, Z., Haykal, S., Ispir, M., Jain, V., Koc, L., Koo, C. Y., ence (pp. 17–31).
Lew, L., Mewald, C., Modi, A. N., Polyzotis, N., Ramesh, S., • Shneiderman, B. (2020). Bridging the gap between ethics and
Roy, S., Whang, S. E., Wicke, M., . . . Zinkevich, M. (n.d.). practice. ACM Transactions on Interactive Intelligent Systems,
TFX: A TensorFlow-Based Production-Scale Machine Learning 10(4), 1–31.
Platform. • Seymoens, T., Ongenae, F., & Jacobs, A. (2018). A methodol-
• Ré, C., Niu, F., Gudipati, P. and Srisuwananukorn, C., 2019. ogy to involve domain experts and machine learning tech-
Overton: A data system for monitoring and improving machine- niques in the design of human-centered algorithms. Working
learned products. arXiv preprint arXiv:1909.05372. Conference on . . . .
• Bhatt, U., Xiang, A., Sharma, S., Weller, A., Taly, A., Jia, Y., • Zinkevich, M. (2017). Rules of machine learning: Best prac-
Ghosh, J., Puri, R., Moura, J. M. F., & Eckersley, P. (2020). Ex- tices for ML engineering. URL: Https://developers. Google.
plainable machine learning in deployment. Proceedings of the Com/machine-Learning/guides/rules-of-Ml.
2020 Conference on Fairness, Accountability, and Transparency, • Park, S., Wang, A., Kawas, B., Vera Liao, Q., Piorkowski, D., &
648–657. Danilevsky, M. (2021). Facilitating Knowledge Sharing from
• Hazelwood, K., Bird, S., Brooks, D., Chintala, S., Diril, U., Domain Experts to Data Scientists for Building NLP Models.
Supplementary Material Arxiv, Feb 2022, USA