0% found this document useful (0 votes)

6 views22 pages

pdf2110 10234

The document discusses collaboration challenges faced by software engineers and data scientists when building ML-enabled systems, highlighting issues in communication, documentation, engineering, and processes. Through interviews with 45 practitioners from 28 organizations, it identifies three core collaboration points: requirements identification, training data negotiation, and integration of data science with software engineering work. The findings emphasize the need for improved interdisciplinary collaboration practices and provide recommendations to enhance teamwork in ML system development.

Uploaded by

Jérémie Marchand

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views22 pages

pdf2110 10234

Uploaded by

Jérémie Marchand

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Collaboration Challenges in Building ML-Enabled Systems:

Communication, Documentation, Engineering, and Process

Nadia Nahar Shurui Zhou
[email protected] University of Toronto
Carnegie Mellon University Toronto, Ontario, Canada
Pittsburgh, PA, USA

Grace Lewis Christian Kästner

Carnegie Mellon Software Engineering Institute Carnegie Mellon University
Pittsburgh, PA, USA Pittsburgh, PA, USA
arXiv:2110.10234v4 [cs.SE] 10 Feb 2022

ABSTRACT Govmt. Integr. Product & Model Team

The introduction of machine learning (ML) components in software client

Organization 3
P3a
projects has created the need for software engineers to collabo- 1
P3c P3b model 3
rate with data scientists and other specialists. While collaboration inference product ML pipeline

can always be challenging, ML introduces additional challenges 2

with its exploratory model development process, additional skills
and knowledge needed, di�culties testing ML systems, need for
monitor. infrastr.

continuous evolution and monitoring, and non-traditional quality 1 Prod. requirements 2 Integration (API & QA) 3 Public data

requirements such as fairness and explainability. Through inter-

Team
views with 45 practitioners from 28 organizations, we identi�ed Product Team Model Team
Inner groups
key collaboration challenges that teams face when building and
Organization 7

P7a Responsibility
deploying ML systems into production. We report on common col- 1
product Data
laboration points in the development of production ML systems 2
P7b
Collab. point
for requirements, data, and integration, as well as corresponding model
Softw. Eng.
3
team patterns and challenges. We �nd that most of these challenges ML pipeline
Data Scientist
inference
center around communication, documentation, engineering, and End user
process, and collect recommendations to address these challenges. 1 Model req. 2 Training data 3 Integr. (API)

Figure 1: Structure of two interviewed organizations

1 INTRODUCTION
Machine learning (ML) is receiving massive attention and funding
in research and practice; it is achieving incredible advances, surpass- systems. In addition, past work has mostly been model-centric,
ing human-level cognition in many applications, but it is widely focused on challenges of learning, testing, or serving models, but
acknowledged that moving from a prototyped machine-learned rarely focuses on the entire system, i.e., the product with many
model to a production system is very challenging. For example, non-ML parts into which the model is embedded as a component,
Venturebeat reported in 2019 that 87 percent of ML projects fail which requires coordinating and integrating work from multiple
[107] and Gartner claimed in 2020 that 53 percent do not make experts or teams.
it from prototype to production [69]. While traditional software To better understand collaboration challenges and avenues to-
projects are already complex, failure prone, and require a broad ward better practices, we conducted interviews with 45 participants
range of expertise, the introduction of machine learning raises contributing to the development of ML-enabled systems for pro-
further challenges, requires additional expertise, and introduces duction use (i.e., not pure data analytics/early prototypes). Our
additional collaboration points. research question is: What are the collaboration points and corre-
Technical aspects such as testing ML components [10, 20], mis- sponding challenges between data scientists and software engineers?
use of ML libraries [43, 45], engineering challenges for developing Participants come from 28 organizations, from small startups to
ML components [3, 5, 18, 27, 40, 44, 60, 89], and automating learning large big tech companies, and have diverse roles in these projects,
and deployment processes for ML components [4, 13, 29, 34, 51], including data scientists, software engineers, and managers. Dur-
have received signi�cant attention in research recently. However, ing our interviews, we explored organizational structures (e.g., see
human factors of collaboration during the development of software Figure 1), interactions of project members with di�erent technical
products supported by ML components, ML-enabled systems for backgrounds, and where con�icts arise between teams.
short, have received less attention, including the need to separate While some organizations have adopted better collaboration
and coordinate data science and software engineering work, to ne- practices than others, many struggle setting up structures, pro-
gotiate and document interfaces and responsibilities, and to plan the cesses, and tooling for e�ective collaboration among team members
system’s operation and evolution. Yet, those human collaboration with di�erent backgrounds when developing ML-enabled systems.
challenges appear to be major hurdles in developing ML-enabled To the best of our knowledge, and con�rmed by the practitioners
Arxiv, Feb 2022, USA Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner

we interviewed, there is little systematic or shared understanding often studied in the context of distributed teams of global corpora-
of common collaboration challenges and best practices for devel- tions [38, 67] and open-source ecosystems [16, 94].
oping ML-enabled systems and coordinating developers with very More broadly, interdisciplinary collaboration often poses chal-
di�erent backgrounds (e.g., data science vs. software engineering). lenges. It has been shown that when team members di�er in their
We �nd that smaller and new-to-ML organizations struggle more, academic and professional backgrounds and possess di�erent expec-
but have limited advice to draw from for improvement. tations on the same system, communication, cultural, and methodi-
Three collaboration points surfaced as particularly challenging: cal challenges often emerge when working together [21, 72]. Key
(1) Identifying and decomposing requirements, (2) negotiating train- insights are that successful interdisciplinary collaboration depends
ing data quality and quantity, and (3) integrating data science and on professional role, structural characteristics, personal character-
software engineering work. We found that organizational struc- istics, and a history of collaboration; speci�cally, structural factors
ture, team composition, power dynamics, and responsibilities di�er such as unclear mission, insu�cient time, excessive workload, and
substantially, but also found common organizational patterns at lack of administrative support are barriers to collaboration [24].
speci�c collaboration points and challenges associated with them. The component interface plays a key role in collaboration as a
Overall, our observations suggest four themes that would bene�t negotiation and collaboration point. It is where teams (re-)negotiate
from more attention when building ML-enabled systems: � Invest how to divide work and assign responsibilities [19]. Team mem-
in supporting interdisciplinary teams to work together (including bers often seek information that may not be captured in interface
education and avoiding silos), � Pay more attention to collabora- descriptions, as interfaces are rarely fully speci�ed [32]. In an ide-
tion points and clearly document responsibilities and interfaces, alized development process, interfaces are de�ned early based on
� Consider engineering work as a key contribution to the project, what is assumed to remain stable [71], because changes to inter-
and � Invest more into process and planning. faces later are expensive and require the involvement of multiple
In summary, we make the following contributions: (1) We iden- teams. In addition, interfaces re�ect key architectural decisions for
tify three core collaboration points and associated collaboration the system, aimed to achieve desired overall qualities [11].
challenges based on interviews with 45 practitioners, triangulated In practice though, the idealized divide-and-conquer approach
with a literature review, (2) We highlight the di�erent ways in following top-down planning does not always work without friction.
which teams organize, but also identify organizational patterns that Not all changes can be anticipated, leading to later modi�cations
associate with certain collaboration challenges, and (3) We identify and renegotiation of interfaces [16, 31]. It may not be possible to
recommendations to improve collaboration practices. identify how to decompose work and design stable interfaces until
substantial experimentation has been performed [12]. To manage,
negotiate, and communicate changes of interfaces, developers have
2 STATE OF THE ART
adopted a wide range of strategies for communication [16, 33, 96],
Researchers and practitioners have discussed whether and how often relying on informal broadcast mechanisms to share planned
machine learning changes software engineering with the introduc- or performed changes with other teams.
tion of learned models as components in software systems [e.g., Software lifecycle models [22] also address this tension of when
1, 5, 42, 68, 80, 82, 89, 102, 110]. To lay the foundation for our inter- and how to design stable interfaces: Traditional top-down mod-
view study and inform the questions we ask, we �rst provide an els (e.g., waterfall) plan software design after careful requirements
overview of the related work and existing theories on collabora- analysis; the spiral model pursues a risk-�rst approach in which de-
tion in traditional software engineering and discuss how machine velopers iterate to prototype risky parts, which then informs future
learning may change this. system design iterations; agile approaches de-emphasize upfront
Collaboration in Software Engineering. Most software projects architectural design for fast iteration on incremental prototypes.
exceed the capacity of a single developer, requiring multiple devel- The software architecture community has also grappled with the
opers and teams to collaborate (“work together”) and coordinate question of how much upfront architectural design is feasible, prac-
(“align goals”). Collaboration happens across teams, often in a more tical, or desirable [11, 106], showing a tension between the desire
formal and structured form, and within teams, where familiarity for upfront planning on one side and technical risks and unsta-
with other team members and frequent co-location fosters informal ble requirements on the other. In this context, our research explores
communication [63]. At a technical level, to allow multiple develop- how introducing machine learning into software projects challenges
ers to work together, abstraction and a divide and conquer strategy collaboration.
are essential. Dividing software into components (modules, func- Software Engineering with ML Components. In a ML-enabled
tions, subsystems) and hiding internals behind interfaces is a key system, machine learning contributes one or multiple components
principle of modular software development that allows teams to to a larger system with traditional non-ML components. We refer to
divide work, and work mostly independently until the �nal system the whole system that an end user would use as the product. In some
is integrated [62, 71]. systems, the learned model may be a relatively small and isolated
Teams within an organization tend to align with the technical addition to a large traditional software system (e.g., audit prediction
structure of the system, with individuals or teams assigned to com- in tax software); in others it may provide the system’s essential core
ponents [30], hence the technical structure (interfaces and depen- with only minimal non-ML code around it (e.g., a sales prediction
dencies between components) in�uences the points where teams system sending daily predictions by email). In addition to models,
collaborate and coordinate. Coordination challenges are especially an ML-enabled system typically also has components for training
observed when teams cannot easily and informally communicate,
Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process Arxiv, Feb 2022, USA

and monitoring the model(s) [42, 51]. Much attention in practice Table 1: Participant and Company Demographics
recently focuses on building robust ML pipelines for training and
deploying models in a scalable fashion, often under names such Type Break-down
as “AI engineering,” “SysML,” and “MLOps” [51, 59, 66, 89]. In this Participant Role (45) ML-focused (23), SE-focused (9), Manage-
work, we focus more broadly on the development of the entire ment (5), Operations (2), Domain Expert (2),
ML-enabled system, including both ML and non-ML components. Other (4)
Compared to traditional software systems, ML-enabled systems Participant Seniority (45) 5 years of experience or more (28), 2-5
require additional expertise in data science to build the models and years (9), under 2 years (8)
Company Type (28) Big tech (6), Non IT (4), Mid-size tech (11),
may place additional emphasis on expertise such as data manage-
Startup (5), Consulting (2)
ment, safety, and ethics [5, 49]. In this paper, we primarily focus on Company Location (28) North America (11), South America (1), Eu-
the roles of software engineers and data scientists, who typically have rope (5), Asia (10), Africa (1)
di�erent skills and educational backgrounds [48, 49, 83, 110]: Data
science education tends to focus more on statistics, ML algorithms,
and practical training of models from data (typically given a �xed interview questions throughout the study.
dataset, not deploying the model, not building a system), whereas
software engineering education focuses on engineering tradeo�s Step 1: Scoping and interview guide. To scope our research and
with competing qualities, limited information, limited budget, and prepare for interviews, we looked for collaboration problems men-
the construction and deployment of systems. Research shows that tioned in existing literature on software engineering for ML-enabled
software engineers who engage in data science without further systems (Sec. 2). In this phase, we selected 15 papers opportunis-
education are often naive when building models [110] and that data tically through keyword search and our own knowledge of the
scientists prefer to focus narrowly on modeling tasks [83] but are �eld. We marked all sections in those papers that potentially relate
frequently faced with engineering work [105]. While there is plenty to collaboration challenges between team members with di�er-
of work on supporting collaboration among software engineers [26, ent skills or educational backgrounds, following a standard open
33, 84, 114] and more recently on supporting collaboration among coding process [98]. Even though most papers did not talk about
data scientists [104, 113], we are not aware of work exploring collab- problems in terms of collaboration, we marked discussions that
oration challenges between these roles, which we do in this work. may plausibly relate to collaboration, such as data quality issues
The software engineering community has recently started to between teams. We then analyzed and condensed these codes into
explore software engineering for ML-enabled systems as a research nine initial collaboration areas and developed an initial codebook
�eld, with many contributions on bringing software-engineering and interview guide (provided in Supplement B and C at the end).
techniques to ML tasks, such as testing models and ML algorithms Step 2: Interviews. We conducted semi-structured interviews
[10, 20, 28, 109], deploying models [4, 13, 29, 34, 51], robustness and with 45 participants from 28 organizations, each 30 to 60 minutes
fairness of models [80, 93, 100], life cycles for ML models [1, 5, 34, long. All participants are involved in professional software projects
61, 73], and engineering challenges or best practices for developing using machine learning that are either already or planned to be
ML components [3, 5, 18, 27, 40, 44, 60, 89]. A smaller body of deployed in production. In Table 1, we show the demographics of
work focuses on the ML-enabled system beyond the model, such the interview participants and their organizations. Details can be
as exploring system-level quality attributes [72, 92], requirements found in the Supplement A at the end.
engineering [102], architectural design [112], safety mechanisms We tried to sample participants purposefully (maximum varia-
[17, 82], and user interaction design [7, 25, 111]. In this paper, we tion sampling [36]) to cover participants in di�erent roles, types of
adopt this system-wide scope and explore how data scientists and companies, and countries. We intentionally recruited most partic-
software engineers work together to build the system with ML and ipants from organizations outside of big tech companies, as they
non-ML components. represent the vast majority of projects that have recently adopted
machine learning and often face substantially di�erent challenges
[40]. Where possible, we tried to separately interview multiple
3 RESEARCH DESIGN participants in di�erent roles within the same organization to get
Because there is limited research on collaboration in building ML- di�erent perspectives. We identi�ed potential participants through
enabled systems, we adopt a qualitative research strategy to explore personal networks, ML-related networking events, LinkedIn, and
collaboration points and corresponding challenges, primarily with recommendations from previous interviewees and local tech lead-
stakeholder interviews. We proceeded in four steps: (1) We prepared ers. We adapted our recruitment strategy throughout the research
interviews based on an initial literature review, (2) we conducted based on our �ndings, at later stages focusing primarily on spe-
interviews, (3) we triangulated results with literature �ndings, and ci�c roles and organizations to �ll gaps in our understanding, until
(4) we validated our �ndings with the interview participants. We reaching saturation. For con�dentiality, we refer to organizations
base our research design on Straussian Grounded Theory [97, 98], by number and to participants by PXy where X refers to the orga-
which derives research questions from literature, analyzes inter- nization number and y distinguishes participants from the same
views with open and axial coding, and consults literature through- organization.
out the process. In particular, we conduct interviews and literature We transcribed and analyzed all interviews. Then, to map chal-
analysis in parallel, with immediate and continuous data analysis, lenges to collaboration points, we created visualizations of orga-
performing constant comparisons, and re�ning our codebook and nizational structure and responsibilities in each organization (we
Arxiv, Feb 2022, USA Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner

show two examples in Figure 1) and mapped collaboration problems which they are responsible, such as, who develops the model, who
mentioned in the interviews to collaboration points within these builds a repeatable pipeline, who operates the model (inference),
visualizations. We used these visualizations to further organize our who is responsible for or owns the data, and who is responsible
data; in particular, we explored whether collaboration problems for the �nal product. A team often has multiple responsibilities and
associate with certain types of organizational structures. interfaces with other teams at multiple collaboration points. Where
Step 3: Triangulation with literature. As we gained insights unambiguous, we refer to teams by their primary responsibility as
from interviews, we returned to the literature to identify related product team or model team.
discussions and possible solutions (even if not originally framed in Organization 3 (Figure 1, top) develops an ML-enabled system for
terms of collaboration) to triangulate our interview results. Relevant a government client. The product (health domain), including an ML
literature spans multiple research communities and publication model and multiple non-ML components, is developed by a single
venues, including machine learning, human-computer interaction, 8-person team. The team focuses on training a model �rst, before
software engineering, systems, and various application domains building a product around it. Software engineering and data science
(e.g., healthcare, �nance), and does not always include obvious tasks are distributed within the team, where members cluster into
keywords; simply searching for machine-learning research yields a groups with di�erent responsibilities and roughly equal negotiation
far too wide net. Hence, we decided against a systematic literature power. A single data scientist is part of this team, though they
review and pursued a best e�ort approach that relied on keyword feel somewhat isolated. Data is sourced from public sources. The
search for topics surfaced in the interviews, as well as backward relationship between the client and development team is somewhat
and forward snowballing. Out of over 300 papers read, we identi�ed distant and formal. The product is delivered as a service, but the
61 as possibly relevant and coded them with the same evolving team only receives feedback when things go wrong.
codebook. The complete list can be found in Supplement D. Organization 7 (Figure 1, bottom) develops a product for in-house
use (quality control for a production process). A small team is devel-
Step 4: Validity check with interviewees. For checking �t and oping and using the product, but model development is delegated
applicability as de�ned by Corbin and Strauss [98] and validating to an external team (di�erent company) composed of four data sci-
our �ndings, we went back to the interviewees after creating a full entists, of which two have some software engineering background.
draft of this paper. We presented the interviewees both a summary The product team interacts with the model team to de�ne and revise
and the full draft, including the supplementary material, along model requirements based on product requirements. The product
with questions prompting them to look for correctness and areas team provides con�dential proprietary data for training. The model
of agreement or disagreement (i.e., �t), and any insights gained team deploys the model and provides a ready-to-use inference API
from reading about experiences of the other companies, roles, or to the product team. The relationship between the teams crosses
�ndings as a whole (i.e., applicability). Ten interviewees responded company boundaries and is rather distant and formal. The product
with comments and all indicated general agreement, some explicitly team clearly has the power in negotiations between the teams.
rea�rmed some �ndings. We incorporated two minor suggested These two organizations di�ered along many dimensions, and we
changes about details of two organizations. found no clear global patterns when looking across organizations.
Threats to validity and credibility. Our work exhibits the typ- Nonetheless patterns did emerge when focusing on three speci�c
ical threats common and expected for this kind of qualitative re- collaboration aspects, as we will discuss in the next sections.
search. Generalizations beyond the sampled participant distribu-
tion should be made with care; for example, we interviewed few 5 COLLABORATION POINT: REQUIREMENTS
managers, no dedicated data experts, and no clients. In several AND PLANNING
organizations, we were only able to interview a single person, giv-
ing us a one-sided perspective. Observations may be di�erent in In an idealized top-down process, one would �rst solicit product re-
organizations in speci�c domains or geographic regions not well quirements and then plan and design the product by dividing work
represented in our data. Self-selection of participants may in�u- into components (ML and non-ML), deriving each component’s re-
ence results; for example developers in government-related projects quirements/speci�cations from the product requirements. In this
more frequently declined interview requests. As described earlier, process, collaboration is needed for: (1) product team needs to ne-
we followed standard practices for coding and memoing, but, as gotiate product requirements with clients and other stakeholders;
usual in qualitative research, we cannot entirely exclude biases (2) product team needs to plan and design product decomposition,
introduced by us researchers. negotiating with component teams the requirements for individ-
ual components; and (3) product project manager needs to plan
and manage the work across teams in terms of budgeting, e�ort
4 DIVERSITY OF ORG. STRUCTURES estimation, milestones, and work assignments.
Throughout our interviews, we found that the number and type
of teams that participate in ML-enabled system development dif- 5.1 Common Development Trajectories
fers widely, as do their composition and responsibilities, power Few organizations, if any, follow an idealized top-down process, and
dynamics, and the formality of their collaborations, in line with it may not even be desirable, as we will discuss later. While we did
�ndings by Aho et al. [1]. To illustrate these di�erences, we provide not �nd any global patterns for organizational structures (Sec. 4),
simpli�ed descriptions of teams found in two organizations in Fig- there are indeed distinct patterns relating to how organizations
ure 1. We show teams and their members, as well as the artifacts for elicit requirements and decompose their systems. Most importantly,
Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process Arxiv, Feb 2022, USA

we see di�erences in terms of the order in which teams identify model team sometimes engage directly with clients (and also report
product and model requirements: having to educate them about ML capabilities). However, when
Model-�rst trajectory: 13 of the 28 organizations (3, 10, 14–17, requirements elicitation is left to the model team, members tend to
19, 20, 22, 23, 25–27) focus on building the model �rst, and build focus on requirements relevant for the model, but neglect require-
a product around the model later. In these organizations, product ments for the product, such as expectations for usability, e.g., P3c’s
requirements are usually shaped by model capabilities after the customers “were kind of happy with the results, but weren’t happy
(initial) model has been created, rather than being de�ned upfront. with the overall look and feel or how the system worked.” Several re-
In organizations with separate model and product teams, the model search papers similarly identi�ed how the goals of data scientists di-
team typically starts the project and the product team joins later verge from product goals if product requirements are not obvious at
with low negotiating power to build a product around the model. modeling time, leading to ine�cient development, worse products,
Product-�rst trajectory: In 13 organizations (1, 4, 5, 7–9, 11– or constant renegotiation of requirements, especially [66, 72, 111].
13, 18, 21, 24, 28), models are built later to support an existing Model development with unclear model requirements is com-
product. In these cases, a product often already exists and product mon (�). Participants from model teams frequently explain how
requirements are collected for how to extend the product with new they are expected to work independently, but are given sparse
ML-supported functionality. Here, the model requirements are de- model requirements. They try to infer intentions behind them, but
rived from the product requirements and often include constraints are constrained by having limited understanding of the product
on model qualities, such as latency, memory and explainability. that the model will eventually support (P3a, P3b, P16b, P17b, P19a).
Parallel trajectory: Two organizations (2, 6) follow no clear Model teams often start with vague goals and model requirements
temporal order; model and product teams work in parallel. evolve over time as product teams or clients re�ne their expec-
tations in response to provided models (P3b, P7a, P9a, P5b, P19b,
5.2 Product and Model Requirements P21a). Especially in organizations following the model-�rst trajec-
We found a constant tension between product and model require- tory, model teams may receive some data and a goal to predict
ments in our interviews. Functional and nonfunctional product something with high accuracy, but no further context, e.g., P3a
requirements set expectations for the entire product. Model re- shared “there isn’t always an actual spec of exactly what data they
quirements set goals and constraints for the model team, such as have, what data they think they’re going to have and what they want
expected accuracy and latency, target domain, and available data. the model to do.” Several papers similarly report projects starting
with vague model goals [50, 76, 82, 110].
Product requirements require input from the model team Even in organizations following a product-�rst trajectory, product
(�, �). A common theme in the interviews is that it is di�cult to requirements are often not translated into clear model requirements.
elicit product requirements without a good understanding of ML ca- For example, participant P17b reports how the model team was
pabilities, which almost always requires involving the model team not clear about the model’s intended target domain, thus could
and performing some initial modeling when eliciting product re- not decide what data was considered in scope. As a consequence,
quirements. Regardless of whether product requirements or model model teams usually cannot focus just on their component, but have
requirements are elicited �rst, data scientists often mentioned being to understand the entire product to identify model requirements
faced with unrealistic expectations about model capabilities. in the context of the product (P3a, P10a, P13a, P17a, P17b, P19b,
Participants that interact with clients to negotiate product re- P20b, P23a), requiring interactions with the product team or even
quirements (which may involve members of the model team) indi- bypassing the product team to talk directly to clients. The di�culty
cate that they need to educate clients about capabilities of ML tech- of providing clear requirements for an ML model has also been
niques to set correct expectations (P3a, P6a, P6b, P7b, P9a, P10a, P15c, raised in the literature [49, 55, 79, 91, 103, 110], partially arguing
P19b, P22b, P24a). This need to educate customers about ML capabil- that uncertainty makes it di�cult to specify model requirements
ities has also been raised in the literature [1, 17, 44, 49, 99, 102, 105]. upfront [1, 44, 50, 68, 105]. Ashemore et al. report mapping product
For many organizations, especially in product-�rst trajectories, requirements to model requirements as an open challenge [10].
the model team indicates similar challenges when interacting with
the product team. If the product team does not involve the model Provided model requirements rarely go beyond accuracy and
team in negotiating product requirements, the product team may data security (�, �). Requirements given to model teams pri-
not identify what data is needed for building the model, and may marily relate to some notion of accuracy. Beyond accuracy, require-
commit to unrealistic requirements. For example, P26a shared “For ments for data security and privacy are common, typically imposed
this project, [the project manager] wanted to claim that we have no by the data owner or by legal requirements (P5a, P7a, P9a, P13a,
false positives and I was like, that’s not gonna work.” Members of P14a, P18a, P20a-b, P21a-b, P22a, P23a, P24a, P25a, P26a). Literature
the model team often report lack of ML literacy in members of also frequently discusses how privacy requirements impact and
the product team and project managers (P1b, P4a, P7a, P12a, P26a, restrict ML work [15, 41, 43, 55, 56, 77].
P27a) and a lack of involvement (e.g., P7b: “The [product team] We rarely heard of any qualities other than accuracy. Some par-
decided what type of data would make sense. I had no say on that.” ). ticipants report that ignoring qualities such as latency or scalability
Usually the product team cannot identify product requirements has resulted in integration and operation problems (P3c, P11a). In a
alone, instead product and model teams need to interact to explore few cases requirements for inference latency were provided (P1a,
what is achievable. P6a, P14a) and in one case hardware resources provided constraints
In organizations with a model-�rst trajectory, members of the on memory usage (P14a), but no other qualities such as training
Arxiv, Feb 2022, USA Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner

latency, model size, fairness, or explainability were required that di�cult for the model team to set expectations or contracts with
could be important for product integration and deployment. clients or the product team regarding e�ort, cost, or accuracy. While
When prompted, very few of our interviewees report consider- data scientists �nd e�ort estimation di�cult, lack of ML literacy
ations for fairness either at the product or the model level. Only in managers makes it worse (P15b, P16a, P19b, P20a, P22b). Teams
two participants from model teams (P14a, P22a) reported receiving report deploying subpar models when running out of time (P3a,
fairness requirements, whereas many others explicitly mentioned P15b, P19a), or postponing or even canceling deployments (P25a).
that fairness is not a concern for them yet (P4a, P5b, P6b, P11a, P15c, These �ndings align with literature mentioning di�culties associ-
P20a, P21b, P25a, P26a). The lack of fairness and explainability re- ated with e�ort estimation for ML tasks [1, 9, 61, 105] and planning
quirements is in stark contrast to the emphasis that these qualities projects in a structured manner with diverse methodologies, with
receive in the literature [e.g., 7, 15, 25, 39, 40, 57, 88, 91, 108, 113]. diverse trajectories, and without practical guidance [1, 17, 61, 105].
Recommendations. Our observations suggest that involving data Generally, participants frequently report that synchronization
scientists early when soliciting product requirements is important between teams is challenging because of di�erent team pace, di�er-
(�) and that pursuing a model-�rst trajectory entirely without ent development processes, and tangled responsibilities (P2a, P11a,
considering product requirements is problematic (�). Conversely, P12a, P14-b, P15b-c, P19a; see also Sec. 7.2).
model requirements are rarely speci�c enough to allow data scien- Recommendations. Participants suggested several mitigation
tists to work in isolation without knowing the broader context of strategies: keeping extra bu�er times and adding additional time-
the system and interaction with the product team should ideally boxes for R&D in initial phases (P8a, P19a, P22b-c, P23a; �), contin-
be planned as part of the process. Requirements form a key col- uously involving clients in every phase so that they can understand
laboration point between product and model teams, which should the progression of the project and be aware of potential missed
be emphasized even in more distant collaboration styles (e.g., out- deadlines (P6b, P7a, P22a, P23a; �). From the interviews, we also
sourced model development). The few organizations that use the observe the bene�ts of managers who understand both software
parallel trajectory report fewer problems by involving data scien- engineering and machine learning and can align product and model
tists in negotiating product requirements to discard unrealistic ones teams toward common goals (P2a, P6a, P8a, P28a; �).
early on (P6b). Vogelsang and Borg also provide similar recommen-
dations to consult data scientists from the beginning to help elicit 6 COLLABORATION POINT: TRAINING DATA
requirements [102]. While many papers place emphasis on clearly Data is essential for machine learning, but disagreements and frus-
de�ning ML use cases and scope [49, 92, 99], several others mention trations around training data were the most common collaboration
how collaboration of technical and non-technical stakeholders such challenges mentioned in our interviews. In most organizations, the
as domain experts helps [72, 88, 103, 105]. team that is responsible for building the model is not the team that
ML literacy for customers and product teams appears to be im- collects, owns, and understands the data, making data a key collab-
portant (�). P22a and P19a suggested conducting technical ML oration point between teams in ML-enabled systems development.
training sessions to educate clients; similar training is also useful
for members of product teams. Several papers argue for similar
6.1 Common Organizational Structures
training for non-technical users of ML products [44, 88, 102].
Most organizations elicit requirements only rather informally We observed three patterns around data that in�uence collaboration
and rarely have good documentation, especially when it comes challenges from the perspective of the model team:
to model requirements. It seems bene�cial to adopt more formal Provided data: The product team
requirements documentation for product and model (�), as several has the responsibility of providing data
participants reported that it fosters shared understanding at this to the model team (org. 6–8, 13, 18, 21,
collaboration point (P11a, P13a, P19b, P22a, P22c, P24a, P25a, P26a). 23). The product team is the initial point
Checklists could help to cover a broader range of model quality of contact for all data-related questions from the model team. The
requirements, such as training latency, fairness, and explainability. product team may own the data or acquire it from a separate data
Formalisms such as model cards [64] and FactSheets [8] could be team (internal or external). Coordination regarding data tends to
used as a starting point for documenting model requirements. be distant and formal, and the product team tends to hold more
negotiation power.
5.3 Project Planning External data: The product team
does not have direct responsibility for
ML uncertainty makes e�ort estimation di�cult (�). Irre- providing data, but instead, the model
spective of trajectory, 19 participants (P3a, P4a, P7a-b, P8a, P14b, team relies on external data providers.
P15b-c, P16a, P17a, P18a, P19a-b, P20a, P22a-c, P23a, P25a) men- Commonly, the model team (i) uses publicly available resources
tioned that the uncertainty associated with ML components makes (e.g., academic datasets, org. 2–4, 6, 19) or (ii) hires a third party for
it di�cult to estimate the timeline for developing an ML compo- collecting or labeling data (org. 9, 15–17, 22, 23). In the former case,
nent and by extension the product. Model development is typically the model team has little to no negotiation power over data; in the
seen as a science-like activity, where iterative experimentation and latter, it can set expectations.
exploration is needed to identify whether and how a problem can In-house data: Product, model, and
be solved, rather than as an engineering activity that follows a data teams are all part of the same or-
somewhat predictable process. This science-like nature makes it ganization and the model team relies
Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process Arxiv, Feb 2022, USA

on internal data from that organization direct communication with the data team, data understanding is-
(org. 1, 5, 9–12, 14, 20, 24–28). In these cases, both product and sues often cannot be resolved e�ectively. For example, P13a reports
model teams often �nd it challenging to negotiate access to internal “Ideally, for us it would be so good to spend maybe a week or two with
data due to di�ering priorities, internal politics, permissions, and one person continuously trying to understand the data. It’s one of
security constraints. the biggest problems actually, because even if you have the person,
if you’re not in contact all the time, then you misinterpreted some
things and you build on it.” The low negotiation power of the model
6.2 Negotiating Data Quality and Quantity team in these organizations hinders access to domain experts.
Disagreements and frustrations around training data were the most Model teams using public data similarly struggle with data un-
common collaboration challenges in our interviews. In almost ev- derstanding and getting help (P3a, P4a, P19a), relying on sparse
ery project, data scientists were unsatis�ed with the quality and data documentation or trying to reach any experts on the data.
quantity of data they received at this collaboration point, in line For in-house projects, in several organizations the model team
with a recent survey showing data availability and management to relies on data in shared databases (org. 5, 11, 26, 27, 28), collected by
be the top-ranked challenge in building ML-enabled systems [5]. instrumenting a production system, but shared by multiple teams.
Several teams shared problems with evolving and often poorly
Provided and public data is often inadequate (�, �). In or- documented data sources, as participant P5a illustrates “[data rows]
ganizations where data is provided by the product team, the model can have 4,000 features, 10,000 features. And no one really cares.
team commonly states that it is di�cult to get su�cient data (P7a, They just dump features there. [...] I just cannot track 10,000 features.”
P8a, P13a, P22a, P22c). The data that they receive is often of low Model teams face challenges in understanding data and identifying
quality, requiring signi�cant investment in data cleaning. Similar to a team that can help (P5a, P25a, P20b, P27a), a problem also reported
the requirements challenges discussed earlier, they often state that in a prior study about data scientists at Microsoft [49].
the product team has little knowledge or intuition for the amount Challenges in understanding data and needing domain experts
and quality of data needed. For example, participant P13a stated are also frequently mentioned in the literature [13, 40, 41, 46, 49,
that they were given a spreadsheet with only 50 rows to build a 65, 76, 83], as is the danger of building models with insu�cient
model and P7a reported having to spend a lot of time convincing understanding of the data [34, 102]. Although we are not aware of
the product team of the importance of data quality. This aligns with literature discussing the challenges of accessing domain experts,
past observations that software engineers often have little appreci- papers have shown that even when data scientists have access,
ation for data quality concerns [49, 54, 65, 76, 83] and that training e�ective knowledge transfer is challenging [70, 90].
data is often insu�cient and incomplete [6, 43, 56, 76, 82, 92, 105].
When the model team uses public data sources, its members Ambiguity when hiring a data team (�). When the model team
also have little in�uence over data quality and quantity and report hires an external data team for collecting or labelling data (org. 9,
signi�cant e�ort for cleaning low quality and noisy data (P2a, P3a, 15, 16, 17, 22, 23), the model team has much more negotiation
P4a, P3c, P6b, P19b, P23a). Papers have similarly questioned the power over setting data quality and quantity expectations (though
representativeness and trustworthiness of public training data [34, Kim et al. report that model teams may have di�culty getting
102, 108] as “nobody gets paid to maintain such data” [104]. buy-in from the product team for hiring a data team in the �rst
Training-serving skew is a common challenge when training data place [49]). Our interviews did not surface the same frustrations as
is provided to the model team: models show promising results, with provided data and public data, but instead participants from
but do not generalize to production data because it di�ers from these organizations reported communication vagueness and hidden
provided training data (P4a, P8a, P13a, P15a, P15c, P21a, P22c, P23a) assumptions as key challenges at this collaboration point (P9a, P15a,
[9, 23, 55, 56, 76–78, 83, 99, 108, 115]. Our interviews show that P15c, P16a, P17b, P22a, P22c, P23a). For example, P9a related how
this skew often originates from inadequate training data combined di�erent labelling companies given the same speci�cation widely
with unclear information about production data, and therefore no disagreed on labels, when the speci�cation was not clear enough.
chance to evaluate whether the training data is representative of We found that expectations between model and data teams are
production data. often communicated verbally without clear documentation. As
a result, the data team often does not have su�cient context to
Data understanding and access to domain experts is a bottle- understand what data is needed. For example, participant P17b
neck (�, �). Existing data documentation (e.g, data item de�ni- states “Data collectors can’t understand the data requirements all the
tions, semantics, schema) is almost never su�cient for model teams time. Because, when a questionnaire [for data collection] is designed,
to understand the data (also mentioned in a prior study [46]). In the overview of the project is not always described to them. Even
the absence of clear documentation, team members often collect if we describe it, they can’t always catch it.” Reports about low
information and keep track of unwritten details in their heads (P5a), quality data from hired data teams have been also discussed in the
known as institutional or tribal knowledge [5, 40]. Data understand- literature [10, 43, 55, 83, 102, 105].
ing and debugging often involve members from di�erent teams and
thus cause challenges at this collaboration point. Need to handle evolving data (�, �). In most projects, mod-
Model teams receiving data from the product team report strug- els need to be regularly retrained with more data or adapted to
gling with data understanding and having a di�cult time getting changes in the environment (e.g., data drift) [42, 55, 83], which is a
help from the product team (or the data team that the product team challenge for many model teams (P3a, P3c, P5a, P7a-b, P11a, P15c,
works with) (P8a, P7b, P13a). As the model team does not have P18a, P19b, P22a). When product teams provide the data, they often
Arxiv, Feb 2022, USA Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner

have a static view and provide only a single snapshot of data rather and often limited documentation (�), even when hiring a dedi-
than preparing for updates, and model teams with their limited cated data team—in stark contrast to more established contracts for
negotiation power have a di�cult time fostering a more dynamic traditional software components. Not all organizations allow the
mindset (P7a-b, P15c, P18a, P22a), as expressed by participant P15c: more agile, constant close collaboration between model and data
“People don’t understand that for a machine learning project, data teams that some suggest [76, 78]. With a more formal or distant
has to be provided constantly.” It can be challenging for a model relationship (e.g., across organizations, teams without buy-in), it
team to convince the product team to invest in continuous model seems bene�cial to adopt a more formal contract, specifying data
maintenance and evolution (P7a, P15c) [46]. quantity and quality expectations, which are well researched in
Conversely, if data is provided continuously (most commonly the database literature [58] and have been repeatedly discussed
with public data sources, in-house sources, and own data teams), in the context of ML-enabled systems [43, 46, 49, 56, 90]. This has
model teams struggle with ensuring consistency over time. Data also been framed as data requirements in the software engineering
sources can suddenly change without announcement (e.g., changes literature [82, 99, 102]. When working with a dedicated data team,
to schema, distributions, semantics), surprising model teams that participants suggested to invest in making expectations very clear,
make but do not check assumptions about the data (P3a, P3c, P19b). for example, by providing precise speci�cations and guidelines (P9a,
For example, participants P5a and P11a report similar challenges P6b, P28a), running training sessions for the data collectors and
with in-house data, where their low negotiation power does not annotators (P17b, P22c), and measuring inter-rater agreement (P6b).
allow them to set quality expectations, but they face undesired and Automated checks are also important as data evolves (�). For
unannounced changes in data sources made by other teams. Most example, participant P13a mentioned proactively setting up data
organizations do not have a monitoring infrastructure to detect monitoring to detect problems (e.g., schema violations, distribution
changes in data quality or quantity, as we will discuss in Sec. 7.3. shifts) at this collaboration point; a practice suggested also in the
In-house priorities and security concerns often obstruct data literature [53, 56, 76, 78, 83, 88, 99] and supported by recent tooling
access (�). In in-house projects, we frequently heard about the [e.g., 47, 78, 85]. The risks regarding possible unnoticed changes to
product or model team struggling to work with another team within data make it important to consider data validation and monitoring
the same organization that owns the data. Often, these in-house infrastructure as a key feature of the product early on (�, �), as
projects are local initiatives (e.g., logistics optimization) with more also emphasized by several participants (P5a, P25a, P26a, P28a).
or less buy-in from management and without buy-in from other
teams that have their own priorities; sometimes other teams explic- 7 COLLABORATION POINT:
itly question the business value of the product. The interviewed PRODUCT-MODEL INTEGRATION
model teams usually have little negotiation power to request data As discussed earlier, to build an ML-enabled system both ML com-
(especially if it involves collecting additional data) and almost never ponents and traditional non-ML components need to be integrated
get an agreement to continuously receive data in a certain format, and deployed, requiring data scientists and software engineers to
quality, or quantity (P5a, P10a, P11a, P20a-b, P27a) (also observed in work together, typically across multiple teams. We found many con-
studies at Microsoft, ING and other organizations [34, 49, 65]). For �icts at this collaboration point, stemming from unclear processes
example, P10a shared “we wanted to ask the data warehouse team to and responsibilities, as well as di�ering practices and expectations.
[provide data], and it was really hard to get resources. They wouldn’t
do that because it was hard to measure the impact [our in-house
7.1 Common Organizational Structures
project] had on the bottom line of the business.” Model teams in these
settings tend to work with whatever data they can get eventually. We saw large di�erences among organizations in how engineering
Security and privacy concerns can also limit access to data (P7a, responsibilities were assigned, most visible in how responsibility
P7b, P21a-b, P22a, P24a) [46, 55, 56, 65, 76], especially when data for model deployment and operation is assigned, which typically
is owned by a team in a di�erent organization, causing frustra- involves signi�cant engineering e�ort for building reproducible
tion, lengthy negotiations, and sometimes expensive data-handling pipelines, API design, or cloud deployment, often with MLOps
restrictions (e.g., no use of cloud resources) for model teams. technologies. We found the following patterns:
Shared model code: In
Recommendations. Data quality and quantity is important to some organizations (2, 6, 23,
model teams, yet they often �nd themselves in a position of low 25), the model team is respon-
negotiation power, leading to frustration and collaboration ine�- sible only for model develop-
ciencies. Model teams that have the freedom to set expectations and ment and delivers training code (e.g., in a notebook) or model
hire their own data teams are noticeably more satis�ed. When plan- �les to the product team; the product team takes responsibility
ning the entire product, it seems important to pay special attention for deployment and operation of the model, possibly rewriting the
to this collaboration point, and budget for data collection, access training code as a pipeline. Here, the model team has little or no
to domain experts, or even a dedicated data team (�). Explicitly engineering responsibilities.
planning to provide substantial access to domain experts early in Model as API: In most organi-
the project was suggested as important (P25a). zations (18 out of 28), the model
We found it surprising that despite the importance of this col- team is responsible for developing
laboration point there is little written agreement on expectations and deploying the model. Hence,
the model team requires substantial engineering skills in addition
Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process Arxiv, Feb 2022, USA

to data science expertise. Here, some model teams are mostly com- seriously engaging with others only during integration (P3a, P3c,
posed of data scientists with little engineering capabilities (org. 7, P6a, P7b, P11a, P13a, P15b, P25a) [41], where problems may surface.
13, 17, 22, 26), some consist mostly of software engineers who have For example, participant P11a reported a problem where product
picked up some data science knowledge (org. 4, 15, 16, 18, 19, 21, and model teams had di�erent assumptions about the expected
24), and others have mixed team members (org. 1, 9, 11, 12, 14, 28). inputs and the issue could only be identi�ed after a lot of back and
These model teams typically provide an API to the product team, forth between teams at a late stage in the project.
or release individual model predictions (e.g., shared �les, email; org. Technical jargon challenges communication (�). Participants
17, 19, 22) or install models directly on servers (org. 4, 9, 12). frequently described communication issues arising from di�ering
All-in-one: If only few people terminology used by members from di�erent backgrounds (P1a-b,
work on model and product, some- P2a, P3a, P5b, P8a, P12a, P14a-b, P16a, P17a-b, P18a-b, P20a, P22b,
times a single team (or even a sin- P23a), leading to ambiguity, misunderstandings, and inconsistent
gle person) shares all responsibilities assumptions (on top of communication challenges with domain
(org. 3, 5, 10, 20, 27). It can be a small team with only data scientists experts) [1, 46, 75, 103]. P1b reports, “There are a lot of conversations
(org. 10, 20, 27) or mixed teams with data scientists and software in which disambiguation becomes necessary. We often use di�erent
engineers (org. 3, 5). kinds of words that might be ambiguous.” For example, data scien-
We also observed two outliers: One startup (org. 8) had a distinct tists may refer to prediction accuracy as performance, a term many
model deployment team, allowing the model team to focus on software engineers associate with response time. These challenges
data science without much engineering responsibility. In one large can be observed more frequently between teams, but they even
organization (org. 28), an engineering-focused model team (model occur within a team with members from di�erent backgrounds
as API) was supported by a dedicated research team focused on (P3a-c, P20a).
data-science research with fewer engineering responsibilities.
Code quality, documentation, and versioning expectations
di�er widely and cause con�icts (�, �). Many participants
7.2 Responsibility and Culture Clashes
reported con�icts around development practices between data sci-
Interdisciplinary collaboration is challenging (cf. Sec. 2). We ob- entists and software engineers during integration and deployment.
served many con�icts between data science and software engineer- Participants report poor practices that may also be observed in
ing culture, made worse by unclear responsibilities and boundaries. traditional software projects; but particularly software engineers
Team responsibilities often do not match capabilities and expressed frustration in interviews that data scientists do not follow
preferences (�). When the model team has responsibilities re- the same development practices or have the same quality standards
quiring substantial engineering work, we observed some dissatis- when it comes to writing code. Reported problems relate to poor
faction when its members were assigned undesired responsibilities. code quality (P1b, P2a, P3b, P5a, P6a-b, P10a, P11a, P14a, P15b-c,
Data scientists preferred engineering support rather than needing P17a, P18a, P19a, P20a-b, P26a) [9, 27, 34, 37, 74, 86, 105], insu�cient
to do everything themselves (P7a-b, 13a), but can �nd it hard to documentation (P5a-b, P6a-b, P10a, P15c, P26a) [8, 46, 64, 113], and
convince management to hire engineers (P10a, P20a, P20b). For not extending version control to data and models (P3c, P7a, P10a,
example P10a describes “I was struggling to change the mindset of P14a, P20b). In two shared-model-code organizations, participants
the team lead, convincing him to hire an engineer...I just didn’t want report having to rewrite code from the data scientists (P2a, P6a-b).
this to be my main responsibility.” Especially in small teams, data Missing documentation for ML code and models is considered the
scientists report struggling with the complexity of the typical ML cause for di�erent assumptions that lead to incompatibility between
infrastructure (P7b, P9a, P14a, P26a, P28a). ML and non-ML components (P10a) and for losing knowledge and
In contrast, when deployment is the responsibility of software even the model when faced with turnover (P6a-b). Recent papers
engineers in the product team or of dedicated engineers in all-in- similarly hold poor documentation responsible for team decisions
one teams, some of those engineers report problems integrating becoming invisible and inadvertently causing hidden assumptions
the models due to insu�cient knowledge on model context or do- [34, 40, 43, 46, 75, 113]. Hopkins and Booth called model and data
main, and the model code not being packaged well for deployment versioning in small companies as desired but “elusive” [40].
(P20b, P23a, P27a). In several organizations, we heard about soft- Recommendations. Many con�icts relate to boundaries of re-
ware engineers performing ML tasks without having enough ML sponsibility (especially for engineering responsibilities) and to dif-
understanding (P5a, P15b-c, P16b, 18b, 19b, 20b). Mirroring obser- ferent expectations by team members with di�erent backgrounds.
vations from past research [110], P5a reports “there are people who Better teams tend to de�ne processes, responsibilities, and bound-
are ML engineers at [company] , but they don’t really understand aries more carefully (�), document APIs at collaboration points
ML. They were actually software engineers... they don’t understand between teams (�), and recruit dedicated engineering support for
[over�tting, under�tting, ...]. They just copy-paste code.” model deployment (�), but also establish a team culture with mu-
Siloing data scientists fosters integration problems (�, �). tual understanding and exchange (�). Big tech companies usually
We observed data scientists often working in isolation—known as have more established processes and clearer responsibility assign-
siloing—in all types of organizational structures, even within single ments than smaller organizations and startups that often follow
small teams (see Sec. 4) and within engineering-focused teams. ad-hoc processes or �gure out responsibilities as they go.
In such settings, data scientists often work in isolation with weak The need for engineering skills for ML projects has frequently
requirements (cf. Sec. 5.2) without understanding the larger context, been discussed [5, 66, 86, 89, 95, 111, 115], but our interviewees
Arxiv, Feb 2022, USA Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner

di�er widely in whether all data scientists should have substantial struggle with testing the entire product after integrating ML and
engineering responsibilities or whether engineers should support non-ML components. Model teams frequently explicitly mentioned
data scientists so that they can focus on their core expertise (�). that they assume no responsibility for product quality (including
Especially interviewees from big tech emphasized that they expect integration testing and testing in production) and have not been
engineering skills from all data science hires (P28a). Others empha- involved in planning for system testing, but that their responsibili-
sized that recruiting software engineers and operations sta� with ties end with delivering a model evaluated for accuracy (P3a, P14a,
basic data-science knowledge can help at many communication P15b, P25a, P26a). However, in several organizations, product teams
and integration tasks, such as converting experimental ML code also did not plan for testing the entire system with the model(s)
for deployment (P2a, P3b), fostering communication (P3c, P25a), and, at most, conducted system testing in an ad-hoc way (P2a, P6a,
and monitoring models in production (P5b). Generally, siloing data P16a, P18a, P22a). Recent literature has reported a similar lack of
scientists is widely recognized as problematic and many intervie- focus on system testing in product teams [13, 113], mirroring also
wees suggest practices for improving communication (�), such as a focus in academic research on testing models rather than testing
training sessions for establishing common terminology (P11a, P17a, the entire system [10, 20]. Interestingly, some established software
P22a, P22c, P23a), weekly all-hands meetings to present all tasks development organizations delegated testing to an existing separate
and synchronize (P2a, P3c, P6b, P11a), and proactive communica- quality assurance team with no process or experience testing ML
tion to broadcast upcoming changes in data or infrastructure (P11a, products (P2a, P8a, P16a, P18b, P19a).
P14a, P14b). This mirrors suggestions to invest in interdisciplinary Planning for online testing and monitoring is rare (�, �,
training [5, 48, 49, 68, 75, 111] and proactive communication [54]. �). Due to possible training-serving skew and data drift, literature
emphasizes the need for online evaluation [4, 10, 13, 14, 23, 42, 44,
47, 51, 65, 86, 87, 89, 102]. With collected telemetry, one can usually
7.3 Quality Assurance for Model and Product
approximate both product and model quality, monitor updates,
During development and integration, questions of responsibility and experiment in production [14]. Online testing usually requires
for quality assurance frequently arise, often requiring coordination coordination among multiple teams responsible for product, model,
and collaboration between multiple teams. This includes evaluating and operation. We observed that most organizations do not perform
components individually (including the model) as well as their monitoring or online testing, as it is considered di�cult, in addition
integration and the whole system, often including evaluating and to lack of standard process, automation, or even test awareness
monitoring the system online (in production). (P2a, P3a, P3b, P4a, P6b, P7a, P10a, P15b, P16b, P18b, P19b, 25a,
Model adequacy goals are di�cult to establish (�, �). O�- P27a). Only 11 out of 28 organizations collected any telemetry; it is
line accuracy evaluation of models is almost always performed by most established in big tech organizations. When to retrain models
the model team responsible for building the model, though often is often decided based on intuition or manual inspection, though
they have di�culty deciding locally when the model is good enough many aspire to more automation (P1a, P3a, P3c, P5a, P10a, P22a,
(P1a, P3a, P5a, P6a, P7a, P15b, P16b, P23a) [34, 44]. As discussed P25a, P27a). Responsibilities around online evaluation are often
in Sec. 5 and Sec. 6, model team members often receive little guid- neither planned nor assigned upfront as part of the project.
ance on model adequacy criteria and are unsure about the actual Most model teams are aware of possible data drift, but many do
distribution of production data. They also voice concerns about not have any monitoring infrastructure for detecting and managing
establishing ground truth, for example, needing to support data drift in production. If telemetry is collected, it is the responsibility
for di�erent clients, and hence not being able to establish (o�ine) of the product or operations team and it is not always accessible to
measures for model quality (P1b, P16b, P18a, P28a). As quality re- the model team. Four participants report that they rely on manual
quirements beyond accuracy are rarely provided for models, model feedback about problems from the product team (P1a, P3a, P4a,
teams usually do not feel responsible for testing latency, memory P10a). At the same time, others report that product and operation
consumption, or fairness (P2a, P3c, P4a, P5a, P6b, P7a, P14a, P15b, teams do not necessarily have su�cient data science knowledge to
P20b). Whereas literature discussed challenges in measuring busi- provide meaningful feedback (P3a, P3b, P5b, P18b, P22a) [81].
ness impact of a model [10, 14, 43, 49] and balancing business goals Recommendations. Quality assurance involves multiple teams
with model goals [72], interviewed data scientists were concerned and bene�ts from explicit planning and making it a high priority
about this only with regards to convincing clients, managers or (�). While the product team should likely take responsibility for
product teams to provide resources (P7a-b, P10a, P26a, P27a). product quality and system testing, such testing often involves build-
Limited con�dence without transparent model evaluation ing monitoring and experimentation infrastructure (�), which re-
(�). Participants in several organizations report that model teams quires planning and coordination with teams responsible for model
do not prioritize model evaluation and have no systematic evalua- development, deployment, and operation (if separate) to identify
tion strategy (especially if they do not have established adequacy the right measures. Model teams bene�t from receiving feedback
criteria they try to meet), performing occasional “ad-hoc inspec- on their model from production systems, but such support needs
tions” instead (P2a, P15b, P16b, P18b, P19b, P20b, P21b, P22a, P22b). to be planned explicitly, with corresponding engineering e�ort as-
Without transparency about their test processes and test results, signed and budgeted, even in organizations following a model-�rst
other teams voiced reduced con�dence in the model, leading to trajectory. We suspect that education about bene�ts of testing in
skepticism to adopt the model (P7a, P10a, P21b, P22a). production and common infrastructure (often under the label Dev-
Unclear responsibilities for system testing (�). Teams often Ops/MLOps [59]) can increase buy-in from all involved teams (�).
Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process Arxiv, Feb 2022, USA

Organizations that have established monitoring and experimenta- � Documentation: Clearly documenting expectations between
tion infrastructure strongly endorse it (P5a, P25a, P26a, P28a). teams is important. Traditional interface documentation familiar
De�ning clear quality requirements for model and product can to software engineers may be a starting point, but practices for
help all teams to focus their quality assurance activities (cf. Sec. 5; documenting model requirements (Sec. 5.2), data expectations (Sec.
�). Even when it is challenging to de�ne adequacy criteria upfront, 6.2), and assured model qualities (Sec. 7.3) are not well established.
teams can together develop a quality assurance plan for model and Recent suggestions like model cards [64], and FactSheets [8] are
product. Participants and literature emphasized the importance of a good starting point for encouraging better, more standardized
human feedback to evaluate model predictions (P11a, P14a) [87], documentation of ML components. Given the interdisciplinary na-
which requires planning to collect such feedback (�). System and ture at these collaboration points, such documentation must be
usability testing may similarly require planning for user studies understood by all involved – theories of boundary objects [2] may
with prototypes and shadow deployment [88, 99, 108]. help to develop better interface description mechanisms.
� Engineering: With attention focused on ML innovations,
8 DISCUSSION AND CONCLUSIONS many organizations seem to underestimate the engineering ef-
Through our interviews we identi�ed three central collaboration fort required to turn a model into a product to be operated and
points where organizations building ML-enabled systems face sub- maintained reliably. Arguably adopting machine learning increases
stantial challenges: (1) requirements and project planning, (2) train- software complexity [48, 68, 86] and makes engineering practices
ing data, and (3) product-model integration. Other collaboration such as data quality checks, deployment automation, and testing in
points surfaced, but were mentioned far less frequently (e.g., inter- production even more important. Project managers should ensure
action with legal experts and operators), did not relate to problems that the ML and the non-ML parts of the project have su�cient
between multiple disciplines (e.g., data scientists documenting their engineering capabilities and foster product and operations thinking
work for other data scientists), or mirrored conventional collabora- from the start.
tion in software projects (e.g., many interviewees wanted to talk � Process: Finally, machine learning with its more science-like
about unstable ML libraries and challenges interacting with teams process challenges traditional software process life cycles. It seems
building and maintaining such libraries, though the challenges clear that product requirements cannot be established without in-
largely mirrored those of library evolution generally [16, 31]). volving data scientists for model prototyping, and often it may
Data scientists and software engineers are certainly not the �rst be advisable to adopt a model-�rst trajectory to reduce risk. But
to realize that interdisciplinary collaborations are challenging and while a focus on the product and overall process may cause delays,
fraught with communication and cultural problems [21], yet it neglecting it entirely invites the kind of problems reported by our
seems that many organizations building ML-enabled systems pay participants. Whether it may look more like the spiral model or
little attention to fostering better interdisciplinary collaboration. agile [22], more research into integrated process life cycles for ML-
Organizations di�er widely in their structures and practices, and enabled systems (covering software engineering and data science)
some organizations have found strategies that work for them (see is needed.
recommendation sections). Yet, we �nd that most organizations do Acknowledgements. Kästner’s and Nahar’s work was supported
not deliberately plan their structures and practices and have little in part by NSF grants NSF award 1813598 and 2131477. Zhou’s
insight into available choices and their tradeo�s. We hope that this work was supported in part by Natural Sciences and Engineering
work can (1) encourage more deliberation about organization and Research Council of Canada (NSERC), RGPIN2021-03538. Lewis’
process at key collaboration points, and (2) serve as a starting point work was funded and supported by the Department of Defense
for cataloging and promoting best practices. under Contract No. FA8702-15-D-0002 with Carnegie Mellon Uni-
Beyond the speci�c challenges discussed throughout this paper, versity (CMU) for the operation of the Software Engineering Insti-
we see four broad themes that bene�t from more attention both in tute (SEI), a federally funded research and development center. We
engineering practice and in research: would thank all our interview participants (K M Jawadur Rahman,
� Communication: Many issues are rooted in miscommunica- Miguel Jette, and anonymous others) and the people who helped
tion between participants with di�erent backgrounds. To facilitate us connect with them.
interdisciplinary collaboration, education is key, including ML liter-
acy for software engineers and managers (and even customers) but
also training data scientists to understand software engineering
concerns. The idea of T-shaped professionals [101] (deep expertise
in one area, broad knowledge of others) can provide guidance for
hiring and training.
Arxiv, Feb 2022, USA Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner

REFERENCES arXiv 2005.00760.

[1] Aho, T., Sievi-Korte, O., Kilamo, T., Yaman, S. and Mikkonen, T., 2020. Demysti- [30] Conway, M.E. 1968. How Do Committees Invent? Datamation. 14, 4, 28–31.
fying data science projects: A look on the people and process of data science [31] Cossette, B.E. and Walker, R.J. 2012. Seeking the Ground Truth: A Retroac-
today. In Proc. Int’l Conf. Product-Focused Software Process Improvement, 153-167. tive Study on the Evolution and Migration of Software Libraries. In Proc. Int’l
[2] Akkerman, S.F. and Bakker, A. 2011. Boundary Crossing and Boundary Objects. Symposium Foundations of Software Engineering (FSE), 1–11.
Review of educational research. 81, 2, 132–169. [32] Curtis, B., Krasner, H. and Iscoe, N. 1988. A �eld study of the software design
[3] Akkiraju, R., Sinha, V., Xu, A., Mahmud, J., Gundecha, P., Liu, Z., Liu, X. and process for large systems. Communications of the ACM. 31, 11, 1268–1287.
Schumacher, J. 2020. Characterizing Machine Learning Processes: A Maturity [33] Dabbish, L., Stuart, C., Tsay, J. and Herbsleb, J. 2012. Social Coding in GitHub:
Framework. Business Process Management, 17–31. Transparency and Collaboration in an Open Software Repository. In Proc. Conf.
[4] Ameisen, E. 2020. Building Machine Learning Powered Applications: Going from Computer Supported Cooperative Work (CSCW), 1277–1286.
Idea to Product. O’Reilly Media, Inc. [34] Haakman, M., Cruz, L., Huijgens, H. and van Deursen, A. 2021. AI Lifecycle
[5] Amershi, S. et al. 2019. Software Engineering for Machine Learning: A Case Models Need To Be Revised. An exploratory study in Fintech. Empirical Software
Study. In Proc. of 41st Int’l Conf. on Software Engineering: Software Engineering Engineering. 26, 5, 1–29.
in Practice (ICSE-SEIP), 291–300. [35] Haakman, M., Cruz, L., Huijgens, H. and van Deursen, A. 2020. AI Lifecycle
[6] Amershi, S., Chickering, M., Drucker, S.M., Lee, B., Simard, P. and Suh, J. 2015. Models Need To Be Revised. An Exploratory Study in Fintech. arXiv 2010.02716.
ModelTracker: Redesigning Performance Analysis Tools for Machine Learning. [36] Harsh, S. 2011. Purposeful Sampling in Qualitative Research Synthesis. Qualita-
In Proc. of 33rd Conf. on Human Factors in Computing Systems, 337–346. tive Research Journal. 11, 2, 63–75.
[7] Amershi, S. et al. 2019. Guidelines for Human-AI Interaction. In Proc. of CHI [37] Head, A., Hohman, F., Barik, T., Drucker, S.M. and DeLine, R. 2019. Managing
Conf. on Human Factors in Computing Systems, 1–13. messes in computational notebooks. In Proc. Conf. Human Factors in Computing
[8] Arnold, M. et al. 2019. FactSheets: Increasing trust in AI services through sup- Systems (CHI), 1-12.
plier’s declarations of conformity. IBM Journal of Research and Development, [38] Herbsleb, J.D. and Grinter, R.E. 1999. Splitting the Organization and Integrating
63. the Code: Conway’s Law Revisited. In Proc. Int’l Conf. Software Engineering
[9] Arpteg, A., Brinne, B., Crnkovic-Friis, L. and Bosch, J. 2018. Software Engineering (ICSE), 85–95.
Challenges of Deep Learning. In Proc. Euromicro Conf. Software Engineering and [39] Holstein, K. et al.. 2019. Improving Fairness in Machine Learning Systems: What
Advanced Applications (SEAA), 50–59. Do Industry Practitioners Need? In Proc. Conf. Human Factors in Computing
[10] Ashmore, R., Calinescu, R. and Paterson, C. 2021. Assuring the Machine Learning (CHI) Systems, 1–16.
Lifecycle: Desiderata, Methods, and Challenges. ACM Computing Surveys (CSUR), [40] Hopkins, A. and Booth, S. 2021. Machine learning practices outside big tech:
54 (5): 1-39. How resource constraints challenge responsible development. In Proc. Conf. on
[11] Bass, L., Clements, P. and Kazman, R. 1998. Software Architecture in Practice. AI, Ethics, and Society, 134-145.
Addison-Wesley Longman Publishing Co., Inc. [41] Hukkelberg, I. and Rolland, K. 2020. Exploring Machine Learning in a Large Gov-
[12] Bass, M., Herbsleb, J.D. and Lescher, C. 2009. A Coordination Risk Analysis ernmental Organization: An Information Infrastructure Perspective. European
Method for Multi-site Projects: Experience Report. In Proc. Int’l Conf. Global Conf. on Information Systems, 92.
Software Engineering, 31–40. [42] Hulten, G. 2019. Building Intelligent Systems: A Guide to Machine Learning
[13] Baylor, D., Breck, E., Cheng, H.T., Fiedel, N., Foo, C.Y. et al. 2017. TFX: A Engineering. Apress.
TensorFlow-Based Production-Scale Machine Learning Platform. In Proc. Int’l [43] Humbatova, N. et al. 2020. Taxonomy of real faults in deep learning systems. In
Conf. Knowledge Discovery and Data Mining, 1387-1395. Proc. Int’l Conf. on Software Engineering (ICSE), 1110-1121.
[14] Bernardi, L., Mavridis, T. and Estevez, P. 2019. 150 successful machine learning [44] Ishikawa, F. and Yoshioka, N. 2019. How do engineers perceive di�culties
models. In Proc. Int’l Conf. Knowledge Discovery & Data Mining, 1743-1751. in engineering of machine-learning systems? - Questionnaire survey. In Proc.
[15] Bhatt, U., Xiang, A., Sharma, S., Weller, A., Taly, A., Jia, Y., Ghosh, J., Puri, R., Int’l Workshop on Conducting Empirical Studies in Industry (CESI) and Software
Moura, J.M.F. and Eckersley, P. 2020. Explainable machine learning in deploy- Engineering Research and Industrial Practice (SER&IP), 2-9.
ment. In Proc. of Conf. on Fairness, Accountability, and Transparency, 648–657. [45] Islam, M.J., Nguyen, H.A., Pan, R. and Rajan, H. 2019. What Do Developers
[16] Bogart, C., Kästner, C., Herbsleb, J. and Thung, F. 2021. When and how to make Ask About ML Libraries? A Large-scale Study Using Stack Over�ow. arXiv
breaking changes: Policies and practices in 18 open source software ecosystems. 1906.11940.
ACM Transactions on Software Engineering and Methodology. 30, 4, 1–56. [46] Kandel, S., Paepcke, A., Hellerstein, J.M. and Heer, J. 2012. Enterprise data
[17] Borg, M. et al. 2019. Safely Entering the Deep: A Review of Veri�cation and analysis and visualization: An interview study. IEEE Transactions on Visualization
Validation for Machine Learning and a Challenge Elicitation in the Automotive and Computer Graphics. 18, 12, 2917–2926.
Industry. Journal of Automotive Software Engineering. 1, 1, 1–9. [47] Kang, D., Raghavan, D., Bailis, P. and Zaharia, M. 2020. Model Assertions for
[18] Bosch, J., Olsson, H.H. and Crnkovic, I. 2021. Engineering AI Systems: A Research Monitoring and Improving ML Models. In Proc. of Machine Learning and Systems,
Agenda. Arti�cial Intelligence Paradigms for Smart Cyber-Physical Systems. IGI 2, 481-496.
Global. 1–19. [48] Kästner, C. and Kang, E. 2020. Teaching Software Engineering for Al-Enabled
[19] Boujut, J.-F. and Blanco, E. 2003. Intermediary Objects as a Means to Foster Systems. In Proc. Int’l Conf. Software Engineering: Software Engineering Education
Co-operation in Engineering Design. Computer Supported Cooperative Work and Training (ICSE-SEET), 45–48.
(CSCW). 12, 2, 205–219. [49] Kim, M., Zimmermann, T., DeLine, R. and Begel, A. 2018. Data Scientists in
[20] Braiek, H.B. and Khomh, F. 2020. On testing machine learning programs. Journal Software Teams: State of the Art and Challenges. IEEE Transactions on Software
of Systems and Software. 164, 110542. Engineering. 44, 11, 1024–1038.
[21] Brandstädter, S. and Sonntag, K. 2016. Interdisciplinary Collaboration. Advances [50] Kuwajima, H., Yasuoka, H. and Nakae, T. 2020. Engineering problems in machine
in Ergonomic Design of Systems, Products and Processes, 395–409. learning systems. Machine Learning, 109, no 5, 1103-1126.
[22] Braude, Eric J and Bernstein, Michael E. 2011. Software Engineering: Modern [51] Lakshmanan, V., Robinson, S. and Munn, M. 2020. Machine Learning Design
Approaches 2nd Edition. Wiley. ISBN-13: 978-0471692089. Patterns. O’Reilly Media, Inc.
[23] Breck, E., Cai, S., Nielsen, E., Salib, M. and Sculley, D. 2017. The ML test score: [52] Lewis, G.A., Bellomo, S. and Ozkaya, I. 2021. Characterizing and Detecting
A rubric for ML production readiness and technical debt reduction. In Proc. of Mismatch in Machine-Learning-Enabled Systems. In Proc. Workshop on AI
Int’l Conf. on Big Data (Big Data), 1123–1132. Engineering-Software Engineering for AI (WAIN), 133-140.
[24] Brown, G.F.C. 1995. Factors that facilitate or inhibit interdisciplinary collaboration [53] Lewis, G. A., Ozkaya, I. and Xu X. 2021. Software Architecture Challenges for
within a professional bureaucracy. University of Arkansas. ML Systems. In Proc. Int’l Conf. on Software Maintenance and Evolution, 634-638.
[25] Cai, C.J., Winter, S., Steiner, D., Wilcox, L. and Terry, M. 2019. “hello AI”: Uncov- [54] Li, P.L., Ko, A.J. and Begel, A. 2017. Cross-Disciplinary Perspectives on Collab-
ering the onboarding needs of medical practitioners for human-AI collaborative orations with Software Engineers. In Proc. Int’l Workshop on Cooperative and
decision-making. In Proc. Human-Computer Interaction. 3, CSCW, 1–24. Human Aspects of Software Engineering (CHASE), 2–8.
[26] Cataldo, M. et al. 2006. Identi�cation of Coordination Requirements: Implications [55] Lwakatare, L.E., Raj, A., Bosch, J., Olsson, H.H. and Crnkovic, I. 2019. A taxonomy
for the Design of Collaboration and Awareness Tools. In Proc. Conf. Computer of software engineering challenges for machine learning systems: An empirical
Supported Cooperative Work (CSCW), 353–362. investigation. In Proc. Int’l Conf. Agile Software Development, 227–243.
[27] Chattopadhyay, S., Prasad, I., Henley, A.Z., Sarma, A. and Barik, T. 2020. What’s [56] Lwakatare, L.E., Raj, A., Crnkovic, I., Bosch, J. and Olsson, H.H. 2020. Large-
Wrong with Computational Notebooks? Pain Points, Needs, and Design Oppor- scale machine learning systems in real-world industrial settings: A review of
tunities. In Proc. Conf. Human Factors in Computing Systems (CHI), 1–12. challenges and solutions. Information and software technology. 127, 106368.
[28] Cheng, D., Cao, C., Xu, C. and Ma, X. 2018. Manifesting Bugs in Machine [57] Madaio, M.A. et al. 2020. Co-Designing Checklists to Understand Organizational
Learning Code: An Explorative Study with Mutation Testing. In Proc. Int’l Conf. Challenges and Opportunities around Fairness in AI. In Proc. Conf. Human
Software Quality, Reliability and Security (QRS), 313–324. Factors in Computing Systems (CHI), 1–14.
[29] Chen, Z., Cao, Y., Liu, Y., Wang, H., Xie, T. and Liu, X. 2020. Understanding [58] Mahanti, R. 2019. Data Quality: Dimensions, Measurement, Strategy, Management,
Challenges in Deploying Deep Learning Based Software: An Empirical Study. and Governance. Quality Press.
Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process Arxiv, Feb 2022, USA

[59] Mäkinen, S., Skogström, H., Laaksonen, E. and Mikkonen, T. 2021. Who Needs Discovery and Data Mining, 274–282.
MLOps: What Data Scientists Seek to Accomplish and How Can MLOps Help? In [88] Sendak, M.P. et al. 2020. Real-World Integration of a Sepsis Deep Learning
Proc. Workshop on AI Engineering-Software Engineering for AI (WAIN), 109-112. Technology Into Routine Clinical Care: Implementation Study. JMIR medical
[60] Martínez-Fernández, S., Bogner, J., Franch, X., Oriol, M., Siebert, J., Trendowicz, informatics. 8, 7, e15182.
A., Vollmer, A.M. and Wagner, S. 2021. Software Engineering for AI-Based [89] Serban, A., van der Blom, K., Hoos, H. and Visser, J. 2020. Adoption and Ef-
Systems: A Survey. arXiv 2105.01984. fects of Software Engineering Best Practices in Machine Learning. In Proc. Int’l
[61] Martinez-Plumed, F., Contreras-Ochando, L., Ferri, C., Hernandez Orallo, J., Symposium on Empirical Software Engineering and Measurement, 1–12.
Kull, M., Lachiche, N., Ramirez Quintana, M.J. and Flach, P.A. 2021. CRISP-DM [90] Seymoens, T., Ongenae, F. and Jacobs, A. 2018. A methodology to involve domain
twenty years later: From data mining processes to data science trajectories. IEEE experts and machine learning techniques in the design of human-centered
Transactions on Knowledge and Data Engineering. 33, 8, 3048–3061. algorithms. In Proc. IFIP Working Conf. Human Work Interaction Design, 200-214.
[62] Meyer, B. 1997. Object-Oriented Software Construction. Prentice-Hall. [91] Shneiderman, B. 2020. Bridging the gap between ethics and practice. ACM
[63] Mistrík, I., Grundy, J., van der Hoek, A. and Whitehead, J. 2010. Collaborative Transactions on Interactive Intelligent Systems. 10, 4, 1–31.
Software Engineering. Springer. [92] Siebert, J., Joeckel, L., Heidrich, J., Nakamichi, K., Ohashi, K., Namba, I., Ya-
[64] Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., mamoto, R. and Aoyama, M. 2020. Towards Guidelines for Assessing Qualities
Spitzer, E., Raji, I.D. and Gebru, T. 2019. Model Cards for Model Reporting. In of Machine Learning Systems. In Proc. Int’l Conf. on the Quality of Information
Proc. Conf. Fairness, Accountability, and Transparency, 220–229. and Communications Technology, 17–31.
[65] Muiruri, D., Lwakatare, L. E., K Nurminen, J. and Mikkonen, T. 2021. Practices [93] Singh, G., Gehr, T., Püschel, M. and Vechev, M. 2019. An abstract domain for
and Infrastructures for ML Systems–An Interview Study. TechRxiv 16939192.v1. certifying neural networks. Proc. ACM Program. Lang. 3, POPL, 1–30.
[66] O’Leary, K. and Uchida, M. 2020. Common problems with creating machine [94] Smith, D., Alshaikh, A., Bojan, R., Kak, A. and Manesh, M.M.G. 2014. Overcoming
learning pipelines from existing code. In Proc. Conf. Machine Learning and barriers to collaboration in an open source ecosystem. Technology Innovation
Systems (MLSys). Management Review. 4, 1.
[67] Ovaska, P., Rossi, M. and Marttiin, P. 2003. Architecture as a coordination tool [95] d. S. Nascimento, E. et al. 2019. Understanding Development Process of Ma-
in multi-site software development. Software Process Improvement and Practice. chine Learning Systems: Challenges and Solutions. In Proc. Int’l Symposium on
8, 4, 233–247. Empirical Software Engineering and Measurement (ESEM), 1–6.
[68] Ozkaya, I. 2020. What Is Really Di�erent in Engineering AI-Enabled Systems? [96] de Souza, C.R.B. and Redmiles, D.F. 2008. An Empirical Study of Software
IEEE Software. 37, 4, 3–6. Developers’ Management of Dependencies and Changes. In Proc. Int’l Conf.
[69] Panetta, K. 2020. Gartner Identi�es the Top Strategic Technology Trends for Software Engineering (ICSE), 241–250.
2021. URL: https:// www.gartner.com/ smarterwithgartner/ gartner-top-strategic- [97] Strauss, A. and Corbin, J. 1994. Grounded theory methodology: An overview.
technology-trends-for-2021. Handbook of qualitative research. N.K. Denzin, ed. 273–285.
[70] Park, S., Wang, A., Kawas, B., Vera Liao, Q., Piorkowski, D. and Danilevsky, M. [98] Strauss, A. and Corbin, J.M. Basics of Qualitative Research: Grounded Theory
2021. Facilitating Knowledge Sharing from Domain Experts to Data Scientists Procedures and Techniques. SAGE Publications.
for Building NLP Models. In Proc. 26th Int’l Conf. on Intelligent User Interfaces, [99] Studer, S. et al. 2021. Towards CRISP-ML(Q): A Machine Learning Process
585-596. Model with Quality Assurance Methodology. Machine Learning and Knowledge
[71] Parnas, D.L. 1972. On the Criteria to be used in Decomposing Systems into Extraction, 3(2), 392-413.
Modules. Communications of the ACM. 15, 12, 1053–1058. [100] Tramèr, F. et al. 2017. FairTest: Discovering Unwarranted Associations in Data-
[72] Passi, S., and Phoebe S. 2020. Making Data Science Systems Work. Big Data & Driven Applications. In Proc. European Symposium on Security and Privacy (EuroS
Society 7 (2): 1-13. P), 401–416.
[73] Patel, K., Fogarty, J., Landay, J.A. and Harrison, B. 2008. Investigating statistical [101] Tranquillo, J. 2017. The T-Shaped Engineer. Journal of Engineering Education
machine learning as a tool for software development. In Proc. Conf. Human Transformations. 30, 4, 12–24.
Factors in Computing Systems (CHI), 667–676. [102] Vogelsang, A. and Borg, M. 2019. Requirements Engineering for Machine Learn-
[74] Pimentel, J.F., Murta, L., Braganholo, V. and Freire, J. 2019. A large-scale study ing: Perspectives from Data Scientists. In Proc. Int’l Requirements Engineering
about quality and reproducibility of Jupyter notebooks. In Proc. 16th Int’l Conf. Conf. Workshops (REW), 245–251.
on Mining Software Repositories (MSR), 507-517. [103] Wagsta�, K. 2012. Machine Learning that Matters. arXiv 1206.4656.
[75] Piorkowski, D. et al. 2021. How AI Developers Overcome Communication Chal- [104] Wang, A.Y., Mittal, A., Brooks, C. and Oney, S. 2019. How Data Scientists Use
lenges in a Multidisciplinary Team: A Case Study. In Proc. ACM on Human- Computational Notebooks for Real-Time Collaboration. Proc. Human-Computer
Computer Interaction, 5, (CSCW1), 1-25. Interaction. 3, CSCW, 39.
[76] Polyzotis, N., Roy, S., Whang, S.E. and Zinkevich, M. 2018. Data Lifecycle Chal- [105] Wan, Z., Xia, X., Lo, D. and Murphy, G.C. 2019. How does Machine Learn-
lenges in Production Machine Learning: A Survey. SIGMOD Rec. 47, 2, 17–28. ing Change Software Development Practices? IEEE Transactions on Software
[77] Polyzotis, N., Roy, S., Whang, S.E. and Zinkevich, M. 2017. Data Management Engineering, 47(9), 1857-1871.
Challenges in Production Machine Learning. In Proc. Int’l Conf. on Management [106] Waterman, M., Noble, J. and Allan, G. 2015. How Much Up-Front? A Grounded
of Data, 1723–1726. theory of Agile Architecture. In Proc. Int’l Conf. Software Engineering, 347–357.
[78] Polyzotis, N., Zinkevich, M., Roy, S., Breck, E. and Whang, S. 2019. Data valida- [107] Sta�, V. B. 2019. Why do 87% of data science projects never make it into pro-
tion for machine learning. In Proc. Machine Learning and Systems, 334–347. duction? URL: https:// venturebeat.com/ 2019/ 07/ 19/ why-do-87-of-data-science-
[79] Rahimi, M., Guo, J.L.C., Kokaly, S. and Chechik, M. 2019. Toward Requirements projects-never-make-it-into-production/ .
Speci�cation for Machine-Learned Components. In Proc. Int’l Requirements [108] Wiens, J., et al. 2019. Do no harm: A roadmap for responsible machine learning
Engineering Conf. Workshops (REW), 241–244. for health care. Nature medicine. 25, 9, 1337–1340.
[80] Rakova, B., Yang, J., Cramer, H. and Chowdhury, R. 2021. Where Responsible AI [109] Xie, X., Ho, J.W.K., Murphy, C., Kaiser, G., Xu, B. and Chen, T.Y. 2011. Testing
meets Reality: Practitioner Perspectives on Enablers for Shifting Organizational and Validating Machine Learning Classi�ers by Metamorphic Testing. Journal
Practices. Proc. ACM Hum.-Comput. Interact. 5, CSCW1, 1–23. of Systems and Software. 84, 4, 544–558.
[81] Ré, C., Niu, F., Gudipati, P. and Srisuwananukorn, C. 2019. Overton: A data sys- [110] Yang, Q., Suh, J., Chen, N.-C. and Ramos, G. 2018. Grounding Interactive Machine
tem for monitoring and improving machine-learned products. arXiv 1909.05372. Learning Tool Design in How Non-Experts Actually Build Models. In Proc. Conf.
[82] Salay, R., Queiroz, R. and Czarnecki, K. 2017. An Analysis of ISO 26262: Using Designing Interactive Systems, 573–584.
Machine Learning Safely in Automotive Software. arXiv 1709.02435. [111] Yang, Q. The role of design in creating machine-learning-enhanced user experi-
[83] Sambasivan, N. et al. 2021. “Everyone wants to do the model work, not the data ence. In Proc. AAAI Spring Symposium Series, 406-411.
work”: Data Cascades in High-Stakes AI. In Proc. Conf. on Human Factors in [112] Yokoyama, H. 2019. Machine Learning System Architectural Pattern for Improv-
Computing Systems (CHI). 1–15. ing Operational Stability. In Proc. Int’l Conf. on Software Architecture Companion
[84] Sarma, A., Redmiles, D.F. and van der Hoek, A. 2012. Palantir: Early Detection of (ICSA-C), 267–274.
Development Con�icts Arising from Parallel Code Changes. IEEE Transactions [113] Zhang, A.X., Muller, M. and Wang, D. 2020. How do data science workers
on Software Engineering. 38, 4, 889–908. collaborate? Roles, work�ows, and tools. Proc. Human-Computer Interaction. 4,
[85] Schelter, S et al. 2018. Automating Large-scale Data Quality Veri�cation. Proc. CSCW1, 1–23.
VLDB Endowment Int’l Conf. Very Large Data Bases. 11, 12, 1781–1794. [114] Zhou, S., Vasilescu, B. and Kästner, C. 2020. How Has Forking Changed in the
[86] Sculley, D. et al. 2015. Hidden Technical Debt in Machine Learning Systems. Last 20 Years? A Study of Hard Forks on GitHub. In Proc. Int’l Conf. Software
Advances in Neural Information Processing Systems 28. 2503–2511. Engineering (ICSE), 445–456.
[87] Sculley, D., Otey, M.E., Pohl, M., Spitznagel, B., Hainsworth, J. and Zhou, Y. 2011. [115] Zinkevich, M. 2017. Rules of machine learning: Best practices for ML engineering.
Detecting adversarial advertisements in the wild. In Proc. Int’l Conf. Knowledge URL: https:// developers.google.com/ machine-learning/ guides/ rules-of-ml.
Supplementary Material
SUPPLEMENT A: INTERVIEW PARTICIPANTS Table 4: Distribution of Company Type

Table 2: Distribution of Company Location Company Type Count

Big Tech 6
Company Location Count Non IT 4
North America 11 Mid-size Tech 11
South America 1 Startup 5
Europe 5 Consulting 2
Asia 10 = 28
Africa 1
= 28

Figure 3: Distribution of Participant Role

Figure 2: Distribution of Company Location

Table 3: Distribution of Participant Role

Current Position Count

ML-Related 23
SE-Related 9
Management 5
Operations 2
Domain Expert 2
Other 4
= 45
Figure 4: Distribution of Company Type
Arxiv, Feb 2022, USA Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner

Table 5: Details of Interviewed Participants, Companies and Products

Company Company Participant Participant Role Product Domain

ID Type ID
1a SE
g-1 Big Tech NLP (entity recognition)
1b Research Scientist (DS)
g-2 Startup 2a Co-founder (SE) Nutrition app with personalized recommendation
3a DS
g-3 Consulting 3b SE and Operations NLP model to identify pathogens
3c SE
g-4 Consulting 4a SE Video processing in edge devices
5a Research Scientist (SE)
g-5 Big Tech Scam detection
5b Operations
6a Co-founder (SE)
g-6 Startup AI solution for communication feedback
6b AI scientiest
7a DS researcher
g-7 Startup Circuit board analysis
7b DS researcher
g-8 Startup 8a CEO AI app for image analysis
g-9 Non IT 9a Product manager Autonomous car
g-10 Non IT 10a DS Warehouse forecasting
g-11 Mid-size Tech 11a DS Food recommendation
g-12 Non IT 12a DS Autonomous car
g-13 Startup 13a DS Price optimization for e-commerce
14a DS
g-14 Mid-size Tech Transcription service
14b SE
15a Team lead (SE)
g-15 Mid-size Tech 15b DS OCR (Govt project)
15c Consultant
16a SE
g-16 Mid-size Tech Speech detection (call center audio), chatbot
16b ML engineer
17a Data analyst
g-17 Non IT Crop analysis (Govt project)
17b Domain expert
18a SE (work in both SE and ML)
g-18 Mid-size Tech Chat-bot (multiple client including banks)
18b DS
19a Team lead (ML)
g-19 Mid-size Tech Pharmaceutical’s product
19b ML and AI engineer
20a ML engineer
g-20 Mid-size Tech Identify prescribed medicine from prescription image
20b ML engineer
21a CEO and Architect
g-21 Mid-size Tech Video processing app using existing AI engine
21b SE
22a COO
g-22 Mid-size Tech 22b Analytics Audio processing for keyword detection
22c DS
g-23 Mid-size Tech 23a Lead Data Scientist Dengue Early Warning System
g-24 Mid-size Tech 24a SE + DS Video processing app using existing AI engine
g-25 Big Tech 25a Data Scientist Text classi�cation
g-26 Big Tech 26a Senior Data Scientist OS crash report analysis
g-27 Big Tech 27a Senior Data Scientist Draud detection
g-28 Big Tech 28a Sta� Software Engineer E-commerce product classi�cation
(managing cross ML team)
Supplementary Material Arxiv, Feb 2022, USA

SUPPLEMENT B: INTERVIEW GUIDE Components Can you tell me a bit about your ML pipeline?
What are the interaction points to the non-ML components?
Introductory Comments
Can you describe/draw the architecture of the system with
– This study intended to understand the challenges of collabo- ml and non-ml components?
ration between di�erent stakeholders in a machine learning System decomposition How did you decide to decompose
production system, and the current industry best practices to the whole system into dependent/independent components
deal with those challenges. In support of this study, a series this way? Was it obvious or did it change? Do you think of
of targeted interviews are being conducted to gather infor- ML as one component? Who was involved in the decompo-
mation from the stakeholders involved in di�erent stages of sition process and deciding the module boundaries?
building, deploying, and operating machine learning produc- System Correctness How do you plan to evaluate your sys-
tion systems. tem correctness? How do you set quality goals?
– Please do not share any con�dential information with us. If ML component testing How do you plan to test the model?
you prefer information not to be included in the recording, Do you conduct o�ine/online testing? What data is col-
just let us know. lected later (e.g., for telemetry)? Who made these decisions
– The information gathered from the interviews will be aggre- and do they evolve?
gated in the form of mismatches and consequences, and will System testing In testing, how much do you focus on test-
have no connection to any organization. ing the model versus testing the entire system? How do
– Before we begin, we would like to notify you that we in- you test the system? Who is involved in testing? Who
tend to record the interview for transcription purposes. Only makes decisions?
the research team will be able to access the recordings and Breaking change Did you face any application break re-
transcripts. After the interview is transcribed and identify- cently? What is considered a breaking change? How do
ing information is removed from the transcript, the audio you deal with breaking changes? How do you detect them?
recording will be destroyed. – (probe) Do you consider that model might evolve during
– The audio recording will be done in a private space for both model development and are likely to in�uence other parts
physical and over phone conversations. of the system?
Data quality How much do you or others in the project worry
Interview Questions about data quality?
[Intro]. Data need Think about data that you receive from other
Project and Role Can you tell me more about you; your role parts of the system (or from outside the system) or of data
at your company, and your academic and professional back- you produce for other parts of the system: Are data quality
ground? Tell me about the last ml-based project you worked and quantity requirements documented? Are there schema
on. To what extent you were involved in the system, and de�nitions or data quality checks or monitoring? Who
what was/were your roles in it? "owns" the data? Who cleans the data? Who is responsible
ML Component Can you tell me about the ml or non-ml parts for checks or documentation? What happens when data
of the project? How the ML component is used as part of format or quality change? How much in�uence do you
the larger system? have on data quality and quantity?
Team Can you describe a bit about your team? What are the Data understanding Is the data schema and its semantics
roles involved? What are their tasks? And how do they com- documented somewhere? Who does the documentation? If
municate with each other, and for what purpose? Who is you don’t understand the data, whom do you communicate
involved with which components? with to understand it?
Planning and monitoring for drift What happens if the
[Topic-Specific �estions].
data changes? Who is responsible for notifying changes?
Understanding system requirements and ML capabilities How do you get to know about the change? How do you
Who were involved in the decision to use ML in this project deal with schema evolution?
(or whether to turn an ML prototype into a product)? Who Special qualities Do you consider requirements like explain-
do you talk to for the requirements of the system as well ability or fairness, and legal requirements like privacy? Are
as the ML components in the project? – (probe) ask about the requirements documented at the system or model level?
feedback loops Who is responsible? Do you plan for fairness/ robustness
Project planning and Process Project planning How do you testing?
plan or estimate for the project, speci�cally the ML com- Versioning Do you maintain model versioning? What about
ponents and dependencies between components? Who is data versioning? Do you think about provenance tracking?
involved in the planning? Are these documented? Who made decisions? Who is a�ected?
Process In what order are the things developed in the ML Reusability, Reproducibility, Maintainability To what ex-
project? Do you follow a process model? tent do you develop documentation for the component? Do
Dealing with change Do you plan for change management? you follow any coding standards or conventions? Do you con-
How does this planning / replanning happen? Who gets sider designing the module to reuse? Do you care whether
involved? components and results are reproducible?
Arxiv, Feb 2022, USA Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner

Operations How do you deploy updates to models and non- Output: Documentation on requirements including functional-
ML components? How often? Who? Do you consider con- nonfunctional requirements, environment interaction analysis rea-
tinuous experimentation? Who is responsible for the opera- soning about feedback loops.
tions? Involved: Primarily - Requirements Engineer, Domain Expert,
Composition/Integration How do you transfer the model Customer and Manager. Needs consultation with - Software Engi-
from prototype to production? Who is responsible for in- neer, Data Scientist, Tester, Operator.
tegrating the ML model to the system? Do multiple roles Lifecycle: Traditionally, this would happen in early requirements
collaborate for this or is it done by one speci�c role? What stages of a waterfall-like process or iteratively in other process
is the di�culty level for that, according to you? models. In ML projects this may happen when shifting from initial
modeling to building a production system.
[Last Thoughts].
Why challenging: While developing ML components, the data
ML vs non-ML Can you share your experience with the ML scientists often forget about the overall system and focus on the
project in comparison with other types of projects that do not model part only. Thus, inconsistencies may arise if the model does
include a machine learning component (if you have worked not correspond to the system goals. Also, it is often reported that
on any)? stakeholders �nd it di�cult to de�ne the scope of the project that
Challenges/ Bene�ts Can you think about any challenges includes ML components. ML components often lead to false expec-
you faced during the project development or afterward? tations that makes it di�cult to set reasonable targets. Additionally,
Why do you think the challenge arose, and in which step? it’s di�cult to quantify the quality targets or the qualitative non-
– (probe) [Interdisciplinary Collaboration] Is working with functional requirements. Identifying the feedback loops is also not
team members of di�erent backgrounds a challenge in this a trivial task. Moreover, as both the traditional and ML parts are
project? involved and the team consists of people with heterogeneous lan-
guages and priorities, these people need to interact with each other
SUPPLEMENT C: CODEBOOK to come to a common position.
1. Understanding System Requirements Example: Google photos app has ML modules for photo tagging.
Description: This is the collaboration point where the broad sys- However, the app has a lot of other features, and the ML mod-
tem requirements are collected/de�ned. The overall system goal and ule needs to interact and be consistent with those. For example -
system requirements may di�er from those of individual ML compo- should the app show the photo tags to the users or will the users be
nents and focus on the overall system behavior and its interactions able to change a tag if they think its incorrect. Thus, the di�erent
with the environment. While a system can grow around keeping stakeholders of the Google photos app need to be involved in the
an ML module as a central point, another system can consist of requirements collection and realization process.
less important ML module(s) working along with other traditional 2. Project Planning, Process and Interdisciplinary Collabo-
features. Therefore, system requirements may be collected upfront ration
or late and incrementally after the initial model development. For Description: This is the collaboration point that deals with the
the �rst category of the system, the requirements are generated overall project planning. It de�nes the process model to be fol-
after the ML component is de�ned. Thus, the ML design decisions lowed. While a project might start with collecting requirements
in�uence how the entire system is shaped, and take control on �rst, another can start with letting some data scientists play with
de�ning the requirements of the overall system. For the second cat- data for a year. So the process needs to be determined at this point.
egory of system, the requirements gathering is executed similarly Additionally, general planning like time estimation, risk mitigation
as a traditional software. One or few ML features are incorporated plan etc. also need to be conducted and synchronized among the
into the overall system, that are in�uenced and shaped by the broad stakeholders at this collaboration point. As this collaboration point
system-level requirements constraining the data scientists during is where the overall system planning and process is discussed, the
modeling. Based on these categories, we expect that stakeholders whole team involved in the project should be involved here, at least
think di�erently about the business needs of the system. In this col- implicitly. That is why, the communication challenges between the
laboration point, the stakeholders need to de�ne the system scope, interdisciplinary team members is also combined with this point.
system-level requirements and metrics to set quality expectations Agree on: Project plan and process models to be followed. Com-
for the overall system. This includes considering how the system munication agendas between di�erent teams/roles.
interacts with the environment and how this in�uences safety, se- Output: Informal project planning to more formal planning doc-
curity, fairness, and feedback-loop issues at the system level. Also, uments, possible process documentation, adoption of process prac-
this is the collaboration point where the special requirements like tices, forming of interdisciplinary teams, adoption of team work
explainability, fairness, privacy, safety, security, Human-AI inter- practices
action planning, etc. need to be de�ned properly for the overall Involved: Primarily - Project Manager. Needs consultation with -
system as well as realize their e�ect on the ML part. In short, the Software Engineer, Data Scientist, Domain Experts, Tester, Operator,
di�erent stakeholders of the system need to be involved in the etc.
requirements collection and realization process. Lifecycle: Generally, in a waterfall-like process, the tasks can be
Agree on: System-level requirements, reasoning about feedback actively handled in both System Requirements and Planning/ High-
loops, non-functional requirements of the system level Design (outside ML pipeline) stages. However, in practice, it
generally spreads out and continues as ongoing tasks.
Why challenging: In general, time estimation is hard for the ML
Supplementary Material Arxiv, Feb 2022, USA

applications due to their exploratory nature. This also creates a Engineer, Data Scientist, Operator, Manager.
lot of collaboration challenges in planning due to the di�erences 3.2. Fairness, Privacy and Accountability
between SE and ML components. Example: Google photos app has Description: Inclusion of ML in a system often creates expecta-
ML modules and other traditional functionalities. The ML modules tions of some special system-level quality requirements that in-
have structures and time requirements di�erent from the other cludes - fairness, privacy, explainability, provenance tracking, etc.
traditional parts. Thus, before building the app, the project planning Again, to achieve the quality at the system level, the role of individ-
needs to incorporate planning for the ML parts. For example - the ual components needs to be negotiated, de�ned, and assured; the
photo tagging component might need a di�erent timeline than the integration of components needs to be evaluated.
traditional components. A risk analysis might be required for the Agree on: Protected data points, level of fairness expected, level
component, and similar to the iterative model, the risk-associated of explainability/provenance tracking expected
component can be attempted to build �rst. Component-level Evaluation: Model Fairness Testing (increasing
3. System Decomposition, Local Checking and System Eval- data quantity might be needed), Constraints on Protected Values in
uation Data, Use of Explainable Algorithms, Versioning Code, Model and
Description: This collaboration point deals with decomposing the Data.
overall system into ML and non-ML SE components. This de�nes Involved: Primarily - Data Scientist. Needs consultation with -
the module boundaries and negotiates the requirements or interface Domain Expert, Legal Department, Manager, Tester.
contracts for each of the components. While each component will System-level Evaluation: System-level Fairness Testing, Testing
be evaluated internally against its interfaces, also later composi- whether provenance tracking can be performed, Testing whether
tion and integration-/system-level quality assurance is planned, a system output can be explained or whether the results are inter-
ensuring that the composed system meets the system speci�cations. pretable.
During decomposition and local-system evaluation many qualities Involved: Primarily - Tester. Needs consultation with - Software
need to be considered, each resulting possibly with corresponding Engineer, Data Scientist, Domain Expert, Legal Department, Opera-
obligations at the module interfaces. Generally, we consider four tor, Manager.
kinds of components: 3.3. Data Quality
• non-ML component: These are the traditional SE compo- Description: Data is one of the most important parts of an ML
nents application. This includes tracking data needs both from the as-
• Pipeline component: This component represents the process pect of quality and quantity, de�ning an appropriate data schema,
of producing the model, roughly how data is transferred to monitoring data drifts and taking necessary steps to incorporate
the model and the deployment of it. change or evolution, documentation of data, etc. Agree on: Quality
• Inference component: This is the component that is con- and quantity of data to be collected, data schema de�nition, data
cerned about the model prediction, and thus, relates to using understanding and reporting, data monitoring and managing drifts
the model. Component-level Evaluation: Improve/Degrade of Accuracy can
• Monitoring component: This represents the component for be one indicator of data needs. Privacy is a concern here as data can
monitoring the system after the deployment of the system. contain protected attributes. Data cleaning and storing in schema
Agree on: Module boundaries and quality requirements for the are steps to check on data integrity in mode level.
components after the decomposition. Also, evaluation mechanisms Involved: Primarily - Data Scientist. Needs consultation with - Le-
for the components and the overall system after the composition. gal Department [for privacy], Domain Expert [for schema de�nition
3.1. Functional Correctness / Target Domain / Fit (Accu- and drift], Manager [for more data need].
racy) System-level Evaluation: Degrade of model accuracy in the inte-
Description: Functional "correctness" is used as a broad term here, grated system can be one indicator of data drift.
that measures if the system produces outputs that match a given Involved: Primarily - Data Scientist. Needs consultation with
problem. The system has behavioral expectations that need to be - Software Engineer, Tester, Operator [for collecting telemetry],
broken down into individual components, tested locally, and then Domain Expert [for drift], Manager [for more data need].
be composed and evaluated again at the system level. 3.4. Reusability, Reproducibility, Maintainability, Infras-
Agree on: Accuracy expectations for ML components, functional tructure Quality
behavior expectations of all individual components, test strategy Description: ML components have di�erent coding languages
for components, test strategies for integration and di�erent conventions. However, for a complete software, the
Component-level Evaluation: Model Accuracy. Both O�ine/ On- parts might follow similar coding conventions so that the other
line testing can be needed. Telemetry collection needs collaboration. team members can understand the code easily and integrate the
Functional correctness of non-ML components. parts together without much e�ort. Also, the documentation of the
Involved: Primarily - Data Scientist. For online testing/telemetry implementation needs to be maintained to a certain extent so that
collection, needs consultation with - Domain Expert, Software En- the system can be easily updated later on. This will help to increase
gineer, Operator, Tester. the system maintainability. This also leads to easy reusability and
System-level Evaluation: Integration, system and acceptance test- reproducibility of the system.
ing. Agree on: Coding Convention and Level of Documentation, Ex-
Involved: Primarily - Tester. Needs consultation with - Software pectations of Reusability, Reproducibility and Maintainability
Arxiv, Feb 2022, USA Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner

Component-level Evaluation: Consistent Coding and Documen- • Arpteg, A., Brinne, B., Crnkovic-Friis, L., & Bosch, J. (2018).
tation of the Component (Involved: Primarily - Software Engineer Software Engineering Challenges of Deep Learning. 2018
and Data Scientist. Needs consultation with - Manager) 44th Euromicro Conference on Software Engineering and Ad-
System-level Evaluation: Consistent Coding and Documentation vanced Applications (SEAA), 50–59.
throughout the System. • Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017).
Involved: Primarily - Software Engineer and Data Scientist. Needs The ML test score: A rubric for ML production readiness and
consultation with - Manager. technical debt reduction. 2017 IEEE International Conference
3.5. Updatability, Operations on Big Data (Big Data), 1123–1132.
Description: As ML systems need to be monitored for data drifts • Kery, M. B., Radensky, M., Arya, M., John, B. E., & Myers, B.
and online testing with telemetry collection, deployment is not A. (2018). The story in the notebook. Proceedings of the 2018
the last stage of development. Operations is an important part of CHI Conference on Human Factors in Computing Systems -
such projects leading to the popularity of MLOps. Along with the CHI ’18. the 2018 CHI Conference, Montreal QC, Canada.
data monitoring, change updates are also required for both ML and • Zhang, A. X., Muller, M., & Wang, D. (2020). How do data
traditional parts of the software. Continuous experimentation is science workers collaborate? Roles, work�ows, and tools.
another aspect that requires updating the application continuously. Proceedings of the ACM on Human-Computer Interaction,
This relates to system versioning and provenance tracking as well. 4(CSCW1), 1–23.
Agree on: Deployment and Monitoring Mechanisms, Frequency • Pimentel, J. F., Murta, L., Braganholo, V., & Freire, J. (2019,
of Updates, Continuous Experimentation Requirements May). A large-scale study about quality and reproducibility of
Component-level Evaluation: Stable Model Deployment and Mon- jupyter notebooks. 2019 IEEE/ACM 16th International Confer-
itoring Mechanisms for Data Drifts. Experimentation with Di�erent ence on Mining Software Repositories (MSR). 2019 IEEE/ACM
Versions of the Model. (Involved: Primarily - Software Engineer and 16th International Conference on Mining Software Reposito-
Data Scientist. Needs consultation with - Operator, Manager and ries (MSR), Montreal, QC, Canada.
Tester) • Head, A., Hohman, F., Barik, T., Drucker, S. M., & DeLine,
System-level Evaluation: Stable System Deployment and Monitor- R. (2019). Managing messes in computational notebooks.
ing Mechanisms. Monitoring for change requests and continuous Proceedings of the 2019 CHI Conference on Human Factors
bug �xes generate continuous patch or version updates. in Computing Systems - CHI ’19. the 2019 CHI Conference,
Involved: Primarily - Operator. Needs consultation with - Soft- Glasgow, Scotland Uk.
ware Engineer, Tester and Manager. • Studer, S., Bui, T. B., Drescher, C., Hanuschkin, A., Winkler,
3.6. Usability L., Peters, S., & Mueller, K.-R. (2020). Towards CRISP-ML(Q):
Description: Like the other traditional systems, the ML systems A Machine Learning Process Model with Quality Assurance
also need user interaction in di�erent forms. Whatever the form Methodology. In arXiv [cs.LG]. arXiv.
is, user satisfaction/usability is an important aspect of any system. • Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T.,
Especially, ML systems need a collection of telemetry that demands Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., & Den-
extra attention to the UI design. nison, D. (2015). Hidden Technical Debt in Machine Learn-
Agree on: User Experience Expectations ing Systems. In C. Cortes, N. D. Lawrence, D. D. Lee, M.
Involved: Primarily - Software Engineer and UX Expert. Needs Sugiyama, & R. Garnett (Eds.), Advances in Neural Informa-
consultation with - Data Scientist, Operator, Manager, and Tester tion Processing Systems 28 (pp. 2503–2511). Curran Associates,
Inc.
SUPPLEMENT D: LIST OF PAPERS • Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E.,
Initial Set of Papers (15) Nagappan, N., Nushi, B., & Zimmermann, T. (2019). Software
Engineering for Machine Learning: A Case Study. In 2019
• Chattopadhyay, S., Prasad, I., Henley, A. Z., Sarma, A., & IEEE/ACM 41st International Conference on Software Engi-
Barik, T. (2020). What’s Wrong with Computational Note- neering: Software Engineering in Practice (ICSE-SEIP).
books? Pain Points, Needs, and Design Opportunities. In • Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2017).
Proceedings of the 2020 CHI Conference on Human Factors in Data Management Challenges in Production Machine Learn-
Computing Systems (pp. 1–12). Association for Computing ing. Proceedings of the 2017 ACM International Conference on
Machinery. Management of Data, 1723–1726.
• O’Leary, K., & Uchida, M. (2020). Common problems with • Vogelsang, A., & Borg, M. (2019). Requirements Engineer-
creating machine learning pipelines from existing code. ing for Machine Learning: Perspectives from Data Scientists.
• Li, P. L., Ko, A. J., & Begel, A. (2017). Cross-Disciplinary 2019 IEEE 27th International Requirements Engineering Con-
Perspectives on Collaborations with Software Engineers. In ference Workshops (REW), 245–251.
2017 IEEE/ACM 10th International Workshop on Cooperative
and Human Aspects of Software Engineering (CHASE). Complete Set of Papers (61)
• Kim, M., Zimmermann, T., DeLine, R., & Begel, A. (2018).
• Chattopadhyay, S., Prasad, I., Henley, A. Z., Sarma, A., &
Data Scientists in Software Teams: State of the Art and Chal-
Barik, T. (2020). What’s Wrong with Computational Note-
lenges. In IEEE Transactions on Software Engineering (Vol. 44,
books? Pain Points, Needs, and Design Opportunities. In
Issue 11, pp. 1024–1038).
Proceedings of the 2020 CHI Conference on Human Factors in
Supplementary Material Arxiv, Feb 2022, USA

Computing Systems (pp. 1–12). Association for Computing Management of Data, 1723–1726.
Machinery. • Vogelsang, A., & Borg, M. (2019). Requirements Engineer-
• O’Leary, K., & Uchida, M. (2020). Common problems with ing for Machine Learning: Perspectives from Data Scientists.
creating machine learning pipelines from existing code. 2019 IEEE 27th International Requirements Engineering Con-
• Li, P. L., Ko, A. J., & Begel, A. (2017). Cross-Disciplinary ference Workshops (REW), 245–251.
Perspectives on Collaborations with Software Engineers. In • Holstein, K., Wortman Vaughan, J., Daumé, H., Dudik, M., &
2017 IEEE/ACM 10th International Workshop on Cooperative Wallach, H. (2019). Improving Fairness in Machine Learning
and Human Aspects of Software Engineering (CHASE). Systems: What Do Industry Practitioners Need? Proceedings
• Kim, M., Zimmermann, T., DeLine, R., & Begel, A. (2018). of the 2019 CHI Conference on Human Factors in Computing
Data Scientists in Software Teams: State of the Art and Chal- Systems, 1–16.
lenges. In IEEE Transactions on Software Engineering (Vol. 44, • Rahimi, M., Guo, J. L. C., Kokaly, S., & Chechik, M. (2019).
Issue 11, pp. 1024–1038). Toward Requirements Speci�cation for Machine-Learned
• Arpteg, A., Brinne, B., Crnkovic-Friis, L., & Bosch, J. (2018). Components. 2019 IEEE 27th International Requirements En-
Software Engineering Challenges of Deep Learning. 2018 gineering Conference Workshops (REW), 241–244.
44th Euromicro Conference on Software Engineering and Ad- • Nushi, B., Kamar, E., Horvitz, E., & Kossmann, D. (2017). On
vanced Applications (SEAA), 50–59. human intellect and machine failures: troubleshooting inte-
• Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). grative machine learning systems. Proceedings of the Thirty-
The ML test score: A rubric for ML production readiness and First AAAI Conference on Arti�cial Intelligence, 1017–1025.
technical debt reduction. 2017 IEEE International Conference • Borg, M., Englund, C., Wnuk, K., Duran, B., Levandowski, C.,
on Big Data (Big Data), 1123–1132. Gao, S., Tan, Y., Kaijser, H., Lönn, H., & Törnqvist, J. (2019).
• Kery, M. B., Radensky, M., Arya, M., John, B. E., & Myers, B. Safely Entering the Deep: A Review of Veri�cation and Vali-
A. (2018). The story in the notebook. Proceedings of the 2018 dation for Machine Learning and a Challenge Elicitation in
CHI Conference on Human Factors in Computing Systems - the Automotive Industry. In Journal of Automotive Software
CHI ’18. the 2018 CHI Conference, Montreal QC, Canada. Engineering (Vol. 1, Issue 1, p. 1).
• Zhang, A. X., Muller, M., & Wang, D. (2020). How do data • Kandel, S., Paepcke, A., Hellerstein, J. M., & Heer, J. (2012). En-
science workers collaborate? Roles, work�ows, and tools. terprise data analysis and visualization: An interview study.
Proceedings of the ACM on Human-Computer Interaction, IEEE Transactions on Visualization and Computer Graphics,
4(CSCW1), 1–23. 18(12), 2917–2926.
• Pimentel, J. F., Murta, L., Braganholo, V., & Freire, J. (2019, • Madaio, M. A., Stark, L., Wortman Vaughan, J., & Wallach,
May). A large-scale study about quality and reproducibility of H. (2020). Co-Designing Checklists to Understand Organiza-
jupyter notebooks. 2019 IEEE/ACM 16th International Confer- tional Challenges and Opportunities around Fairness in AI.
ence on Mining Software Repositories (MSR). 2019 IEEE/ACM Proceedings of the 2020 CHI Conference on Human Factors in
16th International Conference on Mining Software Reposito- Computing Systems, 1–14.
ries (MSR), Montreal, QC, Canada. • Salay, R., Queiroz, R., & Czarnecki, K. (2017). An Analysis of
• Head, A., Hohman, F., Barik, T., Drucker, S. M., & DeLine, ISO 26262: Using Machine Learning Safely in Automotive
R. (2019). Managing messes in computational notebooks. Software. In arXiv [cs.AI]. arXiv.
Proceedings of the 2019 CHI Conference on Human Factors • Ashmore, R., Calinescu, R., & Paterson, C. (2019). Assuring
in Computing Systems - CHI ’19. the 2019 CHI Conference, the Machine Learning Lifecycle: Desiderata, Methods, and
Glasgow, Scotland Uk. Challenges. In arXiv [cs.LG]. arXiv.
• Studer, S., Bui, T. B., Drescher, C., Hanuschkin, A., Winkler, • Sendak, M. P., Ratli�, W., Sarro, D., Alderton, E., Futoma, J.,
L., Peters, S., & Mueller, K.-R. (2020). Towards CRISP-ML(Q): Gao, M., Nichols, M., Revoir, M., Yashar, F., Miller, C., Kester,
A Machine Learning Process Model with Quality Assurance K., Sandhu, S., Corey, K., Brajer, N., Tan, C., Lin, A., Brown,
Methodology. In arXiv [cs.LG]. arXiv. T., Engelbosch, S., Anstrom, K., . . . O’Brien, C. (2020). Real-
• Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., World Integration of a Sepsis Deep Learning Technology Into
Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., & Den- Routine Clinical Care: Implementation Study. JMIR Medical
nison, D. (2015). Hidden Technical Debt in Machine Learn- Informatics, 8(7), e15182.
ing Systems. In C. Cortes, N. D. Lawrence, D. D. Lee, M. • Sculley, D., Otey, M. E., Pohl, M., Spitznagel, B., Hainsworth,
Sugiyama, & R. Garnett (Eds.), Advances in Neural Informa- J., & Zhou, Y. (2011). Detecting adversarial advertisements
tion Processing Systems 28 (pp. 2503–2511). Curran Associates, in the wild. Proceedings of the 17th ACM SIGKDD Interna-
Inc. tional Conference on Knowledge Discovery and Data Mining,
• Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., 274–282.
Nagappan, N., Nushi, B., & Zimmermann, T. (2019). Software • Bernardi, L., Mavridis, T., & Estevez, P. (2019). 150 success-
Engineering for Machine Learning: A Case Study. In 2019 ful machine learning models. Proceedings of the 25th ACM
IEEE/ACM 41st International Conference on Software Engi- SIGKDD International Conference on Knowledge Discovery &
neering: Software Engineering in Practice (ICSE-SEIP). Data Mining - KDD ’19. the 25th ACM SIGKDD International
• Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2017). Conference, Anchorage, AK, USA.
Data Management Challenges in Production Machine Learn-
ing. Proceedings of the 2017 ACM International Conference on
Arxiv, Feb 2022, USA Nadia Nahar, Shurui Zhou, Grace Lewis, and Christian Kästner

• Lwakatare, L. E., Raj, A., Bosch, J., Olsson, H. H., & Crnkovic, Dzhulgakov, D., Fawzy, M., Jia, B., Jia, Y., Kalro, A. and Law,
I. (2019). A taxonomy of software engineering challenges for J., 2018, February. Applied machine learning at facebook: A
machine learning systems: An empirical investigation. Inter- datacenter infrastructure perspective. In 2018 IEEE Interna-
national Conference on Agile Software Development, 227–243. tional Symposium on High Performance Computer Architecture
• Yang, Q., Suh, J., Chen, N.-C., & Ramos, G. (2018). Ground- (HPCA) (pp. 620-629). IEEE.
ing Interactive Machine Learning Tool Design in How Non- • Amershi, S., Chickering, M., Drucker, S. M., Lee, B., Simard,
Experts Actually Build Models. Proceedings of the 2018 De- P., & Suh, J. (2015). ModelTracker: Redesigning Performance
signing Interactive Systems Conference, 573–584. Analysis Tools for Machine Learning. Proceedings of the 33rd
• Martinez-Plumed, F., Contreras-Ochando, L., Ferri, C., Her- Annual ACM Conference on Human Factors in Computing
nandez Orallo, J., Kull, M., Lachiche, N., Ramirez Quintana, Systems, 337–346.
M. J., & Flach, P. A. (2020). CRISP-DM twenty years later: • Peng, Z., Yang, J., Chen, T.-H. (peter), & Ma, L. (2020, Novem-
From data mining processes to data science trajectories. IEEE ber 8). A �rst look at the integration of machine learning
Transactions on Knowledge and Data Engineering, 1–1. models in complex autonomous driving systems: a case study
• Ishikawa, F. and Yoshioka, N., 2019, May. How do engineers on Apollo. Proceedings of the 28th ACM Joint Meeting on Eu-
perceive di�culties in engineering of machine-learning systems?- ropean Software Engineering Conference and Symposium on
questionnaire survey. In 2019 IEEE/ACM Joint 7th Interna- the Foundations of Software Engineering. ESEC/FSE ’20: 28th
tional Workshop on Conducting Empirical Studies in Industry ACM Joint European Software Engineering Conference and
(CESI) and 6th International Workshop on Software Engineer- Symposium on the Foundations of Software Engineering,
ing Research and Industrial Practice (SER&IP) (pp. 2-9). IEEE. Virtual Event USA.
• Ozkaya, I. (2020). What Is Really Di�erent in Engineering • Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2018).
AI-Enabled Systems? IEEE Software, 37(4), 3–6. Data Lifecycle Challenges in Production Machine Learning:
• Wiens, J., Saria, S., Sendak, M., Ghassemi, M., Liu, V. X., Doshi- A Survey. SIGMOD Rec., 47(2), 17–28.
Velez, F., Jung, K., Heller, K., Kale, D., Saeed, M., Ossorio, P. N., • Humbatova, N., Jahangirova, G., Bavota, G., Riccio, V., Stocco,
Thadaney-Israni, S., & Goldenberg, A. (2019). Do no harm: a A., & Tonella, P. (2020, June 27). Taxonomy of real faults in
roadmap for responsible machine learning for health care. deep learning systems. Proceedings of the ACM/IEEE 42nd
Nature Medicine, 25(9), 1337–1340. International Conference on Software Engineering. ICSE ’20:
• Wagsta�, K. (2012). Machine Learning that Matters. In arXiv 42nd International Conference on Software Engineering,
[cs.LG]. arXiv. Seoul South Korea.
• Hynes, N., Sculley, D. and Terry, M., 2017. The data linter: • Polyzotis, N., Zinkevich, M., Roy, S., Breck, E., & Whang, S.
Lightweight, automated sanity checking for ml data sets. In (2019). Data validation for machine learning. Proceedings of
NIPS MLSys Workshop. Machine Learning and Systems, 1, 334–347.
• Strubell, E., Ganesh, A. and McCallum, A., 2019. Energy • Wan, Z., Xia, X., Lo, D., & Murphy, G. C. (2019). How does
and policy considerations for deep learning in NLP. arXiv Machine Learning Change Software Development Practices?
preprint arXiv:1906.02243. IEEE Transactions on Software Engineering, 1–1.
• Hukkelberg, I., & Rolland, K. (2020). EXPLORING MACHINE • Lwakatare, L. E., Raj, A., Crnkovic, I., Bosch, J., & Olsson,
LEARNING IN A LARGE GOVERNMENTAL ORGANIZATION: H. H. (2020). Large-scale machine learning systems in real-
AN INFORMATION INFRASTRUCTURE PERSPECTIVE. world industrial settings: A review of challenges and so-
• Hill, C., Bellamy, R., Erickson, T. and Burnett, M., 2016, Sep- lutions. Information and Software Technology, 127(106368),
tember. Trials and tribulations of developers of intelligent 106368.
systems: A �eld study. In 2016 IEEE Symposium on Visual • Siebert, J., Joeckel, L., Heidrich, J., Nakamichi, K., Ohashi,
Languages and Human-Centric Computing (VL/HCC) (pp. 162- K., Namba, I., Yamamoto, R., & Aoyama, M. (2020). Towards
170). Guidelines for Assessing Qualities of Machine Learning Sys-
• Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C. Y., tems. In Communications in Computer and Information Sci-
Haque, Z., Haykal, S., Ispir, M., Jain, V., Koc, L., Koo, C. Y., ence (pp. 17–31).
Lew, L., Mewald, C., Modi, A. N., Polyzotis, N., Ramesh, S., • Shneiderman, B. (2020). Bridging the gap between ethics and
Roy, S., Whang, S. E., Wicke, M., . . . Zinkevich, M. (n.d.). practice. ACM Transactions on Interactive Intelligent Systems,
TFX: A TensorFlow-Based Production-Scale Machine Learning 10(4), 1–31.
Platform. • Seymoens, T., Ongenae, F., & Jacobs, A. (2018). A methodol-
• Ré, C., Niu, F., Gudipati, P. and Srisuwananukorn, C., 2019. ogy to involve domain experts and machine learning tech-
Overton: A data system for monitoring and improving machine- niques in the design of human-centered algorithms. Working
learned products. arXiv preprint arXiv:1909.05372. Conference on . . . .
• Bhatt, U., Xiang, A., Sharma, S., Weller, A., Taly, A., Jia, Y., • Zinkevich, M. (2017). Rules of machine learning: Best prac-
Ghosh, J., Puri, R., Moura, J. M. F., & Eckersley, P. (2020). Ex- tices for ML engineering. URL: Https://developers. Google.
plainable machine learning in deployment. Proceedings of the Com/machine-Learning/guides/rules-of-Ml.
2020 Conference on Fairness, Accountability, and Transparency, • Park, S., Wang, A., Kawas, B., Vera Liao, Q., Piorkowski, D., &
648–657. Danilevsky, M. (2021). Facilitating Knowledge Sharing from
• Hazelwood, K., Bird, S., Brooks, D., Chintala, S., Diril, U., Domain Experts to Data Scientists for Building NLP Models.
Supplementary Material Arxiv, Feb 2022, USA

In arXiv [cs.HC]. arXiv. Study. In arXiv [cs.CY]. arXiv.

• d. S. Nascimento, E., Ahmed, I., Oliveira, E., Palheta, M. P., • Haakman, M., Cruz, L., Huijgens, H., & van Deursen, A.
Steinmacher, I., & Conte, T. (2019). Understanding Devel- (2020). AI Lifecycle Models Need To Be Revised. An Ex-
opment Process of Machine Learning Systems: Challenges ploratory Study in Fintech. In arXiv [cs.SE]. arXiv.
and Solutions. 2019 ACM/IEEE International Symposium on • McGraw, G., Figueroa, H., Shepardson, V. and Bonett, R., 2020.
Empirical Software Engineering and Measurement (ESEM), An architectural risk analysis of machine learning systems:
1–6. Toward more secure machine learning. Berryville Institute
• Cai, C. J., Winter, S., Steiner, D., Wilcox, L., & Terry, M. (2019). of Machine Learning, Clarke County, VA. Accessed on: Mar,
“hello AI”: Uncovering the onboarding needs of medical 23.
practitioners for human-AI collaborative decision-making. • Hopkins, A., & Booth, S. (2021, July 21). Machine learn-
Proceedings of the ACM on Human-Computer Interaction, ing practices outside big tech: How resource constraints
3(CSCW), 1–24. challenge responsible development. Proceedings of the 2021
• Amershi, S., Weld, D., Vorvoreanu, M., Fourney, A., Nushi, AAAI/ACM Conference on AI, Ethics, and Society. AIES ’21:
B., Collisson, P., Suh, J., Iqbal, S., Bennett, P. N., Inkpen, K., AAAI/ACM Conference on AI, Ethics, and Society, Virtual
Teevan, J., Kikin-Gil, R., & Horvitz, E. (2019). Guidelines for Event USA.
Human-AI Interaction. Proceedings of the 2019 CHI Conference • Rakova, B., Yang, J., Cramer, H., & Chowdhury, R. (2021).
on Human Factors in Computing Systems, 1–13. Where Responsible AI meets Reality: Practitioner Perspec-
• Piorkowski, D., Park, S., Wang, A. Y., Wang, D., Muller, M., tives on Enablers for Shifting Organizational Practices. Proc.
& Portnoy, F. (2021). How AI Developers Overcome Com- ACM Hum.-Comput. Interact., 5(CSCW1), 1–23.
munication Challenges in a Multidisciplinary Team: A Case

Integrating DevOps with ML Workflows
No ratings yet
Integrating DevOps with ML Workflows
6 pages
Foundry Databricks 40822 Tech Dossier Final v2 7.26
No ratings yet
Foundry Databricks 40822 Tech Dossier Final v2 7.26
10 pages
LLM Programming Paper
No ratings yet
LLM Programming Paper
38 pages
MLOps: CI/CD for Machine Learning Systems
100% (1)
MLOps: CI/CD for Machine Learning Systems
14 pages
Requirements Engineering For Machine Learning: Perspectives From Data Scientists
No ratings yet
Requirements Engineering For Machine Learning: Perspectives From Data Scientists
8 pages
Topic Cheatsheet For GCP's Professional Machine Learning Engineer Beta Exam
100% (1)
Topic Cheatsheet For GCP's Professional Machine Learning Engineer Beta Exam
2 pages
Automating The Training and Deployment of Models in MLOps by Integrating Systems With Machine Learning
No ratings yet
Automating The Training and Deployment of Models in MLOps by Integrating Systems With Machine Learning
11 pages
C2 - W1 Mlopssadsa
No ratings yet
C2 - W1 Mlopssadsa
111 pages
Standardizing ML Ebook
No ratings yet
Standardizing ML Ebook
24 pages
Challenges in The Deployment and Operation of Machine Learning in Practice
No ratings yet
Challenges in The Deployment and Operation of Machine Learning in Practice
15 pages
MLOps: Aligning People, Processes, Platforms
No ratings yet
MLOps: Aligning People, Processes, Platforms
28 pages
AWS ML Solutions for Businesses
No ratings yet
AWS ML Solutions for Businesses
14 pages
Machine Learning Engineering in Action MEAP V04 Ben T Wilson PDF Download
No ratings yet
Machine Learning Engineering in Action MEAP V04 Ben T Wilson PDF Download
135 pages
Article 3
No ratings yet
Article 3
22 pages
Coursera 2.4
No ratings yet
Coursera 2.4
41 pages
Challenges and Strategies For Implementation
No ratings yet
Challenges and Strategies For Implementation
5 pages
How To Succeed With A ML Project
No ratings yet
How To Succeed With A ML Project
52 pages
Cs329s 01 Slides
No ratings yet
Cs329s 01 Slides
70 pages
Essential Soft Skills For Data Scientists in Software Engineering
No ratings yet
Essential Soft Skills For Data Scientists in Software Engineering
12 pages
cs329s 2022 02 Slides MLSD
No ratings yet
cs329s 2022 02 Slides MLSD
99 pages
Lima Et Al., 2022, MLOps Practices, Maturity Models, Roles, Tools, and Challenges - A Systematic Literature Review
No ratings yet
Lima Et Al., 2022, MLOps Practices, Maturity Models, Roles, Tools, and Challenges - A Systematic Literature Review
13 pages
Certified Artificial Intelligence Practitioner 1
No ratings yet
Certified Artificial Intelligence Practitioner 1
43 pages
Lecture 3 - 1-ML and Data Systems Fundamentals
No ratings yet
Lecture 3 - 1-ML and Data Systems Fundamentals
48 pages
DA Python Env Intro
No ratings yet
DA Python Env Intro
47 pages
Whitepaper ML Skills
No ratings yet
Whitepaper ML Skills
34 pages
Machine Learning To Communication System
No ratings yet
Machine Learning To Communication System
10 pages
Data Science Project Lifecycle
No ratings yet
Data Science Project Lifecycle
43 pages
07 - Data Lifecycle Challenges in Production ML
No ratings yet
07 - Data Lifecycle Challenges in Production ML
12 pages
Webinar Slides Mlops
100% (1)
Webinar Slides Mlops
35 pages
C1 W3
No ratings yet
C1 W3
60 pages
Research Roles and Skills To Support Advanced Analytics and Ai Initiatives
No ratings yet
Research Roles and Skills To Support Advanced Analytics and Ai Initiatives
39 pages
Challenges in Integrating AI, ML, Agile, and DevOps - Geetanjali Rautela
No ratings yet
Challenges in Integrating AI, ML, Agile, and DevOps - Geetanjali Rautela
1 page
Polyzotis Et Al - 2018
No ratings yet
Polyzotis Et Al - 2018
12 pages
Python-Based News Collection System
No ratings yet
Python-Based News Collection System
20 pages
cs329s 02 Note Intro ML Sys Design
No ratings yet
cs329s 02 Note Intro ML Sys Design
27 pages
Tantithamthavorn Et Al - 2025
No ratings yet
Tantithamthavorn Et Al - 2025
7 pages
MLOps: Continuous Delivery on AWS
No ratings yet
MLOps: Continuous Delivery on AWS
69 pages
4 AI Skills Every Employer Expects
No ratings yet
4 AI Skills Every Employer Expects
10 pages
MLOps for Efficient ML Deployment
No ratings yet
MLOps for Efficient ML Deployment
20 pages
Aws Responsible Use of Machine Learning Guide
No ratings yet
Aws Responsible Use of Machine Learning Guide
9 pages
ML Systems Interview Notes
No ratings yet
ML Systems Interview Notes
5 pages
Digital Twin - Old Wine in A New Bottle
No ratings yet
Digital Twin - Old Wine in A New Bottle
20 pages
USAII Reviewer
No ratings yet
USAII Reviewer
100 pages
Edge Impulse
No ratings yet
Edge Impulse
15 pages
MLOps: Automating ML Delivery Pipelines
No ratings yet
MLOps: Automating ML Delivery Pipelines
5 pages
The ML Test Score: A Rubric For ML Production Readiness and Technical Debt Reduction
No ratings yet
The ML Test Score: A Rubric For ML Production Readiness and Technical Debt Reduction
10 pages
Machine Learning As A UX Design Material: How Can We Imagine Beyond Automation, Recommenders, and Reminders?
No ratings yet
Machine Learning As A UX Design Material: How Can We Imagine Beyond Automation, Recommenders, and Reminders?
6 pages
Accelerate Machine Learning With A Unified Analytics Architecture
No ratings yet
Accelerate Machine Learning With A Unified Analytics Architecture
56 pages
MLOps For Enhancing The Accuracy of Machine Learni
No ratings yet
MLOps For Enhancing The Accuracy of Machine Learni
7 pages
5 Reasons Collaborative DS Is Not Enought
No ratings yet
5 Reasons Collaborative DS Is Not Enought
15 pages
Artificial Intelligence and Machine Learning in Software Development
No ratings yet
Artificial Intelligence and Machine Learning in Software Development
9 pages
2019 Machine Learning in The Air
No ratings yet
2019 Machine Learning in The Air
16 pages
Coursera 2.3
No ratings yet
Coursera 2.3
46 pages
(Ebook PDF) Modern Business Statistics With Microsoft Office Excel 6th Edition PDF Download
No ratings yet
(Ebook PDF) Modern Business Statistics With Microsoft Office Excel 6th Edition PDF Download
47 pages
JPractCardiovascSci4133-1144607 031046.pdfkhushbu
No ratings yet
JPractCardiovascSci4133-1144607 031046.pdfkhushbu
5 pages
SOCSTUD 102 (World Hist & Civ 1)
No ratings yet
SOCSTUD 102 (World Hist & Civ 1)
17 pages
Chapter 1 Characteristics of Research
No ratings yet
Chapter 1 Characteristics of Research
1 page
Designing A Research Project Second Edition Piet Verschuren Download Full Chapters
No ratings yet
Designing A Research Project Second Edition Piet Verschuren Download Full Chapters
62 pages
Research in Literature Language
No ratings yet
Research in Literature Language
22 pages
Template For Fullpaper Submission
No ratings yet
Template For Fullpaper Submission
3 pages
Software Tech Impact on Tabaco Accountants
No ratings yet
Software Tech Impact on Tabaco Accountants
22 pages
Araujo 2008 From Neutrality To Praxis The Shifting P
No ratings yet
Araujo 2008 From Neutrality To Praxis The Shifting P
18 pages
Developing A Logic Model
No ratings yet
Developing A Logic Model
16 pages
Cox Communications Inc 1999 Analysis Guide
No ratings yet
Cox Communications Inc 1999 Analysis Guide
7 pages
Outcome Based Education Nursing
No ratings yet
Outcome Based Education Nursing
28 pages
Memorandum No. 044-2022 Work Instructions On Area-Based Demand Driven TVET
No ratings yet
Memorandum No. 044-2022 Work Instructions On Area-Based Demand Driven TVET
47 pages
Test Bank For Marketing Management 13th Edition
No ratings yet
Test Bank For Marketing Management 13th Edition
30 pages
Impact of Employee Engagement On Job Satisfaction and Motivation
No ratings yet
Impact of Employee Engagement On Job Satisfaction and Motivation
11 pages
AI Class 9 Part B
100% (2)
AI Class 9 Part B
41 pages
E-Banking Preferences in Uttara-Kannada
No ratings yet
E-Banking Preferences in Uttara-Kannada
14 pages
Statistics and Probability Quarter 2 Week 1 PDF
No ratings yet
Statistics and Probability Quarter 2 Week 1 PDF
51 pages
Balisacan Ravago Rice Problem
No ratings yet
Balisacan Ravago Rice Problem
25 pages
Effect of Laboratory Practical On Senior Secondary School Students Performance in Biology in Ilorin 11870
No ratings yet
Effect of Laboratory Practical On Senior Secondary School Students Performance in Biology in Ilorin 11870
7 pages
MYP Science Level 2 Unit 1 Summative Assessment
No ratings yet
MYP Science Level 2 Unit 1 Summative Assessment
9 pages
Introduction To The AI Project Cycle
No ratings yet
Introduction To The AI Project Cycle
10 pages
Psychology of Colors Thesis
100% (3)
Psychology of Colors Thesis
8 pages
February 1982: Address: 17/1, Harish Sikdar Path, Kolkata-700012 Dob: 17
No ratings yet
February 1982: Address: 17/1, Harish Sikdar Path, Kolkata-700012 Dob: 17
4 pages
ERP Accounting Benefits & Satisfaction
No ratings yet
ERP Accounting Benefits & Satisfaction
17 pages
Guide to Multiple-Choice Questions
100% (1)
Guide to Multiple-Choice Questions
3 pages
HR Policies in India's Pharma Sector
No ratings yet
HR Policies in India's Pharma Sector
9 pages
Investigating The Impact of Service Delivery On Customer Satisfaction - 01
No ratings yet
Investigating The Impact of Service Delivery On Customer Satisfaction - 01
3 pages
Fa8650 18 S 5010 Baa
No ratings yet
Fa8650 18 S 5010 Baa
23 pages
Sandoval Litrev1
No ratings yet
Sandoval Litrev1
29 pages

pdf2110 10234

Uploaded by

pdf2110 10234

Uploaded by

Collaboration Challenges in Building ML-Enabled Systems:

Communication, Documentation, Engineering, and Process

Grace Lewis Christian Kästner

ABSTRACT Govmt. Integr. Product & Model Team

can always be challenging, ML introduces additional challenges 2

requirements such as fairness and explainability. Through inter-

Figure 1: Structure of two interviewed organizations

REFERENCES arXiv 2005.00760.

Table 2: Distribution of Company Location Company Type Count

Figure 3: Distribution of Participant Role

Table 3: Distribution of Participant Role

Current Position Count

Table 5: Details of Interviewed Participants, Companies and Products

Company Company Participant Participant Role Product Domain

In arXiv [cs.HC]. arXiv. Study. In arXiv [cs.CY]. arXiv.

You might also like