0% found this document useful (0 votes)
41 views7 pages

Data Science Internship at Paychex

Uploaded by

sam17monster
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views7 pages

Data Science Internship at Paychex

Uploaded by

sam17monster
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Internship Report at Paychex

Aradhya Mathur
*Note: Academic Report for DSC-494 Internship

Goergen Institute for Data Science

University Of Rochester, NY

Abstract— During my Data Science internship at Paychex, Inc. from deeply with the quintessential principles of data science
June 2023 to August 2023, I undertook a range of impactful projects
that underscored my analytical prowess and problem-solving abilities. – to derive actionable insights from raw data and
One significant accomplishment was the creation of an advanced channel them into meaningful strategies.
framework to compute Return on Investment (ROI) for established
models. This framework enabled a comprehensive assessment of the
effectiveness of various models, enhancing decision-making processes. The collaborative ethos at Paychex was exemplified
through my integral role in the Distribution Cost
In collaboration with the Distribution Cost Optimization initiative, I Optimization initiative. This venture, a linchpin of
played a pivotal role in finding a way to curb shipping expenses.
Through meticulous analysis, I identified optimal shipping methods financial prudence, targeted an impressive reduction of
that minimized revenue impact, thus contributing significantly to cost shipping costs by a substantial figure of approximately
savings. Furthermore, my involvement in the Pricing project included $1 million. My contribution extended beyond mere cost
a comprehensive exploratory data analysis (EDA) that focused on
evaluating upgrades and downgrades of existing bundles. Leveraging reduction, as I meticulously dissected the intricacies of
client price sensitivity data. shipping data to identify optimal shipping methods. The
goal was to ensure minimal disruption to revenue
Overall, my internship at Paychex equipped me with invaluable
experience in designing intricate frameworks, optimizing costs, and generation, thus ushering in a new era of fiscal
making data-driven decisions, further solidifying my expertise in data efficiency.
science and analytics.
Within the realm of pricing dynamics, I delved into
Keywords— Return on investment, Exploratory data analysis, Cost the Pricing project, embarking on a comprehensive
optimization, Smart Pricing and Price Sensitivity
Exploratory Data Analysis (EDA) journey. My focus on
I. INTRODUCTION evaluating upgrades and downgrades of existing
During the period spanning from June 2023 to August bundles was informed by an unwavering commitment to
2023, I had the privilege of embarking on a precision. Leveraging invaluable client price sensitivity
transformative Data Science internship journey at data, I was part of the team that meticulously
Paychex, Inc. This three-month immersion allowed me constructed a data-driven model that would serve as a
to actively engage in a diverse array of projects, each beacon of illumination for future pricing decisions. By
serving as a testament to my analytical acumen and bridging the gap between empirical insights and
adeptness in resolving complex challenges. As a strategic pricing, this model stands as a testament to the
conscientious participant, I embarked on ventures that commitment to precision in the realm of data analysis.
not only amplified my comprehension of data-driven
strategies but also unveiled a new realm of problem- The culmination of my internship at Paychex has not
solving capabilities. only fortified my expertise but also equipped me with
an arsenal of skills poised to shape the future landscape
One of the foremost accomplishments of my tenure of data science and analytics. The intricacies of
was the conceptualization and development of an designing sophisticated frameworks have become
advanced framework tailored to compute Return on second nature, and the art of optimizing costs through
Investment (ROI) for established models. By pioneering data-driven methodologies is now etched indelibly in
this innovative framework, I facilitated a my skillset. These invaluable experiences have paved
comprehensive evaluation of the effectiveness of a the way for my trajectory in the realms of data science
spectrum of models, thereby enriching the very fabric of and analytics, instilling in me a sense of purpose and
decision-making processes. This achievement resonates invigorating my pursuit of innovative solutions.
analysis guides strategic decisions and ensures effective
In the ensuing sections of this report, I will elaborate utilization of resources.
comprehensively on the projects undertaken, delving B. Need for Distribution Cost Optimization project.
into the methodologies employed, the insights derived,
and the implications they bear. The following segments Cost efficiency stands as a paramount concern,
will encapsulate the depth and breadth of my journey notably in shipping expenses. Paychex, a collaborator
during the internship at Paychex, offering a granular with FedEx, UPS, US Mail, and numerous local
exposition of the strides taken and the knowledge couriers, orchestrates millions of shipments across 8
acquired. distribution centers, spanning states and even countries.
Analyzing optimal courier services per distribution
II. INTERNSHIP OBJECTIVES
center holds the potential to yield substantial cost
The objectives of the internship encompass a diverse savings.
spectrum, aimed at achieving the following:
C. Need for Pricing project.
1) Application of acquired data mining In the context of Paychex's operations, the need for a
methodologies from academic coursework into Pricing project becomes evident as it addresses pivotal
practical real-world scenarios. This entails aspects of client retention and bundle management. By
evaluating performance and gaining a profound delving into the impact of premium processing fees on
appreciation for the seamless transition from client retention, the project strives to uncover potential
theoretical concepts to their practical links between pricing strategies and customer
implementation. satisfaction. Furthermore, the project's focus on
2) Utilization of acquired feature-engineering attributes influencing bundle upgrade/downgrade
techniques to enhance machine learning decisions offers valuable insights into optimizing
algorithms, thereby examining the resultant bundle offerings. With pre-processed datasets and
improvements in their efficacy through the insightful analyses, the Pricing project will equip
integration of these augmented features. Paychex's Data Science team with a comprehensive
3) Utilization of acquired pricing analytics understanding of pricing dynamics, guiding informed
knowledge to effectively address complex decisions and contributing to the ongoing enhancement
pricing strategies and their impact on business of client services and revenue strategies.
outcomes. IV. LANDSCAPING THE PROBLEM
4) Application of proficiency in time-series
The initial phase of problem scoping encompasses the
forecasting methodologies.
exploration of digital resources and a comprehensive
analysis of research articles and industrial documents
These multifaceted goals collectively contribute to an
addressing issues confronted by global practitioners.
immersive and enlightening internship experience,
The objective is to identify challenges exhibiting either
fostering an enhanced understanding of the intricacies
resemblance or, optimally, congruence with the
involved in the practical application of data science
immediate task. This process can be likened to a
principles.
literature review, which serves as a pivotal and
indispensable stride within the realm of research
III. THE BUSINESS PROBLEM endeavors.
A. Return on Investment project
A. Need for calculation of Return on Investment
Calculating Return on Investment (ROI) for existing The problem landscape concerning the calculation of
models in a company like Paychex is vital to quantify Return on Investment (ROI) for existing models within
their performance, optimize resource allocation, and a company encompasses several dimensions. It entails a
determine their contribution to overall profitability. This thorough understanding of the company's specific
models, data sources, and business objectives.
Additionally, it involves exploring various ROI ARC = Total Revenue with model
calculation methods, considering factors like initial implementation – Total Revenue without
investment, operational costs, and revenue generated. model implementation
Furthermore, the landscape involves addressing  Identify cost savings.
potential challenges, such as data accuracy, model  Identify implementation cost.
complexity, and aligning ROI metrics with  Calculate ROI
organizational goals. A comprehensive landscape ROI = ARC + Saving Cost – Implementation
assessment guides the selection of appropriate ROI Cost
calculation strategies and ensures the effective
utilization of models to drive business success. One approach for determining ROI involves the
a. Abbreviations and Acronym
construction of a time series model. The underlying
concept of this model is as follows:
In the context of this paper, we will use the term
"Return on Investment" abbreviated as "ROI”.  Identify the year of model implementation.
b. Method  Collect data preceding this year and employ it
as training data.
 Proceed to develop a time series model using
this dataset as input, facilitating predictions
for future periods.
 Consequently, we possess the actual revenue
data for all years, juxtaposed with forecasted
revenue data generated by the time series
model.
 The disparity between these values
theoretically represents the return on
investment.
However, this model harbors limitations, notably its
inability to incorporate aspects such as market growth
and factors like implementation costs (comprising
elements such as development time and testing
expenses), which are intricate to precisely quantify.
ARIMA (Autoregressive Integrated Moving Average)
models are employed to forecast time series data, like
revenue, by capturing its underlying patterns. The
Fig. 1 ROI Steps model is expressed as ARIMA(p,d,q), where p
In order to calculate the Return on Investment (ROI) represents autoregressive terms, d signifies differencing
for existing models, it is essential to assess multiple to achieve stationarity, and q denotes moving average
factors as depicted in Fig. 1. However, analyzing these terms. In this context, the year of model implementation
factors proves to be challenging due to the presence of serves as a reference point. Historical data before this
several variables. These variables include the year is utilized for training. ARIMA is then applied to
effectiveness of model implementation, regularity of predict future revenue. The discrepancy between
model usage, and market growth. predicted and actual revenue quantifies ROI.
c. Equations
Important steps are:
 Identify Key Performance Indicators (KPIs) ŷt = μ + ϕ1 yt-1 +…+ ϕp yt-p - θ1et-1 -…- θqet-q
 Additional Revenue Calculation μ is a constant term
p is the number of autoregressive terms
q is the number of moving average terms c. Equation
ϕ1, ϕ2, .. are the autoregressive coefficients. Dc represents the set of distribution centers.
θ1, θ2, .. are the moving average coefficients. Ci represents the set of clients.
yt is the value of the time series at time t. dij denotes the drone distance between current
et-1, et-2, .. is the lagged white noise (error) terms distribution center i and client j.
mij denotes the minimum distance between
B. Distribution Cost Optimization Project distribution center i and client j.
S denotes the estimation of cost savings per client.
The problem landscape centers on optimizing cost
R denotes the cost per mile.
within Paychex's intricate shipping framework.
 Compute the distance between the current
Engaging with major couriers, including FedEx, UPS,
distribution center and clients based on
and US Mail, alongside local couriers, the company
latitudes and longitudes
manages extensive shipments across 8 distribution
dij = ComputeDroneDistance(i, j)
centers spanning states and nations. The challenge lies
 Similarly find the drone distance between all
in determining the most effective courier service and
the centers and a particular client
distribution-center pairing, with the goal of realizing
dij = ComputeDroneDistance(Dc[i], Ci[j]
notable cost savings. This multifaceted analysis
 Find a minimum of the above distance
demands precise evaluation to strike a balance between
mij = min(dij, for i in Dc, j belongs to Ci)
efficient revenue generation and streamlined fiscal
 Compare that minimum distance with the
practices, ensuring minimal disruption to revenue while
actual distance.
achieving enhanced cost efficiency.
 The difference is the cost saved and can be
a. Abbreviations and Acronym calculated by:
In the context of this paper, we will use the term " S = (dij - mij) * R
United Parcel Service" abbreviated as "UPS”.  Total cost saved for all clients will be:
Cost Saved = ∑S for Ci
b. Method
Shipping costs are influenced by diverse factors, Drone distance can be found using:
including distance between distribution centers and  R = 6371e3
clients, courier services, surcharges, and concealed fees.  rlat1 = homeLatitude * (math.pi/180)
Air transportation predominates for shipments, thus  rlat2 = destinationLatitude * (math.pi/180)
 rlon1 = homeLongitude * (math.pi/180)
making drone-based solutions ideal for optimizing
 rlon2 = destinationLongitude * (math.pi/180)
center-client distances. The cost optimization procedure  dlat = (destinationLatitude - homeLatitude) *
entails: (math.pi/180)
 dlon = (destinationLongitude - homeLongitude) *
1. Verifying strategic distribution center locations (math.pi/180)
aligned with client density. Haversine formula to find distance
2. Computing drone distances from distribution centers  a = (math.sin(dlat/2) * math.sin(dlat/2)) +
(math.cos(rlat1) * math.cos(rlat2) * (math.sin(dlon/2) *
to clients. math.sin(dlon/2)))
3. Calculating drone distances across all distribution  c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
centers and clients.
4. Identifying the minimum distances among  distance = R * c
distribution centers.
5. Comparing drone distances with minimum distances. C. Pricing Project
6. If the minimum distance is shorter, the cost reduction Landscaping the problem within Paychex, the
potential is evident. Pricing project delves into crucial dimensions. It
7. Savings are approximated by multiplying miles investigates the influence of premium processing fees
saved by cost per mile, providing an estimate of on client retention, unearthing correlations for strategic
reduced shipment expenses. action. Concurrently, the project scrutinizes client
attributes shaping bundle upgrade/downgrade choices, For established models, annual revenue aggregates are
fostering bundle optimization. Leveraging predictive stored, encompassing multiple years.
models and in-depth analyses, this endeavour Utilizing data from the year preceding model
illuminates pricing intricacies, fostering informed implementation, these aggregates form training and
decision-making. By doing so, it bolsters service input data for the time series model, facilitating robust
enhancement and revenue strategies, encapsulating analysis and forecasting.
Paychex's commitment to data-driven excellence.
B. Distribution Cost Optimization project
a. Abbreviations and Acronym
Using shipment data, the total client count was
In the context of this paper, we will use the term " computed for each state. Subsequently, a geographical
Premium Processing" abbreviated as "PP”. map was generated, visually representing the
b. Method distribution of clients across states based on their
This project embarks on a comprehensive exploration of density.
Paychex's intricate client retention landscape and bundle
optimization. A dual-pronged approach is undertaken.
Firstly, an exhaustive descriptive analysis dissects
bundle-switching events, unravelling the intricacies that
propel clients to upgrade or downgrade bundles.
Leveraging pre-processed bundle data with historical
transitions, this phase unveils patterns and drivers
behind clients' strategic decisions. Simultaneously, an
in-depth investigation delves into the nexus between
premium processing fees and client retention. By
harnessing a meticulously curated dataset encompassing
premium processing costs, client attrition, and
forecasted retention scores, this segment scrutinizes
whether premium processing bears a correlation with Fig. 2 Client Density over different states
retention. Statistical scrutiny and analytical rigor In the above Fig. 2, higher density is showcased by dark
illuminate factors shaping client contentment and the blue colour and lighter shades show the comparatively
potential influence of premium processing on retention. lower density of clients.
The insights derived contribute not only to refining Utilizing the same dataset, null values were first
client relations strategies but also inform broader removed, facilitating division into four distinct
business decisions within Paychex's dynamic landscape. components: FedEx, UPS, US Mail, and local couriers.
c. Equation Drone distance calculation, a pivotal task, was executed
Profit = ∑(Costupgraded_bundle – Costprevious_bundle) - through the geodesic library, yielding distances in
∑Revenuelost_client_bundle miles. A critical analysis involved determining the
minimum distribution center distance for each client,
To maximize profit, strategic pricing manipulation is juxtaposed against the real distance. Subsequently, an
essential—prompting bundle upgrades while curbing "outliers" dataframe was established, housing clients
client losses. This dual focus optimizes revenue and with an actual distance exceeding the potential
ensures business stability. minimum by 500 miles. Additionally, distribution
center-specific graphs were plotted, illustrating
V. DATA PRE-PROCESSING FEATURE ENGINEERING shipment frequencies and density, thereby offering
A. Return on Investment project insightful visualizations.
This project places revenue as the central feature,
with the sole pre-processing task involving the
accumulation of historical revenue data.
VI. FORMULATION OF MODELS
In the context of the Return-on-Investment project, a
straightforward yet effective approach was taken. A
time series ARIMA model was meticulously designed
and implemented, taking into consideration the inherent
seasonality within the data. This model played a crucial
role in quantifying the potential return on investment,
thereby facilitating informed decision-making.

Contrastingly, the distribution cost optimization


initiative followed a distinct trajectory. Here, the
Fig. 3 Density-based arial route – Distribution center and Clients.
emphasis was directed toward enhancing efficiency
In the above Fig. 3 orange colour shows the highest through insightful feature engineering and deriving
density of clients, followed by yellow, green, light blue, valuable insights. While no machine learning model
and purple. was employed, the project's focus on unravelling hidden
patterns and optimizing distribution costs laid the
C. Pricing Project foundation for more informed operational strategies.
For the bundle upgrade and downgrade section, I
performed an analysis to find: Looking ahead, the pricing project unfolds as an
 Counts of upgraded bundles for each month ongoing endeavour, marked by its dynamic and
multifaceted nature. The project's trajectory entails the
construction of two to three distinct models, each
specifically tailored to address selected bundle-
switching events. This modelling framework is poised
to provide a comprehensive understanding of how
pricing dynamics influence client behaviours, ultimately
equipping the team with actionable insights to refine
Fig. 4 Bundle upgrade trend over a period of time. pricing strategies and bolster customer engagement.
 Counts of existing bundles for each month
Then divided the dataset into multiple sub datasets for
each bundle that gets either upgraded or downgraded. VII. RESULTS AND FUTURE WORKS
For each bundle, I calculated: All the projects are ongoing, and there are no
 Changed bundle counts over each month. evaluation metrics used for calculating errors. However,
 Percentage of upgrades vs. downgrades we have cemented a structure for calculating the return
Another important factor was to calculate the difference on investment. The Arima model for a particular
between bundles lost vs. new bundles upgraded. retention model showed an ROI of about 56 million
Also, found out the percentage of Upgrade bundle a USD.
from b and Downgrade of bundle b from a. It is a good Found the optimal distribution center for each client
sign if the percentage of upgrades is more than the in the distribution cost optimization project.
downgrade. Calculating the value saved by this method is still
For the premium processing fees, find out the net incomplete as the data involving fees charged by
amount paid by each client over the tenure. Segregated shipping companies is not complete. Out of 8
clients into currently active clients and lost clients. distribution centers, one is situated in Florida; the total
Calculated lost clients for each month. distance that could be saved only by optimizing
Used features like client lost reason to find a correlation shipments from Florida and the courier service being
between lost clients and premium processing fees. FedEx was around 250K miles.
An ongoing project in which I found out the
trends in upgrades and downgrades of different bundles
over the years. The impact of premium processing fees
on client losses was studied. Models were built, and the
most important factor for client loss was high premium
processing fees, with a correlation of 0.34. Causality
was also studied and confirmed that premium
processing fees have a causal effect on client loss.
Providing discounts to clients with a high number of
employees is one way to encourage retention.
I also contributed to writing a value story
template that will be used for past and feature projects.

ACKNOWLEDGMENT

I extend my sincere gratitude for the invaluable


guidance and profound mentorship extended by Dr.
Erika McBride, Mr. Michael Lyons, and Mr. Matt
Agone. Their benevolent assistance has provided me
with the esteemed opportunity to contribute to various
projects and be a part of the Data Science team at
Paychex. This experience has been enriched by the
team's exceptional intellectual acumen and cohesive
collaborative ethos.

REFERENCES

[1] Al-Odeh, Mahmoud. "Calculating Return on Investment for Continuous


Improvement Activities: A Model and a Case Study." International Journal of
Management, vol. 11, 2020, pp. 2102-2121. DOI: 10.34218/IJM., 2020.
[2] Larry Montan, Terry Kuester, and Julie Meehan. "Getting Pricing Right: The
Value of a Multifaceted Approach." *Deloitte Review*, 2008.
[3] Parkhi, Shilpa, R Arun Kumar, and Jagadeesh D. "A Study on Transport Cost
Optimization in Retail Distribution." *Journal of Supply Chain Management
System*, vol. 3, 2014, pp. 31-38.

You might also like