0% found this document useful (0 votes)

24 views19 pages

Data Science Project

This document provides a comprehensive guide on setting up and using Google Colab for data science, including how to upload data, explore dataframes with Pandas, and perform data cleaning and analysis. It covers key methods for data exploration, such as checking for missing values, accessing specific columns, and calculating statistics like maximum and minimum salaries. Additionally, it introduces grouping and pivoting data to analyze average salaries by category, along with challenges and solutions for practical application.

Uploaded by

Junaid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views19 pages

Data Science Project

Uploaded by

Junaid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Getting Set Up for Data Science

Introducing the Google Colab Notebook

VScode is a fantastic IDE, but when we're exploring and visualising a dataset, you'll find
the Python notebook format better suited.

Open your first Google Colab Notebook in through your Google Drive. You can find the
Python Notebook under New → More → Google Colaboratory

If you cannot access the Google Colab Notebooks or would like to run everything locally
on your computer, then I recommend installing Anaconda and using the bundled Jupyter
Notebook instead. Either way works. Google Colab is essentially just an online version
of Jupyter.
How to use a Python Notebook

The notebook is divided into cells. Each cell can be executed individually and the result
is automatically printed out below. To execute a cell use the shortcut Shift + Enter.

Note: The Google Colab Notebook will to connect to a Runtime in order to execute any
code.

That's pretty much it. Let's get started!

Upload the Data and Read the .csv File

Download the salaries_by_college_major.csv file from the course resources and add this
file to the notebook by dropping it into the sidebar with the little folder icon.

Then import pandas into your notebook and read the .csv file.

import pandas as pd
df = pd.read_csv('salaries_by_college_major.csv')

You can save yourself some typing by bringing up the autocompletion by using the
keyboard shortcuts ctrl + Space (windows) or ⌘ + Space (Mac).
Now take a look at the Pandas dataframe we've just created with .head(). This will show
us the first 5 rows of our dataframe.

df.head()

Once you hit shift + enter on your keyboard the cell will be evaluated and you should see
the output automatically printed below the cell. This feature of automatically printing
the output below in a pretty format is what makes the notebook format so lovely to work
with.
Preliminary Data Exploration and Data Cleaning with Pandas

Now that we've got our data loaded into our dataframe, we need to take a closer look at it to
help us understand what it is we are working with. This is always the first step with any data
science project. Let's see if we can answer the following questions:

● How many rows does our dataframe have?

● How many columns does it have?
● What are the labels for the columns? Do the columns have names?
● Are there any missing values in our dataframe? Does our dataframe contain any bad
data?
We've already used the .head() method to peek at the top 5 rows of our dataframe. To see
the number of rows and columns we can use the shape attribute:

df.shape
Do you see 51 rows and 6 columns printed out below the cell?

We saw that each column had a name. We can access the column names directly with the
columns attribute.
Missing Values and Junk Data
Before we can proceed with our analysis we should try and figure out if there are any missing or
junk data in our dataframe. That way we can avoid problems later on. In this case, we're going
to look for NaN (Not A Number) values in our dataframe. NAN values are blank cells or cells that
contain strings instead of numbers. Use the .isna() method and see if you can spot if there's
a problem somewhere.

df.isna()
Did you find anything? Check the last couple of rows in the dataframe:

df.tail()
Aha! We have a row that contains some information regarding the source of the data with blank
values for all the other other columns.

Delete the Last Row

We don't want this row in our dataframe. There's two ways you can go about removing this row.
The first way is to manually remove the row at index 50. The second way is to simply use the
.dropna() method from pandas. Let's create a new dataframe without the last row and
examine the last 5 rows to make sure we removed the last row:

clean_df = df.dropna()
clean_df.tail()
Accessing Columns and Individual Cells in a Dataframe
Find College Major with Highest Starting Salaries

To access a particular column from a data frame we can use the square bracket
notation, like so:

clean_df['Starting Median Salary']

You should see all the values printed out below the cell for just this column:

To find the highest starting salary we can simply chain the .max() method.

clean_df['Starting Median Salary'].max()

The highest starting salary is $74,300. But which college major earns this much on
average? For this, we need to know the row number or index so that we can look up the
name of the major. Lucky for us, the .idxmax() method will give us index for the row
with the largest value.

clean_df['Starting Median Salary'].idxmax()

which is 43. To see the name of the major that corresponds to that particular row, we
can use the .loc (location) property.

clean_df['Undergraduate Major'].loc[43]
Here we are selecting both a column ('Undergraduate Major') and a row at index 43, so
we are retrieving the value of a particular cell. You might see people using the double
square brackets notation to achieve exactly the same thing:

clean_df['Undergraduate Major'][43]

If you don't specify a particular column you can use the .loc property to retrieve an entire
row:

clean_df.loc[43]
Challenge

Now that we've found the major with the highest starting salary, can you write the code
to find the following:

● What college major has the highest mid-career salary? How much do graduates
with this major earn? (Mid-career is defined as having 10+ years of experience).
● Which college major has the lowest starting salary and how much do graduates
earn after university?
● Which college major has the lowest mid-career salary and how much can people
expect to earn with this degree?
Solution: Highest and Lowest Earning Degrees
I hope you gave the last challenge a good go before checking the solution below.

The Highest Mid-Career Salary

print(clean_df['Mid-Career Median Salary'].max())

print(f"Index for the max mid career salary:
{clean_df['Mid-Career Median Salary'].idxmax()}")
clean_df['Undergraduate Major'][8]
If you have multiple lines in the same cell, only the last one will get printed as an output
automatically. If you'd like to see more than one thing printed out, then you still have to use a
print statement on the lines above.

The Lowest Starting and Mid-Career Salary

print(clean_df['Starting Median Salary'].min())

clean_df['Undergraduate Major'].loc[clean_df['Starting Median
Salary'].idxmin()]
Here I've nested the code that we've seen in the previous lesson in the same line. We can also
use the .loc property to access an entire row. Below I've accessed the row at the index of the
smallest mid-career salary:

clean_df.loc[clean_df['Mid-Career Median Salary'].idxmin()]

Sadly, education is actually the degree with the lowest mid-career salary and Spanish is the
major with the lowest starting salary.
Sorting Values & Adding Columns: Majors with the Most Potential vs
Lowest Risk

Lowest Risk Majors

A low-risk major is a degree where there is a small difference between the lowest and
highest salaries. In other words, if the difference between the 10th percentile and the
90th percentile earnings of your major is small, then you can be more certain about
your salary after you graduate.

How would we calculate the difference between the earnings of the 10th and 90th
percentile? Well, Pandas allows us to do simple arithmetic with entire columns, so all
we need to do is take the difference between the two columns:

clean_df['Mid-Career 90th Percentile Salary'] -

clean_df['Mid-Career 10th Percentile Salary']

Alternatively, you can also use the .subtract() method.

clean_df['Mid-Career 90th Percentile

Salary'].subtract(clean_df['Mid-Career 10th Percentile
Salary'])

The output of this computation will be another Pandas dataframe column. We can add
this to our existing dataframe with the .insert() method:

spread_col = clean_df['Mid-Career 90th Percentile Salary']

- clean_df['Mid-Career 10th Percentile Salary']
clean_df.insert(1, 'Spread', spread_col)
clean_df.head()

The first argument is the position of where the column should be inserted. In our case,
it's at position 1, so the second column.
Sorting by the Lowest Spread

To see which degrees have the smallest spread, we can use the .sort_values()
method. And since we are interested in only seeing the name of the degree and the
major, we can pass a list of these two column names to look at the .head() of these two
columns exclusively.

low_risk = clean_df.sort_values('Spread')
low_risk[['Undergraduate Major', 'Spread']].head()

Does .sort_values() sort in ascending or descending order? To find out, check out
the Pandas documentation:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_
values.html

👇
You can also bring up the quick documentation with shift + tab on your keyboard
directly in the Python notebook.
Challenge

Using the .sort_values() method, can you find the degrees with the highest potential?
Find the top 5 degrees with the highest values in the 90th percentile.

Also, find the degrees with the greatest spread in salaries. Which majors have the
largest difference between high and low earners after graduation.

I've got the solution for you in the next lesson.

Solution: Degrees with the Highest Potential
Here's the solution to the challenge from the previous lesson:

Majors with the Highest Potential

highest_potential = clean_df.sort_values('Mid-Career 90th

Percentile Salary', ascending=False)
highest_potential[['Undergraduate Major', 'Mid-Career 90th
Percentile Salary']].head()
Majors with the Greatest Spread in Salaries

highest_spread = clean_df.sort_values('Spread',
ascending=False)
highest_spread[['Undergraduate Major', 'Spread']].head()
Notice how 3 of the top 5 are present in both. This means that there are some very high earning
Economics degree holders out there, but also some who are not earning as much. It's actually
quite interesting to compare these two rankings versus the degrees where the median salary is
very high.
Grouping and Pivoting Data with Pandas

Often times you will want to sum rows that belong to a particular category. For
example, which category of degrees has the highest average salary? Is it STEM,
Business or HASS (Humanities, Arts, and Social Science)?

To answer this question we need to learn to use the .groupby() method. This allows
us to manipulate data similar to a Microsoft Excel Pivot Table.

We have three categories in the 'Group' column: STEM, HASS and Business. Let's count
how many majors we have in each category:

clean_df.groupby('Group').count()
Mini Challenge
Now can you use the .mean() method to find the average salary by group?

Here's the solution:

Number formats in the Output

The above is a little hard to read, isn't it? We can tell Pandas to print the numbers in our
notebook to look like 1,012.45 with the following line:

pd.options.display.float_format = '{:,.2f}'.format
Ah, that's better, isn't it?
Extra Credit:

The PayScale dataset used in this lesson was from 2008 and looked at the prior 10
years. Notice how Finance ranked very high on post-degree earnings at the time.
However, we all know there was a massive financial crash in that year. Perhaps things
have changed. Can you use what you've learnt about web scraping in the prior lessons
(e.g., Day 45) and share some updated information from PayScale's website in the
comments below?

Learning Points & Summary

Today's Learning Points

● Use .head(), .tail(), .shape and .columns to explore your DataFrame and
find out the number of rows and columns as well as the column names.

● Look for NaN (not a number) values with .findna() and consider using
.dropna() to clean up your DataFrame.

● You can access entire columns of a DataFrame using the square bracket
notation: df['column name'] or df[['column name 1', 'column name
2', 'column name 3']]
● You can access individual cells in a DataFrame by chaining square brackets
df['column name'][index] or using df['column name'].loc[index]

● The largest and smallest values, as well as their positions, can be found with
methods like .max(), .min(), .idxmax() and .idxmin()

● You can sort the DataFrame with .sort_values() and add new columns with
.insert()
●
● To create an Excel Style Pivot Table by grouping entries that belong to a
particular category use the .groupby() method

I've attached the completed notebook to this lesson as a .zip file. If you have any issues,
unzip the file, upload it to google drive and open it as a Google Colab Notebook.

Pandas
No ratings yet
Pandas
32 pages
Python Data Science: Pandas & ML Basics
100% (1)
Python Data Science: Pandas & ML Basics
41 pages
CO3 - 1 - Pandas Series and Data Frame
No ratings yet
CO3 - 1 - Pandas Series and Data Frame
37 pages
Data Analysis with Pandas
No ratings yet
Data Analysis with Pandas
31 pages
Intro Pandas
No ratings yet
Intro Pandas
18 pages
Murali Internship
No ratings yet
Murali Internship
34 pages
Python For Data Analysis Edgar
No ratings yet
Python For Data Analysis Edgar
49 pages
Unit - 4 - Part 2
No ratings yet
Unit - 4 - Part 2
36 pages
Python For ML
No ratings yet
Python For ML
41 pages
Python Data Analysis Libraries Guide
100% (1)
Python Data Analysis Libraries Guide
43 pages
ICT2103 Full Book-Part-3
No ratings yet
ICT2103 Full Book-Part-3
14 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
58 pages
Python Data Analysis for IT Students
No ratings yet
Python Data Analysis for IT Students
28 pages
Python For Data Science
No ratings yet
Python For Data Science
45 pages
Intro To Pandas For Data Analytics
No ratings yet
Intro To Pandas For Data Analytics
20 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
Pandas Dataframe All Operations 1735471870
No ratings yet
Pandas Dataframe All Operations 1735471870
4 pages
Data Analysis Crash Course with Python
No ratings yet
Data Analysis Crash Course with Python
397 pages
Python Libraries for Statistical Analysis
No ratings yet
Python Libraries for Statistical Analysis
40 pages
ML Lab1 Python Panda
No ratings yet
ML Lab1 Python Panda
9 pages
AI Student HandbookXII 2025-26!8!20
No ratings yet
AI Student HandbookXII 2025-26!8!20
13 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
Razorpay Data Analyst Interview Questions 1739977522
No ratings yet
Razorpay Data Analyst Interview Questions 1739977522
12 pages
Pandas
No ratings yet
Pandas
13 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
27 pages
Analyze Salary Data with Python EDA
No ratings yet
Analyze Salary Data with Python EDA
20 pages
Python Data Science Guide
100% (2)
Python Data Science Guide
47 pages
Data Prep & EDA for Python Users
No ratings yet
Data Prep & EDA for Python Users
12 pages
Pandas
No ratings yet
Pandas
28 pages
Exercises 2
No ratings yet
Exercises 2
10 pages
Pandas
No ratings yet
Pandas
26 pages
CSV Data Handling Guide
No ratings yet
CSV Data Handling Guide
14 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
Data Handling Module
No ratings yet
Data Handling Module
10 pages
Pandas
No ratings yet
Pandas
35 pages
Python Data Analysis Tutorial
No ratings yet
Python Data Analysis Tutorial
47 pages
Lecture Week2
No ratings yet
Lecture Week2
72 pages
Pandas For Machine Learning
No ratings yet
Pandas For Machine Learning
10 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
47 pages
L6 and 7-Data Preprocessing-Coding
No ratings yet
L6 and 7-Data Preprocessing-Coding
34 pages
EDA Cheat Sheet - Exploratory Data Analysis
No ratings yet
EDA Cheat Sheet - Exploratory Data Analysis
2 pages
Pandas Data Analysis Techniques
No ratings yet
Pandas Data Analysis Techniques
8 pages
Pandas Introduction: What Is Python Pandas Used For?
No ratings yet
Pandas Introduction: What Is Python Pandas Used For?
28 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Learn Pandas
No ratings yet
Learn Pandas
37 pages
Unit-2 Bda
No ratings yet
Unit-2 Bda
11 pages
Introduction To Pandas in Data Analytics
No ratings yet
Introduction To Pandas in Data Analytics
12 pages
Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
Python Cheat Sheet Code Academy
100% (1)
Python Cheat Sheet Code Academy
1 page
.2 Dse
No ratings yet
.2 Dse
14 pages
W04L01 - FA23 - AIC270 - Programming For AI - Syed Ahmed
No ratings yet
W04L01 - FA23 - AIC270 - Programming For AI - Syed Ahmed
66 pages
Participation Motivation in Italian Youth
No ratings yet
Participation Motivation in Italian Youth
17 pages
MSC-IT Part I Regular Sem 1 Nov 2022
No ratings yet
MSC-IT Part I Regular Sem 1 Nov 2022
7 pages
Project
100% (1)
Project
3 pages
Audi Lubrication System Guide
No ratings yet
Audi Lubrication System Guide
15 pages
Silex
No ratings yet
Silex
112 pages
Android Lifecycle & MVVM Guide
No ratings yet
Android Lifecycle & MVVM Guide
6 pages
Service Style 2 Set Menu
100% (1)
Service Style 2 Set Menu
82 pages
Unit 2 Economic Geology
No ratings yet
Unit 2 Economic Geology
20 pages
AMT WEEK 4 Grade 9
No ratings yet
AMT WEEK 4 Grade 9
2 pages
Volume - Prisms and Cylinders: Find the volume of each shape. Round your answer to two decimal places. (use π = 3.14)
100% (1)
Volume - Prisms and Cylinders: Find the volume of each shape. Round your answer to two decimal places. (use π = 3.14)
2 pages
Armatron Robotic Arm Analysis
No ratings yet
Armatron Robotic Arm Analysis
20 pages
Everything Created in Pairs
No ratings yet
Everything Created in Pairs
1 page
Galaxy ESS
No ratings yet
Galaxy ESS
3 pages
Telecom Network Alarm List
No ratings yet
Telecom Network Alarm List
4 pages
CIPW Norm Calculation Guide
No ratings yet
CIPW Norm Calculation Guide
5 pages
Basic Design Vapor-Compression Refrigeration System
100% (1)
Basic Design Vapor-Compression Refrigeration System
20 pages
11111111sensata Switch Catalog
No ratings yet
11111111sensata Switch Catalog
49 pages
Measurement Systems - v5
No ratings yet
Measurement Systems - v5
33 pages
Ratio and Proportion MCQs for CA Foundation
No ratings yet
Ratio and Proportion MCQs for CA Foundation
11 pages
Acetaldehyde Production Overview
100% (2)
Acetaldehyde Production Overview
173 pages
Binary Tree Traversal Guide
No ratings yet
Binary Tree Traversal Guide
7 pages
Survey Adjustment
No ratings yet
Survey Adjustment
97 pages
1 15 HP 1 Phase Franklin Motor Information 080620
No ratings yet
1 15 HP 1 Phase Franklin Motor Information 080620
3 pages
Practice Questions-Work and Energy
No ratings yet
Practice Questions-Work and Energy
12 pages
Mains Phy Pyqssss
No ratings yet
Mains Phy Pyqssss
19 pages
200+++in+30+days+ +Time+table+-+JEE+2022
No ratings yet
200+++in+30+days+ +Time+table+-+JEE+2022
28 pages
Highway Design Training Guide
No ratings yet
Highway Design Training Guide
30 pages
Answer: Diagram-128k-X-16-Ram-Using-64k-X-8-Chips-Ps-Upload-Drawing-Quiz2upload-Secti-Q83387252
No ratings yet
Answer: Diagram-128k-X-16-Ram-Using-64k-X-8-Chips-Ps-Upload-Drawing-Quiz2upload-Secti-Q83387252
2 pages
Alcano 2
No ratings yet
Alcano 2
2 pages
Introduction to Hadoop & Big Data
No ratings yet
Introduction to Hadoop & Big Data
22 pages

Data Science Project

Uploaded by

Data Science Project

Uploaded by

Getting Set Up for Data Science

Introducing the Google Colab Notebook

That's pretty much it. Let's get started!

●​ How many rows does our dataframe have?

Delete the Last Row

​ clean_df['Starting Median Salary']

​ clean_df['Starting Median Salary'].max()

​ clean_df['Starting Median Salary'].idxmax()

The Highest Mid-Career Salary

​ print(clean_df['Mid-Career Median Salary'].max())

The Lowest Starting and Mid-Career Salary

​ print(clean_df['Starting Median Salary'].min())

​ clean_df.loc[clean_df['Mid-Career Median Salary'].idxmin()]

Lowest Risk Majors

​ clean_df['Mid-Career 90th Percentile Salary'] -

​ clean_df['Mid-Career 90th Percentile

​ spread_col = clean_df['Mid-Career 90th Percentile Salary']

I've got the solution for you in the next lesson.

Majors with the Highest Potential

​ highest_potential = clean_df.sort_values('Mid-Career 90th

Here's the solution:

Number formats in the Output

Learning Points & Summary

Today's Learning Points

You might also like

● How many rows does our dataframe have?

clean_df['Starting Median Salary']

clean_df['Starting Median Salary'].max()

clean_df['Starting Median Salary'].idxmax()

print(clean_df['Mid-Career Median Salary'].max())

print(clean_df['Starting Median Salary'].min())

clean_df.loc[clean_df['Mid-Career Median Salary'].idxmin()]

clean_df['Mid-Career 90th Percentile Salary'] -

clean_df['Mid-Career 90th Percentile

spread_col = clean_df['Mid-Career 90th Percentile Salary']

highest_potential = clean_df.sort_values('Mid-Career 90th