0% found this document useful (0 votes)
130 views68 pages

Data Analysis

This document outlines the fourth week of a Data Analysis and Visualization course, focusing on key statistical concepts and data preparation techniques using Excel. It covers hypothesis testing, regression analysis, and the importance of data cleaning and transformation before analysis. The module includes practical labs on sorting, filtering, and manipulating data in Excel to facilitate effective data analysis.

Uploaded by

Ikenna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views68 pages

Data Analysis

This document outlines the fourth week of a Data Analysis and Visualization course, focusing on key statistical concepts and data preparation techniques using Excel. It covers hypothesis testing, regression analysis, and the importance of data cleaning and transformation before analysis. The module includes practical labs on sorting, filtering, and manipulating data in Excel to facilitate effective data analysis.

Uploaded by

Ikenna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 68

Data Analysis & Visualisation_Week 4

Data Analysis and


Visualisation

Data analysis and visualization are two


integral components of the data science
and analytics process, aimed at deriving
insights and communicating findings from
data in a meaningful and
understandable way.
Learning Objectives for the week

★ Explain key statistical analysis concepts which include: hypothesis testing,


significance levels and role of probability in inferential statistics.
★ Apply hypothesis testing techniques to real world scenarios, formulate null
and alternative hypotheses, conduct statistical tests and interpret results
to form meaningful conclusions.
★ Utilize analysis of variance to compare means across multiple groups and
assess the significance of difference.
★ Perform regression analysis in a bit to uncover the relationship between
variables, identify predictors and make inferences in regards to future
outcomes.
★ Evaluate the reliability and validity of statistical analysis.

4.0.1 Introduction
The fourth step in the Data Analytics Process is preparing data for analysis.
Datasets often contain blank rows or columns, data in the wrong formats, or
corrupt field entries. All of these things can make the results of any analysis
incorrect, so these datasets need to be cleaned and transformed prior to
analysis. In this module, you will dive deep into analyzing data with Excel.
You will use Excel to import, clean and prepare the data for the
visualizations.

4.0.2 What Will I Learn in This Module?


In this module, you will learn how to perform important data transformation
tasks in Excel. You will start sorting and filtering data with Excel by
converting data, applying conditional formatting, and combining datasets.
You will use formulas and functions to perform calculations and to transform
data. Finally, you will use pivot tables to summarize, analyze, explore, and
present data.
4.1 Sorting and Filtering Data with Excel
4.1.1 Organizing Data with Sorting and
Filtering in Excel: A video is played
Lab - Using Excel to Sort and Filter Data
Objectives
In this lab, you will learn the basics of sorting and filtering data in Microsoft Excel.

Part 1: Download and Prepare a Sample Data Set

Part 2: Sort the Data

Part 3: Filter the Data

Background / Scenario
Sorting and filtering the data facilitates data analysis through grouping and including or
excluding specific data. This ensures consistency, accuracy, and better decision-making
processes.

Instructions
Part 1: Download and Prepare a Sample Data Set

Step 1: Adjust column widths.

a. Open the downloaded sample CSV file Bike Sales_Sort_Lab 4.1.2.csv in Excel.
b. To make all the data visible and more readable, enlarge the column widths to fit:
1) Select the entire worksheet by clicking the top left corner. (The triangle left of column
A.)
2) Move the cursor to the line between two columns, for example, columns A and B,
click and drag the mouse to the right until all data is visible.
c. All columns are now wider and the same width. Adjust each column width so the column
is only wide enough to contain the data.
d. Note that text data is left justified in each cell, whereas numeric data is right justified.

Step 2: Freeze top row. Because there are many rows of data, column headings will disappear
when you scroll down the worksheet. You can freeze the top row so that this information is
always visible.

a. Click View > Freeze Panes > Select Freeze Top Row.
b. Now if you scroll down, the heading for each column is still displayed.

Note: The data has been changed for readability only, and it has not been sorted or manipulated.
Because this is a text (CSV) file, any changes to presentation of the data (such as column width or
freezing the top row) will not be retained unless the file is saved in Excel spreadsheet format.

Part 2: Sort the Data

In this part of the lab, you will sort the data by country and sub-category.

Step 1: Select the Sort tool.

a. Press Ctrl+A to select all the data.


b. Click Home > click Sort & Filter > select Custom Sort.

Step 2: Select the Sort criteria.

a. Verify that the My data has headers checkbox is selected.


b. Select Country as the Column for the Sort by dropdown list. Select Sort Ascending as
the order.
c. To sort by Sub-category, click + Add Level and select Sub_Category as the column.
For Order, select Sort Ascending from the drop down.
d. Click OK to continue. Note that the data is now sorted in ascending order by Country and
then Sub-category.

To clear sorting and filtering, select the column being sorted (Sub_Category, in this case). Then
click on the Sort & Filter button in the toolbar and choose Clear.

If you need to remove all filters in a worksheet, you can select the Sort & Filter tools and then
click on the Filter option in the drop-down menu.

Part 3: Filter the Data

In this part, you will convert the data into a table and filter the data.

Step 1: Covert data to a table.

a. Before filtering the data information, it is best to convert it to a table. Use Ctrl+A to
select all the data.
b. To convert the data, click Home > select Format as Table.
c. Select a table style of your choosing. Ensure that the checkbox My table has headers is
selected. Click OK to accept the range of cells to be in the table.
d. Under Table Name, replace Table1 with BikeSales as the table name.

Step 2: Set and display filtered data.


a. Note the drop-down arrows at the top of each column of the table. By clicking the drop down on
each column, you can filter the data to view only rows that match those specific values.
For example, click the drop-down arrow on the Age_Group column. Unselect the (Select
all) checkbox and select Young Adults (25-34).

Note the row numbers on the left of the table. The data has not been moved; filtering is
displaying only those rows that you chose.

Note the filtered column, Age_Group, shows a Filter icon instead of the dropdown arrow to
indicate the data is being filtered by that column.

b. To show all data in a column, click the Age Group drop down button again and check
the (Select all) checkbox.

4.0.1 Introduction

The fourth step in the Data Analytics Process is preparing data for analysis. Datasets
often contain blank rows or columns, data in the wrong formats, or corrupt field entries.
All of these things can make the results of any analysis incorrect, so these datasets
need to be cleaned and transformed prior to analysis. In this module, you will dive deep
into analyzing data with Excel. You will use Excel to import, clean and prepare the data
for the visualizations.

4.0.2 What Will I Learn in This Module?

4.1 Sorting and Filtering Data with Excel

Question: You have selected a range of cells in Microsoft Excel and would
like to find duplicate values in the spreadsheet. Which option on the Home >
Conditional Formatting menu should you use to accomplish this?
Answer: Conditional Formatting options:
 Highlight Cell Rules can be used to find and highlight duplicate data
 Data Bars are horizontal bars added to each cell, much like a bar
graph.
 Top/Bottom Rules can be used to highlight cells that are above
average.
 Color Scales change the color of each cell based on its value. Each
color scale uses a two- or three-color gradient. Color Scales can be
used to create a heat map. A heat map is a visual representation of
data where the color represents the value in a cell. For example, you
can create a heat map where a cell with the highest value is colored
green and there is a shift towards red color as the value decreases.
 Icon Sets are built-in icons that can be used to add icons to cells in a
range to indicate how large the cell values are compared to the other
values in the range

4.1.2 Lab - Using Excel to Sort and Filter Data


In this lab, you will learn the basics of sorting and filtering data in Microsoft
Excel.
Part 1: Download and Prepare a Sample Data Set
Part 2: Sort the Data
Part 3: Filter the Data
Sorting and filtering the data facilitates data analysis through grouping
and including or excluding specific data. This ensures consistency,
accuracy, and better decision-making processes.
Objectives Sorting and filtering the data facilitates data analysis through
grouping and including or excluding specific data. This ensures consistency,
accuracy, and better decision-making processes.
Objectives
In this lab, you will learn the basics of sorting and filtering data in Microsoft
Excel.
Part 1: Download and Prepare a Sample Data Set
Part 2: Sort the Data
Part 3: Filter the Data
Background / Scenario
Sorting and filtering the data facilitates data analysis through grouping and
including or excluding specific data. This ensures consistency, accuracy, and
better decision-making processes.
Instructions
Part 1: Download and Prepare a Sample Data Set
Step 1: Adjust column widths.
a. Open the downloaded sample CSV file Bike Sales_Sort_Lab
4.1.2.csv in Excel.
b. To make all the data visible and more readable, enlarge the column
widths to fit:
1) Select the entire worksheet by clicking the top left corner. (The
triangle left of column A.)
2) Move the cursor to the line between two columns, for example,
columns A and B, click and drag the mouse to the right until all data is
visible.
c. All columns are now wider and the same width. Adjust each column
width so the column is only wide enough to contain the data.
d. Note that text data is left justified in each cell, whereas numeric data is
right justified.
Step 2: Freeze top row.
Because there are many rows of data, column headings will disappear when
you scroll down the worksheet. You can freeze the top row so that this
information is always visible.
a. Click View > Freeze Panes > Select Freeze Top Row.
b. Now if you scroll down, the heading for each column is still displayed.

Note: The data has been changed for readability only, and it has not
been sorted or manipulated. Because this is a text (CSV) file, any
changes to presentation of the data (such as column width or freezing
the top row) will not be retained unless the file is saved in Excel
spreadsheet format.
Part 2: Sort the Data
In this part of the lab, you will sort the data by country and sub-category.
Step 1: Select the Sort tool.
a. Press Ctrl+A to select all the data.
b. Click Home > click Sort & Filter > select Custom Sort.
Step 2: Select the Sort criteria.
a. Verify that the My data has headers checkbox is selected.
b. Select Country as the Column for the Sort by dropdown list.
Select Sort Ascending as the order.
c. To sort by Sub-category, click + Add Level and
select Sub_Category as the column. For Order, select Sort
Ascending from the drop down.
d. Click OK to continue. Note that the data is now sorted in ascending
order by Country and then Sub-category.
To clear sorting and filtering, select the column being sorted
(Sub_Category, in this case). Then click on the Sort & Filter button in the
toolbar and choose Clear.
If you need to remove all filters in a worksheet, you can select the Sort &
Filter tools and then click on the Filter option in the drop-down menu.
Part 3: Filter the Data
In this part, you will convert the data into a table and filter the data.
Step 1: Covert data to a table.
a. Before filtering the data information, it is best to convert it to a table.
Use Ctrl+A to select all the data.
b. To convert the data, click Home > select Format as Table.
c. Select a table style of your choosing. Ensure that the checkbox My
table has headers is selected. Click OK to accept the range of cells
to be in the table.
d. Under Table Name, replace Table1 with BikeSales as the table name.
Step 2: Set and display filtered data.
a. Note the drop-down arrows at the top of each column of the table. By
clicking the drop down on each column, you can filter the data to view
only rows that match those specific values.
For example, click the drop-down arrow on the Age_Group column.
Unselect the (Select all) checkbox and select Young Adults (25-34).

Note the row numbers on the left of the table. The data has not been
moved; filtering is displaying only those rows that you chose.
Note the filtered column, Age_Group, shows a Filter icon instead of
the dropdown arrow to indicate the data is being filtered by that
column.
b. To show all data in a column, click the Age Group drop down button
again and check the (Select all) checkbox.
Reflection Questions
Use data sorting and filtering on relevant columns to determine the
following:
1. How many Senior customers were there in December 2021?

4.1.3 Practice Item


Question 1
A data analyst wishes to remove all the filters applied in a Microsoft
Excel spreadsheet. What is the quickest way to accomplish this?
Ans: By clicking on the filter button in the sort and filter group.

Question: Refer to the exhibit. You have selected a range of cells in


Microsoft Excel and would like to find duplicate values in the
spreadsheet. Which option on the Home > Conditional
Formatting menu should you use to accomplish this?

Ans: Highlight cell Rules.

4.2.7 Lab - Manipulate Data


In this lab, you will perform some basic data formatting and
adjustment in Microsoft Excel.
 Part 1: Combine Data in an Excel Spreadsheet
 Part 2: Conditional Data Formatting

Background / Scenario
To enable effective analysis, data points sometimes need to be split (separated), or
combined, or formatted into an appropriate type. The presentation of data can also be
conditionally formatted to highlight certain values. Manipulating the data in these ways
facilitates more meaningful and accurate analysis.

Instructions
Part 1: Combine Data in an Excel Spreadsheet
Step 1: Examine the data.
a. Open the downloaded sample CSV file Bike
Sales_Manipulate_Lab 4.2.7.csv in Excel.
b. Review the columns of data and the values they contain.
Sometimes data in a worksheet must be manipulated to
perform the desired analysis or to produce a specific result or
output.
For a specific analysis, the size information in Column M needs
to be combined with the product description in Column L so
that the model, color, and size are all in one column.

Step 2: Combine the data.


a. Insert a new column to the right of the current column M. This
new column becomes column N.
b. Click in cell N2 and enter the formula: =L2&”, “&M2
Note: type the formula directly into the cell; do not copy and paste
from the lab.
Here’s what the formula is telling Excel to do: = at the beginning of
the string means this cell contains a formula L2 and M2 are the cells
to be combined & is the combining operator
", " inserts a literal comma and a space between each value (in
other cases, any other characters can be inserted here)

The result should be “mountain-200 silver, 42”


c. To apply this formula to each cell in Column N, select N2 and
move the mouse to bottom right corner of cell N2, where the
pointer will change to a plus sign (+), and drag down the
column to copy the formula into each cell. You can also copy
the cell and paste it into other cells, which will paste the
formula with its correct cell references for that row.
Note that what the cells in column N actually contain are formulas.
What is needed are text values not formulas. Now we will convert
these formulas to text values.
d. Select cells N2 through N89 (or all of Column N) and
click Copy in the toolbar.
e. Select cells L2 through L89 and click the down arrow under
the Paste button in the tool bar. In the dropdown menu
select Paste Values.
Column L should now have the model, color, and size information in
each cell. If you click on a cell in column L it should contain text and
not a formula.
f. Since Column N with the formulas and Column M with just the
sizes are no longer needed, they can be deleted.

Part 2: Conditional Data Formatting


In a data file, the values in the cells can be conditionally formatted
depending on the cell values.
Step 1: Conditional formatting numeric values
In this step, values in the Revenue column will be colored based on
the following criteria:
 Greater than $10,000: Green
 Greater than $5,000: Red
 Less than $5,000: Yellow
a. Select all the values in the Revenue column, except the
heading. Click Home > Conditional Formatting > Highlight Cells
Rules > Between.
b. In the Between dialog box, enter 5000 and 10000. Accept the
default Light Red Fill with Dark Red Text. Click Done.
c. Repeat for values greater than
10,000.Click Home > Conditional Formatting > Highlight Cells
Rules > Greater Than. Then enter 10000 in the in
the Rules field and select Green Fill with Dark Green Text.
Click Done.
d. For values less than 5000, click Home > Conditional
Formatting > Highlight Cells Rules > Less Than. Enter 5000 in
the in the field Format cells that are LESS THAN: and select
the Yellow Fill with Dark Yellow Text.
Values in the selected column will now be highlighted conditionally
according to each value.
e. Sorting all the data by Revenue, highest to smallest, will group
the highlighted values. Highlight the whole sheet and
click Data > Custom Sort. In the Custom Sort dialog box check
the My data has headers box. In the Sort by Column drop box
select Revenue.
Step 2: Conditional formatting average values
This step will conditionally format values in the Profit column that
are above the average profit. It is best demonstrated if the previous
Revenue formatting is undone.
a. To clear the conditional formatting, click Home > Conditional
Formatting > Clear rules > Clear Rules from Entire Sheet.
b. Select the Profit column, click Conditional
Formatting > Top/Bottom Rules > Above Average.
c. Accept the default (Light Red Fill with Dark Red Text) in the
Above Average dialog box. Click OK to continue.
Profit values above the average are now highlighted in red.
Step 3: Conditional formatting text values
This step will conditionally format values in the Country column that
match specified text; in this case, a country name. It is best
demonstrated if any previous Revenue and Profit data sorting is
undone.
a. Clear all the conditional formatting again.
b. Select the Country column. Click Conditional
Formatting > Highlight Cell Rules > Text That Contains.
c. In the empty text dialog box, enter Australia in the field
Format cells that contain the text: and select the Green Fill
with Dark Green Text. Click OK.
d. All occurrences of Australia in the Country column are now
highlighted in green with dark green text.
Note: if this did not highlight any cells, it may be because you have
a space after ‘Australia’ from copy-and-pasting. Make sure there are
no empty spaces and try again.

Reflection Question:

Data formatting applied to cells is dynamic. For example, if a data


set is formatted to highlight those values above the average value,
and a value in one cell changes to be less than the average, then
the formatting (and the average value) will also change accordingly.

How could this feature be useful in a presentation?

Answer Area : There are multiple answers to this question, but one
example is “to show an audience how outcomes will change for
different scenarios.”

4.2.8 Practice Item


Question 1
What feature of Microsoft Excel allows for highlighting cells that
contain values which meet a certain condition?

Answer: Conditional Formatting.

4.3 DATA CALCULATIONS

4.3.1 Formulas and Functions. This is a video.

4.3.2 Formulas and Functions


Excel supports a wide range of formulas and function capabilities—
the only difference between these is that functions use keywords
and are predefined by Excel, while users can build custom formulas
themselves using operators. There are strict rules to follow when
creating formulas, so it is a good idea to plan out how you want the
calculations to work before you start entering them into your
spreadsheet.
First, the basics: all Excel formulas start with an equal sign (=). The
equals sign indicates that the data in the cell is not text or a
number, but a calculation. This ensures that your formula will not be
displayed in the cell instead of the result of the calculation. For
example, =1+2 entered into a cell (with the equals sign as the first
character) will display the value 3, not the text =1+2.

Relative and Absolute References


Calculations can be performed on either a constant, like the number
5, or the data contained at a specific location (usually a cell or range
of cells). For example, the formula =L1/L7 divides the value in
cell L1 by the value in cell L7.
Within a formula or function, when there are cell or range
references, they can be either relative or absolute. A reference is
relative when its value depends on the location of the reference
itself, while an absolute reference refers to the same cell or range
no matter where the reference appears. For example, when you did
the Manipulate Data lab, you copied and pasted formulas from one
row into the rows below it. When you did that, because the formulas
had relative references, they used the values from the lower rows to
make the calculations for those rows. This is the default behavior
for formulas and functions.
If you want a formula to refer to the same cell or range of cells no
matter where the formula is pasted, you can make the reference
absolute by adding a dollar sign ($) before the column or row
indicator for that reference. Let’s use the example referenced
above, =L1/L7. If you wanted to use this formula in multiple places
in your spreadsheet and always have it use the values in L1 and L7,
rather than the values that correspond to the formula’s new
location, you would write it as =$L$1/$L$7.
Both the column indication and the row indication can have this
attribute, independently of each other, so you can also add the
dollar sign to only the column reference or only the row reference.

Functions
Built-in functions are formulas that start with a keyword that
identifies a specific function to be performed. Most function
keywords describe the calculation that the function will perform.
Correct syntax varies by function, but usually has a similar pattern.
For example, the syntax for the AVERAGE function accepts a range
or list of cells to calculate the average of the values within the
range, such as =AVERAGE(L2:L37) or =AVERAGE(L2, L5, L6).

Note: You must make sure to close all parenthesis and brackets, or
you will receive an error message when you press enter to complete
the entry. For long functions this can be difficult.
When you start a cell entry with an equal sign, as you type the
function name, a menu of functions matching your entry will usually
appear and show the short description of each function listed. If this
doesn’t happen automatically, you can also right-click on a cell and
enter the equals sign in the text field that says “Search the menus”
to show this menu. Selecting the function that you want places the
function name and an open parenthesis on the formula bar and in
the cell.

Common Errors That Results From Incorrect


Formula Entries Are Shown In The Table.

Error What caused the error


Indicator
# NULL The cell range indicated in the
formular is entered incorrectly.
# DIV/0! A formula entered in a cell causes
a divide by 0 error. Example = L1/L2
where L2 is an empty cell or has a
value of zero.
#N/A The formula refers to an empty
cell, so no data is available to
compute.
#REF The formula refers to a cell that is
not accessible, such as a named
sheet that doesn’t exist.
#NUM The formula entered in the cell is
invalid.
#NAME A non-existent, range name is
used in a formula.
#VALUE A function is used incorrectly, or
the arguments are invalid.

4.3.3 Lab - Formulas and Functions


In this lab you will learn the basics of using formulas and functions
in Microsoft Excel.
 Part 1: Using Excel Text Functions: CONCAT, LEN, LEFT, RIGHT,
and MID
 Part 2: Using Excel Statistics Functions: COUNT, COUNTBLANK,
COUNTA, COUNTIF, AVERAGEIF, MINIFS, and MAXIFS
 Part 3: Experiment with Other Excel Functions

Background / Scenario
Excel has eleven different categories of built-in functions. Many of the functions used by
data analysts belong to the Text and Statistical function categories. The Text function
category includes CONCAT, LEN, LEFT, RIGHT, and MID. The Statistical function
category includes COUNT, COUNTBLANK, COUNTA, COUNTIF, AVERAGEIF,
MINIFS, and MAXIFS. This lab will work through the purpose and usage of each of
these functions.

Instructions

Part 1: Using Excel Text Functions


Excel has several text (also known as string) functions that help data analysts work with
textual data. These functions allow analysts to change text, change case, find a string,
count the length of a string, and more.

Step 1: Open Lab Workbook

Open the Bike Sales_Functions_Lab workbook.

Step 2: Use the CONCAT Function to Combine Data

The CONCAT function, short for “concatenate,” joins strings of text together. The
function can include actual text strings enclosed in quotation marks as well as cell and
range references. It also allows you to combine different types of data like numbers,
dates, and text strings into a single cell.One use case for the CONCAT function is to
combine different pieces of important information for a human to be able to read at a
glance, instead of going to multiple columns to find this information. In this step we will
use the CONCAT function with our bike sales data to simulate this use case, and
combine the sales order number, order quantity, product subcategory, and date into a
new “Sales_Summary” column.

a. Select cell U1 and enter the header text “Sales Summary”

This will be the column for the combined sales data

b. Select cell U2 and type (don’t copy and paste) =CONCAT

Excel may display a description of the function, “Concatenates a list or range of text
strings,” below the formula bar.

c. Add a parenthesis after the keyword so it reads =CONCAT(

Excel may display the syntax of the function, or how it must be written, below the
formula bar.

d. If it doesn’t do this automatically, you can access the same


information by clicking the Function button to the left of the
formula bar, choosing “Text” in the Category drop-down, and
selecting CONCAT from the list of functions.
e. Notice that the text strings and cell references inside the
parenthesis are separated by commas.
f. Note: if an actual text string (instead of a cell or range
reference) is used, it must be enclosed by quotation marks.
g. Finish the writing the function in cell U2 so that it will combine
the order number (in cell A2) with the product subcategory (in
cell L2):
1. In cell U2, finish typing = CONCAT(A2, L2)
Cell U2 should now display the text “000261695Mountain Bikes”
(it’s also OK if the leading zeroes are missing).
2. To make the results more readable, let’s edit the formula
to add a colon and a blank space between the order
number and “Mountain Bikes”.
h. =CONCAT(A2, “: “, L2).
Note: Type the characters directly into the sheet, rather than
copying and pasting from the lab, or you may get an error message.
1. For a challenge, edit the function to add additional
information that includes the order quantity and the
date. Be sure to enter a blank space between the order
quantity and “Mountain Bikes,” and add a dash and
spaces between “Mountain Bikes” and the sales date.
Note: Excel stores dates as simple five-digit numbers;
they have to be formatted using a TEXT function to be
human-readable. =CONCAT(A2,": ",N2, " ",L2," - ",
TEXT(B2, "mm/dd/yyyy"))
The results in cell U2 should be “000261695: 4 Mountain Bikes –
12/01/2021”
Note: When Excel thinks a function is text, it adds an apostrophe
before the equals sign. If your function is showing up as text
instead of returning data from other cells, check that there is no
apostrophe before the equals sign in the formula bar.
1) Copy or drag the formula from cell U2 to the remaining
cells in Column U, from U3 through U89.
i. Remove formulas from column U
1. The cells in Column U contain the CONCAT formulas. To
remove the formulas, highlight and copy Column U. Then
use Paste > Paste Values to paste just the values
into Column V.
2. Delete Column U as the formulas are no longer needed.

Question: What is the result if the formula in step 3 is written using


just a cell reference to B2 to include the date, as =CONCAT(A2,":
",N2, " ",L2," - ", B2), rather than using the TEXT function?

Answer: The function returns the date as a five-digit number.

Step 3: Use the LEFT, RIGHT and MID functions to Extract


Information
Column M in the data set contains the product description of the
bikes sold. It gives the bike model, the color, and the size. What if
you needed this information separated into three separate columns
for analysis? You can use the LEFT, RIGHT and MID functions to do
this:
 The LEFT function will return a specified number of characters
from the start (left) of a text string.
 The RIGHT function will return a specified number of
characters from the end (right) of a text string.
 The MID function will return a specified number of characters
from the middle of a text string.
a. First, create three new columns to the right of Column M,
Product Description, that will contain the separated
information. These will become columns N, O, and P.
b. Name the columns Model, Color, and Size respectively.
c. In cell N2 enter the function =LEFT(M2, 12) to separate out the
bike model. This function tells Excel to return the first 12 text
characters from the left in cell M2.
The result should be “Mountain-200” in cell N2.
d. In cell O2 enter the function =MID(M2, 14,5) to separate out
the bike color. This function tells Excel to return the 5
characters after the 14th character (counting from the left) in
cell M2.
The result should be “Black” in cell O2.
e. In cell P2 enter the function =RIGHT(M2,2) to separate out the
bike size. This function tells Excel to return the last 2
characters from the right (or end of the text string) in cell M2.
The result should be “46” in cell P2.
f. Copy and paste the functions in cells N2, O2, and P2 down to
the remaining cells in the columns.
g. While the functions returned the expected results for cells in
row 2, notice that in other rows, some of the text is cut off as
shown below. In row 3, the color is missing the “r” at the end
of Silver. In rows 4 and 5, the –W is cut off the model name,
and the color is shown as “W Sil.” This is because the model
names and colors of the product descriptions in Column M are
not all the same number of characters. This is a common issue
when splitting text between columns.

There are multiple ways to correct this issue; one is to modify the
formulas in these cells based on the length of the model names and
colors. First, we will apply a filter to Column M to display only
specific bike models, and then adjust the functions for those cells to
the matching number of characters.
h. Highlight Column M, and then click Sort & Filter in the Editing
menu group and select Filter in the dropdown menu.
i. Click the filter down arrow in the Column M header and click
“Select All” to uncheck all the boxes.
j. Then check the boxes next to entries that are silver and that
don’t have the “–W” in the model name. This will be a total of
6 boxes. Then, click Apply.
Notice these are all missing the “r” in the color silver. To fix this,
the MID function for these rows needs to return 6 characters after
character 14, rather than 5.
h. to return 6 characters after character 14, rather than 5.

k. Select the first cell, which should be O3, in the color


column, Column O and adjust the formula
To read =MID(M3, 14,6)
l. Copy or drag this formula to the remaining cells in the column.
m.Next, click the filter down arrow in the Column M header and
click “Select All” to uncheck all the boxes.
n. Then check the boxes next to entries that do have a –W in the
model name. This will be a total of 3 boxes. Then, click Apply.

These rows will need an adjustment to the LEFT function in Column


N to capture the whole model name and to the MID function
in Column O to capture the whole color name.
o. In Column N, change the LEFT formula to return the first 14
characters.
=LEFT(M4, 14)
p. In Column O we need to change the MID formula to return the
6 characters (rather than 5) after the 16th character (rather
than the 14th).
=MID(M4, 16,6)
q. Copy these formulas to the remaining cells in the columns.
r. Once the functions are correct, remove the filter from Column
M by selecting Filter under Sort & Filter.
The functions in columns N, O, and P should now display the correct
number of characters.
s. Remove formulas from columns N, O, and P.
1. The cells in columns N, O, and P contain the LEFT, MID,
and RIGHT formulas. To remove the formulas, add three
new columns to the right of column P. Highlight and copy
columns N, O, and P, then use Paste > Paste Values to
paste the values into the three new columns,
2. Delete columns N, O, and P that contain the formulas.
t. Delete all columns added in Part 1 (columns N, O, P and U) to
restore the worksheet to its original state before proceeding to
Part 2 of the lab.
Part 2: Using Excel Statistical Functions
Excel has many built-in statistical functions that can perform all
kinds of mathematical and statistical calculations. Some of the more
common statistical functions used by data analysts include COUNT,
COUNTBLANK, COUNTA, COUNTIF, AVERAGEIF, MINIF, and MAXIF.
Let’s learn how to use these functions to work with data.
Step 1: Use the COUNT Function to Find Cells with Non-numeric
Values
a. The COUNT function in Excel returns the count, or number, of
numeric values in a set or range of cells or values. This means
it counts cells with numbers and dates, but not blank or text
cells.
For this example, we will check Column N, Order_Quantity, for any
blank or non-numeric (and therefore invalid) entries.
b. Add a new blank column to the right of Column N,
Order_Quantity, this becomes Column O.
c. In cell O2, type the function =COUNT(N2:N89)
The result should be 88. Scroll to the end of the data to see that
there are 89 rows filled. The top row is the headers, meaning there
are 88 rows of data in the worksheet. The results of our COUNT
function match the number of data rows we have, which means all
the cells in column N contain numeric data and can be presumed
valid.
Now, let’s introduce some non-numeric data and see how that
changes the result of the COUNT function.
d. Delete the contents of cell N5 to make it a blank cell.
e. Note that this changes the count of numeric cells in cell O2 to
87 because cell N5 now contains no numeric value.
f. Right-click on the cell and select Number Format from the
drop-down menu. A pop-up will appear, and in the field labeled
Category, choose Text. This will change the data type of
cell N5 to Text. Now type 1 into cell N5. Note that Excel left-
justifies the number, because it is reading it as text.
g. Note that the count remains 87, because the value of “1” in
cell N5 is now considered text and not numeric. The COUNT
function only counts cells containing numeric values.
h. Delete the contents of cell N5, change the data type back to
General, and enter 1 to return the cell to its original value. The
function in cell O2 should again return a result of 88.
i. Delete the COUNT function from cell O2.
Step 2: Use the COUNTA Function to Find Non-Blank Cells
The COUNTA function returns the count of non-blank cells in a
range, whether the cells contain numeric values or text values.
Again, we will check Column N, this time to make sure there are no
blank cells.
a. Select cell O2 and type the function =COUNTA(N2:N89)
The result should be 88, indicating that all 88 cells from N2 to N89
contain data.
b. As before in Step 2, delete the contents of cell N5 to make it a
blank cell.
Note, as before, that the count of numeric cells changes to 87
because cell N5 contains no value.
c. Change the data type of cell N5 to Text again, and then
type 1 in the cell again.
d. This time the function returns 88, because the COUNT function
counts cells containing both numeric and text values.
e. Return cell M5 back to the General data type and enter 1 to
return the cell to its original value.
f. Delete the COUNTA function from cell O2.
Step 3: Use the COUNTBLANK Function to Find Blank Cells
Before analyzing data, it is a good idea to check for missing data.
When importing data or with manual data entry, some cells that are
supposed to have values in them may end up blank. The
COUNTBLANK function is useful for identifying the presence of blank
cells in a range.
The COUNTBLANK function returns the count of blank cells in a
range of cells.
For this example, we will check Column N, Order_Quantity, for any
blank cell entries.
a. Select cell O2 to enter the COUNTBLANK function
Type the function =COUNTBLANK(N2:N89)
The result should be 0 which is expected since there are no blank
cells in cells N2 through N89.
b. Delete the values in a few cells in column M and observe the
output of the function.
For each cell in which the values were deleted, the function should
increase the count of blank cells.
c. Use the undo button at the left of the toolbar (or Command+Z)
to return the cells to their original values. Delete column O.
Step 4: Use the COUNTIF Function to Find Non-Blank Cells
Another important task for data analysts to perform prior to
analyzing data is to check for duplicate entries in a column. This,
and many other tasks, can be accomplished with the COUNTIF
function.
The COUNTIF function returns the count of cells that meet a specific
criterion.
For this example, we will check for duplicated order numbers in
column A, Sales_Order #.
a. Add a new column to the right of Column A. This creates a new
column B.
b. Select cell B2 and type the
function =COUNTIF(A$2:A2,A2); then copy the function to cells
B3 to B89. This is telling Excel to count the total number of
values in column A that match the value to the left of the
formula. If you double click on the cells below B2, you can see
the last cell reference is referring to the cell for that row in
column A.
The result should be 1 for all the cells in column B, which means
there is only one of each order number. In other words, there are no
duplicate sales order numbers in column A.
c. Create some duplicate order numbers in Column A:
1. Copy order number 000261695 from cell A2 into cell A3 to
make a duplicate entry.
2. Copy order number 000261700 from cell A7 to cell A8 to
make another duplicate entry.
The function in cells B3 and B8 should now give a result of 2,
indicating a second instance of the sales order numbers 000261695
and 000261700.
d. Change the sales order number in cell A4 also to 000261695.
The function in cell B4 should now give a result of 3, indicating a
third instance of the order number 000261695.
e. Return the sale order numbers in cells A3, A4, and A8 to their
original values using the Undo button or keyboard shortcut.
f. Delete column B.
Step 5: Use the AVERAGEIF Function
The AVERAGE IF function finds the average of all the cells in a range
that meet a given criterion. For this you will use the AVERAGEIF
function to find the bike shop’s average revenue by age group.
a. In cell U2, enter the text “Youth”
b. Select cell V2 to enter the AVERAGEIF function
The result should be 3533, which is the average revenue from
mountain bikes purchased by youths
c. In cell U3, enter the text “Young Adults” and in cell U4, enter
the text “Adults”
Type the function =AVERAGEIF(G2:G89, “Youth (<25)”, S2:S89)
d. The result should be 3533, which is the average revenue from
mountain bikes purchased by youths.
e. In cell U3, enter the text “Young Adults” and in cell U4, enter
the text “Adults”
f. Write (do not copy and paste) the appropriate formula for
calculating the average revenue from mountain bikes sold to
the Young Adults and Adults age groups in
cells V3 and V4 respectively:
1. For cell V3, use function =AVERAGEIF(G2:G89, "Young
Adult (25-34)",S2:S89)
2. For cell V4 use function =AVERAGEIF(G2:G89, "Adult (35-
64)",S2:S89)
3. Change the number type in cells V2, V3, and V4 to USD
currency.
g. Add the header “Average Revenue by Age Group” in cell V1.
You should see these results:

Average Revenue by Age Group


Youth $ 3, 533.00
Young Adults $ 3,859.55
Adults $ 4,388,43
Step 6: Use the MINIFS and MAXIFS Functions
The MINIFS function in Excel returns the minimum value from a set
of values specified by one or more criteria. The MAXIFS function
returns the maximum value from a set of values specified by one or
more criteria.
For this example, you will use the MINIFS function to find the order
from Australia that generated the smallest amount of revenue.
a. In cell W1, type the header “Minimum Revenue from Australia”
in bold and expand the column as needed to see the text.
b. Select cell W2 to enter the MINIFS function to find the smallest
amount revenue from Australia.
c. Type the function =MINIFS(S2:S89, I2:I89, “Australia”)
The result should be 565. Next, we’ll use the MAXIFS function to
find the order from Australia that generated the largest amount of
revenue.
d. In cell X1, type the header “Maximum Revenue from Australia”
in bold and expand the column as needed to see the text.
e. Select cell X2 to enter the MAXIFS function to find the largest
amount revenue from Australia.
f. Type the function =MAXIFS(S2:S89, I2:I89, “Australia”)
The result should be 13500.
g. Change the data type in cells W2 and X2 to USD currency.
You should see these results:
Maximum
Minimum Revenue from Australia
from Australia
$565.00 $13,500.00

Question: What function could you use if you wanted to add up the
revenue from bikes sales limited to just Australia? (Hint, this
function adds cells values in a range based on a given criteria.) Use
the “Insert Function” button under the Formulas tab on the toolbar
to look through available functions and see if you can find it.
Answer: The SUMIF function lets you add values that meet a specific
criterion.

a. Save the workbook as Bike Sales_Functions_Lab 4.3.3_Part2


Part 3: Experiment with Other Excel Functions

There are many more functions that are useful to data analysts for preparing, cleaning,
and searching datasets. Explore the various function categories in Excel and
experiment by applying them to a dataset. Some specific functions to try are those that
convert text case, such as LOWER, PROPER, and UPPER, and those that return
information, such as LEN, FIND, and SEARCH.
PRACTICE ITEM

Question: In E2, calculate the average mass for apples. Note: Use the
AVERAGEIF( range, criteria, avg_range ) function.

4.3.5 Pivot Tables: A Video

4.3.6 Pivot Tables

Pivot tables are a function in Excel that is invaluable to data analysts. Pivot tables
provide a way to automatically summarize, analyze, explore, and present data. Pivot
charts enable you to add visualizations to the data in a pivot table. Using these built-in
tools you can identify trends, make comparisons between data items, and create charts
in different styles to visualize your data. You’ll be creating a Pivot Table in the upcoming
lab, but for reference, here are the steps to create a pivot table in Excel:

1. Start with a worksheet that is organized in columns and rows, with column
headers.

2. Select the table or range of cells that you want to be included in your Pivot table.
Be sure to include the column headers in the range.

3. Choose Insert from the menu bar. In the Tables section, you can view various
formats that your data can take by selecting the Recommended Pivot tables
choice. If you don’t see a recommended table that meets your requirements,
select Pivot table => From Table or Range.

4. The Create Pivot table dialog box opens. Your selected range should show in the
Table/Range box, and New Worksheet should be selected. When creating your
first Pivot tables, it is best to have them placed on a new worksheet tab.

5. A blank Pivot table worksheet is created and the Pivot table Fields list containing
your column headings appears. Click the fields that you want to add.

6. The Pivot table wizard will place your fields in one of the boxes at the bottom,
Filters, Columns, Rows, and Values. You can drag and drop your fields into
different categories to change the way that your Pivot table summarizes the data.

7. The Pivot table will appear with your selections. Pivot tables support most of the
functions that you can do on a normal spreadsheet, such as sorting, filtering, and
cell formatting.

Updating data in the original spreadsheet does not update the pivot table automatically;
you must refresh your Pivot table to have it reflect new data.
4.3.7 Practice Item
Question 1
What should be done first when creating a Microsoft Excel Pivot Table?
Answer: Create or select data that needs to be analyzed

4.3.8 Lab - Pivot Tables

In this lab, you will learn the basics of creating a pivot table in Microsoft Excel. Pivot tables
provide a way to automatically summarize, analyze, explore, and present data. Charts add
visualizations to the data in the pivot table that analyze trends and comparisons.
Part 1: Creating an Excel Pivot Table
Part 2: Visualizing Pivot Table Data

The bicycle sales company wants to determine the purchasing patterns of different demographic
groups to identify areas where it needs to concentrate its marketing efforts. The company also
wants to see if there are purchasing differences between the countries where it operates.
Pivot tables can reveal useful information in records or data that is not obvious at first sight by
summarizing and re-presenting the data so trends can be explored and reported. Pivot tables
extract meaning from the data by grouping it in different ways, enabling useful conclusions to be
made.
The "pivot" part of a pivot table stems from the fact that that the data can be rotated (pivoted) to
view it from a different perspective.
It is important to note that pivot tables do not add to, subtract from, or otherwise change, the
data; the pivot table just reorganizes the data to reveal useful information.
Instructions
Part 1: Creating an Excel Pivot Table
To facilitate readability of the data, resize the column widths and center the data in the numeric
columns. This exercise will focus on Year, Age Group, Customer Gender, Country, and Order
Quantity data.
Accordingly, it may also be useful to hide the Customer Age, State, Product Category, Sub-
Category, Product columns.
Step 1: Create the pivot table.
a. Download the Bike Sales_Pivot_Lab.xlsx and save it to your OneDrive. Open the file in
MS 365 Excel online. Click the Insert menu tab and select Pivot Table. In the Create
Pivot Table dialog box, make sure New Worksheet is selected and click OK.
b. In the PivotTable Fields dialog box, select the following fields: Year, Age
Group, Country, and Order Quality.
The pivot table created displays each country grouped under each age group, with the sum of the
order quantity for each age group in total, and the total for each country under that age group.
Step 2: Review the pivot table.
Note that pivot tables will automatically sum numeric data under each heading. However, in this
case, this produces the meaningless sum of all the year values. So, year values are not useful to
include at this stage.
a. Uncheck Year in the PivotTable Fields dialog and update the pivot table. (Selecting any
cell in the pivot table will bring the PivotTables Fields dialog box back up if it is no
longer visible.)
The information now makes more sense with the total of orders for each age group shown, with
the country breakdown for that country.
b. In the pivot table, click the - (minus) beside each age group label to collapse (hide) the
countries listed under that age group.
To display the countries again, click the now displayed + (plus) displays the country sales
numbers.

Step 3: Rearrange the pivot table: To create a different view of the data, drag country in the
PivotTable Fields dialog box to the columns pane. The updated pivot table now displays the
countries as columns with sales totals for each age group and each country.
Step 4: Refine the pivot table.
The pivot table now contains blank cells, which detract from the readability of the table. Excel
can be instructed to fill each blank cell with zero.
a. Click the Pivot Table tab on the menu bar. Under Pivot Table, click Settings. In the
field For empty cells show, select the checkbox and enter 0 (zero). Press Enter to
update the table.
b. The column values can be centered for better readability. Center all the columns with
numbers.
c. The first column can be filtered to re-order the age groups from youngest to oldest. Click
the Filter and Sort down arrow next to the Age_Group column heading. Click Sort
Descending to sort age groups from Youth to Adults.
Step 5: Revise the pivot table.
a. To enhance the data analysis, select Customer_Gender to add the field into the pivot
table.
Note that any blank cells are automatically filled with zero as set from a previous step.
Clicking the – next to each age group will hide the gender for that age group and + will expand
that data category.
Part 2: Visualizing Pivot Table Data
Presenting the pivot table as a graphical chart will highlight features of the data and assist in
analysis and decision making.
Step 1: Create a pivot table chart.
a. Select all the cells in the pivot table. Click Insert from the menu bar. The chart icon
group appears on the ribbon.
b. Find and select the stacked column icon in the drop down.
c. Re-size and move the chart for optimum clarity.
d. Right click the chart to bring up a pop-up menu and select Format from the menu list.
This will bring up the Chart Format dialog. Click Chart Title. Change the Chart
Title to “Sales Summary.”
The resulting chart should look similar to the example below, but colors may vary.

Step 2: Analyze the chart data.


Remember the original aim was to determine where marketing effort needs to be applied in
specific markets to reach under-represented demographic groups and to see if there are
differences between the countries where it operates.
The chart graphically shows that the youth age group is globally the poorest area of sales.
Female adults are buying the most product.
Questions that the company can pose, and then develop business decisions in relation to, may
include:
Why does the youth age group have the lowest sales globally?
Why are there sales in Australia in all categories except male youth?
Why are there no sales to adult males in France?
Why is there only one successful market category in the United Kingdom?
Step 3: Revise the chart format.
You can revise the chart, so it only shows the values for each age group.
a. Click the – next to each age group in the pivot table to hide the gender information.
This chart now shows the aggregated sales data across each age group for each country.
Reflection Questions
Examine the data presented in the pivot table charts and charts and answer these questions.

Challenge Activity
Examine the data presented in the pivot table charts and charts and answer these questions.
1. Using the original sample spreadsheet file, Bike Sales_Pivot_Lab.xlsx, construct a new
pivot table that will enable analysis of the most profitable markets by country, age group,
and gender.

4.3.9 Practice Item

Question 1

Refer to the diagram. A data analyst would like to insert a Pivot table in a Microsoft Excel
spreadsheet. Under which tab and function group is the option to insert a Pivot Table?

ANSWER: Under the “Insert Tab” in the table functions. That’s right. Microsoft Excel can be
used to create Pivot tables providing a way to automatically summarize, analyze, explore, and
present data. To insert a Pivot Table, select a data range in an existing worksheet, click on the
Insert tab, then on PivotTable.

4.4 Transforming Data with Excel Summary


4.4.1 What Did I Learn in This Module?
In this module, you learned to use Excel to import, clean and prepare the data for the
visualizations.
Sorting and Filtering Functions:
Topic Objective: Use data analysis tools and techniques to sort and filter data with Excel.
The Excel sort and filter functions enable you to determine what data is included in a particular
view, as well as in what order the data appears. Sort and filter can also be used to identify blank
or corrupt fields in the data.
Formatting and Adjusting Data:
Topic Objective: Use data analysis tools and techniques to format and adjust data with Excel.
Data is not always in a format that can be used in analysis. Excel supports functions that can
convert text to columns, change the formatting of text, numbers, and dates. Data can be
conditionally formatted when the content of the cell or range matches specific user-defined
values, such as the highest, lowest, or the result of a formula.
Data Calculations
Topic Objective: Use Excel and techniques to perform data calculations.
Excel supports a wide range of formulas and function capabilities. Built-in functions are
formulas that start with a keyword that identifies the function to be performed. Pivot tables are a
built-in feature of Excel that supports summarization and visualization of data.
WEEK 5 OF 3MT

Data Visualization
Principle

1
Data visualization is a crucial aspect of data science, focusing on transforming data into
meaningful visual representations. Through practical exercises, you will learn to use popular
tools to craft visuals that support decision-making and effectively communicate your data
insights.

Data Analysis Week 5: Learning Objectives for the week


At the end of the week, you should be able to:
★ Understand and Apply Visualization Principles: Identify key principles of effective data
visualization, including simplicity, accuracy, and clarity, and apply these principles using
statistical methods.
★ Choose and Create Visualizations: Select the appropriate visualization types for
different data scenarios and create visualizations using tools like Excel, while applying
color theory and design principles.
★ Address and Resolve Data Issues: Identify and address anomalies and issues in data
to ensure accurate and clear visual representation.
★ Analyze Data and Present Findings: Analyze data using statistical summaries and
present insights through effective visualizations in written reports.

Weekly Applied Learning Assignment

For this week, complete the following task and submit in the assignment tab on your
dashboard. Ensure you grant view access.
Data Cleaning and Visualization.
Download a dataset with issues from platforms like Kaggle or Google Dataset
Search, clean the data, and document the steps you took to resolve the anomalies.

What are the most important data visualization principles for analytical skills

1:Know your audience


2Choose the right type of visual
3Use colors and shapes wisely
4Follow the data-ink ratio
5Tell a story with your data
6Test and refine your visuals
7Here’s what else to consider
1) Know your audience

The first principle of data visualization is to know your audience and their needs. Different
audiences may have different levels of familiarity with the data, different expectations and
preferences, and different goals and questions. Therefore, you should tailor your visuals to suit
your audience, by choosing the appropriate level of detail, complexity, and interactivity, and by
using clear and consistent labels, legends, and annotations. You should also consider the
context and format of your presentation, such as whether it is online or offline, static or
dynamic, or standalone or part of a report. In my experience, understanding your audience is
pivotal in data visualization for analytical skills. Tailoring visuals to the audience's expertise level
ensures clarity and engagement. For instance, technical stakeholders may prefer detailed
charts, while executives might need concise, high-level summaries. By anticipating questions
and tailoring the presentation accordingly, you foster effective communication, enabling
informed decision-making. Remember, the goal is not just to present data, but to tell a
compelling story that resonates with your audience's needs and expectations.

The data plays very important role in presentation. It helps building the perspectives from
different angles and level. Having data has no logic unless it is analytically presented in simple
but attractive and interactive form. Case stories add values to the data and helps audience be in
the presentation and understand it thoroughly i.e. trends, comparison, corelation Coordination,
data dynamics. Colourful and visual data presentation with analysis and pictorial glimpses add
values to it. Good statics, Chart graphics animation voice over make it more meaningful and
useful. Knowing you audience is very important. Giving clarity to them at a glance is the key.
Defining the metrics and giving them key KPIs helps the end user analyse the visuals deeper.
Aligning your visuals with the business needs help your audience to make business decisions
wisely.

2) Choose the right type of visual

The second principle of data visualization is to choose the right type of visual for your data and
message. Different types of visuals have different strengths and weaknesses, and can convey
different kinds of information, such as trends, comparisons, distributions, proportions, or
relationships. For example, line charts are good for showing changes over time, bar charts are
good for comparing categories or groups, pie charts are good for showing parts of a whole, and
scatter plots are good for showing correlations or outliers. You should avoid using visuals that
are misleading, confusing, or irrelevant to your data and message. With so many different types
of charts, graphs, tables and schemes available to represent data, deciding what type is best for
visualizing the data being presented is itself a skill. Using the right technique will make the data
represented easy to understand, in addition to presenting it in the most accurate and relevant
manner. Make sure you select the right type of visualization to represent certain data, as this
plays an important role in conveying the right data in the most effective way to users. Think
about how users will interact with the information you are presenting.

3) Use colors and shapes wisely


The third principle of data visualization is to use colors and shapes wisely to
enhance your visual appeal and clarity. Colors and shapes can help you highlight
important elements, create contrast and harmony, group and separate
categories, and convey emotions and meanings. However, you should also be
careful not to use too many colors and shapes, or use them inconsistently or
arbitrarily, as this can create visual noise and distraction. You should also consider
the cultural and psychological associations of colors and shapes, and the
accessibility and readability of your visuals for different audiences. Remember
that intuitively, while we interpret visualized data in the form of graphs, charts,
heat-maps, tables, maps, etc. which use colors, we associate lighter colors to
represent lower or median values and darker colors to represent the higher or
extreme values. You can also use color for easy-understandable color coding, for
example, show on-target performance in green and below target in red. Mixing
this up or changing this can make it extremely difficult and counterintuitive to
understand the data presented. Colors and shapes are not just aesthetic choices
in data visualization, they serve as critical tools for cognitive processing. The
appropriate use of color can direct viewers' attention to key data points and
patterns, while shapes can differentiate data sets. However, it's essential to use a
colorblind-friendly palette to ensure accessibility and to avoid overwhelming the
viewer with excessive variety, which can lead to misinterpretation or confusion.
Understanding the psychology behind color perception can greatly enhance the
effectiveness of data presentation.

4) Follow the data-ink ratio: The fourth principle of data visualization is to


follow the data-ink ratio, which is the proportion of ink used to display the data
versus the total ink used in the visual. The data-ink ratio suggests that you should
maximize the amount of ink that represents the data, and minimize the amount of
ink that does not, such as unnecessary borders, backgrounds, grids, or decorations.
This can help you reduce clutter and emphasize the data. However, you should also
balance the data-ink ratio with the aesthetics and functionality of your visuals, and
not sacrifice important information or features for the sake of simplicity.

5Tell a story with your data: The fifth principle of data visualization is to tell a
story with your data , which is the process of creating a narrative around your data
that engages your audience and conveys your message. A story can help you
provide context and meaning to your data, highlight key findings and insights, and
persuade or inspire your audience to take action or learn more. To tell a story with
your data, you should use a clear and logical structure, such as a beginning,
middle, and end, and use techniques such as headlines, captions, annotations, or
transitions to guide your audience through your visuals. Telling a story with data
visualization enhances understanding by providing context and relevance. It
transforms raw numbers into compelling narratives, engaging stakeholders and
driving decision-making. Through clear visualizations, complex patterns and
trends become accessible and memorable.

6Test and refine your visuals: The sixth and final principle of data visualization
is to test and refine your visuals, which is the process of evaluating and improving
your visuals based on feedback and data. Testing and refining your visuals can
help you ensure that your visuals are accurate, relevant, clear, and effective, and
that they meet the needs and expectations of your audience. You can test and refine
your visuals by using various methods, such as checking for errors, biases, or
inconsistencies, using data visualization tools or software, or asking for feedback
from your colleagues, clients, or users. Testing and refining data visuals is a
critical step that leverages iterative design principles. By incorporating user
feedback, one can identify and correct misinterpretations, ensuring that the visual
communication is both effective and intuitive. Additionally, this process helps to
align the visualization with the cognitive load of the audience, optimizing for
clarity and comprehension. It's important to remember that data visualization is not
just an art but also a science that benefits from repeated evaluation and
adjustments.

7) Here’s what else to consider: This is a space to share examples, stories, or


insights that don’t fit into any of the previous sections. What else would you like to
add?

Several key principles are crucial like: - Clarity: Ensure that the visualization is
easy to understand.
- Simplicity: Keep it simple and focused on the key insights.
- Accuracy: Ensure that the data presented is accurate and reliable.
- Relevance: Focus on presenting data that is relevant to the analysis and the
audience.
- Interactivity: Provide interactive elements that allow the exploration and gain
deeper insights.
- Storytelling: Use it to tell a story that helps the audience understand the data and
its implications.
- Feedback: Seek feedback to improve the effectiveness.
The key principles for data visualization in analytical skills are: clarity, relevance,
accuracy, contextualization, interactivity, and aesthetics. These principles
ensure that visualizations are understandable, useful, and accurate for data
interpretation.
Always talk to the Stackholder to understand what he expects, how to analyze the
data and try to analyze what you think is important that he didn't present to you and
show it as a bonus. Always use the best practices of Storytelling, always review
what has been built and practice PDCA.
Keep visualizations simple and easy to understand to convey the message clearly.
Ensure that the visualizations are relevant to the audience and the message you
want to convey. Represent data accurately without distorting or misinterpreting it.
Use clear labels, titles, and legends to make it easy for viewers to understand the
information. Maintain consistency in the design elements such as color, style, and
scale across the visualizations. Provide interactive elements like tooltips or filters
to allow users to explore the data in more detail. Use visualizations to tell a story
and guide viewers through the data to highlight key insights. Be aware of the
ethical considerations when visualizing data.
Data visualization is a crucial aspect when analyzing company data, especially in
areas of sales, market share, product usage, and demographics, among others. In
order to effectively and clearly transform necessary data into meaningful insights
and aid in decision-making, visualizations need to be void of vagueness and
displayed properly in order to avoid inaccurate conclusions, which can be
disastrous.
Data visualization can answer important strategic questions and provide really
valuable information that can help put together real solutions. It can be used to
track performance, monitor behavior, as well as to measure the effectiveness of
various processes implemented. Taking the time in the beginning to define the
purpose, priorities, and goals of the data visualization will make the result more
useful and relevant to the specific target audience, preventing you from creating
visuals that are unnecessary. Determine what decisions you want to drive from the
target audience and how frequently the data needs to be reported.
To begin with, we must set ourselves an objective: What do we want to have with
the data we are analyzing? Second, we have to evaluate who it is aimed at, who is
going to consume that data. The ideal is to keep it simple, follow a line and tell a
story, always thinking about the objective (for what and for whom).
Data wrangling is the process of cleaning, transforming, and organizing data for
analysis. It is an important skill to learn when it comes to data visualization
training Tools: Familiarizing yourself with different data Design principles:
Understanding design principles such as color theory, typography, and layout is
important Communication skills: The ability to communicate data in a clear and
concise way is essential, The ability to tell a story with data, this skill allows you to
take complex data and present it in a way that is easy to understand and actionable.
By mastering these skills, you will be able to create effective data visualizations
that communicate insights and make data accessible to a wide audience

5.0.1 Introduction

To do their jobs efficiently and effectively, data analysts must have a basic
understanding of statistics. This is because data analytics relies heavily on statistics
in the process of analyzing and interpreting data.

5.0.2 What Will I Learn in This Module?

In this module, we will create some visualizations in Excel. But first, we need to
understand some statistical concepts in order to make the most of visual
interpretations.

Upon completion of this module, you should be able to:

Topic Title Topic Objective


Using Statistics to Interpret Data Describe different types of statistics.
Choosing the Right Visualization for the Job Select data visualizations to best explain analysis results.
Creating Visualizations with Excel Create visualizations with Excel.
Addressing Anomalies in Data Interpret visualizations to identify anomalies in data.
Using Excel to Address Issues with Data Use VLOOKUP or XLOOKUP in Excel to identify and fix issues.

5.1. Using Statistics to Interpret Data

5.1.1: What are statistics, populations and samples.

5.1.2 Practice Item

Question 1

What is one important criteria when selecting sample data for analysis? The sample is a
representative of the population being studied. When choosing a sample for analysis, the sample
should be a randomized group that is representative of the total population.

5.1.3 Descriptive Statistics


After the problem statement (also known as the question to be asked) and population is
determined, some form of statistical analysis is needed. There are two key branches of statistics
that we will discuss in this course:
 Descriptive Statistics
 Inferential Statistics
Descriptive statistics are used to describe or summarize the values and observations of a data set.
For example, a fitness tracker logged a person’s daily steps and heart rate for a 10-day period. If
the person met their fitness goals in 6 out of the 10 days, then they were successful 60% of the
time. Over that 10-day period, you could observe that the person’s heart rate was a maximum of
140 beats per minute (bpm), but an average of 72 bpm. These observations would be descriptive
statistics that could be used to describe and simplify the data set.
 Basic descriptive statistics might include the total number of data points in a data set, the
range of values that exist for those numeric data points, or the number of times a given
value appears in a data set. Descriptive statistics may also answer questions about the
occurrence of trends.
The answers to these questions can be provided in numerical or graphical formats. Results of
descriptive statistics are often represented in pie charts, bar charts or histograms. One important
point to note is that while descriptive statistics describe the current or historical state of the
observed population, it does not allow for:
 comparison of groups
 conclusions to be drawn
 predictions to be made about data sets that are not in the population
In the fitness tracker example, we cannot infer that the person has poor health because they were
only successful in meeting their goal 60% of the time. We also cannot use the data set for this
one person to predict the fitness performance for others with similar characteristics. This is
where inferential statistics becomes important.

5.1.4 Practice Item

Question 1

TrueorFalse

Descriptive statistics only allow you to make summations about the people or objects that
you have actually measured.
ANSWER: true. That’s right.
Descriptive statistics are limited so much that they only allow you to make summations
about the people or objects that you have actually measured. Even the data you collected
for generalizations regarding other people or objects is inadmissible. While descriptive
statistics describe the current or historical state of the observed population, it does not
allow for comparison of groups, conclusions to be drawn, or predictions to be made about
other data sets that are not in the population.
5.1.5 Inferential Statistics

Descriptive statistics allows you to summarize findings based on data that you already have
recorded or observed about a population. However, there are situations in which gathering data
for a very large population may not always be practical or even possible. It is possible, however,
to study a smaller representative sample of a population and use inferential statistics to test
hypotheses and draw conclusions about the larger population.
Inferential statistics is the process of collecting, analyzing and interpreting the data gathered
from a sample to generalize or predict something about a population. When a representative
sample is used, methodological concerns may arise and must be addressed, such as whether the
groups chosen for the study or the environment in which a study is carried out accurately reflects
characteristics of the larger group. Typically, these types of analyses will include different
sampling techniques to reduce error and increase confidence in the generalized findings. The
type of sampling technique used will depend on the type of data.

5.1.6 Practice Item

Question 1 What is a characteristic of inferential statistics in data analytics?

Answer: It is the process of collecting, analyzing and interpreting data gathered from
representative samples. That’s right.

Inferential statistics is the process of collecting, analyzing and interpreting data gathered from a
sample to make generalizations or predictions about a population. While descriptive statistics can
only summarize a sample’s characteristics, inferential statistics use your sample to make
reasonable guesses about the larger population. Because a representative sample is used instead
of actual data from the entire population, concerns that the particular groups chosen for the study
or the environment in which a study is carried out need to be addressed. As they may not
accurately reflect characteristics of the larger group, include different sampling techniques to
reduce error and increase confidence in the generalizations about the findings. The type of
sampling technique used will depend on the type of data.

5.1.7 Statistics and Big Data: Different statistical approaches are used in big data analytics.
As we know, descriptive statistics describe a sample. This is useful for understanding the
sample data and for determining the quality of the data. Problems can occur when dealing
with large amounts of data that come from multiple sources. Data points can be corrupted,
incomplete, or missing entirely. Descriptive statistics can help determine how much of the
data in the sample is good for the analysis and identify criteria for removing data that is
inappropriate or problematic. Graphs of descriptive statistics are a helpful way to make
quick judgements about the quality of a sample.

For example, in a sample of tweets selected for analysis, some contain only text characters,
while others contain both characters and images. The type of analysis or question to be
answered with analysis will determine whether tweets that contain images or tweets with
no images should be analyzed. This will identify tweets that are invalid based on a very
simple criterion, because images contain information that must be considered in the
analysis if the tweets using images are included in the sample.

A number of inferential analyses are very commonly used in big data analytics:

 Cluster analysis - Used to find groups of observations that are similar to each other
 Association analysis - Used to find co-occurrences of values for different variables
 Regression analysis - Used to quantify the relationship, if any, between the
variations of one or more variables

5.1.8 Practice Item


Question 1
That’s right.
Place the options in the following order:

5.2 Choosing the Right Visualization for the Job

5.2.1 Importance of Visualizations

Press the Play button to watch the video.

5.2.2 Practice Item

Question 1

True or False
Visualizations cannot display outliers making the data displayed unreliable.

ANSWER: FALSE. That’s right.

Good visualizations make faster decision making possible because humans can process
visual images faster than tables of data and they help users make better sense of
complicated data including outliers. Outliers can be displayed using scatter plots.

5.2.3 Common Types of Data Visualizations

There are many types of data visualizations. Determining the best option usually depends on the
answers to the following questions, among others:

 How many variables are you going to show?


 How many data points are in each variable?
 Is your data over time or are you comparing data points at a single point in time?

Select each chart type to review some uses and best practices.
1 LINE CHART

There are many types of data visualizations. Determining the best option usually depends on the
answers to the following questions, among others:
How many variables are you going to show?
How many data points are in each variable?
Is your data over time or are you comparing data points at a single point in time?
Line charts are one of the most commonly used types of comparison charts. Use line charts when
you have a continuous set of data, the number of data points is high, and/or you would like to
show a trend in the data over time. Some examples include:
 Quarterly sales for the past five years
 Number of customers per week in the first year of a new retail shop
 Change in a stock’s price on one day, from opening to closing bell

Some best practices for line charts include:

 Label the axes.


 Plot time on the x-axis (horizontal) and the data values on the y-axis (vertical).
Use a solid line (rather than a broken line) to emphasize continuity of the data.
 Keep the number of data sets to a minimum. There should be a very good reason for
plotting more than four lines. If needed, add a legend to help the audience understand
what they are viewing.
 Remove or minimize gridlines to reduce distraction. Consider using no gridlines except
to emphasize certain values or time periods.
 Modify the y-axis starting point to obtain something close to a 45-degree slope in one or
more of the lines. This ensures you emphasize the change in the data without introducing
distortions that dramatize the visualization.

2) Column Chart
Column charts are positioned vertically. They are probably the most common chart type used to
display the values of a specific variable across similar categories. Some examples include:
 Populations of the BRICS nations (Brazil, Russia, India, China, and South Africa)
 Last year’s sales for the top four car companies
 Average student test scores for six math classes
Some best practices for column charts include:
 Label the axes.
 If changes over time are being shown, time should be plotted on the x-axis.
 If time is not part of the data, consider ordering the data so that column heights ascend or
descend.
 Fill the columns with a solid color. To highlight one column, consider using an accent
color and make all the other columns the same color.
 Column charts are best when there are no more than seven categories on the horizontal
axis. This will help the viewer clearly see the value for each column.
 Start the value of the y-axis at zero to accurately reflect the full value of each column.
 The spacing between columns should ideally be roughly half the width of a column.

Title: Column Chart

3) Bar Chart

Bar charts are similar to column charts except they are positioned horizontally and
hence used slightly differently (for example, they do not usually show changes over
time). Longer bars indicate larger values. They are best used when the names for each
data point is long, because there is space to write the information. Some examples
include:

 Gross domestic product (GDP) of the 25 highest-producing nations in a given


year
 Number of cars sold by each sales representative in a group
 Exam scores for each student in a math class

Some best practices for bar charts include:

 Label the axes.


 Consider ordering the bars so that the lengths go from longest to shortest. The
meaning of the data shown will most likely determine whether the longest bar
should be on the bottom or the top for greatest impact or easiest understanding.
 Fill the bars with a solid color. To highlight one bar, consider using an accent
color and make all the other bars the same color.
 Start the value of the x-axis at zero to accurately reflect the full value of each bar.
 The spacing between bars should ideally be roughly half the width of a bar.
Fig: Bar Chart.

4) Pie Chart:

Fig: Pie-Chart.
Pie charts are used to show the composition of a total. Segments of different sizes visually
represent percentages of that total. The sum of the segments must equal 100%.
Some examples include:
 Annual expenses for a corporation (e.g., rent, administrative, utilities, production)
 A country’s energy sources (e.g., oil, coal, gas, solar, wind)
 Survey results for a group’s favorite type of movie (e.g., action, romance, comedy,
drama, science fiction)
Some best practices for pie charts include:
 Limit the number of categories so that the viewer can easily differentiate between
segments and their meaning in relation to each other. After ten or more segments,
the slices begin to lose meaning and impact.
 If necessary, consolidate smaller segments into one segment with a label such as
“Other” or “Miscellaneous”.
 Use a different color or gray scale for each segment.
 Order the segments clockwise according to size.
 Make sure the value of all segments equals 100%.
5) Scatter Plot

Scatter plots

Scatter plots are very popular for visualizing correlations, or to show the distribution of
many data points. Scatter plots are also useful for demonstrating clustering or identifying
outliers in the data.
Some examples include:
 Comparing life expectancy to GDP for each country in a group
 Comparing the daily sales of ice cream at a given location to the average outside
temperature
 Comparing the weight to the height of each person in a group
Some best practices for scatter plots include:
 Label the axes.
 Make sure the data set is large enough to provide visualization for clustering or
outliers.
 Start the value of the y-axis at zero to accurately reflect the full values of the data.
The value of the x-axis will depend on the data. For example, age ranges of ice
cream customers might be labeled on the x-axis, and there would be no need to start
at zero years of age.
 If scatter plot shows a correlation between values on the x- and y-axes, consider
adding a trend line.
 Do not include more than two trend lines.

5.2.4 Practice Item


Question 1
Match the chart type best suited to visually display the described data.

Question 1
That’s right.
Place the options in the following order:

line chart change in a stock’s price from opening to closing bell


column chart average student test scores for six chemistry classes
bar chart number of cars sold by each sales representative with long names
pie chart annual expenses for a corporation (e.g., rent, administrative, utilities, production)
scatter plot comparing the daily sales of ice cream to the average outside temperature
5.3 Creating Visualizations with Excel

5.3.1 Steps to Create Visualizations in Excel

Press the Play button to watch the video.

5.3.2 Practice Item

Question 1

Refer to the exhibit. What options are accessed by selecting the plus sign (+) icon shown in
the exhibit?
ANS: Editing of the chart elements such as legends ,titles, and data labels
That’s right.
The + icon enables the editing of chart elements. The paintbrush icon chooses a chart style
and applies a color scheme. The funnel icon filters the data points on selected data values.

5.3.3 Lab - Create Visualizations in Excel

In this lab, you will create charts to visualize data in Microsoft Excel.
 Part 1: Creating a Line Chart
 Part 2: Creating a Column Chart
 Part 3: Creating a Pie Chart
Data visualization assists with the analysis and interpretation of data by graphically
presenting relationships, trends, and inferences that cannot always be clearly derived
by examining the raw text and numeric values in a dataset. This lab uses sample
datasets to demonstrate three visualizations of that data.
Data visualization assists with the analysis and interpretation of data by graphically
presenting relationships, trends, and inferences that cannot always be clearly derived
by examining the raw text and numeric values in a dataset.
This lab uses sample datasets to demonstrate three visualizations of that data.
Part 1: Creating a Line Chart
This chart will display the Profit and Revenue for the years 2017, 2018, 2019, 2020 and 2021.
Step 1: Download the data file
a. Download the sample file Bike Sales_Visualizations_Lab.xlsx to your OneDrive. Open
the downloaded file in MS 365 Excel.
This workbook contains four worksheets that each will be used separately throughout this lab.
Step 2: Insert the line chart.
a. Select the Revenue and Profit by Year worksheet. The worksheet contains the profit
and the revenue totals for each of the years 2017, 2018, 2019, 2020, and 2021.
b. Select the data in the cells A3 through to C8.
c. From the Insert menu, expand the ribbon using the down arrow on the right side of the
ribbon, click the Line chart tool, then select Line with Markers (bottom left option)
This creates a line chart with an x-axis showing the years, and a y-axis dollar amounts.
Step 3: Format the chart.
a. To improve the clarity of the chart, change the vertical axis to display USD currency.
1. Right click on the chart and select Format. The Chart Format window pane
opens on the right of the worksheet.
2. Expand the options for the Vertical Axis.
3. In the Number Format section change Category to Currency and
change Decimal places to 0.
b. Add a chart title
1. In the Chart Format window pane, change the Chart Title option switch to the on
position if it is not already and expand the Chart Title options.
2. Change the Chart Title to “Revenue vs. Profits”.
3. Keep the Title Position at the default which is Above.
c. Change the Legend names to “Annual Profit” and “Annual Revenue”.
1. Select cell B3 and change the column name to Annual Profit.
2. Select cell C3 and change the column name to Annual Revenue.
The legend names at the bottom of the chart should change to match the column names.
d. Reposition the Legend to the right of the chart.
1. Right click on the chart to bring up the Chart Format window pane.
2. Expand the options for Legend.
3. Change the Position option to Right.
e. Add axis titles for both the vertical and horizontal axis..
1. If necessary, right click on the chart to bring up the Chart Format window pane.
2. Expand the Horizontal Axis options.
3. Scroll down to the Axis Title and move the switch to the on position.
4. Add an axis title of “Year”.
5. Expand the options for the Vertical Axis.
6. Scroll down to the Axis Title and move the switch to the on position.
7. Add an axis title of “US Dollars”.
The finished chart should appear as shown below.

Part 2: Creating a Column Chart


Step 1: Insert the Column Chart
a. Select the Product Revenue by Country worksheet. The worksheet contains the revenue
totals for each product category by country.
b. Select the data in the cells A3 through to E10.
c. From the Insert menu, click the Column chart tool, then select the Stacked
Column (middle option)
This creates a column chart with an x-axis showing the country, and a y-axis showing dollar
amounts.
Step 2: Format the chart.
a. Using the same methods used for the line chart in Part 1 perform the following formatting
changes to the chart.
1. Give the chart a title of “Product Revenue by Country”.
2. Change the vertical axis Number Format to Currency and the Decimal
Places to zero
3. Change the Position of the Legend to the Right.
4. Add a horizontal Axis Title of “Country”.
5. Add a vertical Axis Title of “US dollars”
Once completed the chart should appear as shown below.

Part 3: Creating a Pie Chart


Step 1: Insert the Pie Chart
a. Select the Revenue by Age Group worksheet. The worksheet contains the revenue totals
for each product category.
b. Select the data in the cells A3 through to B7.
c. From the Insert menu, click the Pie chart tool, then select the 2D- Pie (top option)
This creates a pie chart with each age group represented by an area on the cart representative of
the revenue for that group.
Step 2: Format the chart.
a. Using the same methods used previously for the line and column carts make the
following format changes:
1. Give the chart a title of “Revenue Comparison by Age Group”.
2. Change the Position of the Legend to the Right.
b. Add data labels to the chart area.
1. In the Chart Format window expand the options for Series “Total:
2. Expand the options for Data Labels.
3. Check the boxes for Category Name and Percentage.
Once completed the chart should appear as shown below.

Reflection Questions

Review each of the datasets given in this lab and select different options for each chart. Consider
if the visualization of the data is enhanced, or not, with these different options.

Part 1: Creating a Line Chart

This chart will display the Profit and Revenue for the years 2017, 2018, 2019, 2020 and 2021.

Step 1: Download the data file

a. Download the sample file Bike Sales_Visualizations_Lab.xlsx to your OneDrive. Open


the downloaded file in MS 365 Excel.

This workbook contains four worksheets that each will be used separately throughout this lab.

Step 2: Insert the line chart.

a. Select the Revenue and Profit by Year worksheet. The worksheet contains the profit
and the revenue totals for each of the years 2017, 2018, 2019, 2020, and 2021.
b. Select the data in the cells A3 through to C8.
c. From the Insert menu, expand the ribbon using the down arrow on the right side of the
ribbon, click the Line chart tool, then select Line with Markers (bottom left option)

This creates a line chart with an x-axis showing the years, and a y-axis dollar amounts.

Step 3: Format the chart.


a. To improve the clarity of the chart, change the vertical axis to display USD currency.
1. Right click on the chart and select Format. The Chart Format window pane
opens on the right of the worksheet.
2. Expand the options for the Vertical Axis.
3. In the Number Format section change Category to Currency and
change Decimal places to 0.
b. Add a chart title
1. In the Chart Format window pane, change the Chart Title option switch to the on
position if it is not already and expand the Chart Title options.
2. Change the Chart Title to “Revenue vs. Profits”.
3. Keep the Title Position at the default which is Above.
c. Change the Legend names to “Annual Profit” and “Annual Revenue”.
1. Select cell B3 and change the column name to Annual Profit.
2. Select cell C3 and change the column name to Annual Revenue.

The legend names at the bottom of the chart should change to match the column names.

d. Reposition the Legend to the right of the chart.


1. Right click on the chart to bring up the Chart Format window pane.
2. Expand the options for Legend.
3. Change the Position option to Right.
e. Add axis titles for both the vertical and horizontal axis..
1. If necessary, right click on the chart to bring up the Chart Format window pane.
2. Expand the Horizontal Axis options.
3. Scroll down to the Axis Title and move the switch to the on position.
4. Add an axis title of “Year”.
5. Expand the options for the Vertical Axis.
6. Scroll down to the Axis Title and move the switch to the on position.
7. Add an axis title of “US Dollars”.

The finished chart should appear as shown below.


Part 2: Creating a Column Chart

Step 1: Insert the Column Chart

a. Select the Product Revenue by Country worksheet. The worksheet contains the revenue
totals for each product category by country.
b. Select the data in the cells A3 through to E10.
c. From the Insert menu, click the Column chart tool, then select the Stacked
Column (middle option)

This creates a column chart with an x-axis showing the country, and a y-axis showing dollar
amounts.

Step 2: Format the chart.

a. Using the same methods used for the line chart in Part 1 perform the following formatting
changes to the chart.
1. Give the chart a title of “Product Revenue by Country”.
2. Change the vertical axis Number Format to Currency and the Decimal
Places to zero
3. Change the Position of the Legend to the Right.
4. Add a horizontal Axis Title of “Country”.
5. Add a vertical Axis Title of “US dollars”

Once completed the chart should appear as shown below.


Part 3: Creating a Pie Chart

Step 1: Insert the Pie Chart

a. Select the Revenue by Age Group worksheet. The worksheet contains the revenue totals
for each product category.
b. Select the data in the cells A3 through to B7.
c. From the Insert menu, click the Pie chart tool, then select the 2D- Pie (top option)

This creates a pie chart with each age group represented by an area on the cart representative of
the revenue for that group.

Step 2: Format the chart.

a. Using the same methods used previously for the line and column carts make the
following format changes:
1. Give the chart a title of “Revenue Comparison by Age Group”.
2. Change the Position of the Legend to the Right.
b. Add data labels to the chart area.
1. In the Chart Format window expand the options for Series “Total:
2. Expand the options for Data Labels.
3. Check the boxes for Category Name and Percentage.

Once completed the chart should appear as shown below.


Reflection Questions

Review each of the datasets given in this lab and select different options for each chart. Consider
if the visualization of the data is enhanced, or not, with these different options.

5.3.4 Practice Item


Question 1

Match the type of visualization to best use for the described scenario.

Question 1

That’s right.

Place the options in the following order:

used when you have a continuous set of data, the number of data points is high, and you would like to show a
line chart
trend in the data over time
column used when you want to display the value of a specific data point and compare that value across similar categories
chart (chart is positioned vertically)
bar chart best used when the names for each data point is long (chart is positioned horizontally)
used to show the composition of a static number (segments represent a percentage of that number amounting to a
pie chart
total sum of 100%)
scatter
used to show the distribution of a large number of data points and identifying outliers in the data
plot
5.4 Addressing Anomalies in Data
5.4.1 Discovering Anomalies Through Visualization
Press the Play button to watch the video.
5.4.2 Practice Item
Question 1
This Practice Item requires you to download the following file.
Download the file entitled "games purchased" before starting this question. Use the Microsoft
Excel file downloaded to create a scatter plot. What is the value of the outlier identified in the
scatter plot chart?

5.4.3 Outliers and Anomalies

Before data analysis can begin, considerable time must be spent cleaning the data. During the
data cleaning phase, you may find outliers, or anomalies, in the data. If so, they need to be
investigated so that the data can be corrected or the meaning of the outlying data point can be
understood An outlier is defined as a value or data point that varies significantly from others,
either much smaller or much greater. Sometimes outliers are mistakes and sometimes they
represent an important piece of information. In the figure, the data point at the extreme bottom
right is an outlier. All the other data points cluster along the trend line.
In the data analysis process, outliers that are the result of mistakes can lead to anomalies in the
results obtained, while outliers that are not errors can be very important to an analysis. This is
why investigating anomalies is a very important part of the data cleaning process—it ensures that
data can be analyzed effectively and generate accurate and valid results.
With small data sets it may be relatively easy to spot outliers by sorting or filtering the data. But
when it comes to large datasets and big data, other tools are required. Two common types of data
visualization used to find outliers are scatter plots and box plots.

5.4.4 Practice Item

Question 1

What values are considered as outliers in a given data set? values that deviate far from the
expected valuesThat’s right. Outliers are values that deviate so far from expected values
that they could distort the results of the analysis. These observations are frequently
removed from the data set after careful consideration.

5.4.5 Lab - Interpret Visualizations with Respect to Outliers

In this lab, charts and functions will be used to detect data outliers.

Part 1: Examine a Dataset for Outliers

Lab – Interpret Visualizations with Respect to Outliers

Part 1: Examine a Dataset for Outliers

An outlier is a value or data point that varies significantly from others in the same dataset.
An outlier can result from variability in the measurements, experimental errors, or human
error in entering the data.
To make sure that any data analysis is correct, outliers need to be identified and then it
needs to be determined how best to treat them.
Instructions
Part 1: Examine a Dataset for Outliers
Step 1: Open the data set.
a. Download the file Bike Sales_Outlier_Lab.xlsx
b. Upload the file to your OneDrive and open it in MS 365 Excel online.
Step 2: Use a Pivot Table to Select Data for Analysis
a. Click any cell in the Bike Sales worksheet.
b. Insert a pivot table by clicking Insert > PivotTable. Check that New Worksheet is
selected in the Create PivotTable dialog box and click OK.
This adds a new worksheet for the pivot table.
c. In the PivotTable Fields Dialog box check the Date and Order_Quantity fields.
The pivot table is created with two columns Date and Sum of Order_Quantity.
Step 3: Sorting Data to Find Outliers
One way to identify outliers is by just sorting the data. This method works with small data
sets where the data is easily scanned.
a. Sort the Sum of Qrder_Quantity column from high to low
1. Select the data points in the Sum of Order_Quantity column. (Do not select
the Grant Total or the column header).
2. Click Sort & Filter > Sort Descending.
This sorts the Order_Quantity data points from highest to lowest.

Which December date had the largest sales quantity? What was the sales quantity?

Review the data in the Bike Sales worksheet for December 19th. Which entry contributes
most to the Sum of Order_Quantity in the pivot table? In other words, which order
number is most responsible for the outlier?

Step 4: Use a Scatter Chart to Find Outliers

A scatter chart can help to identify outliers, especially in larger datasets.

a. Return to the worksheet containing the pivot table (Sheet1).


b. Copy and paste the data from the pivot table into two blank columns (D and E).
Copy the header row with the data, but do not copy the Grand Total row.

Excel will not allow creation of a scatter plot from data in a pivot table. So, the data must
be moved to other columns.

c. Insert scatter plot.


1. Select the all cells in the copied data and use Sort & Filter to sort it
ascending.
2. Highlight the Sum of Order_Quantity column in the copied data.
3. Click on Insert > Scatter and then select the top left scatter plot in the
dropdown list.

Note that the visual of the scatter chart makes the sales for December 19 th easily stand out
as an outlier from the other order quantity datapoints as shown below.

4. Delete the scatter plot.

Step 5: Using the LARGE and SMALL Functions to Find Outliers.

If there is a lot of data the LARGE and SMALL functions can be used to extract the largest
and smallest values which can help to see if there are any outliers.

For this example, the Date column is column D and the Sum of Order_Quantity column is
column E. The columns in your worksheet may be different so adjust your function cells
references accordingly.

a. In an empty cell enter the function =LARGE($E$4:$E27,1).


This function looks at the entries from cell E4 through E27 and returns the highest value.

What value was returned?

Answer: 43

b. To get the highest 5 values, modify the functions to =LARGE($E$4:$E27,


ROW($1:5)).

This returns the highest five values. To return more values change the “5” at the end of the
function to number of values you would like returned.

What function would return the lowest 6 values?

Answer: =SMALL($E$4:$E27, ROW($1:6))

Once outliers are identified, the next challenge is what to do with them. Outliers may
indicate errors in the data, or may be valid data that needs to be investigated as to why it
appears to be an anomaly. There are a couple of ways in which a data analyst can deal with
outliers.

1. Delete them. In a large dataset deleting a few outliers will likely not impact the
overall analysis. However, it is important to create a copy of the data so you can
research what was causing the outliers in the first place. In this example, row 72 in
the Bike Sales dataset could be deleted.

2. Normalize them (Adjust their value). The value of the outliers is changed to be
slightly above the maximum value in the dataset. This is a good method if it will not
skew the data. There are a number of statistical methods to normalize data.
Research the various methods before randomly adjusting data values. In the
example Bike Sales dataset, the December 19 th Order_Quantity could be changed
from 43 to 20 so it is just above the maximum value of 19.

Reflection Questions

List the factors that could determine whether data outliers should or should not be
considered in the final analysis of a dataset.

Answers vary. It may include dataset size, sample size of the outlier, purpose of the
analysis, significance of the outlier value with respect to the majority of the data.

5.4.6 Practice Item

Question 1

Refer to the exhibit. What inference can be drawn from the results displayed in the graph?
Ans: there are outliers in the data that could skew the results

5.5 Using Excel to Address Issues with Data

5.5.1 Introducing VLOOKUP

Press the Play button to watch the video.

5.5.2 Practice Item

Question 1

Look up the polygon name given a number of sides.


Use the number entered in E2.

Note: Use either VLOOKUP( ) or XLOOKUP( ).

ABCDEFGHIJK
1
2
3
4
5
6
7
8
9
1
0
1
1
1
2
Sides Name
3 triangle Enter number (3-20): 3
4 quadrilateral Polygon name: triangle
5 pentagon
6 hexagon
7 heptagon
8 octagon
9 nonagon
10 Decagon
11 hendecagon
12 dodecagon
triskaidekago
13
n

Total Countries

Calculation Mode: Automatic

5.5.3 What You Can Do with VLOOKUP

VLOOKUP is a very powerful data analysis tool within Excel and is great when you need
to find information in a large spreadsheet or if you are consistently looking for the same
type of information.

VLOOKUP is an abbreviation of “vertical lookup,” and it’s a function that searches a


(vertical) column in a table for a specified value. This means that the data must be
organized in a table where each row has different but related forms of data in each column.
If an approximate match is specified in the formula, the first column (the lookup column)
must be sorted in numeric or alphabetic order.

A VLOOKUP function consists of 4 key pieces of information:

1. The value to search for


2. The range to search in
3. The column in the range that contains the value you want the function to return
4. An indication of whether the function should return an approximate match (TRUE,
in the function) or only an exact match (FALSE) of the return value. The default for
VLOOKUP is an approximate match if FALSE is not specified in the function.

VLOOKUP searches for a value in the leftmost column of a table and, when the value is
found, returns information from the same row but in another column.

XLOOKUP, an alternative to VLOOKUP

XLOOKUP is a newer lookup function, similar to VLOOKUP, that is not available in all
versions of Excel currently in use. With XLOOKUP, you can look in any column (not only
the leftmost in a table) for a search term and return a result from the same row. One
difference is that XLOOKUP defaults to returning an exact match, whereas VLOOKUP
defaults to closest match unless the FALSE keyword is used. In this course, you may use
either VLOOKUP or XLOOKUP to obtain the desired results if they are both available in
the spreadsheet tool you are using.

Note: XLOOKUP is not backward compatible, so worksheets using XLOOKUP may not be
usable in earlier versions of Excel.

VLOOKUP Example: Look Up a Typed Value

In the example below, a VLOOKUP formula (shown in the formula bar at the top) has
been entered into cell G3 that will return the budget of a movie when a movie title is typed
into cell G2. The formula =VLOOKUP(G2,A1:D11,3,FALSE) tells Excel to search the
range of cells A1 through D11 for the row with the information typed into G2 and, if it
finds a match, to return the value in column 3 for that row. The FALSE tells Excel to look
for an exact match between the title and the budget. In this screenshot, the Budget
cell G3 has an #N/A error message, because nothing has been entered into G2 yet.

After entering a movie title, as shown in the figure below, the formula returns the budget
for that movie.
VLOOKUP Example: Data Cleaning

VLOOKUP can also be used for data cleaning. For example, you can use it to compare two
columns (or lists) and find values that appear in both.

In the example spreadsheet below, “Good” superheroes are listed in column A and “Evil”
villains are listed in Column B. There should be no duplicates between these two columns,
because you can’t be both at the same time, and the VLOOKUP function can be used to
find any names listed in both columns. There are multiple ways to do this using
VLOOKUP. In this example, for each villain, the cell to the right of the villain’s name will
hold a VLOOKUP function to indicate whether he also appears in the “Good” column.

The formula to search for Abomination in Column A is written in


cell C2 as =VLOOKUP(B2,$A$2:$A$10,1,FALSE). This compares the value in B2 with the
range A2:A10. The $ symbols, as you may recall, tell Excel to reference
cells A2 through A10, even when we copy the formula to other cells. If there is no duplicate
(meaning Abomination is not found in the list of superheroes), then an error (#N/A) is
displayed, as shown.
But if we replace Abomination’s name in the Evil list with Agent 13, as below, VLOOKUP
will find Agent 13 also listed as Good in cell A7 and the duplicate name is shown in cell C2.

The formula can be copied to the other cells in Column C to check all the entries in each
column. Because of the $ symbols, it will still look in cells A2 through A10 for the names.
Customizing Results

With a small addition, VLOOKUP can be written to display custom text to reflect whether
an entry appears in both columns or not, like so:

=IF(ISNA(VLOOKUP(B2,$A$2:$A$10,1,FALSE)),"Unique","Duplicate")

Above, the formula is modified by adding the IF and ISNA functions, and the
values Unique and Duplicate. The IF function makes logical comparisons, and the ISNA
function looks for cells containing the #N/A error message. Together, they tell Excel to
return “Unique” if the VLOOKUP function finds no duplicate (and so returns an error)
and to return “Duplicate” otherwise. The formula is copied to the other cells to give the
results below.
In the next lab, you will have an opportunity to practice using VLOOKUP.

5.5.4 Practice Item

Question 1

How is the Microsoft Excel VLOOKUP tool used in data analysis?

Ans: to find specific information in a large spreadsheet

That’s right.

VLOOKUP is a very powerful data analysis tool in Microsoft Excel that is used to find
information in a large spreadsheet. VLOOKUP is a vertical lookup function, so the data
needs to be organized in a table where each row has different but related forms of data in
each column.

5.5.5 Lab - Using VLOOKUP in Data Analysis

In this lab, you will complete the following objectives:

 Part 1: Initial Examination of a Large Dataset


 Part 2: Apply the VLOOKUP Function

Lab - Using VLOOKUP in Data Analysis

Objectives

In this lab, you will use the VLOOKUP function in Microsoft Excel to:

Part 1: Initial Examination of a Large Dataset


Part 2: Apply the VLOOKUP Function

Background / Scenario

Searching for a single data record from a very large dataset for analysis usually requires
the application of specific search functions to facilitate locating and extracting the
information required. In this lab, the Excel VLOOKUP function will be used to locate
information from a large dataset.

Required Resources

 Mobile device or PC/laptop with a browser, MS 365 Excel online, and an Internet
connection

Note: The precise steps to format and combine or split data in Excel can vary between
platforms and versions. The instructions in this lab are based on the free version of Excel
available from Office.com and may have to be modified to match the platform or version
used to achieve the results shown in this lab.

Instructions

Part 1: Initial Examination of a Large Dataset

Step 1: Download the data file.

a. Download the sample workbook file Bike Sales_VLOOKUP.xlsx and save it to your
OneDrive.
b. Open the downloaded file in MS 365 Excel online. Adjust the column widths such
that the data is visible.
c. Freeze the top (Heading) row so it remains visible as you scroll through the many
thousands of entries.
1. Click View > Freeze Panes and select Freeze Top Row.

Step 2: Examine the data.

Scroll through and examine the data in the worksheet. The worksheet contains bikes sales
data for the 4th quarter of 2021 which is 752 rows of data.

As can be seen, even if the data is sorted by Sales_Order # or Date, the size of the dataset
means that manually searching for information about a specific sale would be time
consuming and perhaps prone to error.

To facilitate searching, this lab will use VLOOKUP to display the Sales_Order
Number and Product of a particular sale.
An important requirement of VLOOKUP is that the reference data must be in the leftmost
(first) column of the dataset in Excel. In this lab, Sales_Order# is being used as the search
reference and must be the first column (Column A in this spreadsheet).

Part 2: Apply the VLOOKUP Function

Step 1: Select result display area

For usability and consistency, especially with large datasets, it is best practice to select a
blank area of the worksheet, or a new blank worksheet, in which to enter the search
criteria and display the result.

This entry and display area may be on another worksheet within the spreadsheet; or may
be enabled by using macros and forms. For this lab, the entry and display area will be on
the same worksheet adjacent to the data.

The values to be used are Sales_Order#, Product, and Order_Quantity.

a. In cell U3, enter Sales_Order # =


b. In cell U4, enter Product =
c. In cell U5, enter Order_Quantity =

Step 2: Create the VLOOKUP functions

a. In cell V4 enter =VLOOKUP(V3, A2:S753 ,13, FALSE)


b. In cell V4 enter =VLOOKUP(V3, A2:S753 ,14, FALSE)

Where:

V3 is the cell where the reference (lookup) value, the sales order number, is entered.

A2:S753 is full range of data across all columns and rows that will be searched.

13 is the index (column number) with respect to the reference column; Product is the
13th column to the right of the Sales_Order# column, column A.

14 is the index (column number) with respect to the reference column; Order_Quantity is
the 14th column to the right of the Sales_Order# column, column A.

FALSE denotes an exact match for the movie title must be found.

Step 3: Test the VLOOKUP functions

a. Enter a Sales_Order# in cell V3 and observe the


displayed Product and Order_Quantiy in cells V4 and V5.
1. If you enter Sales_Order# 000261274 the returned results should be “Road-
650 Red 44” for the Product and “2” for the Order_Quantity.

If Excel is displaying 261274 in cell V3, enter '000261274 into cell V3. The single quote (')
forces the leading zeros in the sales order number to be included in the lookup.

Step 4:

a. Test the VLOOKUP functions with other sales order numbers.

Step 5: Improve the VLOOKUP function

If a sales order number is entered that is not found in the dataset, then #N/A is displayed.

a. In cell V3 enter an invalid sales order number such as 12345

The result in cells V4 and V5 is #N/A.

This result is not very helpful to the user. A solution is to modify each VLOOKUP function.

b. Change cell M4 to = IFNA(VLOOKUP(V3, A2:S753 ,13, FALSE), "Not Found")


c. Change cell M5 to = IFNA(VLOOKUP(V3, A2:S753 ,14, FALSE), "Not Found")

The VLOOKUP function is enclosed within a IFNA function, which if TRUE displays the
given text.

An invalid sales order number should now return “Not Found”.

d. Try other valid and invalid order numbers to test the VLOOKUP function.
Reflection Questions

Consider searching the dataset for other information relating to a particular movie.

1. Write a VLOOKUP function to retrieve the Cost of a particular order.

ANS: = IFNA(VLOOKUP(V3, A2:S753 ,18, FALSE), "Not Found")

2 Write a VLOOKUP function to retrieve the Revenue of a particular order.

ANS: = IFNA(VLOOKUP(V3, A2:S753 ,19, FALSE), "Not Found")

Challenge Activity

Explore the possibilities of using VLOOKUP on other datasets and how useful this would
be in analyzing the data. Consider what would be the reference data and the structure of
the VLOOKUP function to achieve the required analysis.

5.5.6 Practice Item

Question 1

Look up the number of days for the month entered.


Use LEFT(E2,3) to get the first 3 letters of the month.

Note: Use either VLOOKUP( ) or XLOOKUP( ).

ABCDEFGHIJKL
1
2
3
4
5
6
7
8
9
1
0
1
1
1
2
Mont Days
h
JAN 31 Enter month: September
Number of
FEB 28 30 =VLOOKUP(LEFT(E2,3),A:B,2,FALSE)
days:
MAR 31
APR 30
MAY 31
JUN 30
JUL 31
AUG 31
SEP 30
OCT 31
NOV 30

Month Days

Calculation Mode: Automatic

5.6 Analyze the Data Using Statistics Summary


5.6.1 What Did I Learn in This Module?
In this module, you learned that statistics can help the data analyst to interpret data
correctly, identify patterns and trends in the data, and convert them into meaningful
information.
Use the statistics to Interpret Data:

Topic Objective: Describe the different types of statistics.


Statistics is a tool used by data analysts to help analyze large quantities of data and to
identify patterns and trends in that data.
There are two key branches of statistics: descriptive statistics, used to describe or
summarize the values and observations of a data set, and inferential statistics, used to make
generalizations or predictions about a population.

Choosing the Right Visualizations for the Job:


Topic Objective: Select data visualizations to best explain analysis results.
Three considerations when choosing the best visualization are how many variables that
need to be shown, how many datapoints are in each variable, and if data over time or
comparisons needs to be shown.
Line charts are good for continuous data and showing trends over time, while column and
bar charts display comparisons of specific data points across similar categories. Pie charts
are good for showing the percentages of a total, and scatter plots can show the distribution
of many data points and are useful for identifying outliers in data.
Topic Objective: Create visualizations with Excel.
Creating visualization with Excel is easily accomplished by selecting the data to be visualized
and then selecting the desired chart type from the list of available charts. Excel has many options
for customizing charts such as adding a title, axis labels, legends, gridlines, and data labels.
Addressing Animalities’ in Dataset
Topic Objective: Interpret visualizations.
Data must be cleaned before data analysis can begin. Outliers are data points that vary
significantly from the other data points. If found in the data, they need to be investigated and
verified or removed so they do not negatively impact the accuracy of the analysis. In small
datasets, outliers can be identified by visually scanning sorted and filtered data. In larger
datasets, visualization tools such as scatter plots and box plots are often used.

Using Excel to Address issues with Data:


Topic Objective: Use VLOOKUP in Excel to identify and fix issues.
VLOOKUP is a built-in function in Excel that performs a vertical search for one piece of
information in a table and then extracts a specified corresponding piece of information.
VLOOKUP is useful for finding information in large spreadsheets. It is also useful for cleaning
data. With VLOOKUP, a data analyst can compare the data values in two columns to identify
duplicate values.

5.6.2 Analyze the Data Using Statistics Quiz

Question 1
What are inferential data sets? They are data sets gathered from a representative sample to make
generalizations or predictions about a population.
Question 2
How is the Microsoft Excel VLOOKUP tool used in data analysis?
Question 3
What can a data analyst do if they wanted to remove duplicate values in a Microsoft Excel
spreadsheet?
Question 4
Which three key pieces of information are required to perform a VLOOKUP function in
Microsoft Excel? (Choose three.)
1. the column number in the range that contains the return value
2. the range where the value is located
3. the lookup value
Question 5
A data analyst wants to find data point values that are significantly different from others in a data
set. What are these values called?
Ans: outliers
Question 6
Why are different sampling techniques used to gather inferential statistical data?
ANS: to reduce error and increase confidence in the generalizations about the findings
Question 7
Which type of inferential and machine learning analysis is used to find groups of observations
that are similar to each other?
ANS: cluster
Question 8
Why is regression analytics used in the inferential and machine learning analyses of big data?
Ans: It is used to quantify the relationship, if any, between the variations of one or more
variables.
Question 9
A data analyst wants to display the various segments of a country’s energy sources (e.g., oil,
coal, gas, solar, wind) contributing to 100% of its energy sources in a visual format. What type
of chart would be best used to accomplish this?
Ans: pie chart
Question 10
A data analyst wants to display outliers in the data set. Which type of visual representation
would best suit this task?
ANS: scatter plot

END OF MODULES 4 AND 5

You might also like