Data Analysis
Data Analysis
4.0.1 Introduction
The fourth step in the Data Analytics Process is preparing data for analysis.
Datasets often contain blank rows or columns, data in the wrong formats, or
corrupt field entries. All of these things can make the results of any analysis
incorrect, so these datasets need to be cleaned and transformed prior to
analysis. In this module, you will dive deep into analyzing data with Excel.
You will use Excel to import, clean and prepare the data for the
visualizations.
Background / Scenario
Sorting and filtering the data facilitates data analysis through grouping and including or
excluding specific data. This ensures consistency, accuracy, and better decision-making
processes.
Instructions
Part 1: Download and Prepare a Sample Data Set
a. Open the downloaded sample CSV file Bike Sales_Sort_Lab 4.1.2.csv in Excel.
b. To make all the data visible and more readable, enlarge the column widths to fit:
1) Select the entire worksheet by clicking the top left corner. (The triangle left of column
A.)
2) Move the cursor to the line between two columns, for example, columns A and B,
click and drag the mouse to the right until all data is visible.
c. All columns are now wider and the same width. Adjust each column width so the column
is only wide enough to contain the data.
d. Note that text data is left justified in each cell, whereas numeric data is right justified.
Step 2: Freeze top row. Because there are many rows of data, column headings will disappear
when you scroll down the worksheet. You can freeze the top row so that this information is
always visible.
a. Click View > Freeze Panes > Select Freeze Top Row.
b. Now if you scroll down, the heading for each column is still displayed.
Note: The data has been changed for readability only, and it has not been sorted or manipulated.
Because this is a text (CSV) file, any changes to presentation of the data (such as column width or
freezing the top row) will not be retained unless the file is saved in Excel spreadsheet format.
In this part of the lab, you will sort the data by country and sub-category.
To clear sorting and filtering, select the column being sorted (Sub_Category, in this case). Then
click on the Sort & Filter button in the toolbar and choose Clear.
If you need to remove all filters in a worksheet, you can select the Sort & Filter tools and then
click on the Filter option in the drop-down menu.
In this part, you will convert the data into a table and filter the data.
a. Before filtering the data information, it is best to convert it to a table. Use Ctrl+A to
select all the data.
b. To convert the data, click Home > select Format as Table.
c. Select a table style of your choosing. Ensure that the checkbox My table has headers is
selected. Click OK to accept the range of cells to be in the table.
d. Under Table Name, replace Table1 with BikeSales as the table name.
Note the row numbers on the left of the table. The data has not been moved; filtering is
displaying only those rows that you chose.
Note the filtered column, Age_Group, shows a Filter icon instead of the dropdown arrow to
indicate the data is being filtered by that column.
b. To show all data in a column, click the Age Group drop down button again and check
the (Select all) checkbox.
4.0.1 Introduction
The fourth step in the Data Analytics Process is preparing data for analysis. Datasets
often contain blank rows or columns, data in the wrong formats, or corrupt field entries.
All of these things can make the results of any analysis incorrect, so these datasets
need to be cleaned and transformed prior to analysis. In this module, you will dive deep
into analyzing data with Excel. You will use Excel to import, clean and prepare the data
for the visualizations.
Question: You have selected a range of cells in Microsoft Excel and would
like to find duplicate values in the spreadsheet. Which option on the Home >
Conditional Formatting menu should you use to accomplish this?
Answer: Conditional Formatting options:
Highlight Cell Rules can be used to find and highlight duplicate data
Data Bars are horizontal bars added to each cell, much like a bar
graph.
Top/Bottom Rules can be used to highlight cells that are above
average.
Color Scales change the color of each cell based on its value. Each
color scale uses a two- or three-color gradient. Color Scales can be
used to create a heat map. A heat map is a visual representation of
data where the color represents the value in a cell. For example, you
can create a heat map where a cell with the highest value is colored
green and there is a shift towards red color as the value decreases.
Icon Sets are built-in icons that can be used to add icons to cells in a
range to indicate how large the cell values are compared to the other
values in the range
Note: The data has been changed for readability only, and it has not
been sorted or manipulated. Because this is a text (CSV) file, any
changes to presentation of the data (such as column width or freezing
the top row) will not be retained unless the file is saved in Excel
spreadsheet format.
Part 2: Sort the Data
In this part of the lab, you will sort the data by country and sub-category.
Step 1: Select the Sort tool.
a. Press Ctrl+A to select all the data.
b. Click Home > click Sort & Filter > select Custom Sort.
Step 2: Select the Sort criteria.
a. Verify that the My data has headers checkbox is selected.
b. Select Country as the Column for the Sort by dropdown list.
Select Sort Ascending as the order.
c. To sort by Sub-category, click + Add Level and
select Sub_Category as the column. For Order, select Sort
Ascending from the drop down.
d. Click OK to continue. Note that the data is now sorted in ascending
order by Country and then Sub-category.
To clear sorting and filtering, select the column being sorted
(Sub_Category, in this case). Then click on the Sort & Filter button in the
toolbar and choose Clear.
If you need to remove all filters in a worksheet, you can select the Sort &
Filter tools and then click on the Filter option in the drop-down menu.
Part 3: Filter the Data
In this part, you will convert the data into a table and filter the data.
Step 1: Covert data to a table.
a. Before filtering the data information, it is best to convert it to a table.
Use Ctrl+A to select all the data.
b. To convert the data, click Home > select Format as Table.
c. Select a table style of your choosing. Ensure that the checkbox My
table has headers is selected. Click OK to accept the range of cells
to be in the table.
d. Under Table Name, replace Table1 with BikeSales as the table name.
Step 2: Set and display filtered data.
a. Note the drop-down arrows at the top of each column of the table. By
clicking the drop down on each column, you can filter the data to view
only rows that match those specific values.
For example, click the drop-down arrow on the Age_Group column.
Unselect the (Select all) checkbox and select Young Adults (25-34).
Note the row numbers on the left of the table. The data has not been
moved; filtering is displaying only those rows that you chose.
Note the filtered column, Age_Group, shows a Filter icon instead of
the dropdown arrow to indicate the data is being filtered by that
column.
b. To show all data in a column, click the Age Group drop down button
again and check the (Select all) checkbox.
Reflection Questions
Use data sorting and filtering on relevant columns to determine the
following:
1. How many Senior customers were there in December 2021?
Background / Scenario
To enable effective analysis, data points sometimes need to be split (separated), or
combined, or formatted into an appropriate type. The presentation of data can also be
conditionally formatted to highlight certain values. Manipulating the data in these ways
facilitates more meaningful and accurate analysis.
Instructions
Part 1: Combine Data in an Excel Spreadsheet
Step 1: Examine the data.
a. Open the downloaded sample CSV file Bike
Sales_Manipulate_Lab 4.2.7.csv in Excel.
b. Review the columns of data and the values they contain.
Sometimes data in a worksheet must be manipulated to
perform the desired analysis or to produce a specific result or
output.
For a specific analysis, the size information in Column M needs
to be combined with the product description in Column L so
that the model, color, and size are all in one column.
Reflection Question:
Answer Area : There are multiple answers to this question, but one
example is “to show an audience how outcomes will change for
different scenarios.”
Functions
Built-in functions are formulas that start with a keyword that
identifies a specific function to be performed. Most function
keywords describe the calculation that the function will perform.
Correct syntax varies by function, but usually has a similar pattern.
For example, the syntax for the AVERAGE function accepts a range
or list of cells to calculate the average of the values within the
range, such as =AVERAGE(L2:L37) or =AVERAGE(L2, L5, L6).
Note: You must make sure to close all parenthesis and brackets, or
you will receive an error message when you press enter to complete
the entry. For long functions this can be difficult.
When you start a cell entry with an equal sign, as you type the
function name, a menu of functions matching your entry will usually
appear and show the short description of each function listed. If this
doesn’t happen automatically, you can also right-click on a cell and
enter the equals sign in the text field that says “Search the menus”
to show this menu. Selecting the function that you want places the
function name and an open parenthesis on the formula bar and in
the cell.
Background / Scenario
Excel has eleven different categories of built-in functions. Many of the functions used by
data analysts belong to the Text and Statistical function categories. The Text function
category includes CONCAT, LEN, LEFT, RIGHT, and MID. The Statistical function
category includes COUNT, COUNTBLANK, COUNTA, COUNTIF, AVERAGEIF,
MINIFS, and MAXIFS. This lab will work through the purpose and usage of each of
these functions.
Instructions
The CONCAT function, short for “concatenate,” joins strings of text together. The
function can include actual text strings enclosed in quotation marks as well as cell and
range references. It also allows you to combine different types of data like numbers,
dates, and text strings into a single cell.One use case for the CONCAT function is to
combine different pieces of important information for a human to be able to read at a
glance, instead of going to multiple columns to find this information. In this step we will
use the CONCAT function with our bike sales data to simulate this use case, and
combine the sales order number, order quantity, product subcategory, and date into a
new “Sales_Summary” column.
Excel may display a description of the function, “Concatenates a list or range of text
strings,” below the formula bar.
Excel may display the syntax of the function, or how it must be written, below the
formula bar.
There are multiple ways to correct this issue; one is to modify the
formulas in these cells based on the length of the model names and
colors. First, we will apply a filter to Column M to display only
specific bike models, and then adjust the functions for those cells to
the matching number of characters.
h. Highlight Column M, and then click Sort & Filter in the Editing
menu group and select Filter in the dropdown menu.
i. Click the filter down arrow in the Column M header and click
“Select All” to uncheck all the boxes.
j. Then check the boxes next to entries that are silver and that
don’t have the “–W” in the model name. This will be a total of
6 boxes. Then, click Apply.
Notice these are all missing the “r” in the color silver. To fix this,
the MID function for these rows needs to return 6 characters after
character 14, rather than 5.
h. to return 6 characters after character 14, rather than 5.
Question: What function could you use if you wanted to add up the
revenue from bikes sales limited to just Australia? (Hint, this
function adds cells values in a range based on a given criteria.) Use
the “Insert Function” button under the Formulas tab on the toolbar
to look through available functions and see if you can find it.
Answer: The SUMIF function lets you add values that meet a specific
criterion.
There are many more functions that are useful to data analysts for preparing, cleaning,
and searching datasets. Explore the various function categories in Excel and
experiment by applying them to a dataset. Some specific functions to try are those that
convert text case, such as LOWER, PROPER, and UPPER, and those that return
information, such as LEN, FIND, and SEARCH.
PRACTICE ITEM
Question: In E2, calculate the average mass for apples. Note: Use the
AVERAGEIF( range, criteria, avg_range ) function.
Pivot tables are a function in Excel that is invaluable to data analysts. Pivot tables
provide a way to automatically summarize, analyze, explore, and present data. Pivot
charts enable you to add visualizations to the data in a pivot table. Using these built-in
tools you can identify trends, make comparisons between data items, and create charts
in different styles to visualize your data. You’ll be creating a Pivot Table in the upcoming
lab, but for reference, here are the steps to create a pivot table in Excel:
1. Start with a worksheet that is organized in columns and rows, with column
headers.
2. Select the table or range of cells that you want to be included in your Pivot table.
Be sure to include the column headers in the range.
3. Choose Insert from the menu bar. In the Tables section, you can view various
formats that your data can take by selecting the Recommended Pivot tables
choice. If you don’t see a recommended table that meets your requirements,
select Pivot table => From Table or Range.
4. The Create Pivot table dialog box opens. Your selected range should show in the
Table/Range box, and New Worksheet should be selected. When creating your
first Pivot tables, it is best to have them placed on a new worksheet tab.
5. A blank Pivot table worksheet is created and the Pivot table Fields list containing
your column headings appears. Click the fields that you want to add.
6. The Pivot table wizard will place your fields in one of the boxes at the bottom,
Filters, Columns, Rows, and Values. You can drag and drop your fields into
different categories to change the way that your Pivot table summarizes the data.
7. The Pivot table will appear with your selections. Pivot tables support most of the
functions that you can do on a normal spreadsheet, such as sorting, filtering, and
cell formatting.
Updating data in the original spreadsheet does not update the pivot table automatically;
you must refresh your Pivot table to have it reflect new data.
4.3.7 Practice Item
Question 1
What should be done first when creating a Microsoft Excel Pivot Table?
Answer: Create or select data that needs to be analyzed
In this lab, you will learn the basics of creating a pivot table in Microsoft Excel. Pivot tables
provide a way to automatically summarize, analyze, explore, and present data. Charts add
visualizations to the data in the pivot table that analyze trends and comparisons.
Part 1: Creating an Excel Pivot Table
Part 2: Visualizing Pivot Table Data
The bicycle sales company wants to determine the purchasing patterns of different demographic
groups to identify areas where it needs to concentrate its marketing efforts. The company also
wants to see if there are purchasing differences between the countries where it operates.
Pivot tables can reveal useful information in records or data that is not obvious at first sight by
summarizing and re-presenting the data so trends can be explored and reported. Pivot tables
extract meaning from the data by grouping it in different ways, enabling useful conclusions to be
made.
The "pivot" part of a pivot table stems from the fact that that the data can be rotated (pivoted) to
view it from a different perspective.
It is important to note that pivot tables do not add to, subtract from, or otherwise change, the
data; the pivot table just reorganizes the data to reveal useful information.
Instructions
Part 1: Creating an Excel Pivot Table
To facilitate readability of the data, resize the column widths and center the data in the numeric
columns. This exercise will focus on Year, Age Group, Customer Gender, Country, and Order
Quantity data.
Accordingly, it may also be useful to hide the Customer Age, State, Product Category, Sub-
Category, Product columns.
Step 1: Create the pivot table.
a. Download the Bike Sales_Pivot_Lab.xlsx and save it to your OneDrive. Open the file in
MS 365 Excel online. Click the Insert menu tab and select Pivot Table. In the Create
Pivot Table dialog box, make sure New Worksheet is selected and click OK.
b. In the PivotTable Fields dialog box, select the following fields: Year, Age
Group, Country, and Order Quality.
The pivot table created displays each country grouped under each age group, with the sum of the
order quantity for each age group in total, and the total for each country under that age group.
Step 2: Review the pivot table.
Note that pivot tables will automatically sum numeric data under each heading. However, in this
case, this produces the meaningless sum of all the year values. So, year values are not useful to
include at this stage.
a. Uncheck Year in the PivotTable Fields dialog and update the pivot table. (Selecting any
cell in the pivot table will bring the PivotTables Fields dialog box back up if it is no
longer visible.)
The information now makes more sense with the total of orders for each age group shown, with
the country breakdown for that country.
b. In the pivot table, click the - (minus) beside each age group label to collapse (hide) the
countries listed under that age group.
To display the countries again, click the now displayed + (plus) displays the country sales
numbers.
Step 3: Rearrange the pivot table: To create a different view of the data, drag country in the
PivotTable Fields dialog box to the columns pane. The updated pivot table now displays the
countries as columns with sales totals for each age group and each country.
Step 4: Refine the pivot table.
The pivot table now contains blank cells, which detract from the readability of the table. Excel
can be instructed to fill each blank cell with zero.
a. Click the Pivot Table tab on the menu bar. Under Pivot Table, click Settings. In the
field For empty cells show, select the checkbox and enter 0 (zero). Press Enter to
update the table.
b. The column values can be centered for better readability. Center all the columns with
numbers.
c. The first column can be filtered to re-order the age groups from youngest to oldest. Click
the Filter and Sort down arrow next to the Age_Group column heading. Click Sort
Descending to sort age groups from Youth to Adults.
Step 5: Revise the pivot table.
a. To enhance the data analysis, select Customer_Gender to add the field into the pivot
table.
Note that any blank cells are automatically filled with zero as set from a previous step.
Clicking the – next to each age group will hide the gender for that age group and + will expand
that data category.
Part 2: Visualizing Pivot Table Data
Presenting the pivot table as a graphical chart will highlight features of the data and assist in
analysis and decision making.
Step 1: Create a pivot table chart.
a. Select all the cells in the pivot table. Click Insert from the menu bar. The chart icon
group appears on the ribbon.
b. Find and select the stacked column icon in the drop down.
c. Re-size and move the chart for optimum clarity.
d. Right click the chart to bring up a pop-up menu and select Format from the menu list.
This will bring up the Chart Format dialog. Click Chart Title. Change the Chart
Title to “Sales Summary.”
The resulting chart should look similar to the example below, but colors may vary.
Challenge Activity
Examine the data presented in the pivot table charts and charts and answer these questions.
1. Using the original sample spreadsheet file, Bike Sales_Pivot_Lab.xlsx, construct a new
pivot table that will enable analysis of the most profitable markets by country, age group,
and gender.
Question 1
Refer to the diagram. A data analyst would like to insert a Pivot table in a Microsoft Excel
spreadsheet. Under which tab and function group is the option to insert a Pivot Table?
ANSWER: Under the “Insert Tab” in the table functions. That’s right. Microsoft Excel can be
used to create Pivot tables providing a way to automatically summarize, analyze, explore, and
present data. To insert a Pivot Table, select a data range in an existing worksheet, click on the
Insert tab, then on PivotTable.
Data Visualization
Principle
1
Data visualization is a crucial aspect of data science, focusing on transforming data into
meaningful visual representations. Through practical exercises, you will learn to use popular
tools to craft visuals that support decision-making and effectively communicate your data
insights.
For this week, complete the following task and submit in the assignment tab on your
dashboard. Ensure you grant view access.
Data Cleaning and Visualization.
Download a dataset with issues from platforms like Kaggle or Google Dataset
Search, clean the data, and document the steps you took to resolve the anomalies.
What are the most important data visualization principles for analytical skills
The first principle of data visualization is to know your audience and their needs. Different
audiences may have different levels of familiarity with the data, different expectations and
preferences, and different goals and questions. Therefore, you should tailor your visuals to suit
your audience, by choosing the appropriate level of detail, complexity, and interactivity, and by
using clear and consistent labels, legends, and annotations. You should also consider the
context and format of your presentation, such as whether it is online or offline, static or
dynamic, or standalone or part of a report. In my experience, understanding your audience is
pivotal in data visualization for analytical skills. Tailoring visuals to the audience's expertise level
ensures clarity and engagement. For instance, technical stakeholders may prefer detailed
charts, while executives might need concise, high-level summaries. By anticipating questions
and tailoring the presentation accordingly, you foster effective communication, enabling
informed decision-making. Remember, the goal is not just to present data, but to tell a
compelling story that resonates with your audience's needs and expectations.
The data plays very important role in presentation. It helps building the perspectives from
different angles and level. Having data has no logic unless it is analytically presented in simple
but attractive and interactive form. Case stories add values to the data and helps audience be in
the presentation and understand it thoroughly i.e. trends, comparison, corelation Coordination,
data dynamics. Colourful and visual data presentation with analysis and pictorial glimpses add
values to it. Good statics, Chart graphics animation voice over make it more meaningful and
useful. Knowing you audience is very important. Giving clarity to them at a glance is the key.
Defining the metrics and giving them key KPIs helps the end user analyse the visuals deeper.
Aligning your visuals with the business needs help your audience to make business decisions
wisely.
The second principle of data visualization is to choose the right type of visual for your data and
message. Different types of visuals have different strengths and weaknesses, and can convey
different kinds of information, such as trends, comparisons, distributions, proportions, or
relationships. For example, line charts are good for showing changes over time, bar charts are
good for comparing categories or groups, pie charts are good for showing parts of a whole, and
scatter plots are good for showing correlations or outliers. You should avoid using visuals that
are misleading, confusing, or irrelevant to your data and message. With so many different types
of charts, graphs, tables and schemes available to represent data, deciding what type is best for
visualizing the data being presented is itself a skill. Using the right technique will make the data
represented easy to understand, in addition to presenting it in the most accurate and relevant
manner. Make sure you select the right type of visualization to represent certain data, as this
plays an important role in conveying the right data in the most effective way to users. Think
about how users will interact with the information you are presenting.
5Tell a story with your data: The fifth principle of data visualization is to tell a
story with your data , which is the process of creating a narrative around your data
that engages your audience and conveys your message. A story can help you
provide context and meaning to your data, highlight key findings and insights, and
persuade or inspire your audience to take action or learn more. To tell a story with
your data, you should use a clear and logical structure, such as a beginning,
middle, and end, and use techniques such as headlines, captions, annotations, or
transitions to guide your audience through your visuals. Telling a story with data
visualization enhances understanding by providing context and relevance. It
transforms raw numbers into compelling narratives, engaging stakeholders and
driving decision-making. Through clear visualizations, complex patterns and
trends become accessible and memorable.
6Test and refine your visuals: The sixth and final principle of data visualization
is to test and refine your visuals, which is the process of evaluating and improving
your visuals based on feedback and data. Testing and refining your visuals can
help you ensure that your visuals are accurate, relevant, clear, and effective, and
that they meet the needs and expectations of your audience. You can test and refine
your visuals by using various methods, such as checking for errors, biases, or
inconsistencies, using data visualization tools or software, or asking for feedback
from your colleagues, clients, or users. Testing and refining data visuals is a
critical step that leverages iterative design principles. By incorporating user
feedback, one can identify and correct misinterpretations, ensuring that the visual
communication is both effective and intuitive. Additionally, this process helps to
align the visualization with the cognitive load of the audience, optimizing for
clarity and comprehension. It's important to remember that data visualization is not
just an art but also a science that benefits from repeated evaluation and
adjustments.
Several key principles are crucial like: - Clarity: Ensure that the visualization is
easy to understand.
- Simplicity: Keep it simple and focused on the key insights.
- Accuracy: Ensure that the data presented is accurate and reliable.
- Relevance: Focus on presenting data that is relevant to the analysis and the
audience.
- Interactivity: Provide interactive elements that allow the exploration and gain
deeper insights.
- Storytelling: Use it to tell a story that helps the audience understand the data and
its implications.
- Feedback: Seek feedback to improve the effectiveness.
The key principles for data visualization in analytical skills are: clarity, relevance,
accuracy, contextualization, interactivity, and aesthetics. These principles
ensure that visualizations are understandable, useful, and accurate for data
interpretation.
Always talk to the Stackholder to understand what he expects, how to analyze the
data and try to analyze what you think is important that he didn't present to you and
show it as a bonus. Always use the best practices of Storytelling, always review
what has been built and practice PDCA.
Keep visualizations simple and easy to understand to convey the message clearly.
Ensure that the visualizations are relevant to the audience and the message you
want to convey. Represent data accurately without distorting or misinterpreting it.
Use clear labels, titles, and legends to make it easy for viewers to understand the
information. Maintain consistency in the design elements such as color, style, and
scale across the visualizations. Provide interactive elements like tooltips or filters
to allow users to explore the data in more detail. Use visualizations to tell a story
and guide viewers through the data to highlight key insights. Be aware of the
ethical considerations when visualizing data.
Data visualization is a crucial aspect when analyzing company data, especially in
areas of sales, market share, product usage, and demographics, among others. In
order to effectively and clearly transform necessary data into meaningful insights
and aid in decision-making, visualizations need to be void of vagueness and
displayed properly in order to avoid inaccurate conclusions, which can be
disastrous.
Data visualization can answer important strategic questions and provide really
valuable information that can help put together real solutions. It can be used to
track performance, monitor behavior, as well as to measure the effectiveness of
various processes implemented. Taking the time in the beginning to define the
purpose, priorities, and goals of the data visualization will make the result more
useful and relevant to the specific target audience, preventing you from creating
visuals that are unnecessary. Determine what decisions you want to drive from the
target audience and how frequently the data needs to be reported.
To begin with, we must set ourselves an objective: What do we want to have with
the data we are analyzing? Second, we have to evaluate who it is aimed at, who is
going to consume that data. The ideal is to keep it simple, follow a line and tell a
story, always thinking about the objective (for what and for whom).
Data wrangling is the process of cleaning, transforming, and organizing data for
analysis. It is an important skill to learn when it comes to data visualization
training Tools: Familiarizing yourself with different data Design principles:
Understanding design principles such as color theory, typography, and layout is
important Communication skills: The ability to communicate data in a clear and
concise way is essential, The ability to tell a story with data, this skill allows you to
take complex data and present it in a way that is easy to understand and actionable.
By mastering these skills, you will be able to create effective data visualizations
that communicate insights and make data accessible to a wide audience
5.0.1 Introduction
To do their jobs efficiently and effectively, data analysts must have a basic
understanding of statistics. This is because data analytics relies heavily on statistics
in the process of analyzing and interpreting data.
In this module, we will create some visualizations in Excel. But first, we need to
understand some statistical concepts in order to make the most of visual
interpretations.
Question 1
What is one important criteria when selecting sample data for analysis? The sample is a
representative of the population being studied. When choosing a sample for analysis, the sample
should be a randomized group that is representative of the total population.
Question 1
TrueorFalse
Descriptive statistics only allow you to make summations about the people or objects that
you have actually measured.
ANSWER: true. That’s right.
Descriptive statistics are limited so much that they only allow you to make summations
about the people or objects that you have actually measured. Even the data you collected
for generalizations regarding other people or objects is inadmissible. While descriptive
statistics describe the current or historical state of the observed population, it does not
allow for comparison of groups, conclusions to be drawn, or predictions to be made about
other data sets that are not in the population.
5.1.5 Inferential Statistics
Descriptive statistics allows you to summarize findings based on data that you already have
recorded or observed about a population. However, there are situations in which gathering data
for a very large population may not always be practical or even possible. It is possible, however,
to study a smaller representative sample of a population and use inferential statistics to test
hypotheses and draw conclusions about the larger population.
Inferential statistics is the process of collecting, analyzing and interpreting the data gathered
from a sample to generalize or predict something about a population. When a representative
sample is used, methodological concerns may arise and must be addressed, such as whether the
groups chosen for the study or the environment in which a study is carried out accurately reflects
characteristics of the larger group. Typically, these types of analyses will include different
sampling techniques to reduce error and increase confidence in the generalized findings. The
type of sampling technique used will depend on the type of data.
Answer: It is the process of collecting, analyzing and interpreting data gathered from
representative samples. That’s right.
Inferential statistics is the process of collecting, analyzing and interpreting data gathered from a
sample to make generalizations or predictions about a population. While descriptive statistics can
only summarize a sample’s characteristics, inferential statistics use your sample to make
reasonable guesses about the larger population. Because a representative sample is used instead
of actual data from the entire population, concerns that the particular groups chosen for the study
or the environment in which a study is carried out need to be addressed. As they may not
accurately reflect characteristics of the larger group, include different sampling techniques to
reduce error and increase confidence in the generalizations about the findings. The type of
sampling technique used will depend on the type of data.
5.1.7 Statistics and Big Data: Different statistical approaches are used in big data analytics.
As we know, descriptive statistics describe a sample. This is useful for understanding the
sample data and for determining the quality of the data. Problems can occur when dealing
with large amounts of data that come from multiple sources. Data points can be corrupted,
incomplete, or missing entirely. Descriptive statistics can help determine how much of the
data in the sample is good for the analysis and identify criteria for removing data that is
inappropriate or problematic. Graphs of descriptive statistics are a helpful way to make
quick judgements about the quality of a sample.
For example, in a sample of tweets selected for analysis, some contain only text characters,
while others contain both characters and images. The type of analysis or question to be
answered with analysis will determine whether tweets that contain images or tweets with
no images should be analyzed. This will identify tweets that are invalid based on a very
simple criterion, because images contain information that must be considered in the
analysis if the tweets using images are included in the sample.
A number of inferential analyses are very commonly used in big data analytics:
Cluster analysis - Used to find groups of observations that are similar to each other
Association analysis - Used to find co-occurrences of values for different variables
Regression analysis - Used to quantify the relationship, if any, between the
variations of one or more variables
Question 1
True or False
Visualizations cannot display outliers making the data displayed unreliable.
Good visualizations make faster decision making possible because humans can process
visual images faster than tables of data and they help users make better sense of
complicated data including outliers. Outliers can be displayed using scatter plots.
There are many types of data visualizations. Determining the best option usually depends on the
answers to the following questions, among others:
Select each chart type to review some uses and best practices.
1 LINE CHART
There are many types of data visualizations. Determining the best option usually depends on the
answers to the following questions, among others:
How many variables are you going to show?
How many data points are in each variable?
Is your data over time or are you comparing data points at a single point in time?
Line charts are one of the most commonly used types of comparison charts. Use line charts when
you have a continuous set of data, the number of data points is high, and/or you would like to
show a trend in the data over time. Some examples include:
Quarterly sales for the past five years
Number of customers per week in the first year of a new retail shop
Change in a stock’s price on one day, from opening to closing bell
2) Column Chart
Column charts are positioned vertically. They are probably the most common chart type used to
display the values of a specific variable across similar categories. Some examples include:
Populations of the BRICS nations (Brazil, Russia, India, China, and South Africa)
Last year’s sales for the top four car companies
Average student test scores for six math classes
Some best practices for column charts include:
Label the axes.
If changes over time are being shown, time should be plotted on the x-axis.
If time is not part of the data, consider ordering the data so that column heights ascend or
descend.
Fill the columns with a solid color. To highlight one column, consider using an accent
color and make all the other columns the same color.
Column charts are best when there are no more than seven categories on the horizontal
axis. This will help the viewer clearly see the value for each column.
Start the value of the y-axis at zero to accurately reflect the full value of each column.
The spacing between columns should ideally be roughly half the width of a column.
3) Bar Chart
Bar charts are similar to column charts except they are positioned horizontally and
hence used slightly differently (for example, they do not usually show changes over
time). Longer bars indicate larger values. They are best used when the names for each
data point is long, because there is space to write the information. Some examples
include:
4) Pie Chart:
Fig: Pie-Chart.
Pie charts are used to show the composition of a total. Segments of different sizes visually
represent percentages of that total. The sum of the segments must equal 100%.
Some examples include:
Annual expenses for a corporation (e.g., rent, administrative, utilities, production)
A country’s energy sources (e.g., oil, coal, gas, solar, wind)
Survey results for a group’s favorite type of movie (e.g., action, romance, comedy,
drama, science fiction)
Some best practices for pie charts include:
Limit the number of categories so that the viewer can easily differentiate between
segments and their meaning in relation to each other. After ten or more segments,
the slices begin to lose meaning and impact.
If necessary, consolidate smaller segments into one segment with a label such as
“Other” or “Miscellaneous”.
Use a different color or gray scale for each segment.
Order the segments clockwise according to size.
Make sure the value of all segments equals 100%.
5) Scatter Plot
Scatter plots
Scatter plots are very popular for visualizing correlations, or to show the distribution of
many data points. Scatter plots are also useful for demonstrating clustering or identifying
outliers in the data.
Some examples include:
Comparing life expectancy to GDP for each country in a group
Comparing the daily sales of ice cream at a given location to the average outside
temperature
Comparing the weight to the height of each person in a group
Some best practices for scatter plots include:
Label the axes.
Make sure the data set is large enough to provide visualization for clustering or
outliers.
Start the value of the y-axis at zero to accurately reflect the full values of the data.
The value of the x-axis will depend on the data. For example, age ranges of ice
cream customers might be labeled on the x-axis, and there would be no need to start
at zero years of age.
If scatter plot shows a correlation between values on the x- and y-axes, consider
adding a trend line.
Do not include more than two trend lines.
Question 1
That’s right.
Place the options in the following order:
Question 1
Refer to the exhibit. What options are accessed by selecting the plus sign (+) icon shown in
the exhibit?
ANS: Editing of the chart elements such as legends ,titles, and data labels
That’s right.
The + icon enables the editing of chart elements. The paintbrush icon chooses a chart style
and applies a color scheme. The funnel icon filters the data points on selected data values.
In this lab, you will create charts to visualize data in Microsoft Excel.
Part 1: Creating a Line Chart
Part 2: Creating a Column Chart
Part 3: Creating a Pie Chart
Data visualization assists with the analysis and interpretation of data by graphically
presenting relationships, trends, and inferences that cannot always be clearly derived
by examining the raw text and numeric values in a dataset. This lab uses sample
datasets to demonstrate three visualizations of that data.
Data visualization assists with the analysis and interpretation of data by graphically
presenting relationships, trends, and inferences that cannot always be clearly derived
by examining the raw text and numeric values in a dataset.
This lab uses sample datasets to demonstrate three visualizations of that data.
Part 1: Creating a Line Chart
This chart will display the Profit and Revenue for the years 2017, 2018, 2019, 2020 and 2021.
Step 1: Download the data file
a. Download the sample file Bike Sales_Visualizations_Lab.xlsx to your OneDrive. Open
the downloaded file in MS 365 Excel.
This workbook contains four worksheets that each will be used separately throughout this lab.
Step 2: Insert the line chart.
a. Select the Revenue and Profit by Year worksheet. The worksheet contains the profit
and the revenue totals for each of the years 2017, 2018, 2019, 2020, and 2021.
b. Select the data in the cells A3 through to C8.
c. From the Insert menu, expand the ribbon using the down arrow on the right side of the
ribbon, click the Line chart tool, then select Line with Markers (bottom left option)
This creates a line chart with an x-axis showing the years, and a y-axis dollar amounts.
Step 3: Format the chart.
a. To improve the clarity of the chart, change the vertical axis to display USD currency.
1. Right click on the chart and select Format. The Chart Format window pane
opens on the right of the worksheet.
2. Expand the options for the Vertical Axis.
3. In the Number Format section change Category to Currency and
change Decimal places to 0.
b. Add a chart title
1. In the Chart Format window pane, change the Chart Title option switch to the on
position if it is not already and expand the Chart Title options.
2. Change the Chart Title to “Revenue vs. Profits”.
3. Keep the Title Position at the default which is Above.
c. Change the Legend names to “Annual Profit” and “Annual Revenue”.
1. Select cell B3 and change the column name to Annual Profit.
2. Select cell C3 and change the column name to Annual Revenue.
The legend names at the bottom of the chart should change to match the column names.
d. Reposition the Legend to the right of the chart.
1. Right click on the chart to bring up the Chart Format window pane.
2. Expand the options for Legend.
3. Change the Position option to Right.
e. Add axis titles for both the vertical and horizontal axis..
1. If necessary, right click on the chart to bring up the Chart Format window pane.
2. Expand the Horizontal Axis options.
3. Scroll down to the Axis Title and move the switch to the on position.
4. Add an axis title of “Year”.
5. Expand the options for the Vertical Axis.
6. Scroll down to the Axis Title and move the switch to the on position.
7. Add an axis title of “US Dollars”.
The finished chart should appear as shown below.
Reflection Questions
Review each of the datasets given in this lab and select different options for each chart. Consider
if the visualization of the data is enhanced, or not, with these different options.
This chart will display the Profit and Revenue for the years 2017, 2018, 2019, 2020 and 2021.
This workbook contains four worksheets that each will be used separately throughout this lab.
a. Select the Revenue and Profit by Year worksheet. The worksheet contains the profit
and the revenue totals for each of the years 2017, 2018, 2019, 2020, and 2021.
b. Select the data in the cells A3 through to C8.
c. From the Insert menu, expand the ribbon using the down arrow on the right side of the
ribbon, click the Line chart tool, then select Line with Markers (bottom left option)
This creates a line chart with an x-axis showing the years, and a y-axis dollar amounts.
The legend names at the bottom of the chart should change to match the column names.
a. Select the Product Revenue by Country worksheet. The worksheet contains the revenue
totals for each product category by country.
b. Select the data in the cells A3 through to E10.
c. From the Insert menu, click the Column chart tool, then select the Stacked
Column (middle option)
This creates a column chart with an x-axis showing the country, and a y-axis showing dollar
amounts.
a. Using the same methods used for the line chart in Part 1 perform the following formatting
changes to the chart.
1. Give the chart a title of “Product Revenue by Country”.
2. Change the vertical axis Number Format to Currency and the Decimal
Places to zero
3. Change the Position of the Legend to the Right.
4. Add a horizontal Axis Title of “Country”.
5. Add a vertical Axis Title of “US dollars”
a. Select the Revenue by Age Group worksheet. The worksheet contains the revenue totals
for each product category.
b. Select the data in the cells A3 through to B7.
c. From the Insert menu, click the Pie chart tool, then select the 2D- Pie (top option)
This creates a pie chart with each age group represented by an area on the cart representative of
the revenue for that group.
a. Using the same methods used previously for the line and column carts make the
following format changes:
1. Give the chart a title of “Revenue Comparison by Age Group”.
2. Change the Position of the Legend to the Right.
b. Add data labels to the chart area.
1. In the Chart Format window expand the options for Series “Total:
2. Expand the options for Data Labels.
3. Check the boxes for Category Name and Percentage.
Review each of the datasets given in this lab and select different options for each chart. Consider
if the visualization of the data is enhanced, or not, with these different options.
Match the type of visualization to best use for the described scenario.
Question 1
That’s right.
used when you have a continuous set of data, the number of data points is high, and you would like to show a
line chart
trend in the data over time
column used when you want to display the value of a specific data point and compare that value across similar categories
chart (chart is positioned vertically)
bar chart best used when the names for each data point is long (chart is positioned horizontally)
used to show the composition of a static number (segments represent a percentage of that number amounting to a
pie chart
total sum of 100%)
scatter
used to show the distribution of a large number of data points and identifying outliers in the data
plot
5.4 Addressing Anomalies in Data
5.4.1 Discovering Anomalies Through Visualization
Press the Play button to watch the video.
5.4.2 Practice Item
Question 1
This Practice Item requires you to download the following file.
Download the file entitled "games purchased" before starting this question. Use the Microsoft
Excel file downloaded to create a scatter plot. What is the value of the outlier identified in the
scatter plot chart?
Before data analysis can begin, considerable time must be spent cleaning the data. During the
data cleaning phase, you may find outliers, or anomalies, in the data. If so, they need to be
investigated so that the data can be corrected or the meaning of the outlying data point can be
understood An outlier is defined as a value or data point that varies significantly from others,
either much smaller or much greater. Sometimes outliers are mistakes and sometimes they
represent an important piece of information. In the figure, the data point at the extreme bottom
right is an outlier. All the other data points cluster along the trend line.
In the data analysis process, outliers that are the result of mistakes can lead to anomalies in the
results obtained, while outliers that are not errors can be very important to an analysis. This is
why investigating anomalies is a very important part of the data cleaning process—it ensures that
data can be analyzed effectively and generate accurate and valid results.
With small data sets it may be relatively easy to spot outliers by sorting or filtering the data. But
when it comes to large datasets and big data, other tools are required. Two common types of data
visualization used to find outliers are scatter plots and box plots.
Question 1
What values are considered as outliers in a given data set? values that deviate far from the
expected valuesThat’s right. Outliers are values that deviate so far from expected values
that they could distort the results of the analysis. These observations are frequently
removed from the data set after careful consideration.
In this lab, charts and functions will be used to detect data outliers.
An outlier is a value or data point that varies significantly from others in the same dataset.
An outlier can result from variability in the measurements, experimental errors, or human
error in entering the data.
To make sure that any data analysis is correct, outliers need to be identified and then it
needs to be determined how best to treat them.
Instructions
Part 1: Examine a Dataset for Outliers
Step 1: Open the data set.
a. Download the file Bike Sales_Outlier_Lab.xlsx
b. Upload the file to your OneDrive and open it in MS 365 Excel online.
Step 2: Use a Pivot Table to Select Data for Analysis
a. Click any cell in the Bike Sales worksheet.
b. Insert a pivot table by clicking Insert > PivotTable. Check that New Worksheet is
selected in the Create PivotTable dialog box and click OK.
This adds a new worksheet for the pivot table.
c. In the PivotTable Fields Dialog box check the Date and Order_Quantity fields.
The pivot table is created with two columns Date and Sum of Order_Quantity.
Step 3: Sorting Data to Find Outliers
One way to identify outliers is by just sorting the data. This method works with small data
sets where the data is easily scanned.
a. Sort the Sum of Qrder_Quantity column from high to low
1. Select the data points in the Sum of Order_Quantity column. (Do not select
the Grant Total or the column header).
2. Click Sort & Filter > Sort Descending.
This sorts the Order_Quantity data points from highest to lowest.
Which December date had the largest sales quantity? What was the sales quantity?
Review the data in the Bike Sales worksheet for December 19th. Which entry contributes
most to the Sum of Order_Quantity in the pivot table? In other words, which order
number is most responsible for the outlier?
Excel will not allow creation of a scatter plot from data in a pivot table. So, the data must
be moved to other columns.
Note that the visual of the scatter chart makes the sales for December 19 th easily stand out
as an outlier from the other order quantity datapoints as shown below.
If there is a lot of data the LARGE and SMALL functions can be used to extract the largest
and smallest values which can help to see if there are any outliers.
For this example, the Date column is column D and the Sum of Order_Quantity column is
column E. The columns in your worksheet may be different so adjust your function cells
references accordingly.
Answer: 43
This returns the highest five values. To return more values change the “5” at the end of the
function to number of values you would like returned.
Once outliers are identified, the next challenge is what to do with them. Outliers may
indicate errors in the data, or may be valid data that needs to be investigated as to why it
appears to be an anomaly. There are a couple of ways in which a data analyst can deal with
outliers.
1. Delete them. In a large dataset deleting a few outliers will likely not impact the
overall analysis. However, it is important to create a copy of the data so you can
research what was causing the outliers in the first place. In this example, row 72 in
the Bike Sales dataset could be deleted.
2. Normalize them (Adjust their value). The value of the outliers is changed to be
slightly above the maximum value in the dataset. This is a good method if it will not
skew the data. There are a number of statistical methods to normalize data.
Research the various methods before randomly adjusting data values. In the
example Bike Sales dataset, the December 19 th Order_Quantity could be changed
from 43 to 20 so it is just above the maximum value of 19.
Reflection Questions
List the factors that could determine whether data outliers should or should not be
considered in the final analysis of a dataset.
Answers vary. It may include dataset size, sample size of the outlier, purpose of the
analysis, significance of the outlier value with respect to the majority of the data.
Question 1
Refer to the exhibit. What inference can be drawn from the results displayed in the graph?
Ans: there are outliers in the data that could skew the results
Question 1
ABCDEFGHIJK
1
2
3
4
5
6
7
8
9
1
0
1
1
1
2
Sides Name
3 triangle Enter number (3-20): 3
4 quadrilateral Polygon name: triangle
5 pentagon
6 hexagon
7 heptagon
8 octagon
9 nonagon
10 Decagon
11 hendecagon
12 dodecagon
triskaidekago
13
n
Total Countries
VLOOKUP is a very powerful data analysis tool within Excel and is great when you need
to find information in a large spreadsheet or if you are consistently looking for the same
type of information.
VLOOKUP searches for a value in the leftmost column of a table and, when the value is
found, returns information from the same row but in another column.
XLOOKUP is a newer lookup function, similar to VLOOKUP, that is not available in all
versions of Excel currently in use. With XLOOKUP, you can look in any column (not only
the leftmost in a table) for a search term and return a result from the same row. One
difference is that XLOOKUP defaults to returning an exact match, whereas VLOOKUP
defaults to closest match unless the FALSE keyword is used. In this course, you may use
either VLOOKUP or XLOOKUP to obtain the desired results if they are both available in
the spreadsheet tool you are using.
Note: XLOOKUP is not backward compatible, so worksheets using XLOOKUP may not be
usable in earlier versions of Excel.
In the example below, a VLOOKUP formula (shown in the formula bar at the top) has
been entered into cell G3 that will return the budget of a movie when a movie title is typed
into cell G2. The formula =VLOOKUP(G2,A1:D11,3,FALSE) tells Excel to search the
range of cells A1 through D11 for the row with the information typed into G2 and, if it
finds a match, to return the value in column 3 for that row. The FALSE tells Excel to look
for an exact match between the title and the budget. In this screenshot, the Budget
cell G3 has an #N/A error message, because nothing has been entered into G2 yet.
After entering a movie title, as shown in the figure below, the formula returns the budget
for that movie.
VLOOKUP Example: Data Cleaning
VLOOKUP can also be used for data cleaning. For example, you can use it to compare two
columns (or lists) and find values that appear in both.
In the example spreadsheet below, “Good” superheroes are listed in column A and “Evil”
villains are listed in Column B. There should be no duplicates between these two columns,
because you can’t be both at the same time, and the VLOOKUP function can be used to
find any names listed in both columns. There are multiple ways to do this using
VLOOKUP. In this example, for each villain, the cell to the right of the villain’s name will
hold a VLOOKUP function to indicate whether he also appears in the “Good” column.
The formula can be copied to the other cells in Column C to check all the entries in each
column. Because of the $ symbols, it will still look in cells A2 through A10 for the names.
Customizing Results
With a small addition, VLOOKUP can be written to display custom text to reflect whether
an entry appears in both columns or not, like so:
=IF(ISNA(VLOOKUP(B2,$A$2:$A$10,1,FALSE)),"Unique","Duplicate")
Above, the formula is modified by adding the IF and ISNA functions, and the
values Unique and Duplicate. The IF function makes logical comparisons, and the ISNA
function looks for cells containing the #N/A error message. Together, they tell Excel to
return “Unique” if the VLOOKUP function finds no duplicate (and so returns an error)
and to return “Duplicate” otherwise. The formula is copied to the other cells to give the
results below.
In the next lab, you will have an opportunity to practice using VLOOKUP.
Question 1
That’s right.
VLOOKUP is a very powerful data analysis tool in Microsoft Excel that is used to find
information in a large spreadsheet. VLOOKUP is a vertical lookup function, so the data
needs to be organized in a table where each row has different but related forms of data in
each column.
Objectives
In this lab, you will use the VLOOKUP function in Microsoft Excel to:
Background / Scenario
Searching for a single data record from a very large dataset for analysis usually requires
the application of specific search functions to facilitate locating and extracting the
information required. In this lab, the Excel VLOOKUP function will be used to locate
information from a large dataset.
Required Resources
Mobile device or PC/laptop with a browser, MS 365 Excel online, and an Internet
connection
Note: The precise steps to format and combine or split data in Excel can vary between
platforms and versions. The instructions in this lab are based on the free version of Excel
available from Office.com and may have to be modified to match the platform or version
used to achieve the results shown in this lab.
Instructions
a. Download the sample workbook file Bike Sales_VLOOKUP.xlsx and save it to your
OneDrive.
b. Open the downloaded file in MS 365 Excel online. Adjust the column widths such
that the data is visible.
c. Freeze the top (Heading) row so it remains visible as you scroll through the many
thousands of entries.
1. Click View > Freeze Panes and select Freeze Top Row.
Scroll through and examine the data in the worksheet. The worksheet contains bikes sales
data for the 4th quarter of 2021 which is 752 rows of data.
As can be seen, even if the data is sorted by Sales_Order # or Date, the size of the dataset
means that manually searching for information about a specific sale would be time
consuming and perhaps prone to error.
To facilitate searching, this lab will use VLOOKUP to display the Sales_Order
Number and Product of a particular sale.
An important requirement of VLOOKUP is that the reference data must be in the leftmost
(first) column of the dataset in Excel. In this lab, Sales_Order# is being used as the search
reference and must be the first column (Column A in this spreadsheet).
For usability and consistency, especially with large datasets, it is best practice to select a
blank area of the worksheet, or a new blank worksheet, in which to enter the search
criteria and display the result.
This entry and display area may be on another worksheet within the spreadsheet; or may
be enabled by using macros and forms. For this lab, the entry and display area will be on
the same worksheet adjacent to the data.
Where:
V3 is the cell where the reference (lookup) value, the sales order number, is entered.
A2:S753 is full range of data across all columns and rows that will be searched.
13 is the index (column number) with respect to the reference column; Product is the
13th column to the right of the Sales_Order# column, column A.
14 is the index (column number) with respect to the reference column; Order_Quantity is
the 14th column to the right of the Sales_Order# column, column A.
FALSE denotes an exact match for the movie title must be found.
If Excel is displaying 261274 in cell V3, enter '000261274 into cell V3. The single quote (')
forces the leading zeros in the sales order number to be included in the lookup.
Step 4:
If a sales order number is entered that is not found in the dataset, then #N/A is displayed.
This result is not very helpful to the user. A solution is to modify each VLOOKUP function.
The VLOOKUP function is enclosed within a IFNA function, which if TRUE displays the
given text.
d. Try other valid and invalid order numbers to test the VLOOKUP function.
Reflection Questions
Consider searching the dataset for other information relating to a particular movie.
Challenge Activity
Explore the possibilities of using VLOOKUP on other datasets and how useful this would
be in analyzing the data. Consider what would be the reference data and the structure of
the VLOOKUP function to achieve the required analysis.
Question 1
ABCDEFGHIJKL
1
2
3
4
5
6
7
8
9
1
0
1
1
1
2
Mont Days
h
JAN 31 Enter month: September
Number of
FEB 28 30 =VLOOKUP(LEFT(E2,3),A:B,2,FALSE)
days:
MAR 31
APR 30
MAY 31
JUN 30
JUL 31
AUG 31
SEP 30
OCT 31
NOV 30
Month Days
Question 1
What are inferential data sets? They are data sets gathered from a representative sample to make
generalizations or predictions about a population.
Question 2
How is the Microsoft Excel VLOOKUP tool used in data analysis?
Question 3
What can a data analyst do if they wanted to remove duplicate values in a Microsoft Excel
spreadsheet?
Question 4
Which three key pieces of information are required to perform a VLOOKUP function in
Microsoft Excel? (Choose three.)
1. the column number in the range that contains the return value
2. the range where the value is located
3. the lookup value
Question 5
A data analyst wants to find data point values that are significantly different from others in a data
set. What are these values called?
Ans: outliers
Question 6
Why are different sampling techniques used to gather inferential statistical data?
ANS: to reduce error and increase confidence in the generalizations about the findings
Question 7
Which type of inferential and machine learning analysis is used to find groups of observations
that are similar to each other?
ANS: cluster
Question 8
Why is regression analytics used in the inferential and machine learning analyses of big data?
Ans: It is used to quantify the relationship, if any, between the variations of one or more
variables.
Question 9
A data analyst wants to display the various segments of a country’s energy sources (e.g., oil,
coal, gas, solar, wind) contributing to 100% of its energy sources in a visual format. What type
of chart would be best used to accomplish this?
Ans: pie chart
Question 10
A data analyst wants to display outliers in the data set. Which type of visual representation
would best suit this task?
ANS: scatter plot