0% found this document useful (0 votes)
14 views188 pages

Module 1 - BI Introduction

The document provides an overview of Business Intelligence (BI) and its applications in organizations, focusing on the role of analytics in decision-making. It introduces key concepts such as BI terminology, data types, and the structure of analytical organizations, highlighting the importance of self-service BI and effective data management. The content is structured into modules covering various aspects of BI, including descriptive and diagnostic analytics, data visualization, and a capstone project.

Uploaded by

Lauren Mai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views188 pages

Module 1 - BI Introduction

The document provides an overview of Business Intelligence (BI) and its applications in organizations, focusing on the role of analytics in decision-making. It introduces key concepts such as BI terminology, data types, and the structure of analytical organizations, highlighting the importance of self-service BI and effective data management. The content is structured into modules covering various aspects of BI, including descriptive and diagnostic analytics, data visualization, and a capstone project.

Uploaded by

Lauren Mai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 188

BUSINESS INTELLIGENCE

Module 1: Business Intelligence Introduction


Profile
Thao is currently Business Intelligence/ Business Analytics Trainer. She was Business Analytics
Manager at SIFT Analytics Group. She specializes in Analytics Project Management and regularly
using Business Intelligence and Business Analytics tools to review, understand data of customers,
acquire subject matter expertise, communicate model insights, present business performance
analyzing results for C-level managers in finance & retail industry. Thao is also a Trainer, Speaker
for Analytics events, training programs of SIFT and partners. With more than half-decade of
successful experience, Thao used a broad range of technologies, analytical techniques and
methodologies in order to analyze a broad range of data from various sources to provide insights
and support business decision making.
1 Business Intelligence Introduction

2 Business Statistics

3 Descriptive Analytics

4 Diagnostics Analytics

5 Data Visualization

5 Business Intelligence Capstone


1. BUSSINESS INTELLIGENCE INTRODUCTION

01 02 03 04
BI Terminology Business Intelligence in Corporates Power Query Data Preparation
1. Data issues
1. Self-service BI & Analytics 1. Get Data
1. BI vs BA 2. Analytics structure and Coordination Model 2. Data formatting
2. Profile Data
2. Technology in BI 3. BI Success (Organization) 3. Data blending
4. BI Success (Individual) 3. Transform Data
3. Data, Analysis, Analytics 4. Data Preparation Checklist
5. The Evolution of Business Intelligence
4. Understanding Data 6. How can data analytics help organizations?
7. BI process
8. Decision making with BI
9. Decision Bias
Lesson 1: Business Intelligence Terminology
Business Intelligence & Business Analytics
Lesson 1: Business Intelligence Terminology
Business Intelligence & Business Analytics
Lesson 1: Business Intelligence Terminology
Technology in Business Intelligence
Lesson 1: Business Intelligence Terminology
Technology in Business Intelligence
Lesson 1: Business Intelligence Terminology
Data, Analysis, Analytics

Analysis provides you with Analytics The science that


Data, in the information
information & raises analyze crude data to
age, are a large set of
questions extract useful knowledge
bits encoding numbers,
(patterns) from them.
texts, images, sounds,
Analytics give you insights
videos, and so on.
& attempts to answer
questions

DATA ANALYSIS ANALYTICS

Source: A General Introduction to Data Analytics, Wiley & ChartMogul


Lesson 1: Business Intelligence Terminology
Data, Analysis, Analytics

5 months ago, Bank ABC decreased totally Top 4 reasons due to Attrition in Bank:
10.200 bio. VND of Loan portfolio in Attrition (1) Dissatisfaction about services (50%)
(2) Lower rate in another banks (30%)
(Ending Loan portfolio = Beginning Loan + (3) Change another loan package in the bank(10%)
(4) Death (10%)
New loan – Attrition - Maturity)
Lesson 1: Business Intelligence Terminology
Understanding Data – Categories of Data
Lesson 1: Business Intelligence Terminology
Understanding Data – Structures of Data

Cross-sectional Data Time Series Data


• Data collected from several • Data collected over several time
elements/entities at the periods (Year, Month, Day, Hour…).
same, or approximately the • Charts of time series data are
same, point in time. common in business and economics.
• Help analysts understand what
Sep 22, 2015 happened in the past, identify trends
GOOG YHOO FB Industry over time, and project future levels for
Market Cap:
Employees:
426.88B
57148
28.62B
12500
261.91B 277.63M
10955 355
the time series.
Qtrly Rev Growth (yoy): 0.11 0.15 0.39 0.15
Revenue (ttm): 69.61B 4.87B 14.64B 132.20M
Gross Margin (ttm): 0.62 0.67 0.83 0.58
EBITDA (ttm): 22.62B 541.75M 6.38B 3.47M
Operating Margin (ttm): 0.26 0.02 0.32 0.01
Net Income (ttm): 14.39B 6.94B 2.72B N/A
EPS (ttm): 21.22 7.2 0.98 0
P/E (ttm): 29.34 4.22 94.47 33.33
PEG (5 yr expected): 1.22 -2.38 1.59 1.07
P/S (ttm): 6.26 6.02 18.39 3.74
Lesson 1: Business Intelligence Terminology
Understanding Data – Structures of Data
Lesson 1: Business Intelligence Terminology
Understanding Data – Structures of Data
Lesson 1: Business Intelligence Terminology
Understanding Data – Structures of Data
Lesson 1: Business Intelligence Terminology
Understanding Data – Structures of Data
Lesson 1: Business Intelligence Terminology
Understanding Data – Data face
Lesson 1: Business Intelligence Terminology
Understanding Data – Data Sources

Computer files Database Web-based


Lesson 1: Business Intelligence Terminology
Understanding Data – Importance of Data Types
Lesson 1: Business Intelligence Terminology
Understanding Data – Data Types

String data can be Numeric data are Date/time contains a The Boolean type is Images
declared in a number numbers which can specific date, or a sometimes also called Maps
of different ways be whole numbers, combination of both a logical type and is a Report objects
depending on the such as Integers or date and time conditional flag Sound
character set required numbers with decimal representing either
and the anticipated places true or false
length of the string: Byte
any kind of Integer
characters, Fixed Decimal
alphanumeric, Float
including symbols. Double
Lesson 1: Business Intelligence Terminology
Understanding Data – Data Types Exercise
Lesson 1: Business Intelligence Terminology
Understanding Data – Data Types (Excel vs DAX)

Data type in Excel Data type in DAX Description

Whole Number A 64 bit (eight-bytes) Numbers that have no decimal places. Integers can be positive or negative numbers, but must be
integer value 1, 2 whole numbers between -9,223,372,036,854,775,808 (-2^63) and 9,223,372,036,854,775,807 (2^63-1).
Decimal Number A 64 bit (eight-bytes) Real numbers are numbers that can have decimal places. Real numbers cover a wide range of
real number 1, 2 values:
Negative values from -1.79E +308 through -2.23E -308
Zero
Positive values from 2.23E -308 through 1.79E + 308
However, the number of significant digits is limited to 15 decimal digits.
TRUE/FALSE Boolean Either a True or False value.
Text String A Unicode character data string. Can be strings, numbers or dates represented in a text format.
Maximum string length is 268,435,456 Unicode characters (256 mega characters) or 536,870,912
bytes.

Date Date/time Dates and times in an accepted date-time representation.


Valid dates are all dates after January 1, 1900.
Currency Currency Currency data type allows values between -922,337,203,685,477.5808 to 922,337,203,685,477.5807
with four decimal digits of fixed precision.

N/A Blank A blank is a data type in DAX that represents and replaces SQL nulls. You can create a blank by
using the BLANK function, and test for blanks by using the logical function, ISBLANK.
1. BUSSINESS INTELLIGENCE INTRODUCTION

01 02 03 04
BI Terminology Business Intelligence in Corporates Power Query Data Preparation
1. Data issues
1. Self-service BI & Analytics 1. Get Data
1. BI vs BA 2. Analytics structure and Coordination Model 2. Data formatting
2. Profile Data
2. Technology in BI 3. BI Success (Organization) 3. Data blending
4. BI Success (Individual) 3. Transform Data
3. Data, Analysis, Analytics 4. Data Preparation Checklist
5. The Evolution of Business Intelligence
4. Understanding Data 6. How can data analytics help organizations?
7. BI process
8. Decision making with BI
9. Decision Bias
Lesson 2: Business Intelligence in Corporates
Self-service BI & Analytics

The Hardest Thing In Data Science


Math isn’t the hardest thing in Data
Science. Actually, since it’s so mature,
and documented, and well-known, it’s
quite possibly the easiest thing to conquer
in the skillset. No, the hardest thing about
Data Science is asking the right question.
Lesson 2: Business Intelligence in Corporates
Self-service BI & Analytics

ORGANIZATIONS EXIST TO
CREATE VALUE
ORGANIZATIONS HAVE TO BE
Understanding

Creating value is taking what you know QUICK AND NIMBLE

Agile BI
and turning it into action in order to
achieve a desired business outcome. Agile BI (speed-to-value)
Now, more than ever, business leaders
 Data must be relevant need access to the right information at
 Information must be meaningful the right time in order to act before
 Insight must be actionable decision windows close.

ANALYTICS IS A JOURNEY TO EFFECTIVE SELF-SERVICE IS A


VALUE ATTAINMENT BALANCING ACT BETWEEN
FREEDOM AND CONTROL

Describing
Defining

Broaden BI usage while reducing the


burden on IT. These companies have
learned that the goal of self-service is not
unfettered liberation from IT, but rather a
partnership that balances freedom and
control, flexibility and standards,
governance and self-service 25
Lesson 2: Business Intelligence in Corporates
Self-service BI & Analytics
Lesson 2: Business Intelligence in Corporates
Self-service BI & Analytics
Lesson 2: Business Intelligence in Corporates
Analytics structure and Coordination Model

Business Intelligence Discovery

Combine Sources Instant Association

Operational Data Historical Data External Data


E-
POS SCM EDW 1 EDW 2
COMM

MMS ERP WMS ODS 1 ODS 2 28


Lesson 2: Business Intelligence in Corporates
2. Analytics structure and Coordination Model

Analytical Organizations

Functions: Roles:
– Reporting – Database Analysts
– Ad-hoc Analytics – Data Analysts
– Modeling – Modelers
– Data Scientists
– Etc.
Lesson 2: Business Intelligence in Corporates
2. Analytics structure and Coordination Model

Analytical Organizations
How centralized or decentralized should these
organizations be?

Shading shows where


analytics are
executed

Centralized
Functions

Collaboration
Lesson 2: Business Intelligence in Corporates
2. Analytics structure and Coordination Model

Analytical Organizations –
In a Centralized model, a set of analytical
Centralized activities are accomplished through a
central clearinghouse
Example: An enterprise analytics team
serves the needs of marketing, finance,
operations, customer care, etc. with
respect to reporting, ad-hoc analysis, and
statistical modeling
Key Advantages: Key Disadvantages
– Consistency – Responsiveness
– Optimal management of – Lack of context / expertise
bandwidth & focus on can limit effectiveness in more
enterprise priorities complex tasks
– Maximum efficiency in low- – Requires large group and
level tasks consistent overall loading
Lesson 2: Business Intelligence in Corporates
2. Analytics structure and Coordination Model

In an Allocated model, analytical activities


Analytical are accomplished by a common team, but
Organizations – specific capacity is reserved for each
Allocated functional area served
Example: In the same enterprise team, at
least one resource is “assigned” to each
functional area, and takes priority from that
group
Key Advantages: Key Disadvantages
– More responsive – Difficult to match allocation
– More context and with enterprise priority
accelerated development of – Requires large central group
domain expertise and consistent overall loading
– Consistency
– Some load balancing possible
across needs
Lesson 2: Business Intelligence in Corporates
2. Analytics structure and Coordination Model

In a Coordinated model, analytical


Analytical activities are accomplished by teams in
Organizations – each functional group, but those groups use
centrally defined processes & methods
Coordinated
Example: Analytics teams located in
Finance, Marketing & Operations regularly
convene in a users group and participate in
an enterprise-level Data Governance
program
Key Advantages: Key Disadvantages
– Highly responsive – Coordination difficult
– High degree of context and – Effort & data duplication more
expertise attained likely
– Some degree of consistency – Requires a larger overall
maintained number of resources, and
harder to adapt resource
levels to enterprise needs
Lesson 2: Business Intelligence in Corporates
2. Analytics structure and Coordination Model

Analytical In a Distributed model, analytical activities


are accomplished by separate teams in
Organizations – each functional group, with little or no
Distributed coordination
Example: The Business and Consumer
divisions of a large bank each have their
own independently managed analytics
function(s)

Key Disadvantages
Key Advantages:
– Lack of consistency in methods and
– Extremely responsive
sources
– High degree of context and
– Effort & data duplication very likely
expertise attained
– Requires largest overall resources, and
– Efficient localized used of
can be expensive, esp. when
contracting resources
contractors used
Lesson 2: Business Intelligence in Corporates
2. Analytics structure and Coordination Model

Centralized Allocated Coordinated Distributed

Smaller SIZE OF COMPANY Larger

Less Different DIFFERENCES IN METHODS More Different

Near PHYSICAL LOCATION Far

Low CONTEXT REQUIRED Very High


Lesson 2: Business Intelligence in Corporates
2. Analytics structure and Coordination Model

Board of director

Departments Manufactory Marketing Sale Analytics team Planning R&D IT Finance

Planning &
Marketing Analytics Strategy Financial
Analytics Quality Control Sale analytics R&D analytics IT analytics
analytics Manager analytics analytics
Goal analytics

Expanded Marketing
QC Analyst Business Planning Financial
Team Sales Analyst R&D Specialist IT Specialist
Analyst Analyst specialist Analyst

Advanced
Analytics Result Analytis Analytics Result
Specialist

Data engineer

Core Team

Model of structure and coordination of personnel of analysis team members

Center of Excellence (CoE) model fits into the current context

- Main member (full time)


- Extensive members: Business, Information and Communication
Under the Board of Directors
Department, IT, R&D, Finance and Accounting, Human Resource
Department (concurrently)
Lesson 2: Business Intelligence in Corporates
3. BI Success (Organization)

01. The People Domain 03. The Technology Domain

• Organizational Alignment • The role of technology


• The organizational model • The self-service architecture
• The importance of collaboration • The place for big data
• The right team • The confusing analytics landscape
• The right support

02. The Process Domain 04. The Data Domain

• Setting priorities • Understanding data


• Balancing risk and reward • Preparing data
• Building requirements that work • Data on the edge
• Evaluate and Avolve • Understanding data science
• Sell the vision • Data privacy
• Control the Chaos
Lesson 2: Business Intelligence in Corporates
4. BI Success (Individual)
Lesson 2: Business Intelligence in Corporates
5. The Evolution of Business Intelligence

1st Generation 2nd Generation 3rd Generation

Centralized Decentralized Democratized


Lesson 2: Business Intelligence in Corporates
6. How can data analytics help organizations?
Lesson 2: Business Intelligence in Corporates
7. BI process

Context/ Big Data


Data Data Cleansing
Picture Dictionary

Data Analytics Dashboard/Ad-


Data Model Visual
Taxonomy hoc/ML Results

Insights Data
Focus
(Declutter) Storytelling
Lesson 2: Business Intelligence in Corporates
7. BI process
Lesson 2: Business Intelligence in Corporates
8. Decision Making with BI
Lesson 2: Business Intelligence in Corporates
9. Decision Bias

 Impressively large data sets


 The best analytics tools
 Careful statistical methods

Decision-making traps: people don’t carefully process every piece


of information in every decision. Instead, we often rely on heuristics—simplified
procedures that allow us to make decisions in the face of uncertainty or when
extensive analysis is too costly or time-consuming.

There are three main cognitive traps that regularly skew decision making:
• The Confirmation Trap
• The Overconfidence Trap
• The Overfitting Trap
Lesson 2: Business Intelligence in Corporates
9. Decision Bias

Common The Confirmation Trap The Overconfidence Trap The Overfitting Trap
traps
Mistakes This is a sobering prospect for decision makers: • More information=> increase overconfidence without Overfit models generally do a
• Pay more attention to findings that align with improving accuracy (More data in and of itself is not a suspiciously good job of explaining many
our prior beliefs guaranteed solution) nuances of what happened in the past,
• Ignore other facts/ patterns in the data • It prevents us from questioning our methods, but they have great difficulty predicting
motivation, and the way we communicate our findings the future
to others
• It makes it easy to underinvest in data and analysis;
when we feel too confident in our understanding, we
don’t spend enough time or money acquiring more
information or running further analyses.

Tips • Specify in advance the data and analytical • Compare your perfect experiment ideal with your In order to overcome this bias, you need
approaches on which you’ll base your actual data to understand where it might fall short. to discern between the data that matters
decision Identify places where you might be able to close the and the noise around it.
• Actively look for findings that disprove your gap with more data collection or analytical • Split data into train/validation/test
beliefs Techniques • Look for relationships that measure
• Don’t automatically dismiss findings that fall • Asking yourself why your analysis might be wrong, and important effects related to clear and
below your threshold for statistical or practical recommends that you do this for every analysis you logical hypotheses before digging into
significance. perform nuances.
• Assign multiple independent teams to analyze • Ask others with knowledge about the project to • Construct alternative narratives
the data separately, to determine whether imagine its failure a year into the future and to write
the differences are due to error, inconsistent stories of that failure. In doing so, you’ll benefit from the
methods, or bias. wisdom of multiple perspectives, while also providing
• Treat your findings like predictions, and test an opportunity to bring to the surface potential flaws in
them the analysis that you may otherwise overlook.
• Keep track of your predictions and systematically
compare them with what actually happens.
1. BUSSINESS INTELLIGENCE INTRODUCTION

01 02 03 04
BI Terminology Business Intelligence in Corporates Power Query Data Preparation
1. Data issues
1. Self-service BI & Analytics 1. Get Data
1. BI vs BA 2. Analytics structure and Coordination Model 2. Data formatting
2. Profile Data
2. Technology in BI 3. BI Success (Organization) 3. Data blending
4. BI Success (Individual) 3. Transform Data
3. Data, Analysis, Analytics 4. Data Preparation Checklist
5. The Evolution of Business Intelligence
4. Understanding Data 6. How can data analytics help organizations?
7. BI process
8. Decision making with BI
9. Decision Bias
Get and
Transform (M)
(20-25%)
DATA PREPARATION
DATA PREPARATION
DATA PREPARATION
DATA PREPARATION
DATA PREPARATION
DATA PREPARATION
DATA PREPARATION

What does Power Query do?

Power Query is an ETL tool (“a what?” I hear you say). ETL stands for Extract, Transform and Load. Let us look
at each of those words individually to get a better understanding.
•Extract – Data can be extracted from a variety of sources; databases, CSV files, text files, Excel workbooks,
specific cells on the same worksheet, websites and even some PDFs. Basically, if there is data stored
somewhere in a structured or semi-structured format, Power Query can get to it and pull it out.
•Transform – Once the data has been extracted in the previous step, it can be cleaned up (i.e., remove
spaces, split columns, change date formats, fill blanks, find and replace etc) and reshaped (i.e., unpivot,
remove columns etc). When data is extracted from different sources it is unlikely to be consistent, the
transform process is used to make it ready for use.
•Load – Once the data has been extracted and transformed, it needs to be put somewhere so that you
can use it. From an Excel perspective, it can be pushed into a worksheet, a data model, or another query.

To summarize, Power Query takes data from different sources and turns it into something which can be
used. As a tool, this is pretty useful already. But here is the best part. Once the ETL process has been
created, it can be run over and over again with a single click. Which can save hours of work every week.
DATA PREPARATION
DATA PREPARATION
DATA PREPARATION – Get Data
Overview of Supported Data Sources
DATA PREPARATION – Get Data

As of last count, Power Query


has over 140 distinct data
connectors, and that list is
growing all the time. Everything
from local databases; flat files;
cloud storage; Azure; Google;
Amazon; Hadoop; and more
can all be connected to. The
list goes on and on.
DATA PREPARATION – Get Data
DATA PREPARATION – Get Data
DATA PREPARATION – Get Data
DATA PREPARATION – Get Data

If your data is local in any form,


there's also a small but necessary
extra step involved in setting up
report refreshes, and that is installing
a Power BI data gateway. This is the
tool included in the Power BI
subscription, and it's simply a bit of
software that lets Power BI reports
retrieve data from the database or
file location that is on a local server or
machine somewhere.

Now, if your data is not local, then


that means it resides in some type of
cloud service. This means that
typically, it's much easier and simpler
to connect to and refresh as well,
once the report is in the Power BI
service. Since all of the data is
already online, no additional
gateways or setup would be
required. Once you've built the report
and published it to powerbi.com.
DATA PREPARATION – Get Data
DATA PREPARATION – Get Data
DATA PREPARATION – Get Data

Tables are much safer and


more secure to connect
to because within an
Excel file, a table has a
defined range, meaning
that it has specific
headers information and
a certain number of rows
attached to that table.
DATA PREPARATION – Get Data
DATA PREPARATION – Get Data
Load Data - Append Tab
DATA PREPARATION – Get Data
Applied Steps
DATA PREPARATION – Get Data
DATA PREPARATION – Get Data
DATA PREPARATION – Get Data
DATA PREPARATION – Get Data
DATA PREPARATION – Get Data
DATA PREPARATION – Get Data
DATA PREPARATION – Get Data
DATA PREPARATION – Get Data
Loading Data versus Using DirectQuery

Power BI import mode leverages


xVelocity compression engine, so
which compresses the data which is
available in your data model. For
example, sometimes you have a 1 GB
Excel file and when you import it into
Power BI, your Power BI file ends up
with only 10 MB, so this is mainly
because of the compression engine
of Power BI.
DATA PREPARATION – Get Data
Loading Data versus Using DirectQuery
DATA PREPARATION – Get Data
Loading Data versus Using DirectQuery
DATA PREPARATION – Get Data
Working with DAX Functions and Formulas
DATA PREPARATION – Get Data
Loading Data versus Using DirectQuery
DATA PREPARATION – Get Data
Working with Composite Models
DATA PREPARATION – Get Data
Working with Composite Models
DATA PREPARATION – Get Data
Working with Composite Models
DATA PREPARATION – Get Data
Change Data Source Settings
DATA PREPARATION – Get Data
Parameters

Power BI allows you to use parameters to make your reports dynamic.

Adding Connection-Specific Parameters

Using Parameters to Connect to a Data Source

Adding Parameters to Filter Data

Adding Parameters to Control Statement Logic

Using Parameters to Name Dataset Object

Working with Parameters in Data View


DATA PREPARATION – Get Data
Parameters
DATA PREPARATION – Get Data
Choose Columns/Tables
DATA PREPARATION – Transform Data
Get and
Transform (M)
(20-25%)
DATA PREPARATION – Profile Data
Identify data anomalies
DATA PREPARATION – Profile Data
Data profiling
DATA PREPARATION – Profile Data
Data profiling
DATA PREPARATION – Profile Data
Data profiling
DATA PREPARATION – Profile Data
Query Dependencies
Get and
Transform (M)
(20-25%)
DATA PREPARATION – Transform Data
DATA PREPARATION – Transform Data
DATA PREPARATION – Transform Data

Two parts:
1.Using data in the same workbook
2.Using data from external files and external workbooks.
Using data in the same workbook
DATA PREPARATION – Transform Data
DATA PREPARATION – Transform Data
DATA PREPARATION – Transform Data
Add/ Modify Columns

In reality, most
datasets will give us
the majority of the
data we want, but
there still might be a
bit of data that's
missing or that
needs to beslightly
modified.
DATA PREPARATION – Transform Data
Insert in Applied Steps
DATA PREPARATION – Transform Data
Move in Applied Steps
DATA PREPARATION – Transform Data
Insert in Applied Steps
DATA PREPARATION – Transform Data
Add/ Modify Columns
DATA PREPARATION – Transform Data
Merge
DATA PREPARATION – Transform Data
Column Data Type and Formatting

When you load data into Power BI


Desktop, it will attempt to convert the
data type of the source column into a
data type that better supports more
efficient storage, calculations, and
data visualization. For example, if a
column of values you import from Excel
has no fractional values then Power BI
Desktop will convert the entire column
of data to a whole number data type,
which is better suited for storing
integers. This concept is important
because some DAX functions have
special data type requirements. We
have already studied in detail
regarding various data types available
in DAX. While in many cases DAX will
implicitly convert order data for you,
there are some cases where it will not.
For instance, if a DAX function requires
a data type date and the data type for
a column is txt, the DAX function will
not work correctly.
DATA PREPARATION – Transform Data
Column Data Type and Formatting
DATA PREPARATION – Transform Data
Data Type
DATA PREPARATION – Transform Data
Data Type vs Data Format
DATA PREPARATION – Transform Data
Data Type

3 Basic steps

1. Change Column Name


2. Change Property Name
3. Detect Data Type
DATA PREPARATION – Transform Data
Detect Data Type
DATA PREPARATION – Transform Data
Detect Data Type
DATA PREPARATION – Transform Data
DATA PREPARATION - Transform Data
Filter
DATA PREPARATION - Transform Data
Filter
DATA PREPARATION - Transform Data
Filter
DATA PREPARATION - Transform Data
Filter
DATA PREPARATION - Transform Data
Filter
DATA PREPARATION – Transform Data
Joining or Combining Queries
DATA PREPARATION – Transform Data
DATA PREPARATION – Transform Data
DATA PREPARATION – Transform Data

Power Query (or Get & Transform as it’s known in Excel 2016 and
later)

Ken Puls, Mynda Treacy, Oz du Soliel were all talking about how
amazing Power Query is.

Excel 2016 was released and Power Query was integrated


directly into main Excel application (though it was now known by
the name of Get and Transform).

Power Query has an easy to use interface which is designed for


everyday users.

If you want to go deep into Power Query, there is a


programming language called “M” which you could learn. But
you can harness 99.9% of the power without needing it at all.
DATA PREPARATION – Transform Data

https://docs.microsoft.com/en-us/powerquery-m/
DATA PREPARATION – Transform Data
DATA PREPARATION – Transform Data
Writing Power Query Functions
DATA PREPARATION – Transform Data
Writing Power Query Functions
DATA PREPARATION – Transform Data
Writing Power Query Functions
DATA PREPARATION – Transform Data
Writing Power Query Functions
DATA PREPARATION
Replace
DATA PREPARATION – Transform Data
Sorting Data in Columns and Data Categorization
DATA PREPARATION – Transform Data
Sorting Data in Columns and Data Categorization
DATA PREPARATION – Transform Data
New Tables

Excel: Add as New Query => To Table


DATA PREPARATION – Transform Data
Data Categorization
1. BUSSINESS INTELLIGENCE INTRODUCTION

01 02 03 04
BI Terminology Business Intelligence in Corporates Power Query Data Preparation
1. Get Data 1. Data issues
1. Self-service BI & Analytics
1. BI vs BA 2. Analytics structure and Coordination Model 2. Profile Data 2. Data formatting
2. Technology in BI 3. BI Success (Organization) 3. Transform Data 3. Data blending
4. BI Success (Individual)
3. Data, Analysis, Analytics 4. Data Preparation Checklist
5. The Evolution of Business Intelligence
4. Understanding Data 6. How can data analytics help organizations?
7. BI process
8. Decision making with BI
9. Decision Bias
Lesson 4: Data Preparation
Why Data Cleansing?
Lesson 4: Data Preparation
Why Data Cleansing?
Lesson 4: Data Preparation
Why Data Cleansing?
Lesson 4: Data Preparation
Data Issues – Types of Data Issues

1. Dirty
Data

Data
Issues
2.
3.Outliers Missing
data
Lesson 4: Data Preparation
Data Issues – Dirty Data

Dirty Data contains some kind of errors in them, or in a format that’s unfriendly or unusable
Lesson 4: Data Preparation
Data Issues – Dirty Data: Extra Characters

Extra characters can be currency symbols, number signs… We’d need to remove these before
changing between field types
Lesson 4: Data Preparation
Data Issues – Dirty Data: Extra Characters (Example)
Lesson 4: Data Preparation
Data Issues – Dirty Data

No: Yes:

Addresses Address City State Zip


313 173rd Blvd, Kent, WA 981215 313 173rd Blvd Kent WA 981215 Raw Data: Data
316 66th Blvd, Kent, WA 981244 316 66th Blvd Kent WA 981244
stored in its smallest
4358 23rd St, Kent, WA 981225 4358 23rd St Kent WA 981225
965 151st St, Kent, WA 981162 965 151st St Kent WA 981162
size
7900 173rd Lane, Kent, WA 981266 7900 173rd Lane Kent WA 981266
4047 15th Ave, Kent, WA 981228 4047 15th Ave Kent WA 981228
4907 13th Ave, Kent, WA 981232 4907 13th Ave Kent WA 981232
3789 4th Blvd, Seattle, WA 981152 3789 4th Blvd Seattle WA 981152
2977 66th Lane, Seattle, WA 981171 2977 66th Lane Seattle WA 981171
3392 23rd St, Seattle, WA 981131 3392 23rd St Seattle WA 981131

Why?
Because it is easier to analyze data when it is stored in its smallest parts
Lesson 4: Data Preparation
Data Issues – Missing data

Missing data: gaps in data

Blank/ Empty cells (CSV) Null value (Database) N/A (program)

BIAS in statistics refers


to the tendency of an
analysis to either over
or under estimate the
values of that specific
field or parameter
Lesson 4: Data Preparation
Data Issues – Missing data (Example)

Real Data

Downward BIAS
Lesson 4: Data Preparation
Data Issues – Solutions for Missing data

SOLUTIONS
1. Deleting Missing Data
2. Imputation
3. Advanced methods
Lesson 4: Data Preparation
Data Issues – Missing data: Deleting Missing Data

Deleting Missing Data


Deleting missing data is often the default method
because it's simplicity. No decisions that need to be
made that might confuse the data. You just get rid
of records where there are missing values.

However, you should make sure that deleting


missing data doesn't have adverse effects on your
analysis. For example, if a particular demographic
tended to leave a response blank in a survey, then
removing records with blank entries will mean that a
part of the population is underrepresented.
One of the downsides is that eliminating missing
data reduces the size of the dataset (Ex: cost).
Lesson 4: Data Preparation
Data Issues – Missing data: Deleting Missing Data (Example)
Lesson 4: Data Preparation
Data Issues – Missing data: Deleting Missing Data (Example)

Red colors: Age & Income are


Strings => Check in Meta Data
Lesson 4: Data Preparation
Data Issues – Missing data: Deleting Missing Data
Effect of Deletion on Model

Raw data Deleted missing data


Lesson 4: Data Preparation
Data Issues – Missing data: Imputation

Imputation
In statistics, Imputation is the process of
substituting values in the data where the
value are missing (we impute values, we
are making them up). We are creating
fake data in order to develop a model
that makes sense and is as close to
reality as we can get it
Lesson 4: Data Preparation
Data Issues – Missing data: Imputation
Lesson 4: Data Preparation
Data Issues – Missing data: Imputation

Data types
Nominal -- categorical data , gender, type of school.(categories)
Ordinal data-- Ordinal Number is a number that tells the position of something in a list, such as 1st, 2nd, 3rd,
4th, 5th etc.
Interval data -- 6-7,8-9,100-200
Ratio -- Ex: Number of Applications/Number of Loans
Lesson 4: Data Preparation
Data Issues – Missing data: Imputation
Lesson 4: Data Preparation
Data Issues – Missing data: Imputation (Example)
Lesson 4: Data Preparation
Data Issues – Missing data: Advanced methods

If your business and results could be significantly off by using a simpler method,
you might want to explore these options

Missing values aren’t


actually replaced, but
they’re handle within the
modeling process itself
Blend models together
Lesson 4: Data Preparation
Data Issues – Missing data: Advanced methods (Excel)
Lesson 4: Data Preparation
Data Issues – Missing data: Selecting the method

What methodology might be the best approach


1. How much data is really missing? (>=20%)
2. How the missing data is distributed across the dataset? (2/10 predictor variables missed)
3. Whether those specific variables are actually significant to our analysis and model making
process
4. The missing data is numeric or categorical
Lesson 4: Data Preparation
Data Issues – Outliers

Identifying outliers in the data helps us understand how vulnerable our model would be to a small
set of observations.
Lesson 4: Data Preparation
Data Issues – Outliers: Identify

Identifying outliers more methodically rather than simply eyeballing them


Violin Plot: shows the volume of the distribution
Others: z-scores or standard deviations
Lesson 4: Data Preparation
Data Issues – Outliers: Identify
Lesson 1: What should we do with Data?
Data Issues – Outliers: Identify
Lesson 4: Data Preparation
Data Issues – Outliers: Identify

If a value is 1.5 times the INTERQUARTILE RANGE of a data set, then it


can be considered an OUTLIER
Lesson 4: Data Preparation
Data Issues – Outliers: Identify
Lesson 4: Data Preparation
Data Issues – Outliers: Identify
Lesson 4: Data Preparation
Data Issues – Outliers: Dealing with outliers

3. Don’t have obvious errors,


1 & 2/ ERRORS but we aren’t certain whether
the data is accurate or not

1. Try to go back to the original 2. Delete the record from the


source to determine the dataset
correct data

Ex: Age: 299


Lesson 4: Data Preparation
Data Issues – Outliers:
Effect of outliers & Dealing with outliers

Could be correct, but it’s just abnormal then the analysis and Outliers didn’t change the
modeling process SHOULD INCLUDE that data. That said, it is results, the regression line
legitimate to create models without the data as well to compare retained its original lope, then
results, but it should be noted which models do and don’t contain it can be legitimate to
the outliers (2 options: include outliers vs exclude outliers) remove that observation
Lesson 4: Data Preparation
Data Issues – Outliers: Dealing with outliers

4. Truncation NOTE: We see here where age and income are fairly
random with no association between how old a
person is and how much income they have. But the
outlier creates the slope of the line by just being
present... so without outlier 1 (row 10), we have a
steep positive slope, but without outliers 2 (row 14)
and 3 (row 15), we have a negative slope.
In other words without the outlier we wouldn’t really
be able to draw a legitimate line at all, but the
presence of the outlier is what creates the model
effect. In cases such as this, we should definitely
Where we know that a certain value can remove the outlier and investigate other predictor
only be below a given maximum and yet variables.
a value is reported above that
Lesson 4: Data Preparation
Data Formatting

• How to identify when your data needs to be formatted.


• How to massage data into the correct format
• How to aggregate it to the form required

1. Transposing
2. Aggregating Data
3. Cross Tabulation
Lesson 4: Data Preparation
Data Formatting - Transposing
Lesson 4: Data Preparation
Data Formatting – Transposing - Example
Lesson 4: Data Preparation
Data Formatting – Transposing - Example

Power Query
(1) Select columns => Right Click => Unpivot
(2) Select other columns => Right Click => Unpivot other columns
Lesson 4: Data Preparation
Data Formatting - Aggregating Data
Lesson 4: Data Preparation
Data Formatting - Aggregating Data - Example
Lesson 4: Data Preparation
Data Formatting - Cross Tabulation
Lesson 4: Data Preparation
Data Formatting - Cross Tabulation - Example
Lesson 4: Data Preparation
Data Blending

Data may come from different places,


and as a results, it’ll all need to be
stitched together into one data file
Lesson 4: Data Preparation
Data Blending – Unions

Unioning allows you to take multiple datasets and deal with them as one
Lesson 4: Data Preparation
Data Blending – Joining Datasets
Lesson 4: Data Preparation
Data Blending – Fuzzy Matching

Fuzzy Matching will enable you to join 2 data sets


together where a regular join may fail. The Fuzzy
Match identifies records with similar string values
in specified fields.

Fuzzy Matching uses algorithms to score how


similar 2 words or phrases are.

Fuzzy Matching Algorithms


Jaro: The Jaro algorithm is a measure of MATCHES
characters in common, being no more than half
the length of the longer string in distance, with
consideration for transpositions.
Levenshtein: The Levenshtein algorithm counts the
number of edits (insertions, deletions, or
substitutions) needed to convert one string to the
other.
Lesson 4: Data Preparation
Data Blending – Fuzzy Matching - Example

It looks at these words and calculate a closeness of match score


based on the similarity of these words.

The match threshold is the minimum score achieved by the fuzzy matching for
it to be considered to be a match
Lesson 4: Data Preparation
Data Blending – Fuzzy Matching - Example
Lesson 4: Data Preparation
Data Blending – Fuzzy Matching - Example
Lesson 4: Data Preparation
Data Blending – Spatial Matching

Types of Spatial Data


All of these location data examples are represented by points, lines, or polygons

Points Lines Polygons


A point, also referred to as a A line is a string of latitudes Polygons are made up of a series of
centroid, is in the form of a latitude and longitude locations. longitude and latitude coordinates
and longitude which we use to defining all of the vertices of a region.
pinpoint its exact location.
Lesson 4: Data Preparation
Data Blending – Spatial Blending

There aren’t fields that can be Gray area: How many customers fall
used to join them together within a store trade area is to match
them and assign a store number to them
Lesson 4: Data Preparation
Data Blending – Spatial Blending - Example

Customer Information

Spatial Data
Lesson 4: Data Preparation
Data Problems Checklist

CREATING AN ANALYTICAL DATASET Issues 1st Fix-date 2nd Fix-date


Data Source Enough Data
Up to Date
Data Types Data Types correctly
Data Issues Dirty Data Not Parsed Correctly
Extra characers
Unexpected Pattern
Incorrect Data
Duplicate Data Records
M isspelled Entries
Missing Data Deleting M issing Data
Imputation
Advanced methods

Outliers Errors: Cross-check & fix


Errors: Delect
No certainty: Remove if Insignificant
Certainty: Truncation

Data Formatting Transposing


Aggregating Data
Cross Tabulation

Data Blending Unions


Joins
Fuzzy M atching
Spatial M atching
Lesson 4: Data Preparation
Data Problems Checklist

You might also like