UNIT IV
UNIT IV
Data Cleaning Techniques
Data Munging - Getting, cleaning, and shaping data , Cleanse data - Merge, shape, and filter data -
Group and aggregate data - Insert calculated columns
Power BI Data Model
Creating data Model - Explain what a data model is - Create relationships between tables in the
model - Create and use a star schema - Understand when and how to de-normalize the data –
Create and use linked tables
Data Cleaning Techniques
Data Munging - Getting, cleaning, and shaping data , Cleanse data - Merge, shape, and filter data - Group
and aggregate data - Insert calculated colum
Data Munging
Data munging, also known as data wrangling, is the process of cleaning and transforming data in
Power BI to make it suitable for analysis and visualization.
This involves tasks like
• Handling missing values
• Removing duplicates
• Fixing inconsistencies
• Formatting data into the correct types.
Data Cleaning:
This involves identifying and addressing issues like missing data, duplicate rows, inconsistent
formatting, and incorrect data types.
Data Transformation:
This involves reshaping data (e.g., pivoting or unpivoting columns), combining data from
multiple sources, and creating new columns based on existing ones.
Data Formatting:
This ensures that data is in a consistent and usable format for analysis, including data type
conversion and standardization of text and numbers.
Power Query Editor:
Power BI's primary tool for data munging, allowing you to connect to various data sources,
perform transformations, and clean data before loading it into the data model.
Common data munging tasks in Power BI:
• Handling Missing Values:
You can remove rows with missing data, replace missing values with a specific value, or
use more advanced techniques like interpolation.
• Removing Duplicates:
You can easily remove duplicate rows from your dataset using the Power Query Editor.
UNIT IV
• Fixing Inconsistencies:
This might involve standardizing text formats, correcting spelling errors, or converting
values to the correct data type.
• Transforming Data:
You can use Power Query's features to pivot, unpivot, aggregate, and split columns.
• Creating Calculated Columns:
You can create new columns that are derived from existing data, using DAX (Data
Analysis Expressions) formulas.
• Aggregating Data:
You can group data by specific columns and perform calculations like sum, average, or
count.
• Data Blending:
You can combine data from multiple sources into a single dataset for analysis.
Benefits of Data Munging:
• Improved Data Quality:
Data munging ensures that your data is accurate, consistent, and ready for analysis, leading
to more reliable insights.
• Efficient Analysis:
Cleaned and transformed data is easier to analyze and visualize, saving time and effort.
• Accurate Reports and Dashboards:
Data munging ensures that your reports and dashboards are based on accurate and reliable
data.
Clean, transform, and load data in Power BI
Power Query has an incredible number of features that are dedicated to helping you clean and
prepare your data for analysis. Power query editor role
• How to simplify a complicated model
• Change data types
• Rename objects, and pivot data.
• How to profile columns so as to know which columns have the valuable data for deeper
analytics.
Learning objectives :By the end of this module, you’ll be able to:
• Resolve inconsistencies, unexpected or null values, and data quality issues.
• Apply user-friendly value replacements.
• Profile data so you can learn more about a specific column before using it.
• Evaluate and transform column data types.
• Apply data shape transformations to table structures.
• Combine queries.
• Apply user-friendly naming conventions to columns and queries.
• Edit M code in the Advanced Editor.
UNIT IV
Introduction
Consider the scenario where you have imported data into Power BI from several different sources
and, when you examine the data, it is not prepared for analysis. What could make the data unprepared
for analysis?
When examining the data, you discover several issues, including:
• A column called Employment status only contains numerals.
• Several columns contain errors.
• Some columns contain null values.
• The customer ID in some columns appears as if it was duplicated repeatedly.
• A single address column has combined street address, city, state, and zip code.
You start working with the data, but every time you create visuals on reports, you get bad data, incorrect
results, and simple reports about sales totals are wrong.
UNIT IV
Dirty data can be overwhelming and, though you might feel frustrated, you decide to get to work and
figure out how to make this semantic model as pristine as possible.
Fortunately, Power BI and Power Query offer you a powerful environment to clean and prepare the
data. Clean data has the following advantages:
• Measures and columns produce more accurate results when they perform aggregations and
calculations.
• Tables are organized, where users can find the data in an intuitive manner.
• Duplicates are removed, making data navigation simpler. It will also produce columns that can be
used in slicers and filters.
• A complicated column can be split into two, simpler columns. Multiple columns can be combined
•
• into one column for readability.
• Codes and integers can be replaced with human readable values.
In this module, you will learn how to:
• Resolve inconsistencies, unexpected or null values, and data quality issues.
• Apply user-friendly value replacements.
• Profile data so you can learn more about a specific column before using it.
• Evaluate and transform column data types.
• Apply data shape transformations to table structures.
• Combine queries.
• Apply user-friendly naming conventions to columns and queries.
• Edit M code in the Advanced Editor.
Shape the initial data
Power Query Editor in Power BI Desktop allows you to shape (transform) your imported data. Power
Query can do actions such as
renaming columns or tables,
changing text to numbers
removing rows
setting the first row as headers, and much more
UNIT IV
Note: It is important to shape data to ensure that it is suitable for use in reports.
You have loaded raw sales data from two sources into a Power BI model. Some of the data came from
a .csv file that was created manually in Microsoft Excel by the Sales team. The other data was loaded
through a connection to your organization's Enterprise Resource Planning (ERP) system. Now, when you
look at the data in Power BI Desktop, you notice that it's in disarray; some data that you don't need and
some data that you do need are in the wrong format.
You need to use Power Query Editor to clean up and shape this data before you can start building reports.
Get started with Power Query Editor
To start shaping your data, open Power Query Editor by selecting the Transform data option on
the Home tab of Power BI Desktop.
In Power Query Editor, the data in your selected query displays in the middle of the screen and, on
the left side, the Queries pane lists the available queries (tables).
UNIT IV
An un-titled power Query editor appears
Select New Source
UNIT IV
Select New Source excel workbook
browse to excel work book and select the file apple_products.xslx
UNIT IV
UNIT IV
The excel data appears at middle of screen, then click ok
Now we can observe the data set as above.
Identify column headers and names
Step1:
Identify the column headers and names within the data
Then check, whether they located in the right place or not, if not make them to appear in right place
Case#1
In some cases, where the column headers and names are jumbled, Reorganize thges to reorganize the
data.
As in our case we case apple_products.xlsx
UNIT IV
Observe the data,
Here we have multiple columns, we need observer them clearly which one we require for analysis, which
we did not require.
In our case you can observe the second column(Product_URL), which is no essential for us, now we can
remove the column Product_URL.
Click on the column Product_URL
UNIT IV
Now right click and select option remove
Now, the column Product_URL is removed
UNIT IV
Now it time to split the First column(Product Name) as per our requirement
Now we split the first row
APPLE iPhone 8 Plus (Gold, 64 GB)
UNIT IV
Now select first row, and right click and go to split and select delimiter
Now new screen appears as
UNIT IV
Now we select the space as delimiter and select the option split at leftmost delimiter and click ok
UNIT IV
Now select the rest of product name and go to split and select custom delimiter as ( and select split @
leftmost delimiter.
UNIT IV
Similarly, select next column and select delimiter as ) and split at right most delimiter.
UNIT IV
Now, we have Gold,64GB. Similarly, select the column and select delimiter as comma and split at
left/right most delimiter
UNIT IV
Now you can observe this column, we have accessories along with phone
Now, it is time to separate accessories EarPods and Power Adapter in separate column
UNIT IV
Now, it is time to separate accessories EarPods and Power Adapter into separate columns
UNIT IV
Now observe this column, which contains word Includes we need to remove
Now select split space as delimiter and split @ right most delimiter in lieu of left most delimiter
UNIT IV
Remove the column which contains (Includes
UNIT IV
UNIT IV
Now we move column brand as first column
select at the top of the column and click mouse move towards left till the column brand becomes first
UNIT IV
Now we have first and second columns have same data, Now we can remove second column
UNIT IV
Now its time to rename the columns, first go to second column Product Name 2.1 which we need rename
UNIT IV
Product Name 2.1 to to renamed as Model
UNIT IV
Similarly, select next column Product Name.2.2.1.1 and rename as color
Similarly change all the columns which has been split earlier as per out requirement
UNIT IV
UNIT IV
The above one is a empty column which we need to remove
UNIT IV
Now we can observe, that one cell has (PRODUCT)RED, which is to be replaced with Red.
Now go to that cell and right click
UNIT IV
This is all about the splitting column and renaming columns and go to file and select save as
UNIT IV
give appropriate name and the file is saved as Power Bi file with extention .pbix
UNIT IV
UNIT IV
UNIT IV
Now it is time to find the missing values
Similarly, select next column and select delimiter as ) and split at right most delimiter
When you work in Power Query Editor, all steps that you take to shape your data are recorded.
Then, each time the query connects to the data source, it automatically applies your steps, so your
data is always shaped the way that you specified. Power Query Editor only makes changes to a
particular view of your data, so you can feel confident about changes that are being made to your
original data source. You can see a list of your steps on the right side of the screen, in the Query
Settings pane, along with the query's properties.
The Power Query Editor ribbon contains many buttons you can use to select, view, and shape
your data.
To learn more about the available features and functions, see The query ribbon.
UNIT IV
With Power BI Desktop, you can connect to many different types of data sources, then shape the
data to meet your needs, enabling you to create visual reports to share with others. Shaping data means
transforming the data: renaming columns or tables, changing text to numbers, removing rows, setting the
first row as headers, and so on.
Combining data means connecting to two or more data sources, shaping them as needed,
then consolidating them into a single query.
What is data quality?
First understand some key basics of what will affect your data quality.
Data quality is essential for accurate analysis, informed decision-making,and successful business
outcomes.
Understanding factors that affect data quality and recognizing the importance of data cleaning are crucial
steps in the data preparation process.
In general, several factors describe and make up the quality of a dataset for analysis:
Data accuracy:
Data accuracy means the extent to which data represents the true values and attributes it is
intended to capture, indicating the degree to which it aligns with the true, real-world information it
seeks to represent.
Factors such as human errors during data entry, system glitches, or outdated information can
compromise data accuracy.
Data completeness:
This describes the degree to which all required data elements are present in a dataset.
Missing or incomplete data can occur due to data collection errors
system limitations, or data integration challenges.
Data consistency:
The uniformity and coherence of data across different sources or datasets.
Inconsistencies may arise from variations in data formats, naming conventions, or conflicting data
definitions.
Data validity:
Refers to the extent to which data conforms to defined rules, constraints, or standards. Invalid data
can result from data entry errors, data integration issues, or changes in business rules.
Data timeliness:
The relevance and currency of data in relation to the analysis or reporting time frame.
Outdated or stale data can lead to inaccurate insights and hinder decision-making.
Where do data quality issues come from?
Data quality issues can arise from various sources throughout the data life cycle.
Some common origins of data quality issues:
Data entry errors:
Mistakes made during manual data entry processes can introduce errors such as typos,
misspellings, or incorrect values.
Human error, lack of training, or inadequate validation mechanisms can contribute to data entry
issues.
Incomplete or missing data:
UNIT IV
Data may be incomplete or have missing values due to various reasons, such as data collection
processes that fail to capture all required information, data entry omissions, or system
limitations that prevent data collection.
Data integration challenges:
When combining data from multiple sources or systems, inconsistencies can arise due to
differences in data formats, naming conventions, or data structures. Mismatched or
incompatible data elements can lead to data quality issues.
Data transformation and manipulation:
Data transformations, such as aggregations, calculations, or data conversions, can introduce errors
if not implemented correctly.
Issues can arise from improper formulas, incorrect assumptions, or errors in the data manipulation
process.
Data storage and transfer:
Unreliable storage systems may lead to data corruption, loss, or unauthorized access, impacting
data quality. Events such as hardware failures or system crashes can result in data loss, affecting
completeness and accuracy. Delays in data transfer may lead to latency issues, with outdated or
stale data impacting the accuracy of analyses. Lastly, incompatibility between systems during data
transfer can cause format mismatches and structural issues, introducing inaccuracies.
Data governance and documentation:
Inadequate data governance practices, including a lack of data standards, data definitions, or
metadata documentation, can lead to misunderstandings or misinterpretations of data, resulting in
data quality problems.
Poor documentation of data lineage makes it challenging to trace the origin of quality issues and
prevent future problems.
Data changes and updates:
As data evolves over time, changes in business rules, system updates, or modifications to data
sources can impact data quality. Data may become outdated, inconsistent, or no longer aligned
with the intended use.
External data sources:
When incorporating data from external sources, such as third-party providers or open datasets,
data quality issues may arise.
Inaccurate or unreliable data from external sources can affect the overall data quality.
The role of data cleaning in improving data quality
Data cleaning is a task for data professionals or analysts, the responsibility for ensuring clean data
extends beyond a specific team or department. Data cleaning is to be considered a shared responsibility
within a company, involving stakeholders from all levels and functions.
Data integrity and accuracy
Data cleaning plays a vital role in maintaining data integrity and accuracy.
Inaccurate or inconsistent data can lead to flawed analysis, flawed decision-making, and potential
business risks.
By recognizing data cleaning as a shared responsibility, all individuals working with data can
contribute to maintaining the integrity of the data they generate, use, or interact with.
Decision-making and business outcomes
Data serves as the foundation for informed decision-making.
UNIT IV
When data quality is compromised, the decisions made based on that data are also compromised.
By acknowledging the impact of data quality on business outcomes, individuals across the
organization can understand the importance of data cleaning in facilitating accurate insights and
driving successful outcomes.
Data ownership and accountability
Every individual who interacts with data, regardless of their role, possesses a level of data
ownership and accountability.
By considering data cleaning as part of this ownership, employees become active participants in
maintaining data quality.
When employees take responsibility for the accuracy and cleanliness of the data they work with, a
culture of data stewardship is fostered within the organization.
A holistic view of the data ecosystem
Data flows across departments, systems, and processes within an organization.
Each touchpoint introduces the potential for data quality issues.
Recognizing data cleaning as everyone’s responsibility encourages individuals to consider the
broader data ecosystem and how their actions impact the quality of data used by others.
This holistic view promotes collaboration and communication to address data quality concerns.
Early detection of issues
Data cleaning is not just about rectifying existing issues;
it also involves proactively identifying and addressing data quality issues.
Employees on the front lines of data collection, entry, and analysis are often the first to notice
anomalies or inconsistencies.
By encouraging a culture where data issues are shared and addressed promptly, organizations can
mitigate the impact of poor data quality before it cascades into larger problems.
Continuous improvement and learning
Data cleaning presents opportunities for continuous improvement and learning.
When individuals actively participate in data cleaning, they gain insights into data patterns,
common errors, and areas for improvement.
This knowledge can be shared, leading to enhanced data collection processes,better data entry
practices, and improved data quality over time.
Empowerment and collaboration
Recognizing the shared responsibility of data cleaning empowers employees to take ownership of
the data they work with.
It fosters a sense of collaboration and accountability, as individuals understand that their actions
impact the overall data quality and, consequently, the success of the organization.
By leveraging the collective expertise and commitment of employees, organizations can
effectively address data quality challenges.
Best practices for data quality overall
How you can actually clean your data with Power BI.
Dirty data can have a significant impact on business operations, decision-making, and overall success.
To combat the challenges posed by dirty data, organizations must establish robust data cleaning practices.
Best practices that businesses can implement to effectively tackle dirty data and ensure data quality
throughout their operations.
Establishing data quality standards
UNIT IV
Define clear data quality standards that align with your organization’s goals and objectives.
These standards should include criteria for accuracy, completeness, consistency, validity, and
timeliness.
Developing a data governance framework:
Develop a comprehensive data governance framework that outlines roles, responsibilities,
and processes for data management.
Establish data stewardship roles to oversee data quality initiatives and enforce data
governance policies.
This framework will help create a structured approach to data cleaning and ensure
accountability across the organization.
Implementing data validation and verification techniques:
Apply data validation and verification techniques to identify and resolve inconsistencies,
errors, and outliers.
Use automated validation rules, data profiling tools, and statistical analysis to check data
against predefined rules and validate its accuracy.
Implement regular data audits to identify data quality issues and take corrective actions.
Standardizing data entry processes:
Standardize data entry processes to minimize human errors and ensure consistent data
formats.
Implement data entry controls, such as drop-down menus, data validation lists, and
input masks, to guide data entry and prevent incorrect or inconsistent data.
Provide training and guidelines to employees on proper data entry practices.
Leveraging data cleaning tools and technologies:
Utilize data cleaning tools and technologies, such as Power Query in Power BI, to
automate the data cleaning process.
These tools can help identify and correct data errors, remove duplicates, handle missing
values, and perform various data transformations.
Invest in appropriate data cleaning solutions based on your specific requirements.
Collaborating across departments:
Data cleaning is a collaborative effort that involves various departments within an
organization.
Foster collaboration and communication between departments to address data quality
issues holistically.
Encourage cross-functional teams to share insights, exchange knowledge, and work
collectively toward data cleaning goals.
Continuous data monitoring:
Implement mechanisms for ongoing data monitoring to proactively identify and resolve
data quality issues.
Establish data quality metrics, set up alerts and notifications for data anomalies, and
regularly monitor data quality dashboards.
Continuously monitor data sources, data pipelines, and data integration processes to
maintain high data quality standards.
Data education and training:
Provide data education and training programs to equip employees with the necessary skills
and knowledge to understand and address data quality issues.
UNIT IV
Offer training on data cleaning techniques, data entry best practices, data validation
methods, and the importance of data quality.
This education will empower employees to take ownership of data quality and contribute
to the battle against dirty data.
Ensure that all stakeholders understand and adhere to these standards,fostering a shared commitment to
data quality.
Combine data in Power BI Desktop
Open/get data from web : https://www.fool.com/research/best-states-to-retire
UNIT IV
Click on web
UNIT IV
paste the url https://www.fool.com/research/best-states-to-retire
select ok to fetch the data from web
UNIT IV
Select anonymous and connect
What Does it Mean to Merge Tables in Power BI?
Merging tables in Power BI means combining two or more tables into a single table.
One of Power BI's unique ability to combine data from multiple sources into one unifying platform.
Whether your data comes from a warehouse, Excel files, third-party applications, or even a webpage,
Power BI can access it and merge it.
The beauty of Power BI is the simplicity of this process. You don't need to learn any complex syntax or
take a course in programming just to merge your data. The experience on Power BI is very intuitive and
easy to follow.
For example, you might have one table with customer details and another table with their purchase
history. By merging these tables, you can create a unified view that shows both customer information and
their purchases in one place.
In Power BI, you do this by using the "Merge Queries" feature. It allows you to join tables based on
common columns, like customer ID or product ID, so you can see all the related data together.