0% found this document useful (0 votes)
24 views24 pages

DEV - Unit I Notes

mca dev

Uploaded by

jayanarayana.90
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views24 pages

DEV - Unit I Notes

mca dev

Uploaded by

jayanarayana.90
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

MC25105 DATA EXPLORATION AND VISUALIZATION

Exploratory Data Analysis: Identifying Data Quality, Missing Values, Irregular Cardinality,
Outliers, Handling Data Quality, Describing Data, Preparing Data Tables, Understanding
Relationships, Identifying and Understanding Groups, Building Models from Data,
Significance, Classical and Bayesian Analysis.
Practical:
• Identification of missing values and detection of outliers
• Creation of summary table and visualization of data distribution
EXPLORATORY DATA ANALYSIS (EDA)
Exploratory Data Analysis (EDA) is a important step in data science and data analytics as it
visualizes data to understand its main features, find patterns and discover how different parts of the
data are connected.
Why Exploratory Data Analysis Important?
1. Helps to understand the dataset by showing how many features it has, what type of data each
feature contains and how the data is distributed.
2. Helps to identify hidden patterns and relationships between different data points which help
us in and model building.
3. Allows to identify errors or unusual data points (outliers) that could affect our results.
4. The insights gained from EDA help us to identify most important features for building models
and guide us on how to prepare them for better performance.
5. By understanding the data it helps us in choosing best modeling techniques and adjusting
them for better results.
Types of Exploratory Data Analysis
There are various types of EDA based on nature of records. Depending on the number of columns we
are analyzing we can divide EDA into three types:
1. Univariate Analysis
Univariate analysis focuses on studying one variable to understand its characteristics. It helps
to describe data and find patterns within a single feature. Various common methods like histograms
are used to show data distribution, box plots to detect outliers and understand data spread and bar
charts for categorical data. Summary statistics like mean, median, mode, variance and standard
deviation helps in describing the central tendency and spread of the data
2. Bivariate Analysis

[Link] Kumari (jayanarayana.90@[Link])


Bivariate Analysis focuses on identifying relationship between two variables to find connections,
correlations and dependencies. It helps to understand how two variables interact with each other.
Some key techniques include:
• Scatter plots which visualize the relationship between two continuous variables.
• Correlation coefficient measures how strongly two variables are related which commonly
use Pearson's correlation for linear relationships.
• Cross-tabulation or contingency tables shows the frequency distribution of two categorical
variables and help to understand their relationship.
• Line graphs are useful for comparing two variables over time in time series data to identify
trends or patterns.
• Covariance measures how two variables change together but it is paired with the correlation
coefficient for a clearer and more standardized understanding of the relationship.
3. Multivariate Analysis
Multivariate Analysis identify relationships between two or more variables in the dataset and
aims to understand how variables interact with one another which is important for statistical
modeling techniques. It include techniques like:
• Pair plots which shows the relationships between multiple variables at once and helps in
understanding how they interact.
• Another technique is Principal Component Analysis (PCA) which reduces the complexity
of large datasets by simplifying them while keeping the most important information.
• Spatial Analysis is used for geographical data by using maps and spatial plotting to
understand the geographical distribution of variables.
• Time Series Analysis is used for datasets that involve time-based data and it involves
understanding and modeling patterns and trends over time. Common techniques include line
plots, autocorrelation analysis, moving averages and ARIMA models.
Steps for Performing Exploratory Data Analysis
It involves a series of steps to help us understand the data, uncover patterns, identify anomalies, test
hypotheses and ensure the data is clean and ready for further analysis.
Step 1: Understanding the Problem and the Data
The first step in any data analysis project is to fully understand the problem we're solving and the
data we have. This includes asking key questions like:
1. What is the business goal or research question?

[Link] Kumari (jayanarayana.90@[Link])


2. What are the variables in the data and what do they represent?
3. What types of data (numerical, categorical, text, etc.) do you have?
4. Are there any known data quality issues or limitations?
5. Are there any domain-specific concerns or restrictions?
Step 2: Importing and Inspecting the Data
After understanding the problem and the data, next step is to import the data into our analysis
environment such as Python, R or a spreadsheet tool. It’s important to find data to gain an basic
understanding of its structure, variable types and any potential issues.
1. Load the data into our environment carefully to avoid errors or truncations.
2. Check the size of the data like number of rows and columns to understand its complexity.
3. Check for missing values and see how they are distributed across variables since missing data
can impact the quality of your analysis.
4. Identify data types for each variable like numerical, categorical, etc which will help in the
next steps of data manipulation and analysis.
5. Look for errors or inconsistencies such as invalid values, mismatched units or outliers which
could show major issues with the data.
By completing these tasks we'll be prepared to clean and analyze the data more effectively.
Step 3: Handling Missing Data
Missing data is common in many datasets and can affect the quality of our analysis. During EDA
it's important to identify and handle missing data properly to avoid biased or misleading results.
1. Understand the patterns and possible causes of missing data. Is it Missing Completely At
Random (MCAR), Missing At Random (MAR) or Missing Not At Random (MNAR).
Identifying this helps us to find best way to handle the missing data.
2. Decide whether to remove missing data or impute (fill in) the missing values. Removing data
can lead to biased outcomes if the missing data isn’t MCAR. Filling values helps to preserve
data but should be done carefully.
3. Use appropriate imputation methods like mean or median based on the data’s characteristics.
4. Consider the impact of missing data. Even after imputing, missing data can cause uncertainty
and bias so understands the result with caution.
Step 4: Exploring Data Characteristics
After addressing missing data, the characteristics of data are found by checking the distribution,
central tendency and variability of the variables and identifying outliers or anomalies. This helps in

[Link] Kumari (jayanarayana.90@[Link])


selecting appropriate analysis methods and finding major data issues. The summary statistics should
be calculated like mean, median, mode, standard deviation, skewness and kurtosis for numerical
variables. This provides an overview of the data’s distribution and helps us to identify any irregular
patterns or issues.

Step 5: Performing Data Transformation


Data transformation is an important step in EDA as it prepares our data for accurate analysis and
modeling. Common transformation techniques include:
1. Scaling or normalizing numerical variables like min-max scaling or standardization.
2. Encoding categorical variables for machine learning
3. Applying mathematical transformations to correct skewness or non-linearity.
4. Creating new variables from existing ones like calculating ratios or combining variables.
5. Aggregating or grouping data based on specific variables or conditions.
Step 6: Visualizing Relationship of Data
Visualization helps to find relationships between variables and identify patterns or trends that may
not be seen from summary statistics alone.
1. For categorical variables, create frequency tables, bar plots and pie charts to understand the
distribution of categories and identify imbalances or unusual patterns.
2. For numerical variables generate histograms, box plots, violin plots and density plots to
visualize distribution, shape, spread and potential outliers.
3. To find relationships between variables use scatter plots, correlation matrices or statistical
tests like Pearson’s correlation coefficient or Spearman’s rank correlation.
Step 7: Handling Outliers
Outliers are data points that differs from the rest of the data may caused by errors in measurement or
data entry. Detecting and handling outliers is important because they can skew our analysis and affect
model performance. Properly managing outliers shows the analysis is accurate and reliable.
Step 8: Communicate Findings and Insights
The final step in EDA is to communicate the findings clearly. This involves summarizing the
analysis, pointing out key discoveries and presenting our results in a clear way.
1. Clearly state the goals and scope of your analysis.
2. Provide context and background to help others understand your approach.
3. Use visualizations to support our findings and make them easier to understand.

[Link] Kumari (jayanarayana.90@[Link])


4. Highlight key insights, patterns or anomalies discovered.
5. Mention any limitations or challenges faced during the analysis.
6. Suggest next steps or areas that need further investigation.

IDENTIFYING DATA QUALITY


What is Data Quality?
Data quality refers to the reliability, accuracy, completeness, and consistency of data. High-
quality data is free from errors, inconsistencies, and inaccuracies, making it suitable for reliable
decision-making and analysis. Data quality encompasses various aspects, including correctness,
timeliness, relevance, and adherence to predefined standards. Organizations prioritize data quality to
ensure that their information assets meet the required standards and contribute effectively to business
processes and decision-making. Effective data quality management involves processes such as data
profiling, cleansing, validation, and monitoring to maintain and improve data integrity.
Data Quality Process:
• Discover: Use data profiles to comprehend cause abnormalities.
• Define: Specify the standards for standardization and cleaning.
• Integrate: Apply specified guidelines to procedures for data quality.
• Monitor: Continuously monitor and report on data quality.
Data Quality Dimensions
The Data Quality Assessment Framework (DQAF) is primarily divided into 6 parts that includes
characteristics of data quality: completeness, timeliness, validity, integrity, uniqueness, and
consistency. When assessing the quality of a certain dataset at any given time, these dimensions are
helpful. The data quality could be identified by measuring the below:
• Completeness: The percentage of missing data in a dataset is used to determine completeness.
The accuracy of data on goods and services is essential for assisting prospective buyers in
evaluating, contrasting, and selecting various sales items.
• Timeliness: This refers to how current or outdated the data is at any one time. For instance,
there would be a problem with timeliness if you had client data from 2008 and it is now 2021.
• Validity: Data that doesn't adhere to certain firm policies, procedures, or formats is considered
invalid. For instance, a customer's birthday may be requested by several programs. However,
the quality of the data is immediately affected if the consumer enters their birthday incorrectly
or in an incorrect format.

[Link] Kumari (jayanarayana.90@[Link])


• Integrity: The degree to which information is dependable and trustworthy is referred to as
data integrity. Are the facts and statistics accurate?
• Uniqueness: The attribute of data quality that is most frequently connected to customer
profiles is uniqueness. Long-term profitability and success are frequently based on more
accurate compilation of unique customer data, including performance metrics linked to each
consumer for specific firm goods and marketing activities.
• Consistency: Analytics are most frequently linked to data consistency. It guarantees that the
information collecting source is accurately acquiring data in accordance with the department's
or company's specific goals.
Why is Data Quality Important?
Over the past 10 years, the Internet of Things (IoT), artificial intelligence (AI), edge computing, and
hybrid clouds all have contributed to exponential growth of big data. Due to which, the maintenance
of master data (MDM) has become a more typical task which requires involvement of more data
stewards and more controls to ensure data quality.
To support data analytics projects, including business intelligence dashboards, businesses depend on
data quality management. Without it, depending on the business (e.g. healthcare), there may be
disastrous repercussions, even moral ones.
• Now, organizations that possess high-quality data are able to create key performance
indicators (KPIs) that assess the effectiveness of different projects, enabling teams to expand
or enhance them more efficiently. Businesses that put a high priority on data quality will
surely have an advantage over rivals.
• Teams who have access to high-quality data are better able to pinpoint the locations of
operational workflow failures.
What is Good Data Quality?
Data that satisfies requirements for the accuracy, consistency, dependability, completeness,
and relevance is said to be good data quality. Identifying data quality involves evaluating data based
on characteristics like accuracy, completeness, consistency, validity, and timeliness to determine its
fitness for a specific purpose. For enterprises to get valuable insights, make wise decisions, and run
smoothly, they need to maintain high data quality. The following are essential aspects of good data
quality:
• Accuracy: Data correctly and precisely represents the facts it claims to represent.
• Completeness: All necessary data is present, with no missing values.

[Link] Kumari (jayanarayana.90@[Link])


• Consistency: Data is uniform across different systems or records. For example, a customer's
address should be the same in the sales and support databases.
• Validity: Data conforms to defined rules or formats. For instance, a date field should contain
a valid date, not text.
• Timeliness: Data is up-to-date and available when needed. Old or outdated information is a
sign of poor timeliness.
• Uniqueness: Data does not contain duplicates. Identifying and removing duplicates ensures
accuracy and efficiency.
• Integrity: Data is protected from accidental or intentional corruption, maintaining its
structure and relationships.
• Relevance: The data is useful for its intended purpose.
• Conformity: Data adheres to the standards and formats it's expected to.
• Consolidation: Data is free from duplicates, guaranteeing that each information is only
retained once.
How to Improve Data Quality?
Improving data quality involves implementing strategies and processes to enhance the reliability,
accuracy, and overall integrity of your data. Here are key steps to improve data quality:
• Define Data Quality Standards: Clearly define the standards and criteria for high-quality data
relevant to organization's goals. This includes specifying data accuracy, completeness,
consistency, and other essential attributes.
• Data Profiling: Conduct data profiling to analyze and understand the structure, patterns, and
quality of your existing data. Identify anomalies, errors, or inconsistencies that need attention.
• Validation Rules: Establish validation rules to enforce data quality standards during data
entry. This helps prevent the introduction of errors and ensures that new data adheres to
predefined criteria.
• Data Governance: Implement a robust data governance framework with clear policies,
responsibilities, and processes for managing and ensuring data quality. Establish data
stewardship roles to oversee and enforce data quality standards.
• Regular Audits: Conduct regular audits and reviews of your data to identify and address
issues promptly. Establish a schedule for ongoing data quality checks and assessments.

[Link] Kumari (jayanarayana.90@[Link])


• Automated Monitoring: Implement automated monitoring tools to continuously track data
quality metrics. Set up alerts for anomalies or deviations from established standards, enabling
proactive intervention.
Challenges in Data Quality
Some of the challenges in data quality are listed below.
• Incomplete Data: It can be difficult to collect thorough and accurate data, resulting in gaps
and flaws in the data that can be analyzed and used to make decisions.
• Issues with Data Accuracy: Inconsistencies in data sources, faults during data input, and
system malfunctions can all cause accuracy issues.
• Data Integration Complexity
• Data Governance
Other than these, there can be many other challenges like lack of standardization, Data Security and
Privacy Concerns, Data Quality Monitoring, Technology Limitations, and human error.
Benefits of Data Quality
• Informed Decision: Reliable and accurate data lowers the chance of mistakes and facilitates
improved decision-making.
• Operational Efficiency: High-quality data reduces mistakes and boosts productivity, which
simplifies processes.
• Enhanced Customer Satisfaction: Precise client information results in more individualized
experiences and better service.
• Compliance and Risk Mitigation: Lowering legal and compliance risks is achieved by
adhering to data quality standards.
• Cost saving: Good data reduces the need for revisions, which saves money.
• Credibility and Trust: Credibility and confidence are gained by organizations with high-
quality data among stakeholders.
IDENTIFY MISSING VALUES
During data exploration, identifying missing values is a crucial step for ensuring the
reliability and accuracy of your analysis. Missing values can bias results and cause errors in machine
learning models if not properly addressed. Techniques for both programmatically finding missing
values and visualizing their patterns are essential tools for any data scientist.
Identifying missing values programmatically

[Link] Kumari (jayanarayana.90@[Link])


In Python (using pandas): The pandas library offers multiple functions for detecting and counting
missing values, which are typically represented as NaN (Not a Number) or None.
• Check for any missing values: Use .isnull() (or .isna()) combined with .any() to get a quick
summary of which columns contain missing data.
python
[Link]().any()
• Get a count of missing values per column: Add .sum() to find the total number of missing
entries in each column.
python
[Link]().sum()
• Get a percentage of missing values per column: Combine .isnull(), .sum(), and .mean() to see
the proportion of missing data in each column, which can help in deciding how to handle
them.
python
([Link]().sum() / len(df)) * 100
• Get a data frame summary: The .info() method provides a concise summary, including the
number of non-null entries for each column, which helps quickly spot columns with missing
data.
python
[Link]()

Visualizing missing values


Visualizing missing data can reveal patterns that are not obvious from simple counts alone, such as
whether data is missing at random or systematically. The missingno Python library is an excellent
tool for this purpose.
Matrix plot
The matrix plot provides a graphical representation of the dataset's nullity, with each white line or
space representing a missing value. A sparkline on the right side summarizes the number of non-null
values per row, providing a quick overview of data completeness.
• Use case: Ideal for quickly seeing the overall distribution and location of missing values in
your data frame.

[Link] Kumari (jayanarayana.90@[Link])


• Tool: The missingno library in Python ([Link](df)).
Bar chart
A bar chart shows the proportion of data that is present in each column, with the length of the bar
indicating completeness. Columns with missing data will have bars that do not reach 100%.
• Use case: Useful for quickly comparing the amount of missing data across different features.
• Tool: The missingno library in Python ([Link](df)).

Heatmap
A missingness correlation heatmap shows how the presence of a missing value in one column is
correlated with the presence of a missing value in another column.
• Use case: Helps identify relationships and dependencies between missing values. A value
near 1 means the missingness of two variables is highly correlated.
• Tool: The missingno library in Python ([Link](df)).
Dendrogram
The dendrogram uses hierarchical clustering to group columns by their nullity correlation. Columns
that are grouped together are more likely to have missing values in the same rows.
• Use case: Effective for exploring larger relationships in missing data patterns, revealing
trends that may not be apparent in the heatmap.
• Tool: The missingno library in Python ([Link](df)).
Why identify missing values?
Identifying and visualizing missing data is the first step toward deciding on an appropriate handling
strategy. This process is critical for:
• Improved model accuracy: Most machine learning algorithms cannot handle missing values
and will produce an error or biased results.
• Enhanced data quality: Understanding the scope and patterns of missing data is vital for
assessing overall data quality.
• Bias prevention: The cause of missing data (the "missingness mechanism") can introduce
significant bias if not accounted for. Visualization helps determine if the data is Missing
Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random
(MNAR).
• Better decision-making: A clean, well-understood dataset leads to more reliable and
trustworthy insights.

[Link] Kumari (jayanarayana.90@[Link])


IRREGULAR CARDINALITY
Cardinality refers to the number of unique values in a dataset column. A column with high
cardinality contains a vast number of unique values, while columns with low cardinality contain
fewer unique entries.
High cardinality is especially common in datasets involving unique identifiers, such as:
• Timestamps: Nearly every value is unique when data is logged at the millisecond or
nanosecond level.
• Customer IDs: Each customer has a distinct identifier to track their activity.
• Transaction IDs: Financial datasets often include unique codes for every transaction.
Irregular cardinality refers to anomalies in the number of unique values within a dataset's features,
and it presents significant challenges during data exploration and visualization. The two main types
of irregular cardinality are:
• High cardinality: A feature with a large number of unique values (e.g., product IDs, user
emails, or street addresses).
• Low or unique cardinality: Features with very few unique values, including cases where the
cardinality is one (constant) or where each value is distinct and unique to a single record (e.g.,
a perfect identifier).
Challenges in exploration and visualization
Irregular cardinality can distort the insights gained from exploratory data analysis (EDA) and hinder
effective data visualization.
High cardinality
• Visualization clutter: Creating standard visualizations like bar charts for a high-cardinality
feature can result in hundreds or thousands of bars, making the chart difficult to read and
interpret.
• Loss of insights: With so many unique values, meaningful patterns or trends can become
obscured. It is nearly impossible to spot individual anomalies or small groups of similar
values.
• Performance issues: For databases and dashboards, working with high-cardinality data can
lead to slow queries and increased storage and memory usage.
Low or unique cardinality

[Link] Kumari (jayanarayana.90@[Link])


• Unique identifiers: A feature with unique cardinality, such as a serial number or Social
Security number, provides little to no analytical value on its own. Visualizing it offers no
insights, and it is usually not suitable for aggregation.
• Constant features: A feature with a cardinality of one has no variance and offers no
explanatory power. It can be identified during exploration and safely ignored.
• Data skew: In database and distributed computing contexts, low cardinality can lead to
uneven data distribution, which can create bottlenecks.
Incorrect cardinality
• Inconsistent labels: Errors like typos, misspellings, or inconsistent capitalization can
artificially inflate a feature's cardinality. For instance, a "Gender" column could have "Male,"
"male," and "M" all representing the same value.
• Loss of semantic meaning: Multiple values may mean the same thing, but inconsistent entry
makes them appear as separate categories.
Strategies for exploration and visualization
Detecting irregular cardinality
• Quantitative check: Calculate the number of unique values in each categorical column using a
function like nunique() in Python's pandas library.
• Check unique values: For categorical columns, display the list of unique values. For small
datasets, you can inspect them manually, but for larger ones, sampling may be necessary.
• Consult subject matter experts (SMEs): Discuss your findings with business domain experts
to ensure the unique values and categories align with their knowledge of the data.
Addressing high cardinality
• Aggregate categories: Instead of visualizing every unique value, group them into higher-level
categories. For example, aggregate Product ID into Product Category.
• Visualize top categories: Create a bar chart showing only the top N most frequent categories,
with all other unique values aggregated into an "Other" category.
• Use advanced visualizations:
o Treemaps can display hierarchical relationships and proportions for nested categories.
o Heatmaps can show correlations between high-cardinality features, revealing patterns
that are difficult to spot otherwise.
• Dimensionality reduction: Use techniques like Principal Component Analysis (PCA) to
reduce the number of features while retaining important variance.

[Link] Kumari (jayanarayana.90@[Link])


Resolving low or incorrect cardinality
• Drop useless features: Remove columns with a cardinality of one, as they provide no useful
information.
• Clean inconsistent labels: Consolidate misspellings, varying case, and other typos into a
single standard label.
• Create summary statistics: For unique identifier fields, calculate summary statistics or
aggregate data at a higher level instead of attempting to visualize each individual ID.
• Validate data types: Ensure that numbers, dates, and other data types are consistently
represented to avoid incorrect cardinality.
Example workflow in EDA
1. Initial overview: Use a data profiling tool or library to get a summary of your dataset,
including the number of unique values for each feature.
2. Identify candidates: Flag any features with high or very low cardinality. High-cardinality
categorical features and low-cardinality numerical features are often candidates for irregular
cardinality.
3. Inspect flagged features:
1. For high cardinality, check for inconsistent labels. If found, use data cleaning to
standardize them. Consider if the feature is a unique identifier.
2. For low cardinality, check if the feature is constant or nearly constant. If so, consider
dropping it or confirming with an SME if it's important.
4. Create tailored visualizations:
1. For high-cardinality categorical data, use aggregated bar charts (e.g., top 10
categories), treemaps, or heatmaps instead of raw bar charts.
2. For numerical data with unique values, a scatter plot can help identify outliers that
need further investigation.
5. Refine and repeat: Based on the visualizations and initial cleanup, you may need to go back
and perform more targeted data cleaning or feature engineering to gain deeper insights.
Identifying Outliers
Outliers
Outliers are data point that is essentially a statistical anomaly, a data point that significantly
deviates from other observations in a dataset. Outliers can arise due to measurement errors, natural
variation, or rare events, and they can have a disproportionate impact on statistical analyses and
machine learning models if not appropriately handled.
Example: If you have the following dataset of student test scores:

[Link] Kumari (jayanarayana.90@[Link])


[85, 87, 90, 88, 92, 89, 45]
The score 45 is an outlier—it’s much lower than the others.
Note: Outliers can be valid observations or errors in data entry, mesurement, or processing.
Types of Outliers
Outliers can be classified into various types based on their characteristics:

Global Outliers: Also known as point anomalies, these data points significantly differ
from the rest of the dataset.

2. Contextual Outliers: These are data points that are considered outliers in a specific
context. For example, a high temperature may be normal in summer but an outlier in
winter.
3. Collective Outliers: A collection of data points that deviate significantly from the rest
of the dataset, even if individual points within the collection are not outliers.
Outlier Detection
Outlier detection is a process of identifying observations or data points that significantly deviate
from the majority of the data. Outliers can distort statistical analyses, leading to erroneous
conclusions and misleading interpretations. When calculating means, medians, or standard
deviations, outliers can exert disproportionate influence, skewing the results and undermining the
validity of the analysis. By detecting and appropriately addressing outliers, analysts can mitigate
the impact of these anomalies on statistical measures, ensuring that the insights drawn from the
data are representative and accurate.
Detecting outliers is critical for numerous reasons:
• Improving Accuracy: Removing or accurately handling outliers enhances the
performance and predictability of data models.
• Fraud Detection: Outliers can be symptomatic of fraudulent activity, especially in
financial or transaction data.
• Data Quality: Regular outlier detection is crucial to maintain the integrity and quality
of data, which in turn affects the decision-making processes based on this data.
• Model Performance: Outliers can significantly impact the performance of statistical
models, machine learning algorithms, and other analytical techniques. By identifying
and handling outliers appropriately, we can improve the robustness and accuracy of
these models.
• Insight Generation: Outliers may represent unique or interesting phenomena in the
data. Identifying and analyzing outliers can lead to valuable insights, such as detecting
emerging trends, understanding rare events, or uncovering potential opportunities or
threats.
Methods for Outlier Detection
Outlier detection is a critical task in data analysis, crucial for ensuring the quality and reliability of
conclusions drawn from data. Different techniques are tailored for varying data types and
scenarios, ranging from statistical methods for general data sets to specialized algorithms for
spatial and temporal data. Some techniques are:
Standard Deviation Method
Standard Deviation Method is based on the assumption that the data follows a normal distribution.
Data points outside of three standard deviations from the mean are considered outliers.
It is commonly used for univariate data analysis where the distribution can be assumed to be
approximately normal.
• Step 1: Calculate the average and standard deviation of the data set, if applicable.

[Link] Kumari (jayanarayana.90@[Link])


• Step 2: Define the lower and upper bounds for outliers.
• Step 3: Identify outliers as data points that fall outside these bounds:
Example: Dataset: [1, 2, 2, 3, 1, 3, 10]. Find an outlier using the Standard Deviation Method.
(1+ 2+ 2+ 3+1+3+10) 22
Mean, μ= = ≈3.14
7 7
Standard Deviation, s≈ √ ¿¿¿
Lower and upper bounds for outliers:
• Lower bound = 3.142857 - 2 X 2.917 3.142857 - 5.834 = -2.691
• Upper bound = 3.142857 + 2 X 2.917 3.142857 + 5. 834 = 8.977
Any values outside [-2.691, 8.977] is an outlier, thus 10 is an outlier.
So, the data point 10 is identified as an outlier using the Standard Deviation Method.
IQR Method
The Interquartile Range (IQR) method focuses on the spread of the middle 50% of data. It
calculates the IQR as the difference between the 75th and 25th percentiles of the data and identifies
outliers as those points that fall below 1.5 times the IQR below the 25th percentile or above 1.5
times the IQR above the 75th percentile. This method is robust to outliers and does not assume a
normal distribution.
• Step 1: Find Q1(25th percentage) and Q3(75th percentage)
• Step 2: IQR = Q3 - Q1.
• Step 3: Find Lower Bound: Q1 - 1.5 × IQR and Upper Bound Lower Bound: Q1 - 1.5 ×
IQR
It is suitable for datasets with skewed or non-normal distributions. Useful for identifying outliers in
datasets where the spread of the middle 50% of the data is more relevant than the mean and
standard deviation.
Example: Dataset X = {3,5,7,9,11,13,30}, find outlier using the IQR method.
• Q1 (25th percentile): Median of first half = Median of [3, 5, 7] = 5
• Q3 (75th percentile): Median of second half = Median of [11, 13, 30] = 13
IQR = Q3 − Q1 = 13 − 5 = 8
Lower Bound: Q1 - 1.5 × IQR = 5 - 1.5 × 8 = 5 - 12 = -7
Upper Bound: Q3 + 1.5 × IQR = 13 + 1.5 × 8 =13 + 12 = 25
Therfore the interval is -7 to 25. 30 lies outside the interval, therefore is an outlier.
Z-Score Method
The Z-score method calculates the number of standard deviations each data point is from the mean.
A Z-score threshold is set, commonly 3, and any data point with a Z-score exceeding this threshold
is considered an outlier. This method assumes a normal distribution and is sensitive to extreme
values in small datasets.
• Step 1: Calculate the mean.
• Step 2: Compute Standard Deviation
• Step 3:Calculate z-scores
• Step 4: Apply Threshold Rule: Mild outlier: |Z| > 2 and Extreme outlier: |Z| > 3
Suitable for datasets with large sample sizes and where the underlying distribution of the data can
be reasonably approximated by a normal distribution.
Example: X={4,5,5,6,7,8,20}, find outlier using the Z-score method.
4+5+ 5+6+7+ 8+20 55
Mean, X́ = = ≈ 7.86
7 7
Standard Deviation, s≈ √ ¿¿¿

[Link] Kumari (jayanarayana.90@[Link])


Xi−7.86
Z-scores: Z =
i
5.36
Z-score for all data points: 4: -0.72, 5: -0.53, 5: -0.53, 6: -0.35, 7: -0.16, 8: 0.03, 20: 2.26
Threshold Z > 2 := 20 is an outlier since |2.26| > 2
The choice of outlier detection technique depends on the characteristics of the data, the
underlying distribution, and the specific requirements of the analysis.
Challenges with Outlier Detection
Detecting outliers effectively poses several challenges:
• Determining the Threshold: Deciding the correct threshold that accurately separates
outliers from normal data is critical and difficult.
• Distinguishing Noise from Outliers: In datasets with high variability or noise, it can be
particularly challenging to differentiate between noise and actual outliers.
• Balancing Sensitivity: An overly aggressive approach to detecting outliers might
eliminate valid data, reducing the richness of the dataset.
Applications of Outlier Detection
Outlier detection plays a crucial role across various domains, enabling the identification of
anomalies that can indicate errors, fraud, or novel insights. Here are some key applications of
outlier detection with specific examples:
1. Financial Fraud Detection
• Fraud Detection: Outlier detection is extensively used in the financial sector to identify
fraudulent activities. For instance, credit card companies use outlier detection
algorithms to flag unusual spending patterns that may indicate stolen card usage.
• Example: A credit card transaction for a large amount in a foreign country when the
cardholder usually makes small, local purchases could be flagged as an outlier,
triggering a fraud alert.
2. Cybersecurity
• Network Intrusion Detection: Outlier detection is critical in cybersecurity for identifying
unusual patterns of network traffic that could indicate a security breach.
• Example: A sudden increase in data transmission to an external IP address not
previously contacted by the network could be an outlier, suggesting a potential data
exfiltration attack.
3. AI/ML Modeling
• Data Cleaning: To prevent model skewing in training data
• In reducing Bias: Detects biased predictions.
4. Anomaly Detection in Big Data & Cloud Systems
• Cloud Security: To detect unauthorized access in large-scale cloud environments.
• Ensures integrity by flagging corrupted entries.
HANDLING DATA QUALITY IN DATA EXPLORATION AND VISUALIZATION
Handling data quality during data exploration and visualization involves checking for and
addressing issues like missing values, outliers, and inconsistencies, often through techniques like
statistical summaries, data profiling, and visual analysis. This process helps ensure the data is
accurate and reliable for accurate insights and decision-making, and includes steps like data cleansing
and creating visuals to identify problems.
Steps to handle data quality in exploration and visualization

[Link] Kumari (jayanarayana.90@[Link])


1. Initial data assessment:
• Load the data carefully, checking for errors.
• Examine the size (rows and columns) to understand the dataset's complexity.
• Check for missing values and how they are distributed.
• Identify data types for each variable (e.g., numerical, categorical).
2. Use statistical summaries:
• Generate descriptive statistics to understand central tendencies, variability, and the
overall distribution of the data.
3. Perform visual analysis:
• Create visualizations like histograms, scatter plots, and box plots to identify outliers,
patterns, and relationships.
• Use visualizations to spot inconsistencies, such as invalid values or mismatched units.
4. Address identified issues:
• Missing values: Decide on a strategy for handling them, such as imputation or
removal.
• Outliers: Determine if they are errors or genuine data points, and handle them
appropriately.
• Inconsistencies: Correct errors, clean up invalid entries, and standardize formats.
5. Focus on visualization principles:
• Choose the right chart type to accurately represent the data and the message you want
to convey.
• Use visual elements like color, size, and labels to draw attention to key information
and patterns, while also being careful not to mislead.
6. Iterate and refine:
• Exploration is an iterative process; refine your questions and techniques as you
uncover more about the data's quality and characteristics.
• Document your findings and the steps you took to handle data quality issues
Handling data quality
Data quality management is a continuous effort to ensure that data is fit for use. It requires a
structured approach to identify, correct, and prevent errors across the data lifecycle.
Key dimensions of data quality
Data quality is measured across several key dimensions:

[Link] Kumari (jayanarayana.90@[Link])


• Accuracy: Is the data correct and does it reflect reality? For example, a customer's address
should be valid.
• Completeness: Is all necessary information present? Missing values can lead to inaccurate
analysis.
• Consistency: Is the data uniform across different systems and sources? A customer's birthdate
should be the same in the marketing database and the sales system.
• Validity: Does the data conform to defined business rules and formats? For instance, an age
field should only contain a positive integer.
• Timeliness: Is the data current and available when needed? Outdated inventory figures can
lead to costly mistakes.
• Uniqueness: Is every record unique, with no duplicates? Duplicate customer records can skew
sales reports.
Practical steps for handling data quality
1. Assess current data quality: Perform data profiling to get an initial understanding of your
data's content, structure, and quality issues.
2. Define quality standards: Work with stakeholders to set clear data quality requirements based
on business needs.
3. Cleanse the data: Use automated or manual processes to correct errors. This includes:
1. Correcting or imputing missing values.
2. Standardizing data formats and values.
3. Removing duplicate records.
4. Validate data: Implement automated validation rules to check data against predefined criteria
as it is collected.
5. Monitor continuously: Set up ongoing monitoring to track key data quality metrics over time
and alert the right teams when issues arise.
6. Establish data governance: Define roles and responsibilities for managing data. A data
governance framework ensures everyone follows the established data policies and standards.
Describing data
Describing data, also known as descriptive statistics, is the process of summarizing the characteristics
of a dataset to identify patterns and insights.
Key elements for describing data
• Summary statistics: These are single values that summarize a large dataset.

[Link] Kumari (jayanarayana.90@[Link])


o Measures of central tendency: Describe the center of the data.
o Mean: The average value.
o Median: The middle value when the data is sorted.
o Mode: The most frequently occurring value.
o Measures of variability: Describe the spread of the data.
o Standard deviation: Measures how spread out the data is relative to the mean.
o Range: The difference between the highest and lowest values.
o Interquartile range (IQR): Measures the spread of the middle 50% of the data.
• Data visualization: Graphs and charts provide visual summaries of data distributions.
o Histograms: Show the frequency distribution of a continuous variable.
o Bar graphs: Compare categorical data.
o Box plots: Visualize the central tendency, variability, and outliers in a dataset.
o Scatter plots: Display the relationship between two variables.
• Metadata: "Data about data" provides context and important documentation. It includes
information such as a description of variables, units of measure, and data sources.
Preparing data tables
Preparing data for analysis often involves structuring and organizing raw data into clean, tidy tables.
This process is crucial for ensuring accuracy and usability.
Step-by-step process for preparing data tables
1. Define the table's purpose: Before you begin, clarify what questions the table should help
answer. This will guide your decisions on which data to include and how to structure it.
2. Collect and prepare your data: Gather all the necessary data from its various sources. As
you do, perform initial data quality checks to ensure you are starting with reliable
information.
3. Choose your tool: Select an appropriate tool based on the complexity of your data, such as
Microsoft Excel for simple tables, or a database management system (like SQL) for large or
complex datasets.
4. Structure the table:
1. Use clear headers: Give each column a descriptive, unique header. A "tidy" table has
each variable in its own column and each observation in its own row.
2. Align and format: Ensure consistent formatting for all numerical values, including the
number of decimal places. Align text and numbers consistently for easy readability.

[Link] Kumari (jayanarayana.90@[Link])


5. Clean and transform the data:
1. Standardize data: Ensure data is consistently formatted (e.g., "NYC" vs. "New York").
2. Handle missing values: Choose an appropriate strategy, such as removing rows with
too many missing values or imputing them with the mean, median, or a specific label
like "Unknown".
3. Handle duplicates: Check for and remove duplicate rows to prevent skewed analysis.
4. Engineer new features: For machine learning, you may need to transform categorical
variables into a numerical format using techniques like one-hot encoding.
6. Document your work: Make sure to document all steps taken during preparation, including
how missing values were handled and how features were transformed. This ensures your
work is reproducible and transparent.
Understanding Relationships, Identifying and Understanding Groups
Understanding relationships and groups in data exploration and visualization involves using
techniques like scatter plots to see how variables relate to each other, and using visualizations like bar
charts or heat maps to identify and understand patterns within groups. Data visualization tools help
make these relationships and groupings easier to see by turning data into charts, graphs, and maps,
revealing trends, anomalies, and the main characteristics of the data.
Understanding relationships
• Scatter plots:
These are used to visualize the relationship between two variables. Each point represents a data
instance, allowing you to see trends, correlations, or the lack thereof.
• Line charts:
Useful for tracking how one variable changes over time, revealing trends and patterns.
• Correlation analysis:
While not a visualization itself, this is a statistical method used to determine the strength and
direction of a relationship between two variables. Visualizations like scatter plots help in verifying
the results of this analysis.
Identifying and understanding groups
• Bar charts:
Excellent for comparing values across different categories, helping to identify which groups have
higher or lower values.
• Pie charts:

[Link] Kumari (jayanarayana.90@[Link])


Show the proportion of each category as a slice of a circle, making it easy to understand the
contribution of each group to the whole.
• Heat maps:
Use color intensity to represent values in a matrix, making it easy to spot clusters and patterns in a
large dataset.
• Clustering:
This is a statistical method where data points are grouped based on their similarities. Visualizations
are then used to show these clusters.
The role of data exploration and visualization
• Data exploration:
This is the initial process of examining and summarizing a dataset to discover its main
characteristics, spot anomalies, and find patterns, often using visualization.
• Data visualization:
This is the process of putting data into a visual format to help with analysis and interpretation. It is an
important component of data exploration because it allows analysts to "see" the data and understand
the relationships between variables without formal modeling.
• Beyond formal analysis:
Data visualization helps see what the data can reveal beyond predefined hypotheses, guiding the
formal analysis process.
Building models from data is the process of using statistical or machine learning techniques to
discover patterns and relationships in a dataset. Once a model is built, a key step is to evaluate its
findings using statistical inference, which can be done through two major approaches: classical (or
frequentist) analysis and Bayesian analysis.
Building models from data
The general process for building a model from data involves several key stages:
• Define the objective: Clearly state the problem you are trying to solve. This can range from
predicting sales to classifying images.
• Collect and prepare data: Gather relevant data from various sources. This requires cleaning
the data by handling missing values, correcting errors, and removing duplicates. You may
also perform feature engineering to transform raw data into a more useful format for the
model.

[Link] Kumari (jayanarayana.90@[Link])


• Choose a model: Select an appropriate algorithm based on the problem type (e.g.,
classification, regression, clustering) and data characteristics. For beginners, a simple model
like linear regression is often a good starting point.
• Train the model: Use your prepared data to train the model. During this step, the model
learns the patterns and relationships within the data by adjusting its internal parameters.
• Evaluate the model: Assess the model's performance on a held-out test set to ensure it
generalizes well to new, unseen data.
Significance in statistical analysis
Significance, particularly statistical significance, determines if a result is likely due to a real effect
rather than random chance. It is a central concept in classical hypothesis testing and is typically
assessed by the p-value.
• P-value: In a hypothesis test, researchers start with a null hypothesis (e.g., there is no
relationship between variables). The p-value is the probability of observing the data you did
(or more extreme data) if the null hypothesis were true.
• Significance level ( α ): This is a threshold the researcher sets before the experiment begins,
usually 0.05. If the p-value is less than α , the result is considered statistically significant, and
the null hypothesis is rejected.
• Practical vs. statistical significance: It is crucial to distinguish between these two concepts.
A result can be statistically significant (unlikely due to chance) but have a negligible real-
world effect (low practical significance).
Classical (Frequentist) analysis
Classical statistics views probabilities as the long-run frequency of an event occurring over many
repeated trials. In this approach, parameters of a population are treated as fixed but unknown values.
Key concepts
• Hypothesis testing: A formal procedure to determine if enough evidence exists to reject a
null hypothesis.
• Confidence intervals: Provides a range of values within which the true population parameter
is likely to fall, with a specified degree of confidence (e.g., 95%).
• Maximum likelihood estimation: A common technique for estimating parameters that
chooses the parameter values that maximize the probability of the observed data.
Example: Coin toss

[Link] Kumari (jayanarayana.90@[Link])


If you toss a coin 100 times and get 60 heads, a frequentist analysis would use a hypothesis test to see
if this result is statistically significant. The conclusion would be based only on the observed data and
how often such an outcome would occur over a large number of repeated experiments.
Bayesian analysis
In contrast to the classical approach, Bayesian analysis treats unknown parameters as random
variables with their own probability distributions. It provides a framework for updating beliefs as
new evidence becomes available, using Bayes' theorem.

Key concepts
• Prior probability (P( A)): An initial belief or knowledge about a hypothesis ( A) before
observing any data.
• Likelihood (P( B∨ A)): The probability of observing the data (B) given that the hypothesis (
A) is true.
• Posterior probability (P( A∨B)): The updated probability of the hypothesis ( A) after
considering the new data (𝐵).
Bayes' Theorem
The central formula for Bayesian analysis is:

P( B∨A ) P( A)
P( A∨B)=
P( B)
Example: Coin toss
In the same coin toss example, a Bayesian analysis would start with a prior belief about the coin's
fairness. If you believe the coin is fair, you might use a prior distribution centered at a 50%
probability of heads. After observing the 60 heads out of 100 tosses, you would use Bayes' theorem
to update your belief, resulting in a posterior distribution that is shifted towards a higher probability
of heads. This approach provides a full probability distribution for the parameter, capturing the
uncertainty around the estimate.
Comparison of classical and Bayesian analysis
Aspect Classical (Frequentist) Analysis Bayesian Analysis
The long-run frequency of an event in A degree of belief or subjective
Probability
repeated trials. uncertainty about an event.
Incorporates prior knowledge through a
Prior Does not incorporate prior beliefs or
prior probability distribution, which is
information knowledge; relies only on the current data.
then updated with new data.
Parameters Treated as fixed but unknown values. Treated as random variables with

[Link] Kumari (jayanarayana.90@[Link])


probability distributions.
Provides point estimates (e.g., sample Provides a full posterior probability
Output
mean) and confidence intervals. distribution for the parameters.
Can be more computationally intensive,
Often uses simpler analytical methods and
Computation often requiring simulation methods like
statistical formulas.
Markov Chain Monte Carlo (MCMC).
Situations with large datasets and minimal
Cases with limited data, a strong prior
prior knowledge. Many standard statistical
Best suited for belief, or complex models where
software packages are frequentist by
incorporating uncertainty is key.
default.

[Link] Kumari (jayanarayana.90@[Link])

You might also like