0% found this document useful (0 votes)
43 views12 pages

FAI Notes - Unit 5

This document provides an overview of data analysis and processing, emphasizing the importance of data analysis and visualization in uncovering insights for decision-making. It covers key concepts such as data collection, cleaning, transformation, exploration, and modeling, as well as various types of data and techniques for handling missing values. Additionally, it discusses the significance of data visualization and the tools used for effective analysis and representation of data.

Uploaded by

ahesan.agk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views12 pages

FAI Notes - Unit 5

This document provides an overview of data analysis and processing, emphasizing the importance of data analysis and visualization in uncovering insights for decision-making. It covers key concepts such as data collection, cleaning, transformation, exploration, and modeling, as well as various types of data and techniques for handling missing values. Additionally, it discusses the significance of data visualization and the tools used for effective analysis and representation of data.

Uploaded by

ahesan.agk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

FUNDAMENTALS OF AI

UNIT-5
DATA ANALYSIS AND PROCESSING
5.1. Introduction to Data Analysis and Visualization
 Data analysis and visualization are essential components of the field of data science. They involve the
exploration, interpretation, and presentation of data to uncover meaningful insights and facilitate better
decision-making. In this introduction, we'll cover the basic concepts and techniques used in data
analysis and visualization.
 Data Analysis: Data analysis refers to the process of inspecting, transforming, and modeling data to
discover useful information, draw conclusions, and support decision-making. It involves several steps,
including data collection, data cleaning, data transformation, data exploration, and data modeling.
o Data Collection: Gathering relevant data from various sources, such as databases, surveys,
APIs, or web scraping.
o Data Cleaning: Preparing the data for analysis by addressing missing values, handling outliers,
resolving inconsistencies, and standardizing the format.
o Data Transformation: Converting the data into a suitable format for analysis, which may
involve reshaping the data, aggregating it, or creating new variables.
o Data Exploration: Exploring the data to gain an understanding of its characteristics,
relationships, and patterns. This can involve descriptive statistics, data visualization, and
exploratory data analysis (EDA) techniques.
o Data Modelling: Applying statistical and machine learning techniques to build models that
can make predictions, classifications, or identify patterns in the data.
Here are some key aspects of data analysis:
o Descriptive Statistics: Descriptive statistics provide a summary of the main characteristics of
a dataset. This includes measures such as mean, median, mode, standard deviation, range, and
percentiles. Descriptive statistics help understand the central tendency, dispersion, and shape
of the data distribution.
o Inferential Statistics: Inferential statistics allows us to make inferences and draw conclusions
about a larger population based on a sample. Techniques like hypothesis testing, confidence
intervals, and regression analysis are commonly used in inferential statistics.
o Exploratory Data Analysis (EDA): EDA involves visually and quantitatively exploring the
data to gain insights and identify patterns. Techniques such as data visualization, summary
statistics, and correlation analysis are used to understand relationships, detect outliers, and
uncover hidden patterns in the data.
FUNDAMENTALS OF AI

o Data Mining: Data mining is the process of discovering patterns and relationships in large
datasets. It involves applying statistical algorithms, machine learning techniques, and data
visualization to extract valuable information from the data.
o Predictive Analytics: Predictive analytics uses historical data to make predictions or forecasts
about future events or outcomes. Techniques such as regression analysis, time series analysis,
and machine learning algorithms are employed to build predictive models.
o Text and Sentiment Analysis: Text analysis involves extracting information and insights from
textual data. It includes techniques such as text mining, natural language processing (NLP),
and sentiment analysis to analyze and interpret text-based data.
o Machine Learning: Machine learning algorithms are used to build models that can
automatically learn from data and make predictions or take actions without explicit
programming. Supervised learning, unsupervised learning, and reinforcement learning are
common types of machine learning techniques.
o Data Wrangling: Data wrangling, also known as data munging or data pre-processing,
involves cleaning, transforming, and reshaping the data to make it suitable for analysis. This
step ensures that the data is accurate, complete, and formatted correctly.
o Data Integration: Data integration involves combining data from multiple sources into a
single unified dataset. It may require merging, joining, or blending data to create a
comprehensive view for analysis.
o Data Interpretation: Data interpretation is the process of making sense of the analyzed data
and drawing meaningful insights and conclusions. It involves critically analyzing the results,
considering the context, and making data-driven decisions.
o When conducting data analysis, it is important to follow a systematic approach, considering
the specific goals and questions at hand. It often involves an iterative process of refining the
analysis based on the insights gained from each step.
 Data Visualization: Data visualization is the graphical representation of data to facilitate
understanding and communicate insights effectively. It involves creating visual representations, such
as charts, graphs, maps, or dashboards, to present data in a visually appealing and informative way.
 Benefits of Data Visualization:
o Easy comprehension: Visual representations make it easier to understand complex data
patterns and relationships.
o Insight discovery: Visualizations can help identify trends, outliers, correlations, and other
patterns that might not be apparent in raw data.
o Effective communication: Visualizations enable the clear and concise communication of
findings and insights to a wider audience.
FUNDAMENTALS OF AI

o Common Data Visualization Techniques: Bar Charts and Histograms are used to
representcategorical or numerical data by displaying the frequency or distribution of values.
o Line Charts: Ideal for showing trends and changes over time, typically used for time series or
sequential data.
o Scatter Plots: Depict the relationship between two numerical variables, showing how they
correlate or cluster.
o Pie Charts: Display parts of a whole, useful for illustrating proportions or percentages.
o Heatmaps: Present data in a grid-like format using colors to represent values, commonly used
for matrices or geographic data.
o Geographic Maps: Visualize data on a geographical map, showing regional or spatial patterns.
 Tools for Data Analysis and Visualization: Several tools and programming languages are commonly
used for data analysis and visualization, including:
o Python: Popular programming language with libraries such as pandas, NumPy, and Matplotlib
for data manipulation and visualization.
o R: Statistical programming language with packages like dplyr, ggplot2, and shiny for data
analysis and visualization.
o Tableau: A powerful data visualization tool that allows for interactive and dynamic dashboards
and reports.
o Power BI: Microsoft's business analytics tool that provides data visualization and interactive
reporting capabilities.
o Excel: Widely used spreadsheet software with built-in data analysis and visualization features.
Remember, data analysis and visualization are iterative processes, where insights gained from
visualization can lead to further analysis and refinement of the data. The ultimate goal is to transform
raw data into actionable insights that drive informed decision-making and problem-solving.
5.2. Types of data
Data can be classified into different types based on its nature and characteristics. The most common types
of data are:
 Numerical Data: Numerical data represents quantitative values and can be further categorized into
two subtypes:
o Continuous Data: Continuous data can take any value within a specific range. It is typically
measured on a continuous scale and can have decimal or fractional values. Examples include
height, weight, temperature, and time.
o Discrete Data: Discrete data consists of whole numbers or distinct values that cannot be
subdivided further. It represents countable or categorical data. Examples include the number of
students in a class, the number of cars in a parking lot, or the number of items sold.
FUNDAMENTALS OF AI

o Categorical Data: Categorical data represents qualitative or categorical variables. It


includes distinct categories or groups without any inherent numerical meaning.
Categorical data can be further dividedinto two subtypes:
o Nominal Data: Nominal data represents categories with no inherent order or
hierarchy. Examples include gender (male/female), marital status
(single/married/divorced), or eye color (blue/brown/green).
o Ordinal Data: Ordinal data represents categories with a specific order or ranking.
The categories havea relative position or rank but may not have a fixed numerical
difference between them. Examples include education levels (high school,
college, graduate), rating scales (1-star, 2-star, 3-star), or satisfaction levels (low,
medium, high).
o Time Series Data: Time series data consists of observations collected over a
sequence of time intervals. It represents data points recorded at regular intervals, such
as daily, monthly, or yearly. Time series data is commonly used to analyze trends,
patterns, and seasonality over time. Examples include stock prices, weather data, or
website traffic over time.
o Textual Data: Textual data represents unstructured or semi-structured textual
information. It includes documents, articles, social media posts, emails, or any other
form of textual content. Analyzing textual data involves techniques such as text
mining, natural language processing (NLP), and sentiment analysis.
o Spatial Data: Spatial data represents information about geographic locations or
features on the Earth's surface. It includes coordinates, polygons, maps, or any data
associated with a specific location. Spatial data is used in various domains such as
geography, urban planning, environmental science, and GPS navigation systems.
o Binary Data: Binary data consists of only two possible values, typically represented
as 0 and 1. It is often used in computer science and digital systems, representing on/off
states, true/false conditions, orpresence/absence of certain characteristics.
Understanding the type of data is crucial for selecting appropriate analysis techniques, visualization
methods, and statistical models. It helps determine the appropriate summary statistics, data
transformations, and inferential methods to apply when analyzing and interpreting the data.
5.3. Introduction to Data Pre-processing
Data pre-processing is a crucial step in the data analysis pipeline. It involves preparing raw data to
ensure it is in a suitable format for analysis. Data pre-processing aims to address common issues such as
missing values, outliers, inconsistent formats, and noise, among others. By performing data pre-
processing, you can enhance the quality of the data and improve the accuracy and effectiveness of
subsequent analysis and modeling.
FUNDAMENTALS OF AI

Here are the key steps involved in data pre-processing:


 Data Cleaning:
o Handling Missing Values: Missing values can occur due to various reasons, such as data
collection errors or incomplete records. Common strategies for handling missing values
include:
 Deleting rows or columns with missing values: This approach is suitable when the
amount of missing data is small and will not significantly impact the analysis.
 Imputing missing values: Missing values can be replaced with estimated or calculated
values. Techniques like mean imputation, median imputation, mode imputation, or
advanced imputation methods (e.g., regression imputation, K-nearest neighbors
imputation) can be used.
o Dealing with Outliers: Outliers are data points that significantly deviate from the normal data
distribution. Outliers can be addressed by:
 Removing outliers: Outliers can be identified using statistical techniques (e.g., z-score,
box plots) and then removed from the dataset if they are deemed irrelevant or erroneous.
 Transforming outliers: For certain situations, transforming the data (e.g., applying
logarithmic transformation) can reduce the impact of outliers without removing them
entirely.
o Handling Noise: Noise refers to irrelevant or erroneous data that may arise due to
measurement errors or data collection issues. Techniques like smoothing, filtering, or using
algorithms (e.g., moving averages, median filtering) can help reduce noise in the data.
 Data Integration:
o Data integration involves combining data from multiple sources to create a unified dataset. This
step is essential when dealing with data collected from different databases, files, or formats.
Techniques such as merging, joining, or concatenating can be employed to integrate data
effectively.
 Data Transformation:
o Normalization: Normalization ensures that numerical features are on a similar scale,
preventing one feature from dominating the analysis due to its larger values. Common
normalization techniques include:
o Z-score normalization: Transforming the values to have a mean of 0 and a standard deviation
of 1.
o Feature Scaling: Feature scaling is particularly useful when working with machine learning
algorithms that are sensitive to the scale of input features. Scaling techniques like
standardization or using scaling methods (e.g., robust scaling) can be applied to normalize the
range of numerical features.
FUNDAMENTALS OF AI

o Min-max scaling: Rescaling the values to a specified range, often between 0 and 1.
 Standardization (z-score normalization):
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_filled[['Age', 'Income']] = scaler.fit_transform(df_filled[['Age', 'Income']])
print("\nDataFrame after standardization:")
print(df_filled)
 Normalization (min-max scaling):
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
df_filled[['Age', 'Income']] = min_max_scaler.fit_transform(df_filled[['Age', 'Income']])
print("\nDataFrame after min-max scaling:")
print(df_filled)
o Encoding Categorical Variables: Machine learning algorithms generally require numerical
input, so categorical variables need to be encoded. Common encoding techniques include:
o One-hot encoding: Representing each category as a binary feature column.
o Label encoding: Assigning a numerical label to each category.
o Dimensionality Reduction: Dimensionality reduction techniques are employed to reduce the
number of features while retaining the most important information. This helps overcome the
curse of dimensionality andcan improve the efficiency and interpretability of analysis. Popular
dimensionality reduction methods include
o Principal Component Analysis (PCA): Transforming the original features into a lower-
dimensional space using linear combinations of the original variables.
o Feature Selection: Selecting a subset of relevant features based on statistical techniques,
domain knowledge, or machine learning algorithms.
 Data Discretization:
o Data discretization involves converting continuous data into discrete intervals or bins.
Discretization can be useful when working with algorithms that require categorical or ordinal
data. Techniques like equal-width binning, equal-frequency binning, or entropy-based binning
can be used.
 Handling Imbalanced Data:
Imbalanced data occurs when the distribution of classes or categories in the dataset is skewed, with
one class being significantly more prevalent than others. Techniques for handling imbalanced data
include:
o Under-sampling: Randomly removing samples from the majority class to achieve a balanced
distribution.
FUNDAMENTALS OF AI

o Oversampling: Creating synthetic samples in the minority class to balance the distribution.
o Class-weighting: Assigning higher weights to the minority class during model training to give
it more importance.
5.4. Handling Missing Values
Handling missing values is an essential part of data pre-processing. Missing values can occur due to
various reasons such as data collection errors, equipment failures, or survey non-responses. Dealing
with missing values appropriately is crucial to ensure the accuracy and reliability of data analysis. Here
are some common strategies for handling missing values:
 Deleting Rows or Columns: If the amount of missing data is small and doesn't significantly impact
the analysis, you can choose to delete the rows or columns with missing values. However, this approach
should be used cautiously, as it can lead to a loss of valuable information.
 Delete Columns
# Drop the 'Name' column
df = df.drop(columns=['Name'])
print("\nDataFrame after dropping 'Name' column:")
print(df)
 Delete Rows
# Drop rows with any missing values
df_dropped_rows = df.drop()
print("\nDataFrame after dropping rows with any missing values:")
print(df_dropped_rows)
 Mean/Median/Mode Imputation: Replace missing values with the mean, median, or mode of the
available data for the respective feature. This method assumes that the missing values have a similar
distribution to the observed values.
 Mean Imputation
df_mean_imputed['Age']
df_mean_imputed['Age'].fillna(df_mean_imputed['Age'].mean())
print("\nDataFrame after mean imputation:")
print(df_mean_imputed)
 Median Imputation
df_median_imputed['Age']
df_median_imputed['Age'].fillna(df_median_imputed['Age'].median())
print("\nDataFrame after median imputation:")
print(df_median_imputed)
FUNDAMENTALS OF AI

 Mode Imputation
df_mode_imputed['Gender'] =
df_mode_imputed['Gender'].fillna(df_mode_imputed['Gender'].mode()[0])
print("\nDataFrame after mode imputation:")
print(df_mode_imputed)
 Forward/Backward Filling: Propagate the last known value forward or the next known value
backward to fill in missing values. This approach is suitable when missing values occur in sequences
or time series data.
 Forward Filling
df_ffill = df.copy()
df_ffill = df_ffill.ffill()
print("\nDataFrame after forward filling:")
print(df_ffill)
 Backward Filling
df_bfill = df.copy()
df_bfill = df_bfill.bfill()
print("\nDataFrame after backward filling:")
print(df_bfill)
 K-Nearest Neighbors (KNN) Imputation: Identify the K nearest neighbors based on available
features and use their values to impute missing values. KNN imputation works well when the missing
values are related to the values of their neighboring data points.
5.5. Handling Outliers and Inconsistencies
 Handling outliers and inconsistencies is an important step in data pre-processing to ensure the accuracy
and reliability of data analysis. Outliers are extreme values that deviate significantly from the normal
data pattern, while inconsistencies refer to data values that are illogical or contradictory. Here's a
detailed explanation of how to handle outliers and inconsistencies:
 Handling Outliers:
o Identify Outliers: Identifying outliers is an important step in data pre-processing to understand
the distribution of data and identify any extreme values that may significantly deviate from the
normal pattern. Here are several approaches to identifying outliers
 Visual Inspection: Plot the data using techniques like box plots, histograms, or scatter plots to visually
identify potential outliers. Outliers may appear as points far away from the main distribution or as
values outside a certain range.
 Statistical Methods: Utilize statistical techniques such as z-scores or interquartile range (IQR) to
quantitatively identify outliers. Observations that fall beyond a specified threshold (e.g., z-score > 3
or outside 1.5 times the IQR) can be considered as outliers.
FUNDAMENTALS OF AI

 Decide on the Treatment Approach:


o Remove Outliers: If the outliers are deemed irrelevant or caused by measurement errors, you
may choose to remove them from the dataset. However, be cautious about removing too many
outliers, as it can affect the representativeness of the data.
o Transform Outliers: Instead of removing outliers, you can transform their values to reduce
their impact. Common transformation techniques include logarithmic transformation, square
root transformation, or Winsorization (replacing extreme values with the nearest values within
a certain range).
o Apply the Chosen Approach: If you decide to remove outliers, you can delete the
corresponding data points. However, ensure that the removal does not introduce bias or
significantly alter the overall data distribution.
 If you choose to transform outliers, apply the appropriate transformation method to adjust the values
while maintaining their relative order and pattern.
 Handling Inconsistencies:
o Identify Inconsistencies:
 Perform Data Validation: Check for logical inconsistencies within the data. For
example, verify that age values are within a reasonable range, dates are in a valid
format, or categorical variables contain expected categories.
 Cross-Referencing: Cross-reference data across different sources or variables to
identify inconsistencies. For instance, compare customer addresses with postal code
databases to identify address discrepancies.
o Resolve Inconsistencies:
 Manual Inspection and Correction: Inspect the inconsistent data points and manually
correct them based on available information or expert knowledge.
 Imputation: If inconsistencies cannot be manually resolved, impute missing or
inconsistent values using appropriate imputation techniques (as discussed in the
previous response).
o Document Changes:
 Keep a record of the changes made during the inconsistency handling process. This
documentation ensures transparency and allows others to understand the data pre-
processing steps undertaken.
 When handling outliers and inconsistencies, it is important to consider the context and
domain knowledge. Understanding the data generation process and consulting subject
matter experts can aid in making informed decisions regarding outlier treatment and
resolving inconsistencies.
FUNDAMENTALS OF AI

5.6. Introduction to machine learning


 Machine learning is a subfield of artificial intelligence (AI) that focuses on the development of
algorithms and models that enable computers to learn and make predictions or decisions
without being explicitly programmed. It involves training a computer system to automatically
learn patterns, relationships, and insights from data, and then use that knowledge to perform
tasks or make predictions.
 In traditional programming, developers write explicit instructions for a computer to follow.
However, in machine learning, the computer learns from examples and data, iteratively
improving its performance over time. This learning process can be categorized into three main
types of machine learning:
 Supervised Learning: In supervised learning, the algorithm is trained on labelled data, where
each data point is associated with a known target or output variable. The goal is to learn a
mapping function that can predict the output variable given new, unseen inputs. Examples of
supervised learning algorithms include linear regression, decision trees, random forests,
support vector machines (SVM), and neural networks.
 Unsupervised Learning: In unsupervised learning, the algorithm is trained on unlabelled data,
where there is no predefined target variable. The objective is to find patterns, structures, or
relationships within the data. Unsupervised learning can be used for tasks such as clustering
similar data points, dimensionality reduction, or anomaly detection. Common unsupervised
learning algorithms include k-means clustering, hierarchical clustering, principal component
analysis (PCA), and association rule mining.
 Reinforcement Learning: Reinforcement learning involves an agent that learns to make
decisions in an environment to maximize a cumulative reward. The agent interacts with the
environment, receives feedback in the form of rewards or penalties, and adjusts its actions
based on the received feedback. Reinforcement learning is commonly used in applications such
as game playing, robotics, and autonomous systems.
Machine learning algorithms typically go through the following steps:
 Data Collection: Gathering relevant data that represents the problem or task at hand. The
quality and quantity of data play a crucial role in the performance of machine learning models.
 Data Pre-processing: Cleaning, transforming, and preparing the data for analysis. This
includes handling missing values, dealing with outliers, encoding categorical variables, and
scaling or normalizing the data.
 Model Selection and Training: Choosing an appropriate machine learning algorithm and
training it on the labelled or unlabelled data. This involves splitting the data into training and
validation sets, feeding the data into the algorithm, and optimizing its parameters.
FUNDAMENTALS OF AI

 Model Evaluation: Assessing the performance of the trained model using evaluation metrics
such as accuracy, precision, recall, or mean squared error, depending on the specific problem
and the type of algorithm used.
 Model Deployment: Once the model is trained and evaluated, it can be deployed to make
predictions or decisions on new, unseen data. This could involve integrating the model into a
larger system or application.
 Machine learning has a wide range of applications across various industries, including finance,
healthcare, marketing, image and speech recognition, natural language processing, and
FUNDAMENTALS OF AI

recommendation systems, to name just a few. It continues to advance and evolve, with new
algorithms and techniques being developed to tackle more complex problems and improve
performance.
 It's worth noting that machine learning requires careful consideration of data quality, feature
selection, model evaluation, and ethical considerations to ensure reliable and unbiased results.
The field of machine learning is constantly evolving, with ongoing research and development
focused on enhancing algorithms, handling large-scale datasets, and addressing interpretability
and fairness challenges.

You might also like