Data Science and Analytics - Theory Notes
Unit 1: Introduction to Data, Data Science and Analytics
1. Data and Data Science:
- Data refers to raw facts and figures that are collected and processed for analysis.
- Data Science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge
and insights from structured and unstructured data.
2. Data Analytics and Data Analysis:
- Data Analytics is the broader process of examining data sets to draw conclusions and support decision-making.
- Data Analysis is a component of data analytics and refers specifically to the process of inspecting, cleaning,
transforming, and modeling data.
3. Classification of Analytics:
- Descriptive Analytics: Summarizes past data to understand what happened.
- Diagnostic Analytics: Investigates why something happened.
- Predictive Analytics: Forecasts future outcomes using historical data.
- Prescriptive Analytics: Recommends actions based on data analysis.
4. Application of Analytics in Business:
- Enhances decision-making
- Improves operational efficiency
- Supports customer behavior analysis
- Assists in market trend identification
- Optimizes resource allocation
5. Types of Data:
- Nominal Data: Categorical data without any order (e.g., gender, colors).
- Ordinal Data: Categorical data with a meaningful order (e.g., rankings).
- Scale Data: Quantitative data, either interval or ratio (e.g., income, temperature).
Data Science and Analytics - Theory Notes
6. Big Data and its Characteristics:
- Big Data refers to extremely large datasets that traditional data processing software cannot handle efficiently.
- Characteristics (5 Vs): Volume, Velocity, Variety, Veracity, Value
7. Applications of Big Data:
- Customer insights and behavior prediction
- Fraud detection in finance
- Personalized recommendations in e-commerce
- Predictive maintenance in manufacturing
- Trend analysis in social media and marketing
8. Challenges in Data Analytics:
- Data privacy and security
- Integration of data from multiple sources
- Managing data quality and consistency
- Shortage of skilled professionals
- High cost of data tools and infrastructure
Unit 2: Data Preparation, Summarisation and Visualisation Using Spreadsheet
1. Data Preparation and Cleaning:
- Identifying and correcting errors or inconsistencies to improve data quality before analysis.
2. Sort and Filter:
- Sorting arranges data in a specific order; filtering displays only the data that meets certain criteria.
3. Conditional Formatting:
- Applies specific formatting to cells that meet certain conditions to visually highlight important information.
4. Text to Column:
Data Science and Analytics - Theory Notes
- Splits the content of one cell into multiple cells based on a delimiter (e.g., comma, space).
5. Removing Duplicates:
- Identifies and deletes repeated entries in datasets to maintain data integrity.
6. Data Validation:
- Restricts the type of data that can be entered into a cell, ensuring accuracy and consistency.
7. Identifying Outliers:
- Detects data points that differ significantly from other observations; important for accurate analysis.
8. Covariance and Correlation Matrix:
- Covariance: Measures how two variables change together.
- Correlation Matrix: Shows the strength and direction of linear relationships between variables.
9. Moving Averages:
- A technique used to smooth out short-term fluctuations and highlight trends in data over time.
10. Finding Missing Values:
- Identifying and handling gaps in data, using methods like imputation or deletion.
11. Summarisation:
- Summarizing data using statistical measures such as mean, median, mode, totals, etc.
12. Visualisation Tools:
- Scatter Plots: Show relationships between two variables.
- Line Charts: Display data trends over time.
- Histograms: Show the frequency distribution of a dataset.
- Pivot Tables: Summarize large datasets by grouping and aggregating data.
- Pivot Charts: Visual representations of pivot tables.
- Interactive Dashboards: Combine visualizations to provide an overview for decision-making.
Data Science and Analytics - Theory Notes
Unit 1: Introduction to Data, Data Science and Analytics
1. Data and Data Science:
- Data refers to raw facts and figures that are collected and processed for analysis.
- Data Science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge
and insights from structured and unstructured data.
2. Data Analytics and Data Analysis:
- Data Analytics is the broader process of examining data sets to draw conclusions and support decision-making.
- Data Analysis is a component of data analytics and refers specifically to the process of inspecting, cleaning,
transforming, and modeling data.
3. Classification of Analytics:
- Descriptive Analytics: Summarizes past data to understand what happened.
- Diagnostic Analytics: Investigates why something happened.
- Predictive Analytics: Forecasts future outcomes using historical data.
- Prescriptive Analytics: Recommends actions based on data analysis.
4. Application of Analytics in Business:
- Enhances decision-making
- Improves operational efficiency
- Supports customer behavior analysis
- Assists in market trend identification
- Optimizes resource allocation
5. Types of Data:
- Nominal Data: Categorical data without any order (e.g., gender, colors).
- Ordinal Data: Categorical data with a meaningful order (e.g., rankings).
- Scale Data: Quantitative data, either interval or ratio (e.g., income, temperature).
Data Science and Analytics - Theory Notes
6. Big Data and its Characteristics:
- Big Data refers to extremely large datasets that traditional data processing software cannot handle efficiently.
- Characteristics (5 Vs): Volume, Velocity, Variety, Veracity, Value
7. Applications of Big Data:
- Customer insights and behavior prediction
- Fraud detection in finance
- Personalized recommendations in e-commerce
- Predictive maintenance in manufacturing
- Trend analysis in social media and marketing
8. Challenges in Data Analytics:
- Data privacy and security
- Integration of data from multiple sources
- Managing data quality and consistency
- Shortage of skilled professionals
- High cost of data tools and infrastructure
Unit 2: Data Preparation, Summarisation and Visualisation Using Spreadsheet
1. Data Preparation and Cleaning:
- Identifying and correcting errors or inconsistencies to improve data quality before analysis.
2. Sort and Filter:
- Sorting arranges data in a specific order; filtering displays only the data that meets certain criteria.
3. Conditional Formatting:
- Applies specific formatting to cells that meet certain conditions to visually highlight important information.
4. Text to Column:
Data Science and Analytics - Theory Notes
- Splits the content of one cell into multiple cells based on a delimiter (e.g., comma, space).
5. Removing Duplicates:
- Identifies and deletes repeated entries in datasets to maintain data integrity.
6. Data Validation:
- Restricts the type of data that can be entered into a cell, ensuring accuracy and consistency.
7. Identifying Outliers:
- Detects data points that differ significantly from other observations; important for accurate analysis.
8. Covariance and Correlation Matrix:
- Covariance: Measures how two variables change together.
- Correlation Matrix: Shows the strength and direction of linear relationships between variables.
9. Moving Averages:
- A technique used to smooth out short-term fluctuations and highlight trends in data over time.
10. Finding Missing Values:
- Identifying and handling gaps in data, using methods like imputation or deletion.
11. Summarisation:
- Summarizing data using statistical measures such as mean, median, mode, totals, etc.
12. Visualisation Tools:
- Scatter Plots: Show relationships between two variables.
- Line Charts: Display data trends over time.
- Histograms: Show the frequency distribution of a dataset.
- Pivot Tables: Summarize large datasets by grouping and aggregating data.
- Pivot Charts: Visual representations of pivot tables.
- Interactive Dashboards: Combine visualizations to provide an overview for decision-making.
Data Science and Analytics - Theory Notes
Unit 3: Getting Started with R
1. Introduction to R:
- R is a programming language and environment specifically designed for statistical computing and graphics.
2. Advantages of R:
- Open-source and free to use
- Extensive libraries for data analysis and visualization
- Strong community support
- Excellent for statistical modeling
3. Installation of R Packages:
- Packages can be installed using install.packages("package_name")
- Required packages must be loaded using library("package_name")
4. Importing Data from Spreadsheet Files:
- Data can be imported using read.csv(), read.table(), or functions from packages like readxl.
5. Commands and Syntax:
- R is case-sensitive and uses functions for most operations.
- Syntax is generally function-based, e.g., mean(data), summary(data)
6. Packages and Libraries:
- R has thousands of packages available via CRAN and other repositories for various types of analysis.
7. Data Structures in R:
- Vectors: One-dimensional data structure
- Matrices: Two-dimensional data with elements of the same type
- Arrays: Multi-dimensional generalization of matrices
- Lists: Collection of different types of elements
Data Science and Analytics - Theory Notes
- Factors: Categorical variables
- Data Frames: Tabular data with different data types
8. Conditionals and Control Flows:
- if, else if, else statements for conditional execution
9. Loops:
- for, while, and repeat loops for repetitive tasks
10. Functions and Apply Family:
- User-defined and built-in functions for modular programming
- Apply family (apply, lapply, sapply, etc.) used for efficient looping
Unit 4: Descriptive Statistics Using R
1. Importing Data File:
- Use functions like read.csv(), read_excel() to load data for analysis
2. Data Visualisation Using Charts:
- Histograms: For frequency distribution
- Bar Charts: For categorical comparisons
- Box Plots: For distribution and outlier detection
- Line Graphs: For trends over time
- Scatter Plots: For relationships between variables
3. Data Description:
- Measure of Central Tendency: Mean, Median, Mode
- Measure of Dispersion: Range, Variance, Standard Deviation
4. Relationship Between Variables:
Data Science and Analytics - Theory Notes
- Covariance: Measures how two variables change together
- Correlation: Measures strength and direction of linear relationship
- Coefficient of Determination (R²): Indicates the proportion of variance explained
Unit 5: Predictive and Textual Analytics
1. Simple Linear Regression Models:
- Analyzes the relationship between two continuous variables
2. Confidence and Prediction Intervals:
- Confidence interval gives a range for population parameter
- Prediction interval estimates range for new observations
3. Multiple Linear Regression:
- Models the relationship between one dependent and multiple independent variables
4. Interpretation of Regression Coefficients:
- Shows the effect of each independent variable on the dependent variable
5. Heteroscedasticity:
- Occurs when the variance of errors is not constant
6. Multi-collinearity:
- Happens when independent variables are highly correlated
7. Basics of Textual Data Analysis:
- Analyzing unstructured text data for insights
- Includes understanding context, frequency, and sentiment
8. Significance, Application, and Challenges: