Detailed Notes on Data Analytics and Descriptive Statistics
1. Introduction to Data Analytics
Data Analytics refers to the systematic process of examining datasets to draw meaningful
insights, identify patterns, and support decision-making. It is a foundation of Data Science,
which integrates statistics, computer science, and domain knowledge.
1.1 Big Data and Data Science
Big Data: Refers to datasets so large and complex that traditional data processing tools
cannot handle them efficiently. They are often described by the 5 V’s:
- Volume: Massive amounts of data.
- Velocity: Speed at which data is generated and processed.
- Variety: Different forms (structured, unstructured, semi-structured).
- Veracity: Data reliability and accuracy.
- Value: The potential benefit derived from the data.
Example: Social media platforms generating billions of user interactions daily.
Data Science: A multidisciplinary field combining mathematics, statistics, machine learning,
and programming to extract insights from both big and small data.
Example: Netflix uses Data Science to recommend movies to users based on past viewing
behavior.
1.2 Small Data
Small Data refers to datasets that are small enough to be analyzed by a single computer
using traditional statistical methods. Although smaller in scale, they can be highly
meaningful.
Example: A teacher analyzing the exam scores of 50 students to find class performance
trends.
1.3 Taxonomy of Data Analytics
- Descriptive Analytics: Summarizes historical data. Example: Monthly sales reports.
- Diagnostic Analytics: Explains why something happened. Example: Identifying causes of
sales drop.
- Predictive Analytics: Uses historical data and models to predict outcomes. Example:
Predicting customer churn.
- Prescriptive Analytics: Recommends optimal actions. Example: Suggesting best pricing
strategy.
1.4 Examples of Data Use
- Healthcare: Predicting disease risks based on medical history.
- Finance: Detecting fraudulent transactions.
- Retail: Personalizing shopping experiences through recommendation systems.
- Manufacturing: Predictive maintenance of machinery.
1.5 Case Studies
Breast Cancer in Wisconsin: A dataset containing features of cell nuclei, used in machine
learning classification tasks to distinguish between benign and malignant tumors.
Polish Company Insolvency Data: Contains financial ratios used to predict whether
companies will go bankrupt.
1.6 A Little History on Methodologies
- 1960s–1980s: Focus on descriptive statistics and hypothesis testing.
- 1990s: Growth of data mining and knowledge discovery.
- 2000s: Emergence of Big Data and machine learning applications.
- Today: Integration of AI and deep learning in Data Analytics.
2. Descriptive Statistics
2.1 Scale Types
- Nominal: Categories without any order. Example: Blood type (A, B, AB, O).
- Ordinal: Ordered categories. Example: Customer satisfaction ratings (Poor, Fair, Good,
Excellent).
- Interval: Numeric data without a true zero. Example: Temperature in Celsius.
- Ratio: Numeric data with a true zero. Example: Height, weight, income.
2.2 Descriptive Univariate Analysis
Analyzes one variable at a time. Includes:
- Central Tendency: Mean, Median, Mode.
- Dispersion: Range, Variance, Standard Deviation.
- Distribution Shape: Skewness, Kurtosis.
Example: Analyzing the average salary of employees in a company.
2.3 Univariate Frequencies
Frequency analysis shows how often each value or category appears.
Example: In a survey of favorite fruits, if 40% choose Mango, 30% Apple, 20% Banana, and
10% Grapes, these percentages represent the frequency distribution.
2.4 Common Univariate Probability Distributions
- Normal Distribution: Bell-shaped curve, common in natural phenomena.
- Binomial Distribution: Outcomes of binary events (success/failure).
- Poisson Distribution: Counts of events in a fixed interval. Example: Number of customer
arrivals per hour.
- Exponential Distribution: Time between events.
- Uniform Distribution: All outcomes equally likely.
3. Data Visualization
Data visualization helps represent data graphically for better understanding.
Examples:
- Histogram: Shows frequency of values.
- Box Plot: Shows distribution and outliers.
- Scatter Plot: Shows relationships between two quantitative attributes.
- Heatmap: Visualizes correlations or intensity of data.
Example: A scatter plot showing the relationship between advertising spend and sales
revenue.
4. Descriptive Bivariate Analysis
Bivariate analysis explores the relationship between two variables.
4.1 Two Quantitative Attributes
Methods: Scatter plots, correlation coefficients, regression analysis.
Example: Relationship between hours studied and exam scores.
4.2 Two Qualitative Attributes
Methods: Cross-tabulation, Chi-square tests.
Example: Relationship between gender (male/female) and preference for a product
(yes/no).
4.3 At Least One Nominal Attribute
Methods: Bar charts, stacked bar charts, ANOVA for group comparison.
Example: Comparing average income levels across different job categories.
4.4 Two Ordinal Attributes
Methods: Spearman’s rank correlation, Kendall’s tau.
Example: Relationship between education level (high school, graduate, postgraduate) and
job satisfaction rating (low, medium, high).