0% found this document useful (0 votes)
21 views2 pages

Statistical Analysis For Data Science

Statistical analysis is essential in data science for discovering patterns and making predictions from data. It includes descriptive statistics, which summarize datasets through measures of central tendency and dispersion, and inferential statistics, which use sample data to draw conclusions about larger populations. Key concepts include hypothesis testing, common statistical tests, and probability distributions.

Uploaded by

Shubham Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views2 pages

Statistical Analysis For Data Science

Statistical analysis is essential in data science for discovering patterns and making predictions from data. It includes descriptive statistics, which summarize datasets through measures of central tendency and dispersion, and inferential statistics, which use sample data to draw conclusions about larger populations. Key concepts include hypothesis testing, common statistical tests, and probability distributions.

Uploaded by

Shubham Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Statistical Analysis for Data Science

Statistical Analysis for Data Science


Statistical analysis is a fundamental component of data science that involves collecting,
exploring, and presenting large amounts of data to discover underlying patterns and trends. It
provides the mathematical foundation for making informed decisions and predictions based on
data.

Descriptive Statistics
Descriptive statistics summarize and quantify the main features of a dataset.

Measures of Central Tendency


Mean: The average of all values in a dataset.

Median: The middle value when data is arranged in order.

Mode: The most frequently occurring value in a dataset.

Measures of Dispersion
Range: The difference between the maximum and minimum values.

Variance: The average squared deviation from the mean.

Standard Deviation: The square root of the variance.

Interquartile Range (IQR): The range between the first and third quartiles.

import numpy as np
from scipy import stats

# Sample dataset
data = [12, 15, 17, 19, 20, 22, 25, 25, 27, 28, 30, 32, 35, 37, 40, 42, 45]

# Measures of central tendency


mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data).mode[0]

# Measures of dispersion
data_range = np.max(data) - np.min(data)
variance = np.var(data)

Statistical Analysis for Data Science 1


std_dev = np.std(data)
q75, q25 = np.percentile(data, [75, 25])
iqr = q75 - q25

print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Range: {data_range}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_dev}")
print(f"Interquartile Range: {iqr}")

Inferential Statistics
Inferential statistics use sample data to make inferences about a larger population.

Hypothesis Testing
The process of making statistical decisions based on experimental data.

Null Hypothesis (H₀): A statement that there is no effect or difference.

Alternative Hypothesis (H₁): A statement that contradicts the null hypothesis.

p-value: The probability of obtaining results as extreme as the observed results, assuming
the null hypothesis is true.

Significance Level (α): The threshold below which the null hypothesis is rejected.

Common Statistical Tests

Test Use Case Example

Comparing treatment vs. control


t-test Compare means of two groups
group

Comparing multiple treatment


ANOVA Compare means of three or more groups
options

Chi-square Test relationships between categorical variables Analyzing survey responses

Correlation Measure relationship between two variables Height vs. weight analysis

Predict a dependent variable based on independent


Regression Predicting house prices
variables

Probability Distributions
Probability distributions describe the likelihood of different outcomes in an experiment.

Statistical Analysis for Data Science 2

You might also like