Unit V - Descriptive statistics
Contents
Descriptive Statistics with Spark: Examples ............................................................................................ 1
2. Basic Descriptive Statistics ............................................................................................................. 1
3. Using describe() for Summary Statistics ................................................................................ 2
4. Additional Descriptive Statistics .................................................................................................... 2
8. Conclusion ....................................................................................................................................... 4
Descriptive Statistics with Spark: Examples
Descriptive statistics provides a summary of the data and includes measures like mean, median,
variance, standard deviation, min, max, and count. In Apache Spark, you can perform
descriptive statistics using the DataFrame API or Spark SQL.
1. Using Spark DataFrame API for Descriptive Statistics
First, you need to initialize the SparkSession and create a DataFrame.
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("Descriptive Statistics").getOrCreate()
# Sample data
data = [("Alice", 29), ("Bob", 31), ("Charlie", 35), ("David", 25), ("Eve",
29)]
columns = ["Name", "Age"]
# Create a DataFrame
df = spark.createDataFrame(data, columns)
# Show the DataFrame
df.show()
2. Basic Descriptive Statistics
Spark provides built-in functions to compute basic statistics like mean, min, max, count,
stddev, and variance.
# Import functions
from pyspark.sql.functions import col, mean, min, max, stddev, variance
# Calculate basic statistics for "Age"
df.select(
mean("Age").alias("mean_age"),
min("Age").alias("min_age"),
max("Age").alias("max_age"),
stddev("Age").alias("stddev_age"),
variance("Age").alias("variance_age")
).show()
3. Using describe() for Summary Statistics
The describe() function is the easiest way to generate a summary of the numeric columns in a
DataFrame. It calculates count, mean, stddev, min, max, and the 25th, 50th, and 75th
percentiles.
# Generate descriptive statistics using describe()
df.describe().show()
Output:
+-------+-----+------------------+
|summary| Name| Age|
+-------+-----+------------------+
| count| 5| 5|
| mean| null| 29.8|
| stddev| null| 3.70809909036451|
| min|Alice| 25|
| max| Eve| 35|
+-------+-----+------------------+
4. Additional Descriptive Statistics
A. Percentiles (Quantiles)
You can calculate specific percentiles using approxQuantile(). It calculates the approximate
value of a given percentile based on the sample data.
# Calculate approximate percentiles (25th, 50th, 75th percentiles)
quantiles = df.approxQuantile("Age", [0.25, 0.5, 0.75], 0.0)
print(f"25th percentile: {quantiles[0]}")
print(f"50th percentile (median): {quantiles[1]}")
print(f"75th percentile: {quantiles[2]}")
B. Mode (Most Frequent Value)
Spark doesn't have a direct function to compute mode, but you can compute it by using
groupBy() and count().
# Calculate the mode (most frequent value in the "Age" column)
mode_df = df.groupBy("Age").count().orderBy("count",
ascending=False).limit(1)
mode_df.show()
5. Descriptive Statistics Using Spark SQL
You can also run SQL queries for descriptive statistics.
First, register the DataFrame as a temporary SQL table:
# Register DataFrame as a SQL table
df.createOrReplaceTempView("people")
A. SQL Example for Descriptive Stats
Now, run SQL queries for descriptive statistics.
# Run SQL queries for basic descriptive statistics
spark.sql("""
SELECT
AVG(Age) AS mean_age,
MIN(Age) AS min_age,
MAX(Age) AS max_age,
STDDEV(Age) AS stddev_age,
VARIANCE(Age) AS variance_age
FROM people
""").show()
B. SQL Example for Percentiles
You can also use SQL with a window function to calculate percentiles.
# Calculate the 25th, 50th, and 75th percentiles
spark.sql("""
SELECT
PERCENTILE(Age, 0.25) AS Q1,
PERCENTILE(Age, 0.5) AS Median,
PERCENTILE(Age, 0.75) AS Q3
FROM people
""").show()
6. Additional Insights Using Aggregations
You can also combine grouping and aggregating functions for more detailed statistics.
A. Grouped Descriptive Statistics
# Example: Group data by Age and count occurrences
df.groupBy("Age").count().show()
# Example: Group data by Age and compute mean, max, and min of another column
(if available)
# Assuming we have more columns like "Salary"
# df.groupBy("Age").agg(mean("Salary"), max("Salary"), min("Salary")).show()
7. Example: Descriptive Statistics with a Larger Dataset
Let's simulate a larger dataset for a more practical example:
# Simulated data with more rows
data = [
("Alice", 29, 3500),
("Bob", 31, 4200),
("Charlie", 35, 5500),
("David", 25, 3000),
("Eve", 29, 3800),
("Frank", 32, 4500),
("Grace", 28, 3900),
("Hannah", 33, 4600)
]
columns = ["Name", "Age", "Salary"]
# Create a DataFrame
df = spark.createDataFrame(data, columns)
# Show the DataFrame
df.show()
# Generate summary statistics using describe
df.describe().show()
# Calculate additional statistics for Age and Salary
df.select(
mean("Age").alias("mean_age"),
min("Age").alias("min_age"),
max("Age").alias("max_age"),
stddev("Age").alias("stddev_age"),
mean("Salary").alias("mean_salary"),
min("Salary").alias("min_salary"),
max("Salary").alias("max_salary")
).show()
8. Conclusion
Descriptive statistics in Spark are very easy to compute using both the DataFrame API
and Spark SQL.
Functions like describe(), avg(), min(), max(), stddev(), and variance() provide
quick insights into the dataset.
Percentiles and mode can also be calculated, though some require additional functions.
Using groupBy and agg allows for deeper, grouped insights into the data.