0% found this document useful (0 votes)
32 views5 pages

Unit V SQL

The document provides an overview of descriptive statistics using Apache Spark, detailing methods to compute basic statistics such as mean, median, variance, and percentiles using both the DataFrame API and Spark SQL. It includes examples of initializing a Spark session, creating DataFrames, and using functions like describe() and approxQuantile() for statistical analysis. The conclusion emphasizes the ease of computing descriptive statistics in Spark and the ability to gain deeper insights through grouping and aggregation.

Uploaded by

ldoddi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views5 pages

Unit V SQL

The document provides an overview of descriptive statistics using Apache Spark, detailing methods to compute basic statistics such as mean, median, variance, and percentiles using both the DataFrame API and Spark SQL. It includes examples of initializing a Spark session, creating DataFrames, and using functions like describe() and approxQuantile() for statistical analysis. The conclusion emphasizes the ease of computing descriptive statistics in Spark and the ability to gain deeper insights through grouping and aggregation.

Uploaded by

ldoddi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit V - Descriptive statistics

Contents
Descriptive Statistics with Spark: Examples ............................................................................................ 1
2. Basic Descriptive Statistics ............................................................................................................. 1
3. Using describe() for Summary Statistics ................................................................................ 2
4. Additional Descriptive Statistics .................................................................................................... 2
8. Conclusion ....................................................................................................................................... 4

Descriptive Statistics with Spark: Examples

Descriptive statistics provides a summary of the data and includes measures like mean, median,
variance, standard deviation, min, max, and count. In Apache Spark, you can perform
descriptive statistics using the DataFrame API or Spark SQL.

1. Using Spark DataFrame API for Descriptive Statistics

First, you need to initialize the SparkSession and create a DataFrame.

from pyspark.sql import SparkSession

# Initialize Spark session


spark = SparkSession.builder.appName("Descriptive Statistics").getOrCreate()

# Sample data
data = [("Alice", 29), ("Bob", 31), ("Charlie", 35), ("David", 25), ("Eve",
29)]
columns = ["Name", "Age"]

# Create a DataFrame
df = spark.createDataFrame(data, columns)

# Show the DataFrame


df.show()

2. Basic Descriptive Statistics

Spark provides built-in functions to compute basic statistics like mean, min, max, count,
stddev, and variance.

# Import functions
from pyspark.sql.functions import col, mean, min, max, stddev, variance

# Calculate basic statistics for "Age"


df.select(
mean("Age").alias("mean_age"),
min("Age").alias("min_age"),
max("Age").alias("max_age"),
stddev("Age").alias("stddev_age"),
variance("Age").alias("variance_age")
).show()

3. Using describe() for Summary Statistics

The describe() function is the easiest way to generate a summary of the numeric columns in a
DataFrame. It calculates count, mean, stddev, min, max, and the 25th, 50th, and 75th
percentiles.

# Generate descriptive statistics using describe()


df.describe().show()

Output:

+-------+-----+------------------+
|summary| Name| Age|
+-------+-----+------------------+
| count| 5| 5|
| mean| null| 29.8|
| stddev| null| 3.70809909036451|
| min|Alice| 25|
| max| Eve| 35|
+-------+-----+------------------+

4. Additional Descriptive Statistics

A. Percentiles (Quantiles)

You can calculate specific percentiles using approxQuantile(). It calculates the approximate
value of a given percentile based on the sample data.

# Calculate approximate percentiles (25th, 50th, 75th percentiles)


quantiles = df.approxQuantile("Age", [0.25, 0.5, 0.75], 0.0)
print(f"25th percentile: {quantiles[0]}")
print(f"50th percentile (median): {quantiles[1]}")
print(f"75th percentile: {quantiles[2]}")

B. Mode (Most Frequent Value)

Spark doesn't have a direct function to compute mode, but you can compute it by using
groupBy() and count().

# Calculate the mode (most frequent value in the "Age" column)


mode_df = df.groupBy("Age").count().orderBy("count",
ascending=False).limit(1)
mode_df.show()

5. Descriptive Statistics Using Spark SQL

You can also run SQL queries for descriptive statistics.

First, register the DataFrame as a temporary SQL table:

# Register DataFrame as a SQL table


df.createOrReplaceTempView("people")

A. SQL Example for Descriptive Stats

Now, run SQL queries for descriptive statistics.

# Run SQL queries for basic descriptive statistics


spark.sql("""
SELECT
AVG(Age) AS mean_age,
MIN(Age) AS min_age,
MAX(Age) AS max_age,
STDDEV(Age) AS stddev_age,
VARIANCE(Age) AS variance_age
FROM people
""").show()

B. SQL Example for Percentiles

You can also use SQL with a window function to calculate percentiles.

# Calculate the 25th, 50th, and 75th percentiles


spark.sql("""
SELECT
PERCENTILE(Age, 0.25) AS Q1,
PERCENTILE(Age, 0.5) AS Median,
PERCENTILE(Age, 0.75) AS Q3
FROM people
""").show()

6. Additional Insights Using Aggregations

You can also combine grouping and aggregating functions for more detailed statistics.

A. Grouped Descriptive Statistics

# Example: Group data by Age and count occurrences


df.groupBy("Age").count().show()
# Example: Group data by Age and compute mean, max, and min of another column
(if available)
# Assuming we have more columns like "Salary"
# df.groupBy("Age").agg(mean("Salary"), max("Salary"), min("Salary")).show()

7. Example: Descriptive Statistics with a Larger Dataset

Let's simulate a larger dataset for a more practical example:

# Simulated data with more rows


data = [
("Alice", 29, 3500),
("Bob", 31, 4200),
("Charlie", 35, 5500),
("David", 25, 3000),
("Eve", 29, 3800),
("Frank", 32, 4500),
("Grace", 28, 3900),
("Hannah", 33, 4600)
]
columns = ["Name", "Age", "Salary"]

# Create a DataFrame
df = spark.createDataFrame(data, columns)

# Show the DataFrame


df.show()

# Generate summary statistics using describe


df.describe().show()

# Calculate additional statistics for Age and Salary


df.select(
mean("Age").alias("mean_age"),
min("Age").alias("min_age"),
max("Age").alias("max_age"),
stddev("Age").alias("stddev_age"),
mean("Salary").alias("mean_salary"),
min("Salary").alias("min_salary"),
max("Salary").alias("max_salary")
).show()

8. Conclusion

 Descriptive statistics in Spark are very easy to compute using both the DataFrame API
and Spark SQL.
 Functions like describe(), avg(), min(), max(), stddev(), and variance() provide
quick insights into the dataset.
 Percentiles and mode can also be calculated, though some require additional functions.
 Using groupBy and agg allows for deeper, grouped insights into the data.

You might also like