0% found this document useful (0 votes)
33 views3 pages

Spark SQL QA Summary

The document provides a Q&A summary on Spark SQL with Python, covering key features of the Pandas package, User Defined Functions (UDFs), Vectorized UDFs (VUDFs), and Grouped Vectorized UDFs (GVUDFs). It also outlines common data analysis operations in Spark SQL and discusses the handling of outliers in data analysis. The content emphasizes the integration of Python with Spark for efficient data manipulation and analysis.

Uploaded by

manojdarshan7999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views3 pages

Spark SQL QA Summary

The document provides a Q&A summary on Spark SQL with Python, covering key features of the Pandas package, User Defined Functions (UDFs), Vectorized UDFs (VUDFs), and Grouped Vectorized UDFs (GVUDFs). It also outlines common data analysis operations in Spark SQL and discusses the handling of outliers in data analysis. The content emphasizes the integration of Python with Spark for efficient data manipulation and analysis.

Uploaded by

manojdarshan7999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Spark SQL with Python - Q&A Summary

Q1: What are the key features of the Pandas package in data analysis?

1. Database-style DataFrames: merge, join, concat

2. RPy interface: integration with R functions

3. Ecosystem: stats, ML, IDEs, APIs, and out-of-core features

4. SQL-like operations: SELECT, WHERE, GROUPBY, JOIN, etc.

5. GroupBy: split-apply-combine operations

6. Size mutability: columns can be added/removed

7. Slicing/Dicing: flexible access using custom axis labels

Q2: What are User Defined Functions (UDFs) in Spark SQL?

- UDFs are custom functions applied on DataFrames

- They operate one row at a time and require SerDe (Serialization/Deserialization)

- Useful for reusable business logic (e.g., best sales showroom)

- Reduce coding effort and support code modularity

- Can be defined in Java, Scala, or Python

Q3: What are Vectorized UDFs (VUDFs) in Spark?

- Introduced in Spark 2.3 with Apache Arrow

- Process entire columns (Pandas Series) at once

- Higher performance due to reduced overhead

- Input and output must be pandas.Series

- Output size = input size; data type must be defined

Q4: What are Grouped Vectorized UDFs (GVUDFs)?

Page 1
Spark SQL with Python - Q&A Summary

- Operate on groups of data (grouped by a condition)

- Based on Pandas' split-apply-combine model

- Input and output are pandas.DataFrames

- Return type is a StructType with column names and types

- Used for grouped aggregations and transformations

Q5: What are some common data analysis operations with Spark SQL?

1. Filtering columns (single/multiple)

2. Creating top-ten lists

3. Setting up sub-totals

4. Applying multiple-field filters

5. Creating unique lists

6. Finding and removing duplicates/outliers

7. Multi-key sorting

8. Count unique items

9. Using SUMIF, COUNTIF

10. Using database functions like DSUM, DMAX

11. Converting lists to tables

Q6: What are outliers and how are they handled in analysis?

- Outliers are extreme or inconsistent data points

- Can be caused by human errors, bugs, or hardware issues

- Affect mean, variance, and analysis accuracy

- Remove using statistical methods: mean ± 2*std

- Python Example:

Page 2
Spark SQL with Python - Q&A Summary

df[(df['sales'] > mean - 2*std) & (df['sales'] < mean + 2*std)]

Page 3

You might also like