Spark SQL with Python - Q&A Summary
Q1: What are the key features of the Pandas package in data analysis?
1. Database-style DataFrames: merge, join, concat
2. RPy interface: integration with R functions
3. Ecosystem: stats, ML, IDEs, APIs, and out-of-core features
4. SQL-like operations: SELECT, WHERE, GROUPBY, JOIN, etc.
5. GroupBy: split-apply-combine operations
6. Size mutability: columns can be added/removed
7. Slicing/Dicing: flexible access using custom axis labels
Q2: What are User Defined Functions (UDFs) in Spark SQL?
- UDFs are custom functions applied on DataFrames
- They operate one row at a time and require SerDe (Serialization/Deserialization)
- Useful for reusable business logic (e.g., best sales showroom)
- Reduce coding effort and support code modularity
- Can be defined in Java, Scala, or Python
Q3: What are Vectorized UDFs (VUDFs) in Spark?
- Introduced in Spark 2.3 with Apache Arrow
- Process entire columns (Pandas Series) at once
- Higher performance due to reduced overhead
- Input and output must be pandas.Series
- Output size = input size; data type must be defined
Q4: What are Grouped Vectorized UDFs (GVUDFs)?
Page 1
Spark SQL with Python - Q&A Summary
- Operate on groups of data (grouped by a condition)
- Based on Pandas' split-apply-combine model
- Input and output are pandas.DataFrames
- Return type is a StructType with column names and types
- Used for grouped aggregations and transformations
Q5: What are some common data analysis operations with Spark SQL?
1. Filtering columns (single/multiple)
2. Creating top-ten lists
3. Setting up sub-totals
4. Applying multiple-field filters
5. Creating unique lists
6. Finding and removing duplicates/outliers
7. Multi-key sorting
8. Count unique items
9. Using SUMIF, COUNTIF
10. Using database functions like DSUM, DMAX
11. Converting lists to tables
Q6: What are outliers and how are they handled in analysis?
- Outliers are extreme or inconsistent data points
- Can be caused by human errors, bugs, or hardware issues
- Affect mean, variance, and analysis accuracy
- Remove using statistical methods: mean ± 2*std
- Python Example:
Page 2
Spark SQL with Python - Q&A Summary
df[(df['sales'] > mean - 2*std) & (df['sales'] < mean + 2*std)]
Page 3