0% found this document useful (0 votes)

8 views6 pages

Prep Chatgpt

Uploaded by

aliya.pathan0505

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views6 pages

Prep Chatgpt

Uploaded by

aliya.pathan0505

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Prep chatgpt

26 October 2024 12:56

Why corrupted records occur?

• While reading data (like JSON, CSV, etc.), sometimes records don’t match the expected schema.
• Example: a malformed JSON string, missing columns, or incorrect data types.

How Spark handles corrupted records?

Spark provides different modes and options to handle them:

1. Using mode option (CSV, JSON, Parquet)

• PERMISSIVE (default): Keeps corrupt record, fills missing values with null.
• DROPMALFORMED: Discards corrupted records.
• FAILFAST: Fails immediately if a corrupt record is found.
✅ Example:

df = spark.read \
.option("mode", "PERMISSIVE") \
.json("/path/data.json")

2. Using columnNameOfCorruptRecord
• For JSON/CSV, Spark can capture bad records in a special column.
• Default column name = _corrupt_record.
• You can rename it using spark.sql.columnNameOfCorruptRecord.
✅ Example:

spark.conf.set("spark.sql.columnNameOfCorruptRecord", "bad_data")
df = spark.read \
.option("mode", "PERMISSIVE") \
.json("/path/data.json")
df.select("bad_data").where("bad_data IS NOT NULL").show()
This way, you can analyze corrupted rows separately.

3. Using badRecordsPath (from Spark 2.3+)

• Instead of keeping corrupted data in _corrupt_record, Spark can write bad records into a separate path (JSON format).
✅ Example:

df = spark.read \
.option("badRecordsPath", "/path/to/badRecords") \
.json("/path/input/")
Corrupted rows will be saved for debugging, and valid records will load into the DataFrame.

4. Schema Evolution & Casting

• Sometimes records look corrupted because schema inference fails.
• Best practice: define explicit schema instead of relying on inference.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True)
])
df = spark.read.schema(schema).json("/path/data.json")

Interview Tip Answer (short & professional)

“In Spark, corrupted records are handled using different modes: PERMISSIVE (default), DROPMALFORMED, and FAILFAST. For JSON
and CSV, we can capture bad records in a special column (_corrupt_record) or direct them to a path using badRecordsPath. Best
practice is to define explicit schema to minimize schema mismatches.”

Json file reading

Q1. How do you read and flatten a nested JSON file in Spark?
Answer:
When we read nested JSON in Spark, the schema often contains struct or array types. To flatten it, we use explode() for arrays and dot

pyspark Page 1
When we read nested JSON in Spark, the schema often contains struct or array types. To flatten it, we use explode() for arrays and dot
notation (col("field.subfield")) for structs.
✅ Example:

from pyspark.sql.functions import col, explode

df = spark.read.json("/path/nested.json")
# Flatten struct fields
flat_df = df.select(
col("id"),
col("user.name").alias("user_name"),
col("user.address.city").alias("city")
)
# Explode array
exploded_df = flat_df.withColumn("orders", explode(col("user.orders")))
This gives a tabular structure from nested JSON, which is easier for downstream use.

Q2. JSON file has an array of structs inside it. How do you handle it in Spark?
Answer:
If JSON contains an array of structs, I use explode() to convert the array into multiple rows, and then select inner fields with dot notation.
✅ Example:

df = spark.read.json("/path/orders.json")
df2 = df.withColumn("order", explode("orders")) \
.select("id", "order.order_id", "order.amount")
This way, each order becomes a separate row.

Q3. What’s the difference between reading multiline JSON and line-delimited JSON with nested structures?
Answer:
• For line-delimited JSON, each line is a JSON object, so Spark reads it directly.
• For multiline JSON (where objects are spread across multiple lines), we must set .option("multiline", "true").

df = spark.read.option("multiline", "true").json("/path/nested.json")
Without multiline=true, Spark may throw errors or mark records as _corrupt_record.

Q4. How do you handle dynamic or deeply nested JSON where schema changes?
Answer:
For dynamic nested JSON, I prefer these approaches:
1. Use schema_of_json + from_json to define schema dynamically.
2. Use get_json_object or from_json with a schema for selective parsing.
3. If schema evolves frequently, I ingest JSON in raw format (Bronze layer), then apply transformations in Silver/Gold layers.

Q5. Suppose your JSON has multiple nested arrays (e.g., user -> orders -> items). How would you flatten it?
Answer:
I apply multiple levels of explode() until all arrays are flattened.
✅ Example:

df = spark.read.json("/path/nested.json")
df2 = df.withColumn("order", explode("user.orders")) \
.withColumn("item", explode("order.items")) \
.select("user.id", "order.order_id", "item.item_id", "item.price")
This gives a fully flattened dataset with user, order, and item details.

✅ Interview Tip:
When answering nested JSON questions, always mention:
• Use of explode for arrays.
• Dot notation for struct fields.
• multiline=true for multiline JSON.
• Best practice: Define schema explicitly instead of relying on inference.

pyspark Page 2
Q1. What is Catalyst Optimizer / Spark SQL Engine?
The Catalyst Optimizer is the query optimization engine in Spark SQL. It transforms logical plans into optimized logical plans using rule -based and cost-based optimization. Finally, it
generates a physical plan, which Spark executes as RDD operations.

Q2. Why do we get AnalysisException error?

An AnalysisException occurs when Spark fails to resolve references during query analysis — common reasons include missing columns, invalid table names, schema mismatches, or
accessing unregistered temporary views. It happens in the analysis phase before execution.

Q3. What is Catalog?

A Catalog is Spark’s metadata store that keeps information about databases, tables, views, and functions. It is used by the analyzer to resolve references in queries. In Databricks with
Unity Catalog, it also manages permissions and data governance.

Q4. What is Physical Planning / Spark Plan?

Physical Planning is the process where Spark converts an optimized logical plan into one or more physical plans (execution strategies). Spark chooses the best plan using cost-based
optimization, and the final selected plan is called the SparkPlan, which executes on the cluster.

Q5. Is Spark SQL Engine a Compiler?

Yes, Spark SQL Engine works like a compiler. It parses SQL queries, generates logical and physical plans, applies optimizatio ns, and finally compiles them into Java bytecode that runs on
the JVM over Spark’s execution engine.

Q6. How many phases are involved in Spark SQL engine to convert code into Java bytecode?
There are four main phases:
1. Parsing → SQL/DSL is converted into an unresolved logical plan.
2. Analysis → References are resolved using the Catalog (resolved logical plan).
3. Optimization → Catalyst Optimizer applies rules to create an optimized logical plan.
4. Physical Planning → Generates the best execution plan and compiles it into Java bytecode for execution.

✅ Interview Tip:
Answer in a structured way — Logical Plan → Optimized Plan → Physical Plan → Execution. It shows deep understanding and practical experience.

1. What is Parquet file format?

Parquet is a columnar storage format designed for big data processing. Unlike row-based formats (CSV, JSON), it stores data column-wise, which improves
compression and speeds up analytical queries by reading only the required columns.

2. Why do we need Parquet?

We need Parquet because it:
• Reduces storage cost with efficient compression and encoding.
• Improves performance with column pruning and predicate pushdown.
• Is highly compatible with big data tools like Spark, Hive, ADF, and Databricks.

3. How to read Parquet file?

• In Spark (PySpark): df = spark.read.parquet("path")
• In Pandas: pd.read_parquet("path")
• In SQL engines: Register Parquet tables and query directly.

4. What makes Parquet default choice?

• Columnar storage → faster analytical queries.
• Efficient compression & encoding → smaller storage footprint.
• Schema evolution support → handles changing data structures.
• Compatibility with Hadoop ecosystem & cloud platforms → widely adopted standard.

5. What encoding is done on data?

Parquet uses multiple encodings to save space and improve performance, such as:
• Dictionary Encoding – replaces repeated values with integer keys.
• Run-Length Encoding (RLE) – compresses repeated sequences.
• Delta Encoding – stores differences instead of raw values.

6. What compression techniques are used?

Parquet supports compression codecs like:

pyspark Page 3
Parquet supports compression codecs like:
• Snappy (default) → fast, balanced.
• Gzip → higher compression, slower.
• LZO, Brotli, ZSTD → trade-offs between speed and compression.

7. How to optimize the Parquet file?

• Use appropriate partitioning and bucketing in data lake.
• Apply ZORDER BY or clustering for better filtering.
• Use column pruning by selecting only required columns.
• Use efficient compression codec depending on workload.

8. What is row group, column, and page?

• Row Group: Logical chunk of rows; all column data for these rows is stored together.
• Column Chunk: Data of a single column within a row group.
• Page: Smallest storage unit (usually 8KB–1MB), storing actual encoded values.

9. How projection pruning and predicate pushdown works?

• Projection pruning → Reads only the selected columns instead of full dataset. Example: SELECT name FROM table will not read other columns.
• Predicate pushdown → Filters data at storage level before reading into memory. Example: WHERE age > 30 will skip irrelevant row groups.

Join Strategies in Spark

1. Shuffle Sort-Merge Join (SMJ)
• Spark shuffles both datasets on join keys, sorts them, and then merges.
• Best for: Large datasets when both sides are big and sorted joins are efficient.
• Downside: Expensive shuffle & sort → can be slow.

2. Shuffle Hash Join

• Spark shuffles both datasets on join keys, then builds a hash table on one side and probes with the other.
• Best for: Medium-sized tables when one side can fit in memory after shuffle.
• Faster than SMJ if data distribution is good.

3. Broadcast Hash Join (BHJ)

• Spark broadcasts the smaller dataset to all executors, builds a hash table in memory, and probes with the larger dataset.
• Best for: Joining a very small dataset with a large dataset.
• Very fast (avoids shuffle), but the small dataset must fit in memory.

4. Cartesian Join
• Produces cross join (every row from left × every row from right).
• Best for: Rare cases like generating combinations.
• Downside: Extremely expensive for large datasets (row explosion).

5. Broadcast Nested Loop Join (BNLJ)

• Spark broadcasts the smaller dataset and does a nested loop scan with the larger dataset.
• Best for: When there is no join condition (e.g., cross join with filters).
• Downside: Very costly if data is large.

✅ Summary for Interviews:

• Broadcast Hash Join → fastest when one dataset is small.
• Shuffle Sort-Merge Join → default for large joins.
• Shuffle Hash Join → good alternative when memory is available.
• Cartesian / Nested Loop Joins → only for special cases, usually avoided.

What is OO in S ar
OOM stands for Out Of Memory. It happens when Spark’s driver or executor runs out of allocated memory and cannot store more objects
(RDD/DataFrame, shuffle data, broadcast variables, etc.). This usually causes the job to fail with an OutOfMemoryError.

Wh d we get dri er OO
Driver OOM happens when too much data is collected or stored in the driver.
Examples:
• Using .collect() on huge datasets.
• Large result sets being sent back to the driver.
• Storing large broadcast variables or metadata in the driver’s memory.

What is dri er erhead mem r

Driver (and executor) overhead memory is extra memory reserved outside the JVM heap to handle things like:
• Garbage collection (GC)
• Spark internal metadata
• Off-heap allocations and user data structures
If this overhead memory is insufficient, Spark may still throw OOM even if heap looks free.

mm n reas n t get a dri er OO

• Running collect(), take(), show() on very large DataFrames.
• Improper caching of large datasets in the driver.
• Storing huge logs, metrics, or accumulators in memory.
• Broadcasting very large variables to executors.

w t handle OO
• Avoid collect() — use actions like take(), limit() or write to storage.
• Increase driver/executor memory using configs (--driver-memory, --executor-memory).

pyspark Page 4
• Increase driver/executor memory using configs (--driver-memory, --executor-memory).
• Use persist() / cache() wisely (only when reused).
• Use broadcast joins only when small enough to fit memory.
• Optimize partitions to balance shuffle and avoid skew.
• Tune spark.memory.fraction and spark.executor.memoryOverhead for workloads.

pyspark Page 5
pyspark Page 6

B LSC CD W1 Geiv Yx BAmc EE3 U
No ratings yet
B LSC CD W1 Geiv Yx BAmc EE3 U
166 pages
Scenario Series 19 - Handling JSON in Pyspark
No ratings yet
Scenario Series 19 - Handling JSON in Pyspark
8 pages
How To Work With Apache Spark and Delta Lake?
No ratings yet
How To Work With Apache Spark and Delta Lake?
40 pages
PySpark Tutorial: From Basics to Advanced
No ratings yet
PySpark Tutorial: From Basics to Advanced
102 pages
PYSPARK Interview Questions
100% (4)
PYSPARK Interview Questions
126 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
7 pages
Py Spark
No ratings yet
Py Spark
9 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
9 pages
Freedium - cfd-PySpark Interview Questions
No ratings yet
Freedium - cfd-PySpark Interview Questions
17 pages
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
PySpark Interview Questions Big Data
No ratings yet
PySpark Interview Questions Big Data
8 pages
Apache Spark Components Guide
No ratings yet
Apache Spark Components Guide
6 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
PySpark Interview Questions 2025
No ratings yet
PySpark Interview Questions 2025
8 pages
Pyspark STAR Questions
No ratings yet
Pyspark STAR Questions
21 pages
PySpark All Query
No ratings yet
PySpark All Query
22 pages
PySpark Core Concepts & Interview Prep
No ratings yet
PySpark Core Concepts & Interview Prep
8 pages
Data Engineering Part - 2
No ratings yet
Data Engineering Part - 2
21 pages
Deloitte & EY Data Engineer Interview Questions
No ratings yet
Deloitte & EY Data Engineer Interview Questions
26 pages
Important Interview Qa
No ratings yet
Important Interview Qa
13 pages
PySpark Interview QA
No ratings yet
PySpark Interview QA
2 pages
Full PySpark Interview QA
No ratings yet
Full PySpark Interview QA
5 pages
Spark DataFrame and RDD Operations Guide
No ratings yet
Spark DataFrame and RDD Operations Guide
5 pages
PySpark CSV & Excel Guide in Databricks
No ratings yet
PySpark CSV & Excel Guide in Databricks
4 pages
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
No ratings yet
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
9 pages
Pyspark
100% (1)
Pyspark
48 pages
Pyspark 1
No ratings yet
Pyspark 1
4 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
PySpark Cheatsheet
100% (1)
PySpark Cheatsheet
12 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
2 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Bda U5
No ratings yet
Bda U5
42 pages
Pyspark Dataframe Questions
No ratings yet
Pyspark Dataframe Questions
1 page
Myinterview Qs
No ratings yet
Myinterview Qs
9 pages
Data Engineer
No ratings yet
Data Engineer
12 pages
Execr
No ratings yet
Execr
4 pages
Data Engineer
No ratings yet
Data Engineer
19 pages
Spark Basic Info
No ratings yet
Spark Basic Info
11 pages
Pyspark
No ratings yet
Pyspark
6 pages
Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions
No ratings yet
Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions
16 pages
Chapter 3
No ratings yet
Chapter 3
25 pages
Scenarios Where Bad Records Occur
No ratings yet
Scenarios Where Bad Records Occur
38 pages
50 PySpark Interview Questions PDF
No ratings yet
50 PySpark Interview Questions PDF
7 pages
Spark Questions
No ratings yet
Spark Questions
7 pages
Pyspark Interview Questions: Click Here
0% (1)
Pyspark Interview Questions: Click Here
35 pages
PySpark Real Time Q&A
No ratings yet
PySpark Real Time Q&A
5 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
Data Engineer Interview Prep
No ratings yet
Data Engineer Interview Prep
27 pages
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
No ratings yet
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
14 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
Pyspark 1
No ratings yet
Pyspark 1
7 pages
Engine
No ratings yet
Engine
4 pages
Note 1
No ratings yet
Note 1
2 pages
Learn by Doing It
No ratings yet
Learn by Doing It
9 pages
Basic Arb
No ratings yet
Basic Arb
2 pages
Skills
No ratings yet
Skills
2 pages
Mukassr
No ratings yet
Mukassr
1 page
Tadat Halat
No ratings yet
Tadat Halat
1 page
Tasheel Rule
No ratings yet
Tasheel Rule
4 pages
Minhaj Sabaq
No ratings yet
Minhaj Sabaq
3 pages
Binsw11 Chapter 1 Working Scientifically and Depth Studies
No ratings yet
Binsw11 Chapter 1 Working Scientifically and Depth Studies
31 pages
DCL Commands (Oracle)
No ratings yet
DCL Commands (Oracle)
14 pages
Differentiated Pictograph Lesson
No ratings yet
Differentiated Pictograph Lesson
3 pages
Microsoft Access 2007 Handout Materials
No ratings yet
Microsoft Access 2007 Handout Materials
30 pages
3is-Q2-Module-1.1-stud-copy (1) Aaaaaaaaaaaaaaa
No ratings yet
3is-Q2-Module-1.1-stud-copy (1) Aaaaaaaaaaaaaaa
17 pages
2023 Actimize University Course Catalog.0
No ratings yet
2023 Actimize University Course Catalog.0
31 pages
Information Systems FOR Modern Management: Robert G. Murdick Joel E. Ross
No ratings yet
Information Systems FOR Modern Management: Robert G. Murdick Joel E. Ross
696 pages
Unit 3 Solutions IT Class 10
No ratings yet
Unit 3 Solutions IT Class 10
9 pages
Cs614 Grand Quiz Merge
No ratings yet
Cs614 Grand Quiz Merge
81 pages
Understanding Transaction Processing Systems
No ratings yet
Understanding Transaction Processing Systems
3 pages
Master's in Data Science Syllabus
No ratings yet
Master's in Data Science Syllabus
4 pages
Research Fidp
No ratings yet
Research Fidp
9 pages
Data Analytics and Machine Learning With Java: Conference Paper
No ratings yet
Data Analytics and Machine Learning With Java: Conference Paper
10 pages
CVTSP24 - Module 05 - Commvault Auto Recovery
No ratings yet
CVTSP24 - Module 05 - Commvault Auto Recovery
24 pages
Fieldwork Tradition in Anthropology Prof. Subhdra Channa
No ratings yet
Fieldwork Tradition in Anthropology Prof. Subhdra Channa
18 pages
Enhancing Employees Information Security Awareness in Private and Public Organisations - A Systematic Literature Review
No ratings yet
Enhancing Employees Information Security Awareness in Private and Public Organisations - A Systematic Literature Review
22 pages
SAP Business Data Cloud - External FAQ
No ratings yet
SAP Business Data Cloud - External FAQ
5 pages
AIX Paging Space Guide
No ratings yet
AIX Paging Space Guide
14 pages
Ai Project - Data Acquiring
No ratings yet
Ai Project - Data Acquiring
17 pages
An Analysis of Solid Waste Management Efficiency in Multiple Urban Areas: A Case Study
No ratings yet
An Analysis of Solid Waste Management Efficiency in Multiple Urban Areas: A Case Study
8 pages
How To Manually Remove A Tape Completely From NetBackup-Media Manager
No ratings yet
How To Manually Remove A Tape Completely From NetBackup-Media Manager
3 pages
Descriptive Qualitative Research Design
No ratings yet
Descriptive Qualitative Research Design
2 pages
Business Warehouse - Data Modeling PDF
No ratings yet
Business Warehouse - Data Modeling PDF
110 pages
CRP - Final Brief - 2024-25 - AI
No ratings yet
CRP - Final Brief - 2024-25 - AI
55 pages
SQL DML Statements: Insert, Update, Delete
No ratings yet
SQL DML Statements: Insert, Update, Delete
4 pages
S.No Name Stipend Subject Average Div.: Table: Graduate
No ratings yet
S.No Name Stipend Subject Average Div.: Table: Graduate
1 page
Design and Deployment of TinyURL
No ratings yet
Design and Deployment of TinyURL
14 pages
3 1 Rev Dec 2023
No ratings yet
3 1 Rev Dec 2023
1 page
Retail Store Log Analysis Using PySpark
No ratings yet
Retail Store Log Analysis Using PySpark
5 pages
Case Study ICAEW ACA Advanced Level Notes P
No ratings yet
Case Study ICAEW ACA Advanced Level Notes P
8 pages

Prep Chatgpt

Uploaded by

Prep Chatgpt

Uploaded by

Prep chatgpt

26 October 2024 12:56

Why corrupted records occur?

How Spark handles corrupted records?

1. Using mode option (CSV, JSON, Parquet)

3. Using badRecordsPath (from Spark 2.3+)

4. Schema Evolution & Casting

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

Interview Tip Answer (short & professional)

Json file reading

from pyspark.sql.functions import col, explode

Q2. Why do we get AnalysisException error?

Q3. What is Catalog?

Q4. What is Physical Planning / Spark Plan?

Q5. Is Spark SQL Engine a Compiler?

1. What is Parquet file format?

2. Why do we need Parquet?

3. How to read Parquet file?

4. What makes Parquet default choice?

5. What encoding is done on data?

6. What compression techniques are used?

7. How to optimize the Parquet file?

8. What is row group, column, and page?

9. How projection pruning and predicate pushdown works?

Join Strategies in Spark

2. Shuffle Hash Join

3. Broadcast Hash Join (BHJ)

5. Broadcast Nested Loop Join (BNLJ)

✅ Summary for Interviews:

What is dri er erhead mem r

mm n reas n t get a dri er OO

You might also like