1. What is your Job Role?
2. Where are you storing your data after Processing? What is the seek
of your pipeline?
3. In what amount of data, you are dealing with your pipeline?
4. What types of triggers you apply?
5. What is tippling window trigger?
6. What is incremental load?
Incremental loading in PySpark involves updating a target dataset
with only the new or changed data since the last update. This
approach minimizes processing time and resource consumption. It
typically involves identifying changes, extracting new data,
optionally transforming it, and then appending or merging it into the
target dataset. This process is commonly used in data warehousing
and ETL pipelines for efficient data updates with the help of
timestamps.
7. what is difference between data lake and data warehouse?
In a data lake, you can store structured, semi-structured, and
unstructured data, offering flexibility for various data types and
formats. On the other hand, data warehouses primarily store
structured data, particularly in organized schemas like facts and
dimensions, optimizing retrieval for analytics and reporting
8. What is columnar storage and why the data is stored in
columnar storage fashion inside data warehouse?
Columnar storage is advantageous because it stores data by
columns rather than rows, facilitating faster query execution,
especially when accessing a subset of columns. This benefits data
warehouses and analytical workloads by optimizing performance
and enabling efficient compression. However, it's less beneficial for
transactional data where row-oriented storage is typically more
suitable due to frequent updates and inserts.
9. How the data actually is available in the columnar fashion?
Row 1: [1, Alice, 25, Female]
Row 2: [2, Bob, 30, Male]
Row 3: [3, Charlie, 35, Male]
ID: [1, 2, 3]
Name: [Alice, Bob, Charlie]
Age: [25, 30, 35]
Gender: [Female, Male, Male]
10. While optimizing your pyspark jobs in data warehouse what
kind of challenges and difficulties you faced, also while creating
such pipelines and how you resolved these?
1. Performance Tuning: PySpark jobs may face performance issues
due to inefficient code, data skew, resource contention, or
inadequate cluster configuration. Identify performance bottlenecks
using tools like Spark UI, Spark History Server, and monitoring
metrics. Analyze job stages, task durations, and resource usage to
pinpoint areas for improvement.
2. Data Skew: Uneven data distribution across partitions can lead to
performance bottlenecks and uneven resource utilization. : Use
appropriate partitioning and clustering strategies to distribute data
evenly across partitions and optimize data locality. Consider
partition pruning techniques to minimize data movement during
query execution.
3. Resource Management: Proper allocation and management of
cluster resources, including memory, CPU cores, and executors, is
crucial for efficient job execution. Configure Spark cluster resources
based on workload requirements, including memory allocation,
executor cores, and parallelism settings. Monitor resource utilization
and adjust configuration parameters accordingly.
4. Caching and Persistence: Cache intermediate datasets or persist
them to disk when reused across multiple stages to avoid
recomputation and improve performance.
mplement incremental processing techniques to process only new or
changed data, reducing processing overhead and improving job
efficiency.
11. What is the parametrization of pipeline?
Parameterization of a pipeline involves making components,
configurations, and behaviors configurable through parameters.
These parameters can define aspects such as input data sources,
model hyperparameters, pipeline settings, and execution
environments. Parameterization offers flexibility, reusability,
customization, maintainability, and scalability benefits to pipelines.
By adjusting parameter values, users can adapt pipelines to
different datasets, use cases, and environments, facilitating efficient
and adaptable data processing workflows.
12. What is the duplication of data issue in
pyspark/pipeline and what is skewness?
In PySpark pipelines, duplication of data refers to redundant records
within a dataset, while data skewness denotes an uneven
distribution of data across partitions or keys. Duplication impacts
storage and analysis efficiency, while skewness affects processing
performance. Addressing duplication involves removing redundant
records, while skewness requires balancing data distribution. Both
are critical for optimizing PySpark pipeline performance and
efficiency.
13. What is load on partition?
The "load on partition" in distributed computing systems like Apache
Spark refers to the distribution of data across partitions for parallel
processing. Even distribution ensures optimal resource utilization,
while skewness can lead to uneven processing and performance
issues. Strategies like data partitioning and monitoring tools help
optimize workload balance for efficient data processing.
14. What is faulting in rdd?
Faulting in RDDs refers to handling errors or failures during RDD
operations in Apache Spark. Spark employs fault tolerance
mechanisms like lineage tracking and recomputation to recover
from failures automatically. Checkpointing persists intermediate
RDDs to disk for faster recovery. Monitoring, debugging, and retry
mechanisms are crucial for effectively managing faults and ensuring
robustness in Spark RDD-based pipelines.
15. What is the narrow and wide transformation?
Narrow transformations in Apache Spark involve no shuffling, while
wide transformations require data shuffling across partitions. Narrow
transformations, like map and filter, are efficient and parallelizable,
operating independently on each partition. Wide transformations,
such as groupByKey and join, may lead to performance overhead
due to data movement.
16. What is broadcast join?
17. What is pivoting of data ?
Pivoting of data involves restructuring a dataset by transforming
rows into columns or vice versa, typically aggregating data values
based on certain criteria. It facilitates analysis, reporting, and
visualization by providing a more structured representation of the
data.
| Date | Product | Quantity Sold |
|-----------|---------|---------------|
| 2024-05-01| Apple | 10 |
| 2024-05-01| Banana | 15 |
| 2024-05-02| Apple | 12 |
| 2024-05-02| Banana | 18 |
| Date | Apple | Banana |
|-----------|-------|--------|
| 2024-05-01| 10 | 15 |
| 2024-05-02| 12 | 18 |
18. [Link]
19. div-chn_52066 retreive location out of it which means
chn
you can use substring instr(‘_’)+1+instr(_)-……
20. What is data bricks and data factory?
21. What is the format of your data? When you are working
22. What optimization you were doing on your data and
where you were storing it
We optimized our data by implementing efficient data partitioning,
columnar storage formats like Parquet, and utilizing caching and
persistence mechanisms. We stored the optimized data in
distributed storage systems such as HDFS (Hadoop Distributed File
System) or cloud storage solutions like Amazon S3 or Google Cloud
Storage for scalability and reliability.
23. What is pyspark arctiecture?
PySpark architecture revolves around a driver program that
orchestrates Spark jobs. This program, written in Python, interacts
with SparkSession to define tasks and submit them to the cluster
manager for execution. The cluster manager oversees resource
allocation and task scheduling across the executor nodes, which
execute tasks in parallel. Each executor runs in its JVM and hosts
Python workers for executing Python code. PySpark leverages RDDs
as the core data abstraction for distributed processing, offering
transformations like map and reduce. Additionally, PySpark provides
DataFrame and Dataset APIs for higher-level abstractions, enabling
structured data manipulation and analysis. It also offers libraries like
Spark SQL, MLlib, and GraphX for specific tasks. Data can be read
from and written to external sources such as HDFS and Amazon S3,
ensuring seamless integration with various data storage systems.
This architecture ensures efficient distributed data processing within
the Python ecosystem.
24. What is master slave architecture?
Master-slave architecture is a distributed computing model where a
central master node controls and coordinates multiple slave nodes.
The master node is responsible for task allocation, scheduling, and
overall management of the system, while the slave nodes execute
tasks and report back to the master. This architecture enables
parallel processing and scalability, with the master node distributing
workloads among the slave nodes for efficient data processing or
computation.
25. What is the execution plan of RDDS and what is the
thing called where driver node wait for other to finsh tasks,
and work in the background and didn’t consume until your
call it
The execution plan of RDDs in Apache Spark defines the sequence
of transformations and actions that will be applied to the data. It
outlines the steps required to achieve the desired computation,
including the order of transformations and the dependencies
between them.
The mechanism where the driver node waits for other tasks to finish
and works in the background without consuming resources until
needed is known as lazy evaluation/ Job Syncronozation. In Spark,
transformations on RDDs are lazily evaluated, meaning they are not
executed immediately upon invocation. Instead, Spark waits until an
action is called to trigger the evaluation of transformations,
optimizing resource usage and computation efficiency.
26. What is the difference between rdd, data frame and
dataset, what and which to choose specially in pyspark
RDDs (Resilient Distributed Datasets) are low-level, immutable
collections of data objects.
DataFrames are distributed collections of data organized into named
columns, providing a higher-level abstraction for structured data.
Datasets are a hybrid abstraction combining the benefits of RDDs
and DataFrames, offering type safety and optimization.
Choose RDDs for unstructured data or fine-grained control like
average with outlier complex logic, DataFrames for structured data
processing, and Datasets when you need type safety and
performance optimization with interoperability in PySpark scenarios.
Like complex join with other data sets.
27. Can dataframe in pysaprk be converted into python dataframe
and vice versa
Yes, DataFrame in PySpark can be converted into a Pandas
DataFrame and vice versa.pandas_df = spark_df.toPandas()
spark_df = [Link](pandas_df)
28. Where to check your logs if your job fails in pyspark
In PySpark, you can check logs for failed
in the driver program's standard output/error, Spark Web UI, Spark
History Server, and cluster manager logs. These logs provide
insights into job execution, errors, and performance issues.
29. What is the client mode and cluster mode?
In summary, client mode offers direct access and resource flexibility,
while cluster mode provides fault tolerance and resource efficiency
in PySpark application deployment. The choice depends on factors
such as deployment environment, fault tolerance requirements, and
resource utilization considerations.
30. Can we convert csv to parquet format in pyspark and vice
versa
Yes, you can convert CSV files to Parquet format and vice versa in
PySpark. Use [Link]() to read CSV files and [Link]() to
write to Parquet format. Conversely, use [Link]() to read
Parquet files and [Link]() to write to CSV format.
31. Can we fetch schema as well in pyspark and vice versa
32. Complete pipeline of converting data inplace to cloud
33. In time out time sql question
34. What challenges have you face while working on your
projects?
Query optimation for power bi
Data tyoe mismatch
Api is changed
Rearranged columns => table is already present
Select all the columns
35. What are your contributions?
36. What are gold layers in azure data factory
37. What are delts tables
38. What file formats have you use
Csv sql tables json data
Semi structred => key value pairs
39. What is parquet?
40. What are wide and narrow transformation
Wide means shuffling narrow means no shuffline
41. What is lazy evaluation
When action is hit then transformation will happen and it will work
42. What are dags in pyspark
Directed acyclic graph which store what to follow
43. What is fault tolerance
Fault tlearnace means incase of any fault rdd will not get effected
44. What is client mode and cluster mode
Client mode means driver will work on your machine cluster mode
means driver will run on you’re machine
45. What is real time data ingestion
It means on real time data is ingested on your pipeline
46. Why spark why not map reduce
Because map reduce store the results in disk while spark is in
memory
47. What is broadcast join
Broadcast small table to all the executors so no schuffling
48. What is reshuffling of data
Reshuffling means shuffling of your data to other executors
49. What is memory management in pysaprk
Memory managemet is t manage the cluster exdcutor memory
spark storge andall
50. What is full scan and disadvanategs
Full scan means shuffling which is wide tranforamtions and make
the cpu work and make slow
51. What is cache enabled
Means intermediate result will store in cache
52. What is out of memory in production
Out of memory basically means lets say executor out of memory or
driver out of memepy so lets say we use collect instead of show and
we are retrieving the data whoch is greater than the driver storage
then we will get driver out of memory but lets say we are using
some transformation and joining makes our exctor memery full then
we will get exector oom error
53. How to test your pyspark code
54. What are performance tuning technique to tune tour pyspark
job
Use salting broadcast join, persit technique enough executor core
and memery
55. What is bucketing and salting
Salting is a technique used to evenly distribute skewed data across
partitions by adding a random prefix or suffix (known as a salt) to
the key values.
For example, instead of directly using customer IDs, you can
prepend a random number to each customer ID before hashing or
partitioning. This helps distribute the data more evenly across
partitions.
Bucketing is a partitioning technique where data is grouped into
predefined buckets based on specific criteria, such as a range of
values.
For example, you can bucket customer IDs into a fixed number of
buckets based on their hash value or a modulo operation on the
customer ID. This ensures that each bucket contains a roughly equal
number of records, reducing data skew.
56. In which case you should go for repartition and in which case
coalesce
We use repartition when we have normally have to increase the
partition and we go for coleasce when we have to decrease the
partition but this is not mandatory like colease can be used only for
decreasing the partition bu repartioion can be used for both, the
thing is coleasce replace the partion size and the rdd which is using
it just replace the partion by colease stated partion but repartition
does not work like this instead it works with the old df partion and
then only for that partion it works with less partion thus it shuffles
and colease doesn’t shuffles so there might be cases like small data
then repartioin can be giving good results while decreasing the
partition.
57. Explode and group by for data skew?
Identify the heavily skewed customer IDs using profiling or analysis.
Use explode() to flatten transactions associated with heavily
skewed customer IDs into separate rows.
Apply groupBy() to process transactions from heavily skewed
customer IDs separately, redistributing the workload more evenly
across partitions.
58. How to handle null value
59. What is nan
60. How to handle duplicate values
61. What are different types of join in spark
Shuffle sort merge join, shuffle hash join broadcast join, both shuffle
sort merge join and shuffle hash join are expensive because
shuffling does take place in both, shuffle hash join gives us
advantage my not merging instead uses hash but they both are
wide transformation so they do have these disadvatages by default
is shuffle sort merge join
62. Use cases where to join which join
Normally when we have less data in one table we use broadcast join
where the driver broadcast the small table to all other executor so
that shuffling will not take place
63. How to deploy your pyspark job
64. Difference between shuffle merge join and shuffle hash join
Shuffle sort merge join uses CPU while shuffle hash join use memory
65. Version control in pysaprk
66. What is shuffling and why we should minimize it
Shuffling means actually we are running outr partion our task s in
mutlle executors so shuffling have to take place when we done a
wide tranfpramtion so we minimize we should use techniques like
broadcasting, salting bucketing to help reduce the shuffling, explode
and group by.
67. If your spark job is running slow how to fix it
We have to see may be it is because of the CPU core, or may be
memory or may be some executor issue
68. What is Data spilling
Data spilling means sending the data to the disk
69. What are transformations and actions
Transformation are basically to transform the data and actions are
basically on the top of the transformation where we basically try to
gather save collect the data while lets say group by map filter they
are transformations
70. How to drop duplicates
71. How to drop duplicate columns
72. What Is masking
73. How to initialize pysaprk session
To initialize you simple have to make a pyspark submit
74. How to read the parquet file
75. Filter on data frame
76. What is with column in pysaprk
With column Returns a new DataFrame by adding a column or
replacing the existing column that has the same name
77. Join are wide or narrow transformation?
They are wide transformation
78. What is CICD pipeline?
79. What are best practices for sql queries?
Select necessary columns
Sources column oartitioned
Filter cases
Normalization
80. What is in vs cte where to use and what are diadvantages
81. What is the difeerence between transformation and action in
pyspark
Transformation are lazy evaluated things where you simple want to
transform the data while action are basically you have to A
transformation are functions that return another RDD. An Action in
Spark is any operation that does not return an RDD. Evaluation is
executed when an action is taken. Actions trigger the scheduler,
which build a directed acyclic graph (DAG) as a plan of execution.
82. What are the best paratices for pysprak
Broadcast join
Correct partiotnoing
Caching and persisit
83. Pysprk 3.2 advantages
84. What is cahing and presisitence and What is the different
between caching and persisit
Caching and persitsit are to store the intermedidate rdds cache is
wrapper written on persit which calls persit with saving to disk and
memory in deseralize way but if we have to call lets say memeory
only or disk only serialize deseraclize we have to use parameter of
partitions.
85. What is partitioning (low cardinality)
86. How to upscale or downscale partition in pyspark
Both repartition and colease works
87. Which one othe these repartitong or clease will help in
avoising skewness
Repartition can help in data skewness but for key based we use
salting
88. What is unnest and string agg
89. Data quality ingestion
90. How to ensure in yout schema that id,name,marks are int
sting and it
Infer_schema
91. What is the difference between excel and csv format
92. Why not to shuffle
Because it is a wide transformation and it will utilize the cpu core
93. What is storage memory in spark
The storage memory is basically executor memory so basically
spark memory is divided in two parts spark storage memoey and
executor memepry, executor actually tranforms the data in between
data is being saved in spark executor while intermediate result is
being saved on storage memory as well as the caching which we do
saves ins torage memeory
94. What is lineage graph in spark
Lineage graph is basically how the rdd is been made which steps
were taken
95. Working of azure data factory
96. What is ADLS and azur snap analtrtics which one to choose in
whicn scenario
97. Control flow va data flow in azure data factory
98. Notification services
1. How to print data lineage?
[Link] will print the data lineage top to bottom
2. How Jobs are created in spark
When action is hit job is created
3. What is catalyst optimizer?
Convert user code/transformation and using spark enginer give us
best RDD’s
4. Why do we get analysis exception error?
Because column not exist lets say
5. What is catalog?
Meta data of our data to check lets say fules exist or not
6. What is physical planning/spark plan
From logical planning we get different physical plan for cost
optimization
7. Is spark sql engine a compiler?
It is a compiler because it conversta our to jvm.
8. How many phased are involved in spark? Sql engine to convert a
code in java byte code?
There are four phases of spark sql engine
Analysis phase => code has right columns not wrong columns
names
Logical Planning => retreinving 2 columns from subquery and
before that retrieving all columns so it will only select these 2
Physical Planning => best model based on the cost optimization
logical planning
Code generation => final rdd generated
9. When do we need RDD
Full control on data and unstructured data use RDD
10. Features of an RDD
Immutable lazy evaluation fault tolerant
11. Why we should not use RDD
No optimization done by spark on RDD and very hard code
12. Different type of API In RDD
Structured => Data frame and Dataset , Unstructured => RDD
13. What is AQE
14. Why do we need AQE