0% found this document useful (0 votes)
111 views6 pages

Data Engineering Interview Prep

The documents provide interview questions for various companies. The questions cover topics like Spark, Python, SQL, data modeling, AWS services, streaming, and more. They are technical questions to assess a candidate's skills and knowledge in big data, cloud, and related domains.

Uploaded by

venket s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views6 pages

Data Engineering Interview Prep

The documents provide interview questions for various companies. The questions cover topics like Spark, Python, SQL, data modeling, AWS services, streaming, and more. They are technical questions to assess a candidate's skills and knowledge in big data, cloud, and related domains.

Uploaded by

venket s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Interview Questions-Visa

Manager round

1.Explain your project

2.What are the optimizations you have worked on in spark?

3.What is shuffling?explain

4.any scenario where you deployed a code and experienced failure alerts

5.Difference between coalesce and repartition

6.Any lessons learnt while leading the team

7.Broadcast join

8.Explain your project

Technical round

1.Python question

find all unique pairs of numbers in an array,N which sum to a value s=15
n=[1,9,42,6,2,0,14,15]

2.explain hash table and hash function

3.how do you handle long running jobs in spark?

4.how do you handle data skewness?

5. two tables

merchant_volume,merchant_name,volume are columns

merchant_category,category and merchant_name

sql query to select top merchant of each category

a.What happens if different categories have same merchant name

CELEBAL TECHNOLOGIES

1ST TECHNICAL ROUND

1.explain project story

2.Databricks runtime

3.difference between csv and parquet

4.transformations used in project

5.is it possible to union 2 df with different schema?How can we do it?


6.find non matches between 2 df

7.operators used in airflow

8. what happens if a incremental daily file does not come on a day in databricks

9.How is incremental load done in databricks?

10.higher order functions and anonymous functions in scala

11.is pyspark and sparksql same in terms of execution?difference

2nd round:

AWS Design round

1.Different redshift clusters

2.glue crawlers and what happens if schema changes?

3.different ec2 instances

4.why redshift does not allow primary keys?

5.3 data sources are there and client wants one single source of data?how will the
data modelling be?

6.what is delta load?

7.difference between incremental load and CDC

8.difference between data lake and delta lake

9.difference between athena and redshift

10.how can we increase execution time of lambda?

11.service used to migrate databases?

12.different s3 storage levels and difference

13.how can glue job be triggered?what if one job depends on another?

SMART CUBE

2nd Round:

1.Explain a challenging situation faced in project

2.What is denormalization?

3.difference between union and union all


4.list comprehension

5.lambda functions

6.data structures used

7.find even elements from a list

8.display only the unmatched records from two list

9.why pandas is preferred over spark?

10.how to explode a nested json into row and column in pyspark?

11.what happens internally when we submit spark job

3rd Round:
1.describe any complex architecture you built

2.aws services used

3.find duplicates in a df

4.top 2 customers per month with highest sales(sql)

5.list=[1,2,3,4,5,6]

Find the sum of the odd indexes with and without built-in functions

COFORGE

ROUND 1

1.DIFFERENCE BETWEEN RANK AND DENSE_Rank

2.DEEP COPY VS SHALLOW COPY

3.DATAFRAME VS SERIES

4.LIST VS TUPLE

5.GROUPBYKEY VS REDUCEBYKEY

6.GLUE RESIDES ON MEMORY?

7.RDD VS DATAFRAME

8.SYNCHRONOUS AND ASYNCHRONOUS FUNCTIONS IN LAMBDA


WALMART

ROUND 1

1.find the maximum length of the subset of array having sum as 0

2.find expiry date by adding remaining days to recharge date in pyspark

3.find the count of top trending hashtags but duplicates would not count in the same line
4.spark architecture

5.spark optimizations

6.partitioning in spark

7.yarn architecture

8.shuffle partitions

9.rank vs dense rank

10.data skewness

11.airflow architecture

PUBLICIS

ROUND 2:

1.SERVICES WORKED ON IN AWS

2.PARTITIONING IN HIVE

3.JOBS,STAGE,TASKS IN SPARK

4.SPARK ARCHITECTURE
5.LST=[a,a,b,b,c,c]

Find count of occurrences in python and pyspark

6.airflow architecture

7.project architecture

EPAM

ROUND 2:

1.AWS GLUE ,3,EMR

2.TRANSIENT AND LONG RUNNING JOB IN EMR

3.STEP EXECUTION IN EMR

4.BACKEND OF LAMBDA

5.SYNCHRONOUS AND ASYNCHRONOUS IN LAMBDA

6.spark optimizations

7.relation between cpu cores and partitions

8.ways to solve data skewness

9.can we do repartition on columns

10.adequate query execution in spark

11.generators and decorators

12.list comprehension

13.scd implementation using pyspark

14.args in python

15.checkpointing in spark

16.limitations of lambda

17.where can we see the logs of emr

18.difference between data lake and delta lake

19.serialisation in spark

20.checkpointing in spark

WALMART

ROUND 2:

1.WHAT IS DATA SPILLING?


2.HOW DO YOU DEFINE THE NUMBER OF SHUFFLE PARTITIONS WITH A FILE OF 500 GB AND 10 GB
EXECUTOR MEMORY?

3.Broadcast join and Sort merge join

4.broadcast nested loop join

5.Shuffle partitions concepts

6.spark streaming

7.sql leetcode

8.data spilling

9.how to identify long running jobs

10.how to assign resources to spark jobs

11.case class in scala

12.z-order

13.how to solve out of memory errors in spark?

You might also like