0% found this document useful (0 votes)

2K views17 pages

Barclays Data Engineer Interview Questions

The document outlines key interview questions and answers for a Data Engineer position at Barclays, focusing on SQL optimization, data normalization, schema design, big data tools, and Python coding techniques. It includes practical strategies for optimizing SQL queries, handling NULL values, and designing data warehouses, along with comparisons of Hadoop and Spark. Additionally, it covers best practices for reading large datasets in Python and the use of decorators in coding.

Uploaded by

saurabhhadole50

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2K views17 pages

Barclays Data Engineer Interview Questions

Uploaded by

saurabhhadole50

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Barclays Data Engineer Interview Questions

(3–4 YOE)
12-16 LPA

SQL

1. How would you optimize a slow-running SQL query?

Query optimization is a critical skill for a Data Engineer. Here are practical strategies:

Steps to Optimize a SQL Query:

1. Check Execution Plan:

o Use EXPLAIN (MySQL/PostgreSQL) or SET SHOWPLAN_ALL ON (SQL Server)

to analyze how SQL is executed.

o Identify costly operations like full table scans, nested loops, or missing
indexes.

2. Use Proper Indexing:

o Create indexes on frequently filtered/joined columns.

o Use composite indexes when filtering on multiple columns together.

o Avoid indexes on columns with high cardinality or frequent updates.

3. Avoid SELECT *:

o Only select required columns to reduce I/O load.

4. Use Joins Efficiently:

o Prefer INNER JOIN over OUTER JOIN if NULLs are not needed.

o Ensure joined fields are indexed.

5. Filter Early:
o Apply WHERE clauses early to limit the data set before joins and
aggregations.

6. Avoid Subqueries When Possible:

o Use JOINs or CTEs (Common Table Expressions) for better performance and
readability.

7. Limit Use of Functions in WHERE Clauses:

-- Avoid this:

WHERE YEAR(order_date) = 2024

-- Prefer this:

WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01'

8. Partitioning and Sharding (for big data):

o Use table partitioning to divide large tables logically for faster access.

o Consider sharding for distributed systems.

2. Write a SQL query to find the second highest salary in a table.

Let’s say we have a table called employees(salary).

Query Using Subquery:

SELECT MAX(salary) AS second_highest_salary

FROM employees

WHERE salary < (SELECT MAX(salary) FROM employees);

Alternative Using DENSE_RANK() (SQL Server, PostgreSQL, etc.):

SELECT salary

FROM (

SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk

FROM employees
) ranked

WHERE rnk = 2;

Note: Use RANK() if you want to skip duplicates, DENSE_RANK() if not.

3. What’s the difference between the WHERE and HAVING

clauses in SQL?
Feature WHERE HAVING

Filters groups after

Purpose Filters rows before aggregation
aggregation

Can be used with SELECT, UPDATE, Typically used with GROUP

Usage
DELETE BY

Aggregate
Cannot use (SUM, AVG, etc.) Can use
Functions

Example WHERE salary > 50000 HAVING COUNT(*) > 3

Example:

-- Using WHERE

SELECT * FROM employees

WHERE department = 'IT';

-- Using HAVING

SELECT department, COUNT(*) AS total_employees

FROM employees

GROUP BY department

HAVING COUNT(*) > 5;

4. How do you typically handle NULL values in your SQL queries?

Key Strategies:

1. Use IS NULL or IS NOT NULL:

SELECT * FROM employees WHERE manager_id IS NULL;

2. Use COALESCE() or IFNULL():

o Replace NULLs with default values.

SELECT name, COALESCE(department, 'Not Assigned') AS dept FROM employees;

3. Use CASE statements:

SELECT

name,

CASE

WHEN salary IS NULL THEN 'Not Disclosed'

ELSE salary

END AS salary_status

FROM employees;

4. Avoid NULLs in joins:

o Use INNER JOIN when NULLs are not needed.

o Use LEFT JOIN + COALESCE if necessary.

5. NULL-safe comparison in MySQL:

SELECT * FROM table WHERE column <=> NULL; -- Only TRUE if column is NULL

Data Normalization

1. What is normalization, and why is it important in data

modeling?
Normalization is the process of structuring a relational database to:
• Eliminate data redundancy (duplicate data)

• Ensure data integrity

• Make the database more efficient and easier to maintain

Key Normal Forms:

Normal Form Description Example

1NF (First Normal No repeating groups; atomic Avoid arrays or multiple values in a
Form) columns only single column

2NF (Second 1NF + No partial dependency on Every non-key column depends on

Normal Form) a primary key the whole key

3NF (Third Normal 2NF + No transitive No non-key column depends on

Form) dependencies another non-key column

Why Normalization is Important:

• Reduces data redundancy (e.g., no repeated customer info in each order row)

• Improves data consistency (update in one place only)

• Makes updates, deletions, and insertions safer

• Minimizes storage costs (by avoiding repetition)

However, in OLAP/data warehouses, denormalization (opposite of normalization) is

preferred to optimize for query speed.

2. Explain the difference between a star schema and a

snowflake schema.
Star Schema vs Snowflake Schema:
Feature Star Schema Snowflake Schema

Central fact table linked to Central fact table linked to

Structure
dimension tables normalized dimension tables

Normalization Denormalized Normalized

Query
Faster (fewer joins) Slightly slower (more joins)
Performance

Storage Uses more space Uses less space

Simplicity Easier to understand and query More complex

Star Schema Example:

Fact Table:
Fact_Transactions (transaction_id, customer_id, product_id, amount, date_id)

Dimension Tables:

• Dim_Customer (customer_id, name, gender, age)

• Dim_Product (product_id, name, category)

• Dim_Date (date_id, full_date, month, year)

Snowflake Schema Example:

• Same as Star Schema but dimension tables are normalized:

o Dim_Product is split into Product, Category

o Dim_Date might be split into Day, Month, Year tables

3. Designing a data warehouse for a banking system — How

would you approach it?
This is an open-ended system design question. The interviewer is looking for your ability to
think at the architectural level.
Approach:

Step 1: Requirement Gathering

• Understand business KPIs: e.g., number of transactions, loan approvals, daily

balances

• Identify stakeholders: finance, fraud detection, compliance, marketing

Step 2: Identify Key Subject Areas (Data Marts)

• Accounts (savings, current, loans)

• Transactions (deposits, withdrawals, transfers)

• Customers

• Cards (credit, debit)

• Loans

Step 3: Design the Schema

• Choose Star Schema for better reporting performance

• Example:

Fact Table:

• Fact_Transactions(transaction_id, customer_id, account_id, amount, date_id,

branch_id)

Dimension Tables:

• Dim_Customer(customer_id, name, address, dob, kyc_status)

• Dim_Account(account_id, account_type, open_date)

• Dim_Date(date_id, date, month, quarter, year)

• Dim_Branch(branch_id, branch_name, region)

Step 4: ETL/ELT Design

• Source systems: Core banking systems, customer CRM, external KYC APIs

• Use tools like Apache NiFi, Airflow, or Informatica

• Implement:

o Data cleaning (handle NULLs, outliers)

o Deduplication

o Historical tracking using SCD (Slowly Changing Dimensions)

Step 5: Data Warehouse Layer

• Use cloud DWs like Snowflake, Amazon Redshift, Google BigQuery, or on-
premise like Teradata

• Partition large fact tables

• Use Materialized Views for reporting

Step 6: Reporting Layer

• Build dashboards using Power BI, Tableau, or Looker

• Serve to teams: operations, fraud analytics, compliance

Step 7: Security & Compliance

• Encrypt PII data

• Mask sensitive info (like PAN, Aadhar)

• Role-based access (RLS)

• Retain logs for audit

Big Data Tools

1. Compare Hadoop and Spark in terms of architecture and
use cases.
1.1 Hadoop Architecture:

• Core Components:

o HDFS (Hadoop Distributed File System): Stores massive data across

clusters.

o YARN (Yet Another Resource Negotiator): Manages cluster resources.

o MapReduce: Batch processing framework using map → shuffle → reduce.

• Workflow:

o Data is stored in HDFS → processed using MapReduce → output written back

to HDFS.

o Disk I/O intensive (writes intermediate data to disk).

1.2 Spark Architecture:

• Core Components:

o Spark Core: Handles distributed task scheduling.

o RDD (Resilient Distributed Dataset): Immutable, distributed data.

o Catalyst Engine: For SQL optimization.

o DAG Scheduler: Executes jobs in memory using Directed Acyclic Graphs.

• Spark Ecosystem:

o Spark SQL – Structured data

o Spark Streaming – Real-time data

o MLlib – Machine learning

o GraphX – Graph processing

• In-Memory Processing: Stores intermediate data in memory (RAM), making it much

faster than MapReduce.
1.3 Comparison Table:

Feature Hadoop (MapReduce) Spark

Processing Type Batch only Batch + Real-time

Speed Slower (due to disk I/O) 10–100x faster (in-memory)

Ease of Use Java-based, verbose Supports Scala, Python, SQL

Fault Tolerance Yes (via HDFS replication) Yes (via lineage of RDDs)

Use Cases Legacy batch ETL Real-time processing, ML, ETL

When to Use:

• Hadoop: Archival, cold data storage, traditional batch jobs

• Spark: Real-time analytics, machine learning pipelines, interactive querying

2. Explain how partitioning works in Apache Spark.

Partitioning in Spark:

• Partitioning is how Spark logically divides data across multiple executors or

nodes for parallel processing.

• Spark processes each partition in parallel, leading to high performance in

distributed environments.

Types of Partitioning:

1. Default Partitioning:

o Automatically based on cluster configuration and number of cores.

o Controlled using spark.default.parallelism.

2. Hash Partitioning (via transformations):

rdd.partitionBy(4)

3. Range Partitioning:

o Used in sorted or range-based data.

Repartition vs Coalesce:

Operation Description Use Case

repartition(n) Increases/decreases partitions (full shuffle) When increasing partitions

coalesce(n) Reduces partitions (no full shuffle) When reducing partitions

Why Partitioning Matters:

• Optimizes parallelism

• Reduces data shuffling in joins

• Improves cache efficiency

• Controls skewed data issues

Example: Partitioning in Spark SQL

df.write.partitionBy("country", "year").parquet("output_path")

This creates folders by country/year, making queries faster on those filters.

3. Why might you choose Parquet over CSV for storing large
datasets?
Parquet vs CSV Comparison:

Feature CSV Parquet

Format Type Text-based, row-oriented Columnar binary format

Feature CSV Parquet

Compression Poor (large file size) Highly compressed (Snappy, GZIP)

Read Performance Reads entire file Reads only required columns

Schema Support None Yes (self-describing metadata)

Data Types Strings (needs manual parsing) Strongly typed (ints, floats, etc.)

Splittable for HDFS Yes Yes

Why Parquet is Preferred:

1. Columnar Storage:

o Efficient for analytical queries (OLAP).

o Only loads relevant columns into memory.

2. Compression:

o Up to 75% smaller than CSV.

o Reduces I/O and storage costs.

3. Schema Enforcement:

o Helps validate and track schema evolution.

4. Integration:

o Well supported in Spark, Hive, AWS Athena, BigQuery.

Use Case Example:

For a banking analytics pipeline, where analysts want to aggregate transactions by

account or region:

• CSV would scan the full dataset, including unused columns.

• Parquet would only load account_id, region, amount columns → faster and cheaper.
Coding

1. Python script to read a large CSV file and apply

transformations
Reading Large CSV Files:

When working with large datasets (e.g., millions of rows), it's efficient to:

• Read data in chunks using pandas.read_csv() with chunksize

• Apply transformations chunk by chunk to avoid memory overflow

Example Code:

import pandas as pd

# Define chunk size

chunk_size = 100000

result = []

# Read CSV in chunks

for chunk in pd.read_csv("large_file.csv", chunksize=chunk_size):

# Transformation: Drop nulls and add a new column

chunk = chunk.dropna()

chunk['Total'] = chunk['Price'] * chunk['Quantity']

result.append(chunk)

# Combine all processed chunks

final_df = pd.concat(result)
# Save to new file

final_df.to_csv("transformed_file.csv", index=False)

Best Practices:

• Use dtypes argument to optimize memory usage

• Avoid loading full data in RAM if not necessary

2. Handling missing data in Python (Pandas)

Common Missing Data Techniques:

Technique Code Example Use Case

Remove rows/columns with

Drop missing values df.dropna()
NULLs

Default value like 0 or

Fill with constant df.fillna(0)
"Unknown"

Forward fill (ffill) df.fillna(method='ffill') Time series data

Backward fill (bfill) df.fillna(method='bfill') Alternative to ffill

Fill with
df['col'].fillna(df['col'].mean()) Numerical columns
mean/median/mode

Check % of missing data df.isnull().mean() * 100 Data quality check

Example:

# Fill missing age with mean

df['Age'] = df['Age'].fillna(df['Age'].mean())

# Drop rows where 'Salary' is missing

df = df.dropna(subset=['Salary'])
# Fill missing city names with "Unknown"

df['City'] = df['City'].fillna("Unknown")

3. Python Decorators – Explanation & Use Case

What is a Decorator?

• A decorator is a function that modifies another function’s behavior without

changing its code.

• It is widely used in logging, timing, authentication, and caching.

Simple Decorator Example:

def my_decorator(func):

def wrapper():

print("Before function runs")

func()

print("After function runs")

return wrapper

@my_decorator

def say_hello():

print("Hello!")

say_hello()

Output:

Before function runs

Hello!
After function runs

Real Use Case – Logging Execution Time:

import time

def timer_decorator(func):

def wrapper(*args, **kwargs):

start = time.time()

result = func(*args, **kwargs)

end = time.time()

print(f"{func.name} took {end - start:.2f} seconds")

return result

return wrapper

@timer_decorator

def process_data():

time.sleep(2)

print("Data processed")

process_data()

Output:

Data processed

process_data took 2.00 seconds

Decorator Use Cases in Data Engineering:

Use Case Purpose

Caching results Avoid recomputation (e.g., @lru_cache)

Timing performance Track execution time of ETL steps

Access control Validate user/session roles

Logging Record inputs, outputs, or errors

Mastercard Data Engineer Interview Questions
No ratings yet
Mastercard Data Engineer Interview Questions
16 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Deloitte Data Engineer
No ratings yet
Deloitte Data Engineer
7 pages
Sr. Data Engineer with Azure Expertise
No ratings yet
Sr. Data Engineer with Azure Expertise
6 pages
Databricks Data Engineer Exam Guide 25
No ratings yet
Databricks Data Engineer Exam Guide 25
7 pages
PySpark Cheat Sheet
No ratings yet
PySpark Cheat Sheet
6 pages
Deloitte Scenario-Based Questions in Spark
No ratings yet
Deloitte Scenario-Based Questions in Spark
7 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
Manish Resume Github
No ratings yet
Manish Resume Github
1 page
Data Engineer Interview Questions With Examples
No ratings yet
Data Engineer Interview Questions With Examples
8 pages
Data Engineering Expertise Overview
No ratings yet
Data Engineering Expertise Overview
8 pages
Spark QA
No ratings yet
Spark QA
34 pages
ETL Mastery for Data Professionals
100% (1)
ETL Mastery for Data Professionals
15 pages
60+ Data Engineer Interview Questions and Answers
No ratings yet
60+ Data Engineer Interview Questions and Answers
16 pages
PySpark Optimization Interview Scenarios
No ratings yet
PySpark Optimization Interview Scenarios
8 pages
Azure Analytics Interview Answers Complete
No ratings yet
Azure Analytics Interview Answers Complete
5 pages
Pyspark STAR Questions
No ratings yet
Pyspark STAR Questions
21 pages
Snowflake Optimization Guide
No ratings yet
Snowflake Optimization Guide
23 pages
SQL For Data Engineering
No ratings yet
SQL For Data Engineering
79 pages
Data Engineer Resume: Big Data & Cloud Expertise
No ratings yet
Data Engineer Resume: Big Data & Cloud Expertise
8 pages
Snowflake Interview Question
No ratings yet
Snowflake Interview Question
20 pages
Narsimlu - Azure Data Engineer - Resume .Pf-1
No ratings yet
Narsimlu - Azure Data Engineer - Resume .Pf-1
4 pages
Goldman Sachs Data Engineer Interview Prep
No ratings yet
Goldman Sachs Data Engineer Interview Prep
4 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Top 500 Data Engineering Interview Questions
No ratings yet
Top 500 Data Engineering Interview Questions
126 pages
Snowflake Optimization and Tokenization Insights
No ratings yet
Snowflake Optimization and Tokenization Insights
3 pages
ADF Interview Questions and Scenarios
No ratings yet
ADF Interview Questions and Scenarios
2 pages
Top 50 Azure Data Factory Interview Questions and Answers
No ratings yet
Top 50 Azure Data Factory Interview Questions and Answers
14 pages
Databricks Pyspark 1712042928
100% (1)
Databricks Pyspark 1712042928
21 pages
ETL Process Overview in Agriculture
100% (1)
ETL Process Overview in Agriculture
42 pages
Data Bricks
No ratings yet
Data Bricks
43 pages
Near Real-Time Big Data Processing
No ratings yet
Near Real-Time Big Data Processing
59 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Spark Repartition vs Coalesce Explained
No ratings yet
Spark Repartition vs Coalesce Explained
7 pages
Day 89
No ratings yet
Day 89
9 pages
Understanding Apache Spark Architecture
0% (1)
Understanding Apache Spark Architecture
30 pages
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
No ratings yet
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
9 pages
Extended Spark Interview QA
No ratings yet
Extended Spark Interview QA
3 pages
SQL Interview Questions For A Data Engineer
No ratings yet
SQL Interview Questions For A Data Engineer
11 pages
Snowflake With DBT
No ratings yet
Snowflake With DBT
6 pages
DBT Interview Questions
No ratings yet
DBT Interview Questions
18 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
Incremental Loading For Dimension Table
100% (1)
Incremental Loading For Dimension Table
3 pages
2025 Pyspark Interview Questions Collections
No ratings yet
2025 Pyspark Interview Questions Collections
50 pages
Spark
No ratings yet
Spark
96 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Data Migration and CDC Tasks
No ratings yet
Data Migration and CDC Tasks
11 pages
Naresh DE
No ratings yet
Naresh DE
5 pages
Deloitee Data Engineer Interview Questions
100% (1)
Deloitee Data Engineer Interview Questions
24 pages
Data Engineer
No ratings yet
Data Engineer
5 pages
Overview of Apache Druid Architecture
No ratings yet
Overview of Apache Druid Architecture
12 pages
New Snowflake Questions
No ratings yet
New Snowflake Questions
4 pages
Spark Optimization 1741826797
No ratings yet
Spark Optimization 1741826797
7 pages
Spark Big Data Tuning Guide
100% (1)
Spark Big Data Tuning Guide
20 pages
ABD22 1st Exam - 6 January - Attempt Review
No ratings yet
ABD22 1st Exam - 6 January - Attempt Review
13 pages
Companywise Interview Questions
No ratings yet
Companywise Interview Questions
71 pages
Deepak Dubey Data Engineer Resume
No ratings yet
Deepak Dubey Data Engineer Resume
2 pages
Snowflake Certification Overview and Costs
No ratings yet
Snowflake Certification Overview and Costs
102 pages
Interview Q & A (SQL Spark HIVE Airflow AWS Kafka) - 1
No ratings yet
Interview Q & A (SQL Spark HIVE Airflow AWS Kafka) - 1
25 pages
Cs106B CheatSheet 2
No ratings yet
Cs106B CheatSheet 2
2 pages
Data Management of Askari Bank Limited: 1. File Based Management System (FBMS)
No ratings yet
Data Management of Askari Bank Limited: 1. File Based Management System (FBMS)
4 pages
STM 8 S 003 F 3
No ratings yet
STM 8 S 003 F 3
104 pages
Image Management - User Guide: Release: R18 AMR
100% (2)
Image Management - User Guide: Release: R18 AMR
43 pages
Understanding Internet Protocol (IP) Basics
No ratings yet
Understanding Internet Protocol (IP) Basics
5 pages
Insurance Integration Guide
No ratings yet
Insurance Integration Guide
19 pages
Amol PCX - Report
No ratings yet
Amol PCX - Report
15 pages
Linux Tutorial: Essential Commands
No ratings yet
Linux Tutorial: Essential Commands
7 pages
HM Module Manual
No ratings yet
HM Module Manual
42 pages
Cassandra TP
No ratings yet
Cassandra TP
10 pages
Lec 2 - Introduction - MP
No ratings yet
Lec 2 - Introduction - MP
36 pages
DS0000087 - Programmer's Manual - of X-LIB Software Library
100% (1)
DS0000087 - Programmer's Manual - of X-LIB Software Library
175 pages
AWS Cloud Computing Overview and Benefits
100% (1)
AWS Cloud Computing Overview and Benefits
96 pages
Rules To Develop An API: 1. Accept and Respond With JSON
No ratings yet
Rules To Develop An API: 1. Accept and Respond With JSON
3 pages
Oracle PL - SQL CPT
No ratings yet
Oracle PL - SQL CPT
6 pages
CMS Software en
100% (1)
CMS Software en
299 pages
Load Balancing for IT Professionals
No ratings yet
Load Balancing for IT Professionals
14 pages
Laserfiche: Document Management Solutions
No ratings yet
Laserfiche: Document Management Solutions
12 pages
Informatica Powercenter 9.X Level 1 Developer: Course Description
No ratings yet
Informatica Powercenter 9.X Level 1 Developer: Course Description
4 pages
MCS-012 Merged
No ratings yet
MCS-012 Merged
63 pages
S7025 Manual v1
No ratings yet
S7025 Manual v1
85 pages
Quick Start Guide: Fibre Channel Hba and VM Migration
No ratings yet
Quick Start Guide: Fibre Channel Hba and VM Migration
8 pages
h10683 DD Boost Oracle RMAN Tech Review WP
No ratings yet
h10683 DD Boost Oracle RMAN Tech Review WP
17 pages
10 String Functions
No ratings yet
10 String Functions
1 page
MSDOS Networking for IT Students
No ratings yet
MSDOS Networking for IT Students
5 pages
SAP BW Finance Technical Guide
100% (1)
SAP BW Finance Technical Guide
107 pages
Cython for C Programmers
No ratings yet
Cython for C Programmers
47 pages
Libwebsockets API Doc v2.0
No ratings yet
Libwebsockets API Doc v2.0
34 pages
AIX Interview Questions Guide
No ratings yet
AIX Interview Questions Guide
3 pages
Command Line PDF
No ratings yet
Command Line PDF
592 pages