0% found this document useful (0 votes)

4 views19 pages

SQL & Azure Databricks Interview

Uploaded by

kanakalakshmeshanmugam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views19 pages

SQL & Azure Databricks Interview

Uploaded by

kanakalakshmeshanmugam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

SQL

In SQL (Structured Query Language), a query is a command or request used to retrieve,

manipulate, or manage data within a relational database. It is essentially a question posed to
the database, asking for specific information or instructing it to perform an action.
A lingua franca is a shared language of communication adopted by people whose native
languages are different, serving as a bridge for mutual understanding in a specific region,
field, or context, such as English in global business

What is SQL?
SQL (Structured Query Language) is the standard declarative language used to define,
query, modify and control data stored in relational databases. It’s the lingua franca for
working with tabular data (rows × columns) in systems like PostgreSQL, MySQL,
Microsoft SQL Server, Oracle, SQLite, and also powers analytic engines like Databricks
SQL.
SQL is declarative: you describe what you want (e.g., “give me the top 3 sellers”), not how to
fetch it — the database query optimizer figures out the how.

Core concepts — in depth (what

interviewers expect)
1. Relational model & basic building blocks
• Table: collection of rows (records). Each column has a data type (INT, VARCHAR,
DATE, DECIMAL, BOOLEAN, JSON, etc.).
• Row (tuple) = one record.
• Schema: structure/organization of tables, columns and relationships.
• NULL: absence of a value — important semantics (three-valued logic: TRUE /
FALSE / UNKNOWN).

2. Keys & Constraints

• Primary Key (PK): uniquely identifies rows; usually indexed and NOT NULL.
• Foreign Key (FK): enforces referential integrity between tables (e.g.,
orders.customer_id → [Link]).
• Unique, NOT NULL, CHECK, DEFAULT constraints.
• ON DELETE/UPDATE actions: CASCADE, SET NULL, RESTRICT.

3. DDL / DML / DCL / TCL

• DDL (Data Definition Language): CREATE, ALTER, DROP, TRUNCATE — defines
schema.
• DML (Data Manipulation Language): SELECT, INSERT, UPDATE, DELETE —
read/write data.
• DCL (Data Control Language): GRANT, REVOKE — permissions.
• TCL (Transaction Control Language): BEGIN/START TRANSACTION, COMMIT,
ROLLBACK, SAVEPOINT.

4. SELECT basics: projection, selection, filtering, sorting

• SELECT column_list FROM table WHERE condition ORDER BY ... LIMIT ...
• Projection — which columns you want.
• Selection — which rows (WHERE).
• DISTINCT removes duplicates.
• ORDER BY sorts; OFFSET / LIMIT or TOP paginates.

5. Joins (core of relational queries)

• INNER JOIN — rows with matching keys in both tables.
• LEFT (LEFT OUTER) JOIN — all left rows, matched right rows or NULLs.
• RIGHT (RIGHT OUTER) JOIN — opposite.
• FULL OUTER JOIN — all rows from both; unmatched sides NULL.
• CROSS JOIN — Cartesian product.
• Self-join — join table to itself (e.g., manager-employee).
Example:

SELECT [Link], [Link]

FROM orders o
JOIN customers c ON o.customer_id = [Link];

6. Aggregation & GROUP BY

• Aggregates: COUNT(), SUM(), AVG(), MIN(), MAX().
• GROUP BY groups rows before aggregate.
• HAVING filters groups (apply conditions after grouping).

SELECT department, COUNT(*) as headcount

FROM employees
GROUP BY department
HAVING COUNT(*) > 10;

7. Subqueries & correlated subqueries

• Scalar subquery: returns single value in SELECT or WHERE.
• Correlated subquery references outer query row — executed per outer row (can be
slower).

SELECT [Link]
FROM employees e
WHERE [Link] > (SELECT AVG(salary) FROM employees WHERE dept = [Link]);

8. Set operations
• UNION (removes duplicates), UNION ALL (keeps duplicates), INTERSECT, EXCEPT /
MINUS.

9. Window functions (powerful for analytics)

• OVER (PARTITION BY ... ORDER BY ... [frame]):
o Ranking: ROW_NUMBER(), RANK(), DENSE_RANK().
o Offsets: LAG(), LEAD().
o Running totals: SUM(...) OVER (ORDER BY ...).
Example — top N per group:

SELECT *
FROM (
SELECT s.*,
ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) rn
FROM employees s
) t
WHERE rn <= 3;

10. Indexing & how queries become fast

• Index (usually B-tree): speeds lookups, range scans, ORDER BY.
• Clustered index: physical order of rows (e.g., SQL Server primary cluster).
• Non-clustered: separate structure pointing to rows.
• Covering index: index that contains all columns needed for a query → avoids table
access.
• Tradeoffs: indexes speed reads, slow writes and take space; choose selective columns
& proper composite order.

11. Query optimizer & execution plan

• DB produces a logical plan, chooses a physical plan using statistics and cost model.
• Use EXPLAIN / EXPLAIN ANALYZE to inspect plans: look for sequential scans vs index
scans, expensive sorts, join algorithms (nested loop, hash join, merge join).

12. Transactions & ACID

• Atomicity, Consistency, Isolation, Durability.
• Isolation levels (common):
o READ UNCOMMITTED — dirty reads allowed.
o READ COMMITTED — no dirty reads (default in many DBs).
o REPEATABLE READ — no non-repeatable reads.
o SERIALIZABLE — strictest, no phantom reads.
• MVCC (Multi-Version Concurrency Control): PostgreSQL, Oracle style: readers
don’t block writers and vice versa, using snapshots.
13. Normalization vs Denormalization
• 1NF/2NF/3NF/BCNF remove redundancy, avoid update anomalies.
• Denormalization used in OLAP/data-warehouses for read performance (star schema,
fact & dimension tables).

14. Views, materialized views, stored procedures,

functions, triggers
• View: named query (virtual table).
• Materialized view: stored precomputed result (refresh policy).
• Stored procedures / functions: encapsulate logic inside DB; use sparingly for
business logic portability concerns.
• Triggers: automatic actions on INSERT/UPDATE/DELETE — use with caution (can
be hidden cost).

15. Performance best practices (summary)

• Avoid SELECT *.
• Use proper data types.
• Add indexes on columns used in WHERE / JOIN / ORDER BY.
• Avoid functions on indexed columns in WHERE (non-SARGable).
• Batch large writes, use bulk load tools.
• Use EXPLAIN to find bottlenecks.
• Consider partitioning for very large tables.

16. Security & injection prevention

• Use parameterized queries / prepared statements — no string concatenation of
user input.
• Apply least privilege via roles and GRANTs.

17. Analytic & Big Data considerations

• For analytic queries on huge datasets: partitioning, columnar stores (Parquet, ORC),
materialized views, distribution keys (in distributed engines like Redshift/Spark), and
cost-based optimization matter.

Practical examples (concise)

Create table + constraints

CREATE TABLE customers (

id SERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL UNIQUE,
name VARCHAR(100),
joined_date DATE DEFAULT CURRENT_DATE
);

Insert / Select / Join / Group

INSERT INTO customers (email, name) VALUES ('a@[Link]','Amy');

SELECT [Link], COUNT([Link]) AS orders

FROM customers c
LEFT JOIN orders o ON [Link] = o.customer_id
WHERE c.joined_date >= '2025-01-01'
GROUP BY [Link]
HAVING COUNT([Link]) > 0
ORDER BY orders DESC
LIMIT 10;

Window function — running total

SELECT order_date,
amount,
SUM(amount) OVER (ORDER BY order_date ROWS BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW) AS running_total
FROM orders;

Find duplicates

SELECT email, COUNT(*) AS cnt

FROM customers
GROUP BY email
HAVING COUNT(*) > 1;

Nth highest salary (3rd highest)

SELECT DISTINCT salary

FROM employees
ORDER BY salary DESC
OFFSET 2 LIMIT 1; -- PostgreSQL / MySQL 8+ style
-- OR using window:
SELECT *
FROM (
SELECT e.*, ROW_NUMBER() OVER (ORDER BY salary DESC) rn
FROM employees e
) t
WHERE rn = 3;

Transaction with savepoint

BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
SAVEPOINT deduct_done;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
-- if something wrong:
ROLLBACK TO SAVEPOINT deduct_done;
COMMIT;
Common interview questions & model
answers (short)
1. What is the difference between WHERE and HAVING?
WHERE filters rows before grouping; HAVING filters after aggregation (on group
results).
2. Primary key vs Unique key?
PK: one per table, uniquely identifies rows, cannot be NULL. Unique key allows
NULLs (in many DBs) and multiple unique constraints allowed.
3. DELETE vs TRUNCATE vs DROP?
DELETE removes rows and can be rolled back (logged). TRUNCATE removes all rows
quickly (minimal logging in some DBs), often cannot be rolled back in some DBs and
resets identity counters; DROP removes table definition.
4. Explain normalization.
Normalization reduces redundancy (1NF atomic values; 2NF removes partial
dependencies; 3NF removes transitive dependencies). Improves updates/inserts but
may require joins.
5. When would you denormalize?
For read-heavy OLAP workloads where joins are expensive; use star schema for
reporting.
6. Why avoid SELECT *?
Selects unnecessary columns, increases I/O, can break clients if schema changes,
prevents covering indexes.
7. What is an index and when would it not be used?
Index supports fast lookups. Not used when selectivity is poor (most rows match),
small tables, or if query uses functions on column or type mismatch.
8. Explain JOIN order and performance.
Optimizer decides order; but join type matters (nested loop for small joins, hash join
for large mismatches). Use indexes on join keys.
9. How do you optimize a slow query?
Check EXPLAIN, add proper indexes, avoid full table scans, check statistics, rewrite
subqueries as joins or use window functions, avoid unnecessary sorting.
10. What is a deadlock and how to resolve?
Deadlock: two transactions block each other. DB detects and aborts one transaction.
Avoid by consistent locking order, keep transactions short, and retry logic.
11. Difference between ROW_NUMBER() and RANK()?
ROW_NUMBER() assigns unique sequential numbers; RANK() gives same rank to ties
and leaves gaps.
12. How to prevent SQL injection?
Parameterized queries / prepared statements, input validation, and least privilege DB
users.

Practice problems (do these in the interview

or to prepare)
1. Top 3 products by sales per category — use ROW_NUMBER() partitioned by category
ordered by sum(sales) desc. (Example in window section.)
2. List customers who placed orders every month in 2024 — use aggregation and
COUNT(DISTINCT month) comparing to 12, or use EXCEPT between months set.
3. Given sales table, produce 7-day moving average of revenue — use AVG(amount)
OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW).
4. Find second highest salary without subquery — using window ROW_NUMBER()
approach shown earlier.

(If you want, I’ll provide full datasets & runnable SQL for these.)

Quick interview checklist (what to

say/show)
• Explain things clearly and with examples; when asked to write a query, prefer set-
based solutions (window functions, GROUP BY) over row-by-row loops.
• Mention EXPLAIN and how you'd use it to optimize.
• If given a slow query, show you’d check indexes, stats, and plan; propose adding
covering indexes or rewriting query.
• For systems questions mention ACID, isolation levels, and MVCC (if asked about
Postgres/Oracle).
• For cloud/databricks/azure roles mention familiarity with Azure SQL, Azure
Synapse, Databricks SQL (if appropriate), and data ingestion tools (COPY,
PolyBase, Spark), but don’t over-claim.

Short cheatsheet (one-liners)

• Use EXPLAIN → check seq scan vs index scan.
• WHERE before GROUP BY; HAVING after.
• Prefer JOIN to correlated subqueries for performance (unless optimizer rewrites).
• Window functions = powerful per-group ranking/rolling calculations.
• Index selective columns used in WHERE, JOIN, ORDER BY.
• Avoid SELECT * in production code.
• Always use parameterized queries to prevent injection.

Got it 👍 Since you said yes, I’ll build you a comprehensive interview prep kit:
🔹 Part 1: 20 Likely SQL Interview Questions (with model
answers)
Basics

1. What is SQL and why is it used?

SQL is the standard language for managing and querying data in relational databases.
It allows data definition (DDL), manipulation (DML), control (DCL), and transaction
management (TCL).
2. Difference between DELETE, TRUNCATE, and DROP?
o DELETE: removes rows (can filter with WHERE), logs each row, rollback
possible.
o TRUNCATE: removes all rows quickly, minimal logging, usually resets identity.
o DROP: deletes table schema + data.
3. What are primary key and foreign key?
o PK uniquely identifies a record, can’t be NULL.
o FK enforces relationship between tables, referencing another PK/unique key.
4. What is normalization?
Process of structuring DB to reduce redundancy.
o 1NF: atomic columns
o 2NF: remove partial dependency
o 3NF: remove transitive dependency.
5. Difference between WHERE and HAVING?
WHERE filters before grouping; HAVING filters after grouping.

Intermediate

6. What are joins?

Combine rows from multiple tables: INNER, LEFT, RIGHT, FULL, CROSS, SELF
join.
7. Difference between UNION and UNION ALL?
o UNION: removes duplicates.
o UNION ALL: keeps duplicates.
8. Explain indexes.
Special lookup structures that speed queries. Tradeoff: extra storage and slower
writes. Types: clustered, non-clustered, covering, composite.
9. What are window functions?
Functions applied over partitions of data (OVER clause). Examples: ROW_NUMBER(),
RANK(), LAG(), SUM() for running totals.
10. What is a transaction?
A sequence of SQL statements executed as one unit. Must satisfy ACID properties.

Advanced

11. Explain isolation levels.

• Read Uncommitted (dirty reads allowed)
• Read Committed (default in many DBs, no dirty reads)
• Repeatable Read (no non-repeatable reads)
• Serializable (strictest, avoids phantom reads)

12. What is a deadlock?

When two transactions hold locks the other needs. DB resolves by aborting one.
13. How would you optimize a slow query?

• Check execution plan (EXPLAIN)

• Add indexes
• Avoid SELECT *
• Rewrite subqueries as joins
• Update statistics
• Partition large tables

14. What is the difference between correlated and non-correlated subquery?

• Non-correlated: executes once, independent.

• Correlated: runs for each outer row (slower).

15. What is denormalization? Why use it?

Intentionally adding redundancy for performance, common in OLAP/star schemas.

Hands-On / Scenario Based

16. Find the 2nd highest salary.

SELECT DISTINCT salary

FROM employees
ORDER BY salary DESC
OFFSET 1 LIMIT 1;

17. Find employees who don’t have managers.

SELECT [Link]
FROM employees e
LEFT JOIN employees m ON e.manager_id = [Link]
WHERE [Link] IS NULL;

18. Find duplicate emails.

SELECT email, COUNT(*)

FROM users
GROUP BY email
HAVING COUNT(*) > 1;

19. Running total of sales.

SELECT order_date,
SUM(amount) OVER (ORDER BY order_date) AS running_total
FROM sales;

🔹 Part 2: 10 SQL Practice Challenges (self-test)

1. List all customers who placed orders in every month of 2024.
2. Find employees earning above department average.
3. Get the highest-selling product in each category.
4. Calculate 7-day moving average of sales.
5. Find customers with no orders.
6. Find customers who ordered only once.
7. Find the most recent order for each customer.
8. Find employees with the same salary.
9. Find total revenue per year.
10. Show top 5 customers by lifetime revenue.

🔹 Part 3: Cloud-Specific Edge (since you added Azure &

Databricks in resume)
• Azure SQL Database:
o Managed SQL Server in cloud, supports T-SQL.
o Scales elastically.
o Security: Active Directory integration, managed identities.
• Databricks SQL:
o SQL interface on top of Delta Lake (big data).
o Handles semi-structured (JSON, Parquet) + structured data.
o Supports ANSI SQL + Spark extensions.
o Use for dashboards, BI reporting, and querying massive datasets.

Interview Tip: When asked, emphasize that core SQL concepts remain same, but in
Databricks you deal with huge data volumes, optimized storage formats (Parquet/Delta),
and in Azure you leverage PaaS features (scalability, security, integration with pipelines).
AZURE DATABRICKS
What is Azure Databricks (short answer)
Azure Databricks is a managed, cloud-native data analytics and AI platform that
combines Apache Spark with first-class integrations into Azure (storage, identity,
networking) and Databricks’ Lakehouse features (Delta Lake, Databricks Runtime, SQL
warehouses, ML tools). It’s designed to run scalable ETL, analytics, streaming and ML
workloads on top of your cloud data lake while providing collaboration (notebooks, Repos),
orchestration (Jobs/Workflows) and governance. (Microsoft Learn)

Core concepts — in depth (what

interviewers expect)
1) Lakehouse architecture (Databricks + Delta Lake)
• Lakehouse = combine the best of data lakes (cheap, scalable object storage) and data
warehouses (ACID, low-latency reads, schemas). Delta Lake (open format) is the
storage layer Databricks builds on: it provides ACID transactions, schema
enforcement/evolution, time-travel (versioning) and makes batch & streaming
unify easily. This is the fundamental data model you’ll work with on Databricks.
(Databricks)

Why it matters: Guarantees correctness for concurrent reads/writes, enables

MERGE/UPSERT, and makes analytics & ML workflows simpler and more reliable.

2) Databricks workspace, notebooks & collaboration

• Workspace: web UI for users, groups, permissions, and artifacts (notebooks,
dashboards, libraries, repos).
• Notebooks: interactive development (Python/Scala/SQL/R) with cells, visualizations
and %sql magics.
• Repos: Git integration for code (branches, PR-based workflows).
• DBFS (Databricks File System): abstraction over cloud storage for easy path-based
access. (Microsoft Learn)

Interview tip: Mention collaborative features (comments, versioning via Repos) and how
they speed up reproducible pipelines.
3) Compute: clusters, Databricks Runtime, Photon, and
SQL warehouses
• Clusters: elastic groups of VMs that run Spark. Two main usage modes: all-purpose
(interactive notebooks) and job or serverless (production jobs). Clusters can autoscale
and you can install libraries (PyPI/Maven/CRAN).
• Databricks Runtime (DBR): tuned builds of Spark + optimizations + preinstalled
libraries (ML, GPU support, Delta optimizations).
• Photon: a Databricks-native vectorized query engine/accelerator for faster SQL
workloads (low-latency/high-throughput).
• SQL warehouses (formerly SQL endpoints): SQL-optimized compute for BI and
dashboards (connect to Power BI/Tableau, etc.). (Databricks Documentation)

What to know for interviews: differences between all-purpose vs job clusters, when to use
Photon/SQL warehouses, and how DBR simplifies dependency management.

4) Data ingestion & streaming (Auto Loader, Structured

Streaming)
• Auto Loader: file-based incremental ingestion tool that discovers new files in cloud
storage and brings them in reliably (uses checkpoints, scalable discovery). Great for
near-real-time and micro-batch ingestion.
• Structured Streaming: Spark’s high-level streaming API supported by Databricks
for continuous processing; integrates smoothly with Delta for exactly-once semantics.
(Microsoft Learn)

Practical detail: Auto Loader uses a checkpointing mechanism (e.g., RocksDB) to track
discovered files and can be combined with Delta tables for transactional sinks.

5) Delta Lake features & table management

• ACID transactions across reads and writes (important for concurrent
ETL/BI/streaming).
• Time travel: read historical snapshots of a table (for debugging or audits).
• MERGE INTO: easy UPSERT semantics for CDC/slowly-changing dimensions.
• Optimize / Z-ORDER / VACUUM: table maintenance for layout & performance
(file compaction and data skipping). (Databricks)

Example: MERGE INTO to apply CDC from staging to a dimension table — common
interview task.
6) Data governance & security (Unity Catalog, access
control, identity)
• Unity Catalog: centralized governance layer for catalogs/schemas/tables — unified
access control, auditing, lineage and discovery across workspaces. It’s the
recommended way to manage permissions and governance at scale in Databricks on
Azure.
• Identity & auth: Azure Databricks integrates with Microsoft Entra ID (Azure AD),
supports managed identities and service principals, and offers credential passthrough
options so compute can access ADLS Gen2 with user identities. (Microsoft Learn)

Interview angle: Explain object-level vs data-level permissions in Unity Catalog and why
that’s superior to legacy metastore approaches.

7) Machine learning & model lifecycle (MLflow, Feature

Store)
• MLflow: integrated model tracking, experiments, and model registry — first-class in
Databricks for experiment reproducibility and deployment.
• Feature Store: central store to register, discover and serve ML features, integrated
with Unity Catalog for governance. (Microsoft Learn)

Tip: Interviewers like to hear how you move from data → features → model training →
registry → serving (end-to-end pipeline).

8) Databricks SQL, BI & dashboards

• Databricks SQL: interactive SQL editor, dashboards and BI integration. Uses SQL
warehouses to power high-concurrency BI queries with features like materialized
views, caching and query optimizations. Commonly used to serve analysts and
dashboards. (Databricks)

9) Performance & optimization patterns

• Partitioning: choose partition columns to prune scans.
• File sizing & compaction: avoid too many small files; use OPTIMIZE to compact.
• Z-Order: multi-dimensional clustering for data skipping.
• Caching and Delta caching to speed repeated reads.
• Use EXPLAIN / Spark UI / Ganglia to diagnose shuffles, skew, executor utilization
and expensive stages. (Databricks)
10) Operations & integrations
• Workflows / Jobs: orchestrate multi-task, parameterized pipelines (notebook, JAR,
Python, SQL tasks).
• APIs & automation: REST API / CLI / Terraform providers to create clusters, jobs,
and SQL warehouses.
• Integrations: ADLS Gen2, Azure Data Factory, Event Hubs, Synapse, Power BI,
third-party BI tools and Delta Sharing for cross-org data exchange. (Databricks
Documentation)

Short practical examples (common

interview snippets)
Auto Loader (Python)

([Link]
.format("cloudFiles")
.option("[Link]","json")
.option("[Link]","true")
.load("abfss://mycontainer@[Link]/incoming/"))
.writeStream
.format("delta")
.option("checkpointLocation","/checkpoints/auto_loader")
.toTable("[Link]")

(See Auto Loader docs for production settings.) (Microsoft Learn)

MERGE into Delta (UPSERT)

MERGE INTO [Link] t

USING bronze.customer_updates s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

(Delta provides transactional semantics and time travel.) (Databricks)

Likely interview questions (Azure

Databricks) + short model answers
1. What is Databricks Lakehouse and why prefer Delta Lake?
Short: Lakehouse unifies analytics & BI on a single storage layer; Delta gives ACID,
time travel, and efficient batch+stream processing. (Databricks)
2. Explain Databricks Runtime and why use it.
DBR is Databricks’ optimized Spark distribution with tuned libraries and performance
patches — easier operations and faster jobs. (Databricks Documentation)
3. When to use Photon vs standard Spark?
Photon accelerates SQL workloads and vectorized execution — use it for low-
latency/high-concurrency SQL. (Databricks)
4. How does Auto Loader guarantee exactly-once ingestion?
It checkpoints discovered files and uses transactional writes to Delta; recommended
for scalable incremental ingestion. (Microsoft Learn)
5. What is Unity Catalog and how does it differ from workspace metastore?
Unity Catalog centralizes governance across workspaces, with unified RBAC,
auditing and lineage — unlike per-workspace metastores. (Microsoft Learn)
6. How do you optimize a slow job?
Check Spark UI & EXPLAIN, identify shuffles/skew, tune executors and memory,
add partitioning/OPTIMIZE/Z-ORDER, or change join strategy (broadcast vs
shuffle). (Databricks)
7. How to secure data in Azure Databricks?
Use Azure AD + Unity Catalog, managed identities/service principals, storage ACLs
on ADLS Gen2, Key Vault-backed secrets, and credential passthrough where needed.
(Microsoft Learn)
8. What are common pitfalls for Delta tables?
Too many small files, wrong partitioning, not running OPTIMIZE, ignoring file
compaction and not vacuuming old files. (Microsoft Learn)
9. Explain how Databricks supports MLops.
MLflow tracking + model registry + Feature Store + Jobs make training, registering,
promoting and serving models manageable. (Microsoft Learn)
10. How to connect BI tools to Databricks?
Use Databricks SQL (SQL warehouses) via JDBC/ODBC connectors (Power BI,
Tableau, Looker) or use Delta Share for cross-org sharing. (Databricks)

Quick interview prep checklist (one-liners)

• Be able to explain Lakehouse vs DW vs Data Lake. (Microsoft Learn)
• Know Delta Lake features (ACID, time travel, MERGE, OPTIMIZE). (Databricks)
• Explain clusters vs SQL warehouses and Databricks Runtime. (Databricks
Documentation)
• Security: Unity Catalog + Azure AD + managed identities. (Microsoft Learn)
• Walk through end-to-end: ingest → bronze/silver/gold Δ tables → train model
(MLflow) → deploy. (Microsoft Learn)
Perfect 👍 Let’s go in-depth into Azure Databricks so you’ll be ready to explain it clearly in
your interview.

🔹 What is Azure Databricks?

Azure Databricks is a cloud-based unified data and AI platform, built on Apache Spark
and integrated with Microsoft Azure.
It provides a collaborative environment for data engineers, data scientists, and analysts to
process big data, build machine learning models, and perform advanced analytics.

Think of it as a one-stop solution for:

• Data Engineering → Ingest, clean, transform, and prepare massive amounts of data.
• Data Science / ML → Train and deploy ML models at scale.
• Data Analytics → Query structured/semi-structured data and visualize results.

It’s designed to handle big data workloads that traditional SQL databases struggle with.

🔹 Core Concepts of Azure Databricks

1. Apache Spark Foundation
• Databricks is built on Apache Spark, a distributed data processing framework.
• It supports batch processing, streaming, machine learning, and graph computation.
• Spark’s main advantage → in-memory distributed computing for massive speed
gains over traditional MapReduce.

2. Workspaces
• A collaborative environment where teams can share notebooks, libraries, and
dashboards.
• Supports multi-language notebooks (Python, SQL, Scala, R).
• Enables role-based access control (RBAC) for security.

3. Clusters
• A set of virtual machines (VMs) where Spark jobs run.
• Two main types:
o Interactive Clusters → Used for exploration, development, and running
notebooks.
o Job Clusters → Automatically spun up for scheduled jobs and terminated
after completion.
• Autoscaling ensures efficient cost management.

4. Databricks Runtime
• An optimized Spark runtime provided by Databricks.
• Includes pre-configured libraries for ML (MLlib, TensorFlow, PyTorch, scikit-learn).
• Special runtimes exist:
o ML Runtime → For training ML models.
o Genomics Runtime → For bioinformatics workloads.
o Photon Runtime → High-performance query engine for SQL workloads.

5. Data Lake Integration

• Azure Databricks integrates seamlessly with Azure Data Lake Storage (ADLS),
Azure Blob Storage, and Azure SQL Database.
• Supports Delta Lake for ACID transactions on big data.
o Without Delta Lake → big data is append-only and hard to update.
o With Delta Lake → You can insert, update, delete, and merge data like in a
relational database.

6. Delta Lake
• A key feature of Databricks → brings data reliability and consistency to big data.
• Adds ACID transactions, schema enforcement, and time travel (rollback to older
versions of data).
• Solves the “data swamp” problem of data lakes by organizing it into a data
lakehouse (blend of data lake + data warehouse).

7. Databricks SQL
• Provides a SQL-native interface to query and analyze data stored in Delta Lake.
• Enables BI integrations (Power BI, Tableau).
• Supports data analysts who prefer SQL instead of Python/Scala.
8. Jobs & Workflows
• You can schedule ETL pipelines, machine learning model training, or batch jobs.
• Workflows can be triggered via UI, APIs, or Azure Data Factory.
• Job clusters are provisioned only for execution, saving cost.

9. Machine Learning & AI

• Built-in ML environment with MLflow for model tracking, deployment, and
monitoring.
• Supports AutoML for quick prototyping.
• Integration with Azure ML for deployment.

10. Security & Governance

• Deep integration with Azure Active Directory for identity and access management.
• Role-based access, network security (VNET injection), and data encryption.
• Unity Catalog → centralized governance for data access, metadata, and lineage
tracking.

🔹 Why Azure Databricks is Important

• Scalable → Handles petabytes of structured/unstructured data.
• Unified Platform → Combines data engineering, ML, and BI in one place.
• Cost-Efficient → Pay-as-you-go clusters with autoscaling.
• Collaboration → Shared workspaces for engineers, scientists, and analysts.

🔹 Interview Tip
You can position Azure Databricks as:
👉 “A collaborative, cloud-based platform for data engineering, data science, and advanced
analytics, built on Apache Spark, with key strengths in Delta Lake, scalability, and seamless
Azure integration.”
🌟 Self-Introduction (Interview Version)

**“Good morning/afternoon. My name is Kanakalakshme, and I joined TCS as a fresher on

8th May this year. My technical skill set includes SQL, Azure Databricks, Pega 8.8, and SAP
Basis and Security. I have also completed certifications in Pega as both a System Architect
and a Senior System Architect, and I earned an SAP certification during my training in the
company.

Along with my technical expertise, I bring strong soft skills such as multitasking, teamwork,
and organizational abilities, which help me manage tasks efficiently and collaborate
effectively in diverse teams.

My career goal is to continuously upskill myself and become a critical resource to the
organization by contributing to impactful projects, solving problems effectively, and ensuring
high-quality outcomes.”**

SQL Basics and Query Techniques
No ratings yet
SQL Basics and Query Techniques
4 pages
Overview of SQL
No ratings yet
Overview of SQL
6 pages
SQL Fundamentals
No ratings yet
SQL Fundamentals
27 pages
Comprehensive SQL Guide and Best Practices
No ratings yet
Comprehensive SQL Guide and Best Practices
7 pages
Ibps So It Dbms Basics Notes
No ratings yet
Ibps So It Dbms Basics Notes
4 pages
Relational SQL Notes
No ratings yet
Relational SQL Notes
6 pages
Frequently Used
No ratings yet
Frequently Used
14 pages
SQL Complete Notes Guide by BhagyalakshmiC PDF
No ratings yet
SQL Complete Notes Guide by BhagyalakshmiC PDF
19 pages
Basic SQL Interview Questions
No ratings yet
Basic SQL Interview Questions
18 pages
DBMS Detailed Tutorial
No ratings yet
DBMS Detailed Tutorial
17 pages
Mysql Hand Book
No ratings yet
Mysql Hand Book
22 pages
Before
No ratings yet
Before
13 pages
SQL Commands:: SQL (Structured Query Language)
No ratings yet
SQL Commands:: SQL (Structured Query Language)
7 pages
? SQL (Beginner) - From Scratch To Mastery
No ratings yet
? SQL (Beginner) - From Scratch To Mastery
8 pages
SQL Revision
No ratings yet
SQL Revision
20 pages
Wa0003
No ratings yet
Wa0003
35 pages
SQL Tutorial For Beginners
No ratings yet
SQL Tutorial For Beginners
10 pages
SQL Detailed Notes Overview
No ratings yet
SQL Detailed Notes Overview
5 pages
Complete SQL Guide
No ratings yet
Complete SQL Guide
4 pages
Create Table Insert Into Select Update Delete
No ratings yet
Create Table Insert Into Select Update Delete
3 pages
Study Guide CheatSheet SQL Basics v1
No ratings yet
Study Guide CheatSheet SQL Basics v1
12 pages
SQL 2024
No ratings yet
SQL 2024
3 pages
SQL Notes
No ratings yet
SQL Notes
4 pages
SQL Cheat Sheet
No ratings yet
SQL Cheat Sheet
15 pages
SQL 100 Interview Questions and Answers
No ratings yet
SQL 100 Interview Questions and Answers
28 pages
SQL For Everyone (Definitive Guide)
No ratings yet
SQL For Everyone (Definitive Guide)
10 pages
SQL Mastery for Job Seekers
No ratings yet
SQL Mastery for Job Seekers
28 pages
Untitled Document
No ratings yet
Untitled Document
3 pages
SQL Roadmap - 1
No ratings yet
SQL Roadmap - 1
10 pages
SQL
No ratings yet
SQL
19 pages
SQL Tutorial
No ratings yet
SQL Tutorial
5 pages
SQL Notes
No ratings yet
SQL Notes
7 pages
Unit 3
No ratings yet
Unit 3
6 pages
SQL Short Notes Top 10 Questions 1748266007
No ratings yet
SQL Short Notes Top 10 Questions 1748266007
8 pages
Recherche
No ratings yet
Recherche
18 pages
Comprehensive SQL Overview and Guide
No ratings yet
Comprehensive SQL Overview and Guide
6 pages
SQL Interview Notes Complete
No ratings yet
SQL Interview Notes Complete
6 pages
Complete SQL Guide & Interview Prep
No ratings yet
Complete SQL Guide & Interview Prep
6 pages
Top SQL Interview Questions With Solutions
No ratings yet
Top SQL Interview Questions With Solutions
9 pages
SQL Crash Sheet For MCQ Interview
No ratings yet
SQL Crash Sheet For MCQ Interview
12 pages
CSC421 - Database Management II
No ratings yet
CSC421 - Database Management II
48 pages
SQL For Data Science
No ratings yet
SQL For Data Science
8 pages
RDBMS and DBMS Concepts
No ratings yet
RDBMS and DBMS Concepts
5 pages
SQL Notes
No ratings yet
SQL Notes
6 pages
SQL Notes
No ratings yet
SQL Notes
9 pages
SQL Reference Guide
No ratings yet
SQL Reference Guide
3 pages
SQL Structured Query Language
No ratings yet
SQL Structured Query Language
3 pages
SQL
No ratings yet
SQL
9 pages
SQL Info
No ratings yet
SQL Info
12 pages
Coding
No ratings yet
Coding
3 pages
SQL Complete Cheat Sheet Cleaned
No ratings yet
SQL Complete Cheat Sheet Cleaned
3 pages
SQL - Chapters - 10 - To - 13 - Notes Final
No ratings yet
SQL - Chapters - 10 - To - 13 - Notes Final
15 pages
SQL Topic Wise Notes HackerRank
No ratings yet
SQL Topic Wise Notes HackerRank
3 pages
Dbms SQL Notes
No ratings yet
Dbms SQL Notes
3 pages
SQL Server Query Concepts
No ratings yet
SQL Server Query Concepts
13 pages
SQL Query
No ratings yet
SQL Query
14 pages
Detailed SQL Interview Questions
No ratings yet
Detailed SQL Interview Questions
4 pages
3M™ Clean-Trace™ Surface ATP - Hospital Evaluation Protocol V3
No ratings yet
3M™ Clean-Trace™ Surface ATP - Hospital Evaluation Protocol V3
11 pages
OSI vs TCP/IP: Key Similarities & Differences
No ratings yet
OSI vs TCP/IP: Key Similarities & Differences
5 pages
Splunk - Custom Search Queries
No ratings yet
Splunk - Custom Search Queries
3 pages
Ethnomathematics in Kafa Ethiopia Number
No ratings yet
Ethnomathematics in Kafa Ethiopia Number
10 pages
Secondary Research Methodology
100% (2)
Secondary Research Methodology
12 pages
APC AP9225 User Guide
No ratings yet
APC AP9225 User Guide
67 pages
Environmental Forensics Fundamentals A Practical Guide 1st Edition Ioana Gloria Petrisor Available Full Chapters
No ratings yet
Environmental Forensics Fundamentals A Practical Guide 1st Edition Ioana Gloria Petrisor Available Full Chapters
80 pages
Guided Tutorial For Pentaho Data Integration Using Mysql
No ratings yet
Guided Tutorial For Pentaho Data Integration Using Mysql
39 pages
Data Analytics CLP LED
No ratings yet
Data Analytics CLP LED
15 pages
FTP Protocol: Connections and Ports
No ratings yet
FTP Protocol: Connections and Ports
17 pages
Economics Investigatory Project Guidelines
No ratings yet
Economics Investigatory Project Guidelines
2 pages
How To Perform A Recovery With Initialization On Maxdb/Livecache
No ratings yet
How To Perform A Recovery With Initialization On Maxdb/Livecache
18 pages
Solid State Drive
No ratings yet
Solid State Drive
13 pages
Huawei OceanStor V3 Converged Storage Pre-Sales Training
No ratings yet
Huawei OceanStor V3 Converged Storage Pre-Sales Training
49 pages
Notes Me 112 Concepts in Engineering Design Unit 3
No ratings yet
Notes Me 112 Concepts in Engineering Design Unit 3
23 pages
Oracle Data Integrator Architecture
No ratings yet
Oracle Data Integrator Architecture
26 pages
Methodology-In-The-Humanities-And-Social-Sciences-35179292: Dowload Ebook
No ratings yet
Methodology-In-The-Humanities-And-Social-Sciences-35179292: Dowload Ebook
71 pages
ETL Developer Profile: Adhitya Venkataswamy
No ratings yet
ETL Developer Profile: Adhitya Venkataswamy
4 pages
Data Analytics in Small and Medium Enterprises SME
No ratings yet
Data Analytics in Small and Medium Enterprises SME
18 pages
Learning SAP Analytics Cloud Start Making Better B... - (Cover)
No ratings yet
Learning SAP Analytics Cloud Start Making Better B... - (Cover)
408 pages
Kampala, Uganda
No ratings yet
Kampala, Uganda
5 pages
Data Warehouse Requirements Gathering Methods
No ratings yet
Data Warehouse Requirements Gathering Methods
1 page
PYQs 7th Sem
No ratings yet
PYQs 7th Sem
10 pages
Thesis Proposal Outline Guide
67% (3)
Thesis Proposal Outline Guide
4 pages
Trustworthiness in Qualitative Research
No ratings yet
Trustworthiness in Qualitative Research
2 pages
Shyam Progress Seminar
No ratings yet
Shyam Progress Seminar
21 pages
Sample Multiple Case Study Dissertation
100% (2)
Sample Multiple Case Study Dissertation
7 pages
Unit 2 - Object and Object-Relational Databases
No ratings yet
Unit 2 - Object and Object-Relational Databases
86 pages
Identity Resolution Playbook Condensed
No ratings yet
Identity Resolution Playbook Condensed
40 pages
SQL Queries for ORDER, GROUP, HAVING
No ratings yet
SQL Queries for ORDER, GROUP, HAVING
4 pages

SQL & Azure Databricks Interview

Uploaded by

SQL & Azure Databricks Interview

Uploaded by

SQL

In SQL (Structured Query Language), a query is a command or request used to retrieve,

Core concepts — in depth (what

2. Keys & Constraints

3. DDL / DML / DCL / TCL

4. SELECT basics: projection, selection, filtering, sorting

5. Joins (core of relational queries)

SELECT [Link], [Link]

6. Aggregation & GROUP BY

SELECT department, COUNT(*) as headcount

7. Subqueries & correlated subqueries

9. Window functions (powerful for analytics)

10. Indexing & how queries become fast

11. Query optimizer & execution plan

12. Transactions & ACID

14. Views, materialized views, stored procedures,

15. Performance best practices (summary)

16. Security & injection prevention

17. Analytic & Big Data considerations

Practical examples (concise)

CREATE TABLE customers (

Insert / Select / Join / Group

INSERT INTO customers (email, name) VALUES ('a@[Link]','Amy');

SELECT [Link], COUNT([Link]) AS orders

Window function — running total

SELECT email, COUNT(*) AS cnt

Nth highest salary (3rd highest)

SELECT DISTINCT salary

Transaction with savepoint

Practice problems (do these in the interview

Quick interview checklist (what to

Short cheatsheet (one-liners)

1. What is SQL and why is it used?

6. What are joins?

11. Explain isolation levels.

12. What is a deadlock?

• Check execution plan (EXPLAIN)

14. What is the difference between correlated and non-correlated subquery?

• Non-correlated: executes once, independent.

15. What is denormalization? Why use it?

Hands-On / Scenario Based

16. Find the 2nd highest salary.

SELECT DISTINCT salary

17. Find employees who don’t have managers.

18. Find duplicate emails.

SELECT email, COUNT(*)

19. Running total of sales.

20. Top 3 products per category by revenue.

🔹 Part 2: 10 SQL Practice Challenges (self-test)

🔹 Part 3: Cloud-Specific Edge (since you added Azure &

Core concepts — in depth (what

Why it matters: Guarantees correctness for concurrent reads/writes, enables

2) Databricks workspace, notebooks & collaboration

4) Data ingestion & streaming (Auto Loader, Structured

5) Delta Lake features & table management

7) Machine learning & model lifecycle (MLflow, Feature

8) Databricks SQL, BI & dashboards

9) Performance & optimization patterns

Short practical examples (common

(See Auto Loader docs for production settings.) (Microsoft Learn)

MERGE into Delta (UPSERT)

MERGE INTO [Link] t

(Delta provides transactional semantics and time travel.) (Databricks)

Likely interview questions (Azure

Quick interview prep checklist (one-liners)

🔹 What is Azure Databricks?

Think of it as a one-stop solution for:

🔹 Core Concepts of Azure Databricks

5. Data Lake Integration

9. Machine Learning & AI

10. Security & Governance

🔹 Why Azure Databricks is Important

**“Good morning/afternoon. My name is Kanakalakshme, and I joined TCS as a fresher on

You might also like