0% found this document useful (0 votes)
4 views19 pages

SQL & Azure Databricks Interview

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views19 pages

SQL & Azure Databricks Interview

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

SQL

In SQL (Structured Query Language), a query is a command or request used to retrieve,


manipulate, or manage data within a relational database. It is essentially a question posed to
the database, asking for specific information or instructing it to perform an action.
A lingua franca is a shared language of communication adopted by people whose native
languages are different, serving as a bridge for mutual understanding in a specific region,
field, or context, such as English in global business

What is SQL?
SQL (Structured Query Language) is the standard declarative language used to define,
query, modify and control data stored in relational databases. It’s the lingua franca for
working with tabular data (rows × columns) in systems like PostgreSQL, MySQL,
Microsoft SQL Server, Oracle, SQLite, and also powers analytic engines like Databricks
SQL.
SQL is declarative: you describe what you want (e.g., “give me the top 3 sellers”), not how to
fetch it — the database query optimizer figures out the how.

Core concepts — in depth (what


interviewers expect)
1. Relational model & basic building blocks
• Table: collection of rows (records). Each column has a data type (INT, VARCHAR,
DATE, DECIMAL, BOOLEAN, JSON, etc.).
• Row (tuple) = one record.
• Schema: structure/organization of tables, columns and relationships.
• NULL: absence of a value — important semantics (three-valued logic: TRUE /
FALSE / UNKNOWN).

2. Keys & Constraints


• Primary Key (PK): uniquely identifies rows; usually indexed and NOT NULL.
• Foreign Key (FK): enforces referential integrity between tables (e.g.,
orders.customer_id → [Link]).
• Unique, NOT NULL, CHECK, DEFAULT constraints.
• ON DELETE/UPDATE actions: CASCADE, SET NULL, RESTRICT.

3. DDL / DML / DCL / TCL


• DDL (Data Definition Language): CREATE, ALTER, DROP, TRUNCATE — defines
schema.
• DML (Data Manipulation Language): SELECT, INSERT, UPDATE, DELETE —
read/write data.
• DCL (Data Control Language): GRANT, REVOKE — permissions.
• TCL (Transaction Control Language): BEGIN/START TRANSACTION, COMMIT,
ROLLBACK, SAVEPOINT.

4. SELECT basics: projection, selection, filtering, sorting


• SELECT column_list FROM table WHERE condition ORDER BY ... LIMIT ...
• Projection — which columns you want.
• Selection — which rows (WHERE).
• DISTINCT removes duplicates.
• ORDER BY sorts; OFFSET / LIMIT or TOP paginates.

5. Joins (core of relational queries)


• INNER JOIN — rows with matching keys in both tables.
• LEFT (LEFT OUTER) JOIN — all left rows, matched right rows or NULLs.
• RIGHT (RIGHT OUTER) JOIN — opposite.
• FULL OUTER JOIN — all rows from both; unmatched sides NULL.
• CROSS JOIN — Cartesian product.
• Self-join — join table to itself (e.g., manager-employee).
Example:

SELECT [Link], [Link]


FROM orders o
JOIN customers c ON o.customer_id = [Link];

6. Aggregation & GROUP BY


• Aggregates: COUNT(), SUM(), AVG(), MIN(), MAX().
• GROUP BY groups rows before aggregate.
• HAVING filters groups (apply conditions after grouping).

SELECT department, COUNT(*) as headcount


FROM employees
GROUP BY department
HAVING COUNT(*) > 10;

7. Subqueries & correlated subqueries


• Scalar subquery: returns single value in SELECT or WHERE.
• Correlated subquery references outer query row — executed per outer row (can be
slower).

SELECT [Link]
FROM employees e
WHERE [Link] > (SELECT AVG(salary) FROM employees WHERE dept = [Link]);

8. Set operations
• UNION (removes duplicates), UNION ALL (keeps duplicates), INTERSECT, EXCEPT /
MINUS.

9. Window functions (powerful for analytics)


• OVER (PARTITION BY ... ORDER BY ... [frame]):
o Ranking: ROW_NUMBER(), RANK(), DENSE_RANK().
o Offsets: LAG(), LEAD().
o Running totals: SUM(...) OVER (ORDER BY ...).
Example — top N per group:

SELECT *
FROM (
SELECT s.*,
ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) rn
FROM employees s
) t
WHERE rn <= 3;

10. Indexing & how queries become fast


• Index (usually B-tree): speeds lookups, range scans, ORDER BY.
• Clustered index: physical order of rows (e.g., SQL Server primary cluster).
• Non-clustered: separate structure pointing to rows.
• Covering index: index that contains all columns needed for a query → avoids table
access.
• Tradeoffs: indexes speed reads, slow writes and take space; choose selective columns
& proper composite order.

11. Query optimizer & execution plan


• DB produces a logical plan, chooses a physical plan using statistics and cost model.
• Use EXPLAIN / EXPLAIN ANALYZE to inspect plans: look for sequential scans vs index
scans, expensive sorts, join algorithms (nested loop, hash join, merge join).

12. Transactions & ACID


• Atomicity, Consistency, Isolation, Durability.
• Isolation levels (common):
o READ UNCOMMITTED — dirty reads allowed.
o READ COMMITTED — no dirty reads (default in many DBs).
o REPEATABLE READ — no non-repeatable reads.
o SERIALIZABLE — strictest, no phantom reads.
• MVCC (Multi-Version Concurrency Control): PostgreSQL, Oracle style: readers
don’t block writers and vice versa, using snapshots.
13. Normalization vs Denormalization
• 1NF/2NF/3NF/BCNF remove redundancy, avoid update anomalies.
• Denormalization used in OLAP/data-warehouses for read performance (star schema,
fact & dimension tables).

14. Views, materialized views, stored procedures,


functions, triggers
• View: named query (virtual table).
• Materialized view: stored precomputed result (refresh policy).
• Stored procedures / functions: encapsulate logic inside DB; use sparingly for
business logic portability concerns.
• Triggers: automatic actions on INSERT/UPDATE/DELETE — use with caution (can
be hidden cost).

15. Performance best practices (summary)


• Avoid SELECT *.
• Use proper data types.
• Add indexes on columns used in WHERE / JOIN / ORDER BY.
• Avoid functions on indexed columns in WHERE (non-SARGable).
• Batch large writes, use bulk load tools.
• Use EXPLAIN to find bottlenecks.
• Consider partitioning for very large tables.

16. Security & injection prevention


• Use parameterized queries / prepared statements — no string concatenation of
user input.
• Apply least privilege via roles and GRANTs.

17. Analytic & Big Data considerations


• For analytic queries on huge datasets: partitioning, columnar stores (Parquet, ORC),
materialized views, distribution keys (in distributed engines like Redshift/Spark), and
cost-based optimization matter.

Practical examples (concise)


Create table + constraints

CREATE TABLE customers (


id SERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL UNIQUE,
name VARCHAR(100),
joined_date DATE DEFAULT CURRENT_DATE
);

Insert / Select / Join / Group

INSERT INTO customers (email, name) VALUES ('a@[Link]','Amy');

SELECT [Link], COUNT([Link]) AS orders


FROM customers c
LEFT JOIN orders o ON [Link] = o.customer_id
WHERE c.joined_date >= '2025-01-01'
GROUP BY [Link]
HAVING COUNT([Link]) > 0
ORDER BY orders DESC
LIMIT 10;

Window function — running total

SELECT order_date,
amount,
SUM(amount) OVER (ORDER BY order_date ROWS BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW) AS running_total
FROM orders;

Find duplicates

SELECT email, COUNT(*) AS cnt


FROM customers
GROUP BY email
HAVING COUNT(*) > 1;

Nth highest salary (3rd highest)

SELECT DISTINCT salary


FROM employees
ORDER BY salary DESC
OFFSET 2 LIMIT 1; -- PostgreSQL / MySQL 8+ style
-- OR using window:
SELECT *
FROM (
SELECT e.*, ROW_NUMBER() OVER (ORDER BY salary DESC) rn
FROM employees e
) t
WHERE rn = 3;

Transaction with savepoint

BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
SAVEPOINT deduct_done;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
-- if something wrong:
ROLLBACK TO SAVEPOINT deduct_done;
COMMIT;
Common interview questions & model
answers (short)
1. What is the difference between WHERE and HAVING?
WHERE filters rows before grouping; HAVING filters after aggregation (on group
results).
2. Primary key vs Unique key?
PK: one per table, uniquely identifies rows, cannot be NULL. Unique key allows
NULLs (in many DBs) and multiple unique constraints allowed.
3. DELETE vs TRUNCATE vs DROP?
DELETE removes rows and can be rolled back (logged). TRUNCATE removes all rows
quickly (minimal logging in some DBs), often cannot be rolled back in some DBs and
resets identity counters; DROP removes table definition.
4. Explain normalization.
Normalization reduces redundancy (1NF atomic values; 2NF removes partial
dependencies; 3NF removes transitive dependencies). Improves updates/inserts but
may require joins.
5. When would you denormalize?
For read-heavy OLAP workloads where joins are expensive; use star schema for
reporting.
6. Why avoid SELECT *?
Selects unnecessary columns, increases I/O, can break clients if schema changes,
prevents covering indexes.
7. What is an index and when would it not be used?
Index supports fast lookups. Not used when selectivity is poor (most rows match),
small tables, or if query uses functions on column or type mismatch.
8. Explain JOIN order and performance.
Optimizer decides order; but join type matters (nested loop for small joins, hash join
for large mismatches). Use indexes on join keys.
9. How do you optimize a slow query?
Check EXPLAIN, add proper indexes, avoid full table scans, check statistics, rewrite
subqueries as joins or use window functions, avoid unnecessary sorting.
10. What is a deadlock and how to resolve?
Deadlock: two transactions block each other. DB detects and aborts one transaction.
Avoid by consistent locking order, keep transactions short, and retry logic.
11. Difference between ROW_NUMBER() and RANK()?
ROW_NUMBER() assigns unique sequential numbers; RANK() gives same rank to ties
and leaves gaps.
12. How to prevent SQL injection?
Parameterized queries / prepared statements, input validation, and least privilege DB
users.

Practice problems (do these in the interview


or to prepare)
1. Top 3 products by sales per category — use ROW_NUMBER() partitioned by category
ordered by sum(sales) desc. (Example in window section.)
2. List customers who placed orders every month in 2024 — use aggregation and
COUNT(DISTINCT month) comparing to 12, or use EXCEPT between months set.
3. Given sales table, produce 7-day moving average of revenue — use AVG(amount)
OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW).
4. Find second highest salary without subquery — using window ROW_NUMBER()
approach shown earlier.

(If you want, I’ll provide full datasets & runnable SQL for these.)

Quick interview checklist (what to


say/show)
• Explain things clearly and with examples; when asked to write a query, prefer set-
based solutions (window functions, GROUP BY) over row-by-row loops.
• Mention EXPLAIN and how you'd use it to optimize.
• If given a slow query, show you’d check indexes, stats, and plan; propose adding
covering indexes or rewriting query.
• For systems questions mention ACID, isolation levels, and MVCC (if asked about
Postgres/Oracle).
• For cloud/databricks/azure roles mention familiarity with Azure SQL, Azure
Synapse, Databricks SQL (if appropriate), and data ingestion tools (COPY,
PolyBase, Spark), but don’t over-claim.

Short cheatsheet (one-liners)


• Use EXPLAIN → check seq scan vs index scan.
• WHERE before GROUP BY; HAVING after.
• Prefer JOIN to correlated subqueries for performance (unless optimizer rewrites).
• Window functions = powerful per-group ranking/rolling calculations.
• Index selective columns used in WHERE, JOIN, ORDER BY.
• Avoid SELECT * in production code.
• Always use parameterized queries to prevent injection.

Got it 👍 Since you said yes, I’ll build you a comprehensive interview prep kit:
🔹 Part 1: 20 Likely SQL Interview Questions (with model
answers)
Basics

1. What is SQL and why is it used?


SQL is the standard language for managing and querying data in relational databases.
It allows data definition (DDL), manipulation (DML), control (DCL), and transaction
management (TCL).
2. Difference between DELETE, TRUNCATE, and DROP?
o DELETE: removes rows (can filter with WHERE), logs each row, rollback
possible.
o TRUNCATE: removes all rows quickly, minimal logging, usually resets identity.
o DROP: deletes table schema + data.
3. What are primary key and foreign key?
o PK uniquely identifies a record, can’t be NULL.
o FK enforces relationship between tables, referencing another PK/unique key.
4. What is normalization?
Process of structuring DB to reduce redundancy.
o 1NF: atomic columns
o 2NF: remove partial dependency
o 3NF: remove transitive dependency.
5. Difference between WHERE and HAVING?
WHERE filters before grouping; HAVING filters after grouping.

Intermediate

6. What are joins?


Combine rows from multiple tables: INNER, LEFT, RIGHT, FULL, CROSS, SELF
join.
7. Difference between UNION and UNION ALL?
o UNION: removes duplicates.
o UNION ALL: keeps duplicates.
8. Explain indexes.
Special lookup structures that speed queries. Tradeoff: extra storage and slower
writes. Types: clustered, non-clustered, covering, composite.
9. What are window functions?
Functions applied over partitions of data (OVER clause). Examples: ROW_NUMBER(),
RANK(), LAG(), SUM() for running totals.
10. What is a transaction?
A sequence of SQL statements executed as one unit. Must satisfy ACID properties.

Advanced

11. Explain isolation levels.


• Read Uncommitted (dirty reads allowed)
• Read Committed (default in many DBs, no dirty reads)
• Repeatable Read (no non-repeatable reads)
• Serializable (strictest, avoids phantom reads)

12. What is a deadlock?


When two transactions hold locks the other needs. DB resolves by aborting one.
13. How would you optimize a slow query?

• Check execution plan (EXPLAIN)


• Add indexes
• Avoid SELECT *
• Rewrite subqueries as joins
• Update statistics
• Partition large tables

14. What is the difference between correlated and non-correlated subquery?

• Non-correlated: executes once, independent.


• Correlated: runs for each outer row (slower).

15. What is denormalization? Why use it?


Intentionally adding redundancy for performance, common in OLAP/star schemas.

Hands-On / Scenario Based

16. Find the 2nd highest salary.

SELECT DISTINCT salary


FROM employees
ORDER BY salary DESC
OFFSET 1 LIMIT 1;

17. Find employees who don’t have managers.

SELECT [Link]
FROM employees e
LEFT JOIN employees m ON e.manager_id = [Link]
WHERE [Link] IS NULL;

18. Find duplicate emails.

SELECT email, COUNT(*)


FROM users
GROUP BY email
HAVING COUNT(*) > 1;

19. Running total of sales.

SELECT order_date,
SUM(amount) OVER (ORDER BY order_date) AS running_total
FROM sales;

20. Top 3 products per category by revenue.

SELECT *
FROM (
SELECT [Link], [Link], SUM([Link]) AS total,
ROW_NUMBER() OVER (PARTITION BY [Link] ORDER BY SUM([Link])
DESC) rn
FROM products p
JOIN orders o ON [Link] = o.product_id
GROUP BY [Link], [Link]
) t
WHERE rn <= 3;

🔹 Part 2: 10 SQL Practice Challenges (self-test)


1. List all customers who placed orders in every month of 2024.
2. Find employees earning above department average.
3. Get the highest-selling product in each category.
4. Calculate 7-day moving average of sales.
5. Find customers with no orders.
6. Find customers who ordered only once.
7. Find the most recent order for each customer.
8. Find employees with the same salary.
9. Find total revenue per year.
10. Show top 5 customers by lifetime revenue.

🔹 Part 3: Cloud-Specific Edge (since you added Azure &


Databricks in resume)
• Azure SQL Database:
o Managed SQL Server in cloud, supports T-SQL.
o Scales elastically.
o Security: Active Directory integration, managed identities.
• Databricks SQL:
o SQL interface on top of Delta Lake (big data).
o Handles semi-structured (JSON, Parquet) + structured data.
o Supports ANSI SQL + Spark extensions.
o Use for dashboards, BI reporting, and querying massive datasets.

Interview Tip: When asked, emphasize that core SQL concepts remain same, but in
Databricks you deal with huge data volumes, optimized storage formats (Parquet/Delta),
and in Azure you leverage PaaS features (scalability, security, integration with pipelines).
AZURE DATABRICKS
What is Azure Databricks (short answer)
Azure Databricks is a managed, cloud-native data analytics and AI platform that
combines Apache Spark with first-class integrations into Azure (storage, identity,
networking) and Databricks’ Lakehouse features (Delta Lake, Databricks Runtime, SQL
warehouses, ML tools). It’s designed to run scalable ETL, analytics, streaming and ML
workloads on top of your cloud data lake while providing collaboration (notebooks, Repos),
orchestration (Jobs/Workflows) and governance. (Microsoft Learn)

Core concepts — in depth (what


interviewers expect)
1) Lakehouse architecture (Databricks + Delta Lake)
• Lakehouse = combine the best of data lakes (cheap, scalable object storage) and data
warehouses (ACID, low-latency reads, schemas). Delta Lake (open format) is the
storage layer Databricks builds on: it provides ACID transactions, schema
enforcement/evolution, time-travel (versioning) and makes batch & streaming
unify easily. This is the fundamental data model you’ll work with on Databricks.
(Databricks)

Why it matters: Guarantees correctness for concurrent reads/writes, enables


MERGE/UPSERT, and makes analytics & ML workflows simpler and more reliable.

2) Databricks workspace, notebooks & collaboration


• Workspace: web UI for users, groups, permissions, and artifacts (notebooks,
dashboards, libraries, repos).
• Notebooks: interactive development (Python/Scala/SQL/R) with cells, visualizations
and %sql magics.
• Repos: Git integration for code (branches, PR-based workflows).
• DBFS (Databricks File System): abstraction over cloud storage for easy path-based
access. (Microsoft Learn)

Interview tip: Mention collaborative features (comments, versioning via Repos) and how
they speed up reproducible pipelines.
3) Compute: clusters, Databricks Runtime, Photon, and
SQL warehouses
• Clusters: elastic groups of VMs that run Spark. Two main usage modes: all-purpose
(interactive notebooks) and job or serverless (production jobs). Clusters can autoscale
and you can install libraries (PyPI/Maven/CRAN).
• Databricks Runtime (DBR): tuned builds of Spark + optimizations + preinstalled
libraries (ML, GPU support, Delta optimizations).
• Photon: a Databricks-native vectorized query engine/accelerator for faster SQL
workloads (low-latency/high-throughput).
• SQL warehouses (formerly SQL endpoints): SQL-optimized compute for BI and
dashboards (connect to Power BI/Tableau, etc.). (Databricks Documentation)

What to know for interviews: differences between all-purpose vs job clusters, when to use
Photon/SQL warehouses, and how DBR simplifies dependency management.

4) Data ingestion & streaming (Auto Loader, Structured


Streaming)
• Auto Loader: file-based incremental ingestion tool that discovers new files in cloud
storage and brings them in reliably (uses checkpoints, scalable discovery). Great for
near-real-time and micro-batch ingestion.
• Structured Streaming: Spark’s high-level streaming API supported by Databricks
for continuous processing; integrates smoothly with Delta for exactly-once semantics.
(Microsoft Learn)

Practical detail: Auto Loader uses a checkpointing mechanism (e.g., RocksDB) to track
discovered files and can be combined with Delta tables for transactional sinks.

5) Delta Lake features & table management


• ACID transactions across reads and writes (important for concurrent
ETL/BI/streaming).
• Time travel: read historical snapshots of a table (for debugging or audits).
• MERGE INTO: easy UPSERT semantics for CDC/slowly-changing dimensions.
• Optimize / Z-ORDER / VACUUM: table maintenance for layout & performance
(file compaction and data skipping). (Databricks)

Example: MERGE INTO to apply CDC from staging to a dimension table — common
interview task.
6) Data governance & security (Unity Catalog, access
control, identity)
• Unity Catalog: centralized governance layer for catalogs/schemas/tables — unified
access control, auditing, lineage and discovery across workspaces. It’s the
recommended way to manage permissions and governance at scale in Databricks on
Azure.
• Identity & auth: Azure Databricks integrates with Microsoft Entra ID (Azure AD),
supports managed identities and service principals, and offers credential passthrough
options so compute can access ADLS Gen2 with user identities. (Microsoft Learn)

Interview angle: Explain object-level vs data-level permissions in Unity Catalog and why
that’s superior to legacy metastore approaches.

7) Machine learning & model lifecycle (MLflow, Feature


Store)
• MLflow: integrated model tracking, experiments, and model registry — first-class in
Databricks for experiment reproducibility and deployment.
• Feature Store: central store to register, discover and serve ML features, integrated
with Unity Catalog for governance. (Microsoft Learn)

Tip: Interviewers like to hear how you move from data → features → model training →
registry → serving (end-to-end pipeline).

8) Databricks SQL, BI & dashboards


• Databricks SQL: interactive SQL editor, dashboards and BI integration. Uses SQL
warehouses to power high-concurrency BI queries with features like materialized
views, caching and query optimizations. Commonly used to serve analysts and
dashboards. (Databricks)

9) Performance & optimization patterns


• Partitioning: choose partition columns to prune scans.
• File sizing & compaction: avoid too many small files; use OPTIMIZE to compact.
• Z-Order: multi-dimensional clustering for data skipping.
• Caching and Delta caching to speed repeated reads.
• Use EXPLAIN / Spark UI / Ganglia to diagnose shuffles, skew, executor utilization
and expensive stages. (Databricks)
10) Operations & integrations
• Workflows / Jobs: orchestrate multi-task, parameterized pipelines (notebook, JAR,
Python, SQL tasks).
• APIs & automation: REST API / CLI / Terraform providers to create clusters, jobs,
and SQL warehouses.
• Integrations: ADLS Gen2, Azure Data Factory, Event Hubs, Synapse, Power BI,
third-party BI tools and Delta Sharing for cross-org data exchange. (Databricks
Documentation)

Short practical examples (common


interview snippets)
Auto Loader (Python)

([Link]
.format("cloudFiles")
.option("[Link]","json")
.option("[Link]","true")
.load("abfss://mycontainer@[Link]/incoming/"))
.writeStream
.format("delta")
.option("checkpointLocation","/checkpoints/auto_loader")
.toTable("[Link]")

(See Auto Loader docs for production settings.) (Microsoft Learn)

MERGE into Delta (UPSERT)

MERGE INTO [Link] t


USING bronze.customer_updates s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

(Delta provides transactional semantics and time travel.) (Databricks)

Likely interview questions (Azure


Databricks) + short model answers
1. What is Databricks Lakehouse and why prefer Delta Lake?
Short: Lakehouse unifies analytics & BI on a single storage layer; Delta gives ACID,
time travel, and efficient batch+stream processing. (Databricks)
2. Explain Databricks Runtime and why use it.
DBR is Databricks’ optimized Spark distribution with tuned libraries and performance
patches — easier operations and faster jobs. (Databricks Documentation)
3. When to use Photon vs standard Spark?
Photon accelerates SQL workloads and vectorized execution — use it for low-
latency/high-concurrency SQL. (Databricks)
4. How does Auto Loader guarantee exactly-once ingestion?
It checkpoints discovered files and uses transactional writes to Delta; recommended
for scalable incremental ingestion. (Microsoft Learn)
5. What is Unity Catalog and how does it differ from workspace metastore?
Unity Catalog centralizes governance across workspaces, with unified RBAC,
auditing and lineage — unlike per-workspace metastores. (Microsoft Learn)
6. How do you optimize a slow job?
Check Spark UI & EXPLAIN, identify shuffles/skew, tune executors and memory,
add partitioning/OPTIMIZE/Z-ORDER, or change join strategy (broadcast vs
shuffle). (Databricks)
7. How to secure data in Azure Databricks?
Use Azure AD + Unity Catalog, managed identities/service principals, storage ACLs
on ADLS Gen2, Key Vault-backed secrets, and credential passthrough where needed.
(Microsoft Learn)
8. What are common pitfalls for Delta tables?
Too many small files, wrong partitioning, not running OPTIMIZE, ignoring file
compaction and not vacuuming old files. (Microsoft Learn)
9. Explain how Databricks supports MLops.
MLflow tracking + model registry + Feature Store + Jobs make training, registering,
promoting and serving models manageable. (Microsoft Learn)
10. How to connect BI tools to Databricks?
Use Databricks SQL (SQL warehouses) via JDBC/ODBC connectors (Power BI,
Tableau, Looker) or use Delta Share for cross-org sharing. (Databricks)

Quick interview prep checklist (one-liners)


• Be able to explain Lakehouse vs DW vs Data Lake. (Microsoft Learn)
• Know Delta Lake features (ACID, time travel, MERGE, OPTIMIZE). (Databricks)
• Explain clusters vs SQL warehouses and Databricks Runtime. (Databricks
Documentation)
• Security: Unity Catalog + Azure AD + managed identities. (Microsoft Learn)
• Walk through end-to-end: ingest → bronze/silver/gold Δ tables → train model
(MLflow) → deploy. (Microsoft Learn)
Perfect 👍 Let’s go in-depth into Azure Databricks so you’ll be ready to explain it clearly in
your interview.

🔹 What is Azure Databricks?


Azure Databricks is a cloud-based unified data and AI platform, built on Apache Spark
and integrated with Microsoft Azure.
It provides a collaborative environment for data engineers, data scientists, and analysts to
process big data, build machine learning models, and perform advanced analytics.

Think of it as a one-stop solution for:

• Data Engineering → Ingest, clean, transform, and prepare massive amounts of data.
• Data Science / ML → Train and deploy ML models at scale.
• Data Analytics → Query structured/semi-structured data and visualize results.

It’s designed to handle big data workloads that traditional SQL databases struggle with.

🔹 Core Concepts of Azure Databricks


1. Apache Spark Foundation
• Databricks is built on Apache Spark, a distributed data processing framework.
• It supports batch processing, streaming, machine learning, and graph computation.
• Spark’s main advantage → in-memory distributed computing for massive speed
gains over traditional MapReduce.

2. Workspaces
• A collaborative environment where teams can share notebooks, libraries, and
dashboards.
• Supports multi-language notebooks (Python, SQL, Scala, R).
• Enables role-based access control (RBAC) for security.

3. Clusters
• A set of virtual machines (VMs) where Spark jobs run.
• Two main types:
o Interactive Clusters → Used for exploration, development, and running
notebooks.
o Job Clusters → Automatically spun up for scheduled jobs and terminated
after completion.
• Autoscaling ensures efficient cost management.

4. Databricks Runtime
• An optimized Spark runtime provided by Databricks.
• Includes pre-configured libraries for ML (MLlib, TensorFlow, PyTorch, scikit-learn).
• Special runtimes exist:
o ML Runtime → For training ML models.
o Genomics Runtime → For bioinformatics workloads.
o Photon Runtime → High-performance query engine for SQL workloads.

5. Data Lake Integration


• Azure Databricks integrates seamlessly with Azure Data Lake Storage (ADLS),
Azure Blob Storage, and Azure SQL Database.
• Supports Delta Lake for ACID transactions on big data.
o Without Delta Lake → big data is append-only and hard to update.
o With Delta Lake → You can insert, update, delete, and merge data like in a
relational database.

6. Delta Lake
• A key feature of Databricks → brings data reliability and consistency to big data.
• Adds ACID transactions, schema enforcement, and time travel (rollback to older
versions of data).
• Solves the “data swamp” problem of data lakes by organizing it into a data
lakehouse (blend of data lake + data warehouse).

7. Databricks SQL
• Provides a SQL-native interface to query and analyze data stored in Delta Lake.
• Enables BI integrations (Power BI, Tableau).
• Supports data analysts who prefer SQL instead of Python/Scala.
8. Jobs & Workflows
• You can schedule ETL pipelines, machine learning model training, or batch jobs.
• Workflows can be triggered via UI, APIs, or Azure Data Factory.
• Job clusters are provisioned only for execution, saving cost.

9. Machine Learning & AI


• Built-in ML environment with MLflow for model tracking, deployment, and
monitoring.
• Supports AutoML for quick prototyping.
• Integration with Azure ML for deployment.

10. Security & Governance


• Deep integration with Azure Active Directory for identity and access management.
• Role-based access, network security (VNET injection), and data encryption.
• Unity Catalog → centralized governance for data access, metadata, and lineage
tracking.

🔹 Why Azure Databricks is Important


• Scalable → Handles petabytes of structured/unstructured data.
• Unified Platform → Combines data engineering, ML, and BI in one place.
• Cost-Efficient → Pay-as-you-go clusters with autoscaling.
• Collaboration → Shared workspaces for engineers, scientists, and analysts.

🔹 Interview Tip
You can position Azure Databricks as:
👉 “A collaborative, cloud-based platform for data engineering, data science, and advanced
analytics, built on Apache Spark, with key strengths in Delta Lake, scalability, and seamless
Azure integration.”
🌟 Self-Introduction (Interview Version)

**“Good morning/afternoon. My name is Kanakalakshme, and I joined TCS as a fresher on


8th May this year. My technical skill set includes SQL, Azure Databricks, Pega 8.8, and SAP
Basis and Security. I have also completed certifications in Pega as both a System Architect
and a Senior System Architect, and I earned an SAP certification during my training in the
company.

Along with my technical expertise, I bring strong soft skills such as multitasking, teamwork,
and organizational abilities, which help me manage tasks efficiently and collaborate
effectively in diverse teams.

My career goal is to continuously upskill myself and become a critical resource to the
organization by contributing to impactful projects, solving problems effectively, and ensuring
high-quality outcomes.”**

You might also like