Indexing in Databases - Infosys SP Interview Preparation
## 💡 What is an Index?
An index is like a shortcut or table of contents in a book. Instead of scanning every row, the
database can jump directly to where your data lives.
---
## ⚙️What Happens Internally When You Create an Index
Example:
CREATE INDEX idx_name ON Employees(name);
### 🧩 Step 1: Database scans the table
The DB engine reads all rows from the Employees table and looks at the indexed column.
### 🧱 Step 2: The DB builds a separate data structure (usually a B-Tree)
The database creates a balanced tree (B-Tree) — a sorted structure that stores:
- The indexed column’s value (e.g., name)
- A pointer (address) to the actual row in the table where that value is stored
This allows O(log n) lookup time.
### ⚡ Step 3: The index file is stored separately
Indexes are stored in a separate file or data segment. Queries using indexed columns use
this structure to find data faster.
### 🔄 Step 4: Updates are tracked
When data is inserted, updated, or deleted, indexes must also be updated, which adds
overhead.
---
## 🧠 Types of Index Structures
| Type | Description | Example Use |
|------|--------------|--------------|
| **B-Tree Index** | Default in most DBs, supports range queries (`>`, `<`) | `salary > 50000` |
| **Hash Index** | Fast for equality lookups (`=`) | `email = '
[email protected]'` |
| **Bitmap Index** | Uses bitmaps for low-cardinality columns | `gender`, `status` |
| **Clustered Index** | Data stored physically in index order (only one per table) | Primary
key |
| **Non-clustered Index** | Separate structure with pointers to rows | Secondary indexes |
---
## 🧩 Example
### Table
| ID | Name | Age |
|----|------|-----|
| 1 | Bob | 30 |
| 2 | Alice| 25 |
| 3 | John | 28 |
After creating an index on `Name`:
Index (B-Tree):
"Alice" → Row 2
"Bob" → Row 1
"John" → Row 3
So a query `SELECT * FROM Employees WHERE Name = 'John';` directly uses the index to
jump to Row 3.
---
## 🧩 What is Index Fragmentation?
Index fragmentation happens when the logical order of pages does not match their physical
order on disk.
### Real-Life Analogy:
Imagine a book where Chapter 1 is on page 10, Chapter 2 is on page 100, and Chapter 3 is
on page 25. Reading it would take longer — that’s fragmentation!
### Causes:
- Inserts (new rows in the middle)
- Deletes (holes in pages)
- Updates (data grows larger and moves)
### Types of Fragmentation:
| Type | Description | Example |
|------|--------------|----------|
| **Internal Fragmentation** | Empty space inside pages | A page only 60% full |
| **External Fragmentation** | Pages out of order | Page 1 → Page 100 → Page 20 |
### Fixing Fragmentation:
- **Reorganize**: Light defragmentation (`ALTER INDEX REORGANIZE`)
- **Rebuild**: Full rebuild (`ALTER INDEX REBUILD`)
- **MySQL**: `OPTIMIZE TABLE table_name;`
- **PostgreSQL**: `VACUUM FULL table_name;`
---
## 💬 Common Interview Questions on Indexing
### 🔹 Basic Level
1. What is an index and why is it used?
2. What are clustered and non-clustered indexes?
3. What are advantages and disadvantages of indexes?
4. What happens internally when you create an index?
5. Why does inserting or deleting data slow down when you have many indexes?
### 🔹 Intermediate Level
6. What data structure is used to implement an index?
7. What is a composite (multi-column) index?
8. What’s the difference between a unique index and a primary key?
9. What’s the difference between a B-Tree and Hash index?
10. When should you avoid creating an index?
### 🔹 Advanced Level
11. What is index fragmentation?
12. How does fragmentation affect performance?
13. What is the difference between internal and external fragmentation?
14. How do databases maintain indexes after updates?
15. What is an index scan vs index seek?
16. How do covering indexes improve performance?
17. How can you check if your query uses an index (e.g., using EXPLAIN)?
---
## 💭 Summary
Indexes improve read performance dramatically but come with a write overhead.
Understanding how they work internally (especially B-Trees and fragmentation) is key for
database optimization and interview success.
What is Indexing?
👉 Definition (simple version):
An index in a database is like the index of a book — it helps you find information faster
without scanning every page (record).
💡 Real-Life Analogy
Imagine a dictionary — if you want to find the word “Database”,
you don’t read every word — you look it up using the alphabetical index.
Similarly, in a database:
Without an index → the database scans every row (slow).
With an index → it jumps directly to the required rows (fast).
⚙️How Does Indexing Work?
When you create an index on a column (say student_id),
the database creates a separate data structure (usually a B-tree) that stores:
student_id → pointer to the row in the table
This allows quick lookups, just like searching in a sorted list.
🧮 Example
Suppose we have a table:
ID Name Age City
1 Raj 22 Delhi
2 Anu 23 Mumbai
3 Ravi 21 Delhi
... ... ... ...
Now if you run:
SELECT * FROM students WHERE ID = 1000;
Without index: Database scans all rows → O(n) time.
With index on ID: Database jumps directly to ID = 1000 → O(log n) time.
🧰 Types of Indexes
Type Description Example
Automatically created on the primary key.
Primary Index PRIMARY KEY (id)
Unique and not null.
Ensures all values are unique but not CREATE UNIQUE INDEX
Unique Index
necessarily a primary key. idx_email ON users(email);
Clustered Physically rearranges table data according
Often on primary key.
Index to index order (only one allowed per table).
Non-Clustered Separate structure; table data remains CREATE INDEX idx_city ON
Index unchanged (can have many). students(city);
Composite CREATE INDEX idx_name_city
Index on multiple columns.
Index ON students(name, city);
🚀 Advantages of Indexing
✅ Faster data retrieval
✅ Improved performance for SELECT, WHERE, JOIN, ORDER BY
✅ Helps maintain uniqueness when using UNIQUE INDEX
⚠️Disadvantages
❌ Slows down INSERT, UPDATE, DELETE (because index must also be updated)
❌ Takes extra memory/storage space
❌ Poorly chosen indexes can actually hurt performance
💬 Example in SQL
-- Create an index
CREATE INDEX idx_student_city ON students(city);
-- Drop an index
DROP INDEX idx_student_city;
-- View indexes (MySQL)
SHOW INDEX FROM students;
🔍 Clustered vs Non-Clustered Index (Very Common Question)
Feature Clustered Index Non-Clustered Index
Data storage Rearranges the table rows physically Keeps a separate structure
Number allowed Only 1 per table Many per table
Access speed Faster (direct access) Slightly slower (extra lookup)
Example Primary Key index Secondary indexes on other columns
🧠 Analogy:
Clustered index → Sorted phonebook itself.
Non-clustered index → A separate list with names and page numbers.
🧠 Interview Q&A Summary
💬 Basic Questions
1. What is an index in a database?
2. Why do we need indexing?
3. What happens when we don’t use an index?
4. What type of data structure is used in indexing (B-tree / Hash)?
5. Can a table have multiple indexes?
6. What is the difference between Clustered and Non-Clustered index?
7. What is a composite index?
⚙️Intermediate Questions
1. How does indexing affect INSERT/UPDATE/DELETE operations?
2. Can we create an index on multiple columns?
3. What is the drawback of using too many indexes?
4. What happens internally when we create an index?
5. What is the difference between Primary Key and Unique Index?
🔥 Advanced / SP-Level Questions
1. How do indexes improve the performance of a JOIN query?
2. How are B-tree and Hash indexes different?
3. What is an index scan vs a table scan?
4. When should you not use an index?
5. How can you check if your query is using an index (EXPLAIN PLAN)?
⚡ Tip for Interviews
“An index is like a shortcut — it improves search speed in large databases, but overusing it
can slow down updates and take extra space.”
Would you like me to make a Word file or 1-page revision sheet (with diagrams + SQL
examples + Q&A) for indexing just like I made for transactions?
It’ll be perfect to revise right before your interview.
You said:
What happens internally when we create an index?
ChatGPT said:
Excellent question — this one often impresses interviewers when you answer it clearly.
Let’s break it down in simple terms, then go a bit deeper technically so you’re interview-
ready 👇
💡 First, a quick recap — What is an Index?
An index is like a shortcut or table of contents in a book.
Instead of scanning every row (like reading every page), the database can jump directly to
where your data lives.
⚙️What Happens Internally When You Create an Index
Let’s say you run:
CREATE INDEX idx_name ON Employees(name);
Here’s what happens step by step 👇
🧩 Step 1: Database scans the table
The DB engine reads all rows from the Employees table.
It looks at the name column (the one you want to index).
🧱 Step 2: The DB builds a separate data structure (usually a B-Tree)
The database creates a balanced tree (B-Tree) — a sorted structure that stores:
o The indexed column’s value (name)
o A pointer (address) to the actual row in the table where that value is stored.
So internally it looks like:
Index (B-Tree)
---------------
| "Alice" → RowID 105 |
| "Bob" → RowID 210 |
| "Charlie"→ RowID 314 |
This structure allows binary search, so finding "Bob" is O(log n), not O(n).
⚡ Step 3: The index file is stored separately
The index is not part of the main table — it’s a separate file or data segment.
Every time you query using the indexed column:
SELECT * FROM Employees WHERE name = 'Bob';
the DB uses the index file to jump directly to RowID 210 instead of scanning every row.
🔄 Step 4: Updates are tracked
When you INSERT, UPDATE, or DELETE data:
o The index must also be updated.
o This adds some overhead (why too many indexes can slow down writes).
🧠 Types of Index Structures
Type Description Example Use
B-Tree Index Default in most DBs, supports range queries (>, <) salary > 50000
email =
Hash Index Fast for equality (=) lookups only
'[email protected]'
Bitmap Index Uses bitmaps for low-cardinality columns gender, status
Data stored physically in index order (only one
Clustered Index Primary key
per table)
Non-clustered
Separate structure with pointers to rows Secondary indexes
Index
🧩 Example
Table
ID Name Age
1 Bob 30
2 Alice 25
3 John 28
After creating an index on Name
The DB creates:
Index (B-Tree):
"Alice" → Row 2
"Bob" → Row 1
"John" → Row 3
So if you run:
SELECT * FROM Employees WHERE Name = 'John';
The database directly follows the pointer from the index instead of scanning all 3 rows.
💬 Common Interview Questions on Indexing
🔹 Basic Level
1. What is an index and why is it used?
2. What are clustered and non-clustered indexes?
3. What are advantages and disadvantages of indexes?
4. What happens internally when you create an index?
5. Why does inserting or deleting data slow down when you have many indexes?
🔹 Intermediate Level
6. What data structure is used to implement an index?
7. What is a composite (multi-column) index?
8. What’s the difference between a unique index and a primary key?
9. What’s the difference between a B-Tree and Hash index?
10. When should you avoid creating an index?
🔹 Advanced Level
11. What is index fragmentation?
12. How do databases maintain indexes after updates?
13. What is an index scan vs index seek?
14. How do covering indexes improve performance?
15. How do you check if your query uses an index (e.g., using EXPLAIN in SQL)?
Would you like me to make a Word file of this explanation + all interview questions
(formatted for quick revision before interview)?
You said:
give word file for the last two chats
ChatGPT said:
Analysis paused
You said:
What is index fragmentation?
ChatGPT said:
Excellent question — this is one of the commonly asked advanced database interview
topics. Let’s break it down very simply 👇
🧩 What is Index Fragmentation?
👉 Definition:
Index fragmentation happens when the logical order of pages (how they’re linked in the
index) does not match the physical order of pages on disk.
In simpler words:
Fragmentation means your index’s data pages are scattered all over the place, making it
slower for the database to read sequentially.
📚 Real-Life Analogy
Imagine a book where:
Chapter 1 is on page 10,
Chapter 2 is on page 100,
Chapter 3 is on page 25.
📖 When you read the book, you keep flipping back and forth — that’s fragmentation.
It slows down reading because the content isn’t stored in order.
⚙️How Does Fragmentation Happen?
Fragmentation usually occurs over time as data changes (insertions, deletions, updates):
1. INSERTS — New rows are added in the middle, and if there’s no space, new pages
are created elsewhere.
2. DELETES — Rows are removed, leaving “holes” in pages.
3. UPDATES — When a row grows larger than its current space, it may be moved to
another page.
Over time, these actions scatter the index pages across the disk.
📊 Types of Index Fragmentation
Type Description Example
Internal Wasted space inside pages (due to A page has only 60% filled
Fragmentation deletes or updates). data.
External Page 1 points to Page 100,
Pages are out of order on disk.
Fragmentation then Page 20.
🐢 Why Fragmentation is a Problem
Slower range scans (BETWEEN, <, >)
More I/O operations (disk heads jump around)
Cache inefficiency
Increased query latency
🧹 How to Fix It
You can fix fragmentation using database maintenance commands:
Database Command Description
SQL Server ALTER INDEX REORGANIZE Defrag pages (lightweight)
Database Command Description
SQL Server ALTER INDEX REBUILD Recreate index (heavier, more effective)
MySQL (InnoDB) OPTIMIZE TABLE table_name; Rebuilds table and indexes
PostgreSQL VACUUM FULL table_name; Reorganizes storage
💬 How to Check Fragmentation
In SQL Server:
SELECT
OBJECT_NAME(OBJECT_ID) AS TableName,
index_id,
avg_fragmentation_in_percent
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, NULL)
WHERE avg_fragmentation_in_percent > 10;
🧠 Interview-Focused Summary
Index fragmentation = pages out of logical order.
Caused by inserts, deletes, and updates.
Leads to slow read performance.
Fixed using rebuild or reorganize operations.
Checked via system views or maintenance tools.
💬 Frequently Asked Interview Questions
1. What is index fragmentation?
2. What causes index fragmentation?
3. How does fragmentation affect performance?
4. What’s the difference between internal and external fragmentation?
5. How can you reduce fragmentation?
6. How often should indexes be rebuilt?
7. What’s the difference between REBUILD and REORGANIZE?
8. How can you check the fragmentation level of an index?
SQL and NoSQL databases differ fundamentally in their data models, scalability, schema
flexibility, and query languages.
SQL Databases (Relational Databases):
Data Model:
Organize data into tables with predefined columns and rows, establishing relationships
between tables through foreign keys.
Schema:
Employ a strict, fixed schema that requires defining data types and relationships before
data entry.
Scalability:
Primarily scale vertically, meaning increasing the resources (CPU, RAM) of a single
server. Horizontal scaling can be achieved with more complex architectures like sharding.
Query Language:
Use Structured Query Language (SQL) for data definition, manipulation, and retrieval.
ACID Properties:
Emphasize Atomicity, Consistency, Isolation, and Durability (ACID) to ensure data integrity
and reliability, especially in transactional environments.
NoSQL Databases (Non-Relational Databases):
Data Model:
Offer diverse data models, including document-oriented (e.g., MongoDB), key-value stores
(e.g., Redis), wide-column stores (e.g., Cassandra), and graph databases (e.g., Neo4j).
Schema:
Typically have a dynamic or flexible schema, allowing for easier adaptation to changing data
structures and types.
Scalability:
Designed for horizontal scalability, distributing data across multiple servers or nodes in a
cluster.
Query Language:
Use various query languages specific to their data model, often involving API calls or
specialized query syntaxes.
BASE Properties:
Often prioritize Basic Availability, Soft-state, and Eventual consistency (BASE) over strict
ACID compliance, favoring availability and partition tolerance in distributed systems.
Key Differences Summarized:
Feature SQL Databases NoSQL Databases
Data Model Tabular (relational) Document, Key-Value, Column, Graph
Schema Fixed, predefined Dynamic, flexible
Scalability Primarily vertical Primarily horizontal
Query Language SQL Diverse (e.g., JSON, API calls)
Properties ACID (Consistency, Reliability) BASE (Availability, Scalability)
.
🧩 What is Sharding?
Sharding means splitting a large database into smaller pieces (called shards) and
storing them on different servers.
Each shard holds a portion of the total data, and together they make up the complete
database.
You can think of it like splitting a huge book into multiple smaller volumes — each volume
has a part of the story, but not all of it.
🎯 Why Sharding?
As the amount of data grows, a single database server:
Becomes slow
Runs out of storage
Can’t handle too many requests (high load)
To solve this, companies use sharding — dividing the data across multiple machines so
that:
Each server stores less data
Queries run faster
System can scale horizontally (add more servers easily)
⚙️How Sharding Works (Simple Example)
Let’s say you have a Users table with millions of records.
user_id name city
1 Raj Delhi
2 Anita Mumbai
3 John London
4 Mary New York
Instead of storing everything in one database, you can split it like this:
Shard 1: user_id 1–1000 → stored on Server A
Shard 2: user_id 1001–2000 → stored on Server B
Shard 3: user_id 2001–3000 → stored on Server C
Now, if you need to fetch user with ID = 1500 → you directly go to Shard 2.
🔑 Sharding Key
The sharding key decides how data is split between shards.
Common choices:
User ID
Region/Country
Customer ID
Example:
“All users from India go to Shard 1, users from the USA go to Shard 2.”
A good sharding key ensures data is evenly distributed and queries are predictable.
🧠 Benefits of Sharding
✅ Scalability — Add more servers as data grows
✅ Performance — Smaller data per shard → faster queries
✅ Availability — If one shard fails, others still work
✅ Cost-effective — Use multiple smaller servers instead of one giant one
⚠️Challenges in Sharding
❌ Complexity — Application must know which shard to query
❌ Joins become hard — Data from two shards can’t be joined easily
❌ Resharding — Moving data when shards fill up is difficult
❌ Uneven load — If the sharding key isn’t balanced, one shard may get too much traffic
Real-World Example
Instagram — shards user data by user ID
YouTube — shards video metadata
Twitter — shards tweets by user ID
This allows them to handle billions of users efficiently
Clustering in DBMS (Server-Level or Database Clustering)
🧩 What is Database Clustering?
Database Clustering (also known as Replica Sets) means using multiple servers (nodes)
that are connected to the same database or hold replicas of the same dataset, to improve
performance, availability, and reliability.
👉 It’s like having multiple machines serving the same data, so if one fails, others continue
serving requests.
⚙️Why Clustering is Needed
Sometimes, a single server cannot handle:
Large amount of data
Too many user requests
Need for high uptime
So, multiple servers are grouped together — forming a Database Cluster — to share load
and provide backup.
🧬 How It Works
1. The same dataset is replicated across multiple database servers.
2. A load balancer distributes incoming requests among these servers.
3. If one server fails, another server automatically handles the requests.
So, the system remains available and responsive.
⚙️Architecture Example
Imagine your app connects to:
Server A
Server B
Server C
All these servers have the same database.
If Server A goes down, requests automatically go to Server B or C.
This ensures:
✅ No downtime
✅ Continuous data availability
🪣 Advantages of Database Clustering
1️⃣ Data Redundancy
The same data is stored on multiple servers.
If one crashes, others have the data.
💡 This is useful redundancy, not the kind that causes anomalies.
2️⃣ Load Balancing
Requests are distributed evenly among servers.
Prevents overloading a single machine.
Improves scalability — more servers = more users supported.
💡 Example: If your website suddenly gets 10,000 users, the load can be shared among 3–4
servers seamlessly.
3️⃣ High Availability
Even if one or more servers fail, the system keeps running.
The database stays online almost 100% of the time.
Perfect for mission-critical systems (e.g., banking, e-commerce).
🧭 How Clustering Works (Step-by-Step)
1. A cluster contains multiple servers (called nodes).
2. A load balancer divides incoming requests between nodes.
3. Each node has a replica of the same data.
4. If a node fails:
o The request is redirected to another node.
o Synchronization ensures all nodes remain consistent.
Hence, there are few or no chances of total failure.
🧠 Analogy
Think of a restaurant with many chefs cooking the same menu.
If one chef gets sick (server failure), others keep cooking — customers (users) won’t notice.
⚙️Types of Database Clustering
Type Description
Active-Active All nodes actively handle requests at the same time.
Active- Only one node handles requests; others are on standby, ready to take over in
Passive case of failure.
⚠️Challenges of Clustering
❌ Complex to set up and maintain
❌ Needs synchronization between servers
❌ Expensive hardware resources
❌ Conflicts may occur if replication isn’t properly managed
💬 Common Interview Questions on Clustering
1. What is database clustering?
2. How does clustering improve database performance?
3. What is the difference between Active-Active and Active-Passive clustering?
4. How does clustering ensure high availability?
5. What is the role of replication in clustering?
6. How is clustering different from sharding?
7. What are the main advantages and disadvantages of clustering?
8. How does load balancing work in database clusters?
9. What happens if one node in the cluster fails?
10. Can you give a real-life example of database clustering?