0% found this document useful (0 votes)

12 views21 pages

Infosys SP Indexing and Fragmentation Preparation

The document provides a comprehensive overview of indexing in databases, explaining what an index is, how it is created, and its internal workings, particularly focusing on B-Trees and fragmentation. It outlines various types of index structures, their advantages and disadvantages, and common interview questions related to indexing. Additionally, it discusses index fragmentation, its causes, effects on performance, and methods to fix it.

Uploaded by

Manish singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views21 pages

Infosys SP Indexing and Fragmentation Preparation

Uploaded by

Manish singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Indexing in Databases - Infosys SP Interview Preparation

## 💡 What is an Index?
An index is like a shortcut or table of contents in a book. Instead of scanning every row, the
database can jump directly to where your data lives.

---

## ⚙️What Happens Internally When You Create an Index

Example:
CREATE INDEX idx_name ON Employees(name);

### 🧩 Step 1: Database scans the table

The DB engine reads all rows from the Employees table and looks at the indexed column.

### 🧱 Step 2: The DB builds a separate data structure (usually a B-Tree)

The database creates a balanced tree (B-Tree) — a sorted structure that stores:
- The indexed column’s value (e.g., name)
- A pointer (address) to the actual row in the table where that value is stored

This allows O(log n) lookup time.

### ⚡ Step 3: The index file is stored separately

Indexes are stored in a separate file or data segment. Queries using indexed columns use
this structure to find data faster.

### 🔄 Step 4: Updates are tracked

When data is inserted, updated, or deleted, indexes must also be updated, which adds
overhead.

---

## 🧠 Types of Index Structures

| Type | Description | Example Use |

|------|--------------|--------------|
| **B-Tree Index** | Default in most DBs, supports range queries (`>`, `<`) | `salary > 50000` |
| **Hash Index** | Fast for equality lookups (`=`) | `email = '[email protected]'` |
| **Bitmap Index** | Uses bitmaps for low-cardinality columns | `gender`, `status` |
| **Clustered Index** | Data stored physically in index order (only one per table) | Primary
key |
| **Non-clustered Index** | Separate structure with pointers to rows | Secondary indexes |

---

## 🧩 Example

### Table
| ID | Name | Age |
|----|------|-----|
| 1 | Bob | 30 |
| 2 | Alice| 25 |
| 3 | John | 28 |

After creating an index on `Name`:

Index (B-Tree):
"Alice" → Row 2
"Bob" → Row 1
"John" → Row 3

So a query `SELECT * FROM Employees WHERE Name = 'John';` directly uses the index to
jump to Row 3.

---

## 🧩 What is Index Fragmentation?

Index fragmentation happens when the logical order of pages does not match their physical
order on disk.

### Real-Life Analogy:

Imagine a book where Chapter 1 is on page 10, Chapter 2 is on page 100, and Chapter 3 is
on page 25. Reading it would take longer — that’s fragmentation!

### Causes:
- Inserts (new rows in the middle)
- Deletes (holes in pages)
- Updates (data grows larger and moves)

### Types of Fragmentation:

| Type | Description | Example |
|------|--------------|----------|
| **Internal Fragmentation** | Empty space inside pages | A page only 60% full |
| **External Fragmentation** | Pages out of order | Page 1 → Page 100 → Page 20 |
### Fixing Fragmentation:
- **Reorganize**: Light defragmentation (ÀLTER INDEX REORGANIZE`)
- **Rebuild**: Full rebuild (ÀLTER INDEX REBUILD`)
- **MySQL**: ÒPTIMIZE TABLE table_name;`
- **PostgreSQL**: `VACUUM FULL table_name;`

---

## 💬 Common Interview Questions on Indexing

### 🔹 Basic Level

1. What is an index and why is it used?
2. What are clustered and non-clustered indexes?
3. What are advantages and disadvantages of indexes?
4. What happens internally when you create an index?
5. Why does inserting or deleting data slow down when you have many indexes?

### 🔹 Intermediate Level

6. What data structure is used to implement an index?
7. What is a composite (multi-column) index?
8. What’s the difference between a unique index and a primary key?
9. What’s the difference between a B-Tree and Hash index?
10. When should you avoid creating an index?

### 🔹 Advanced Level

11. What is index fragmentation?
12. How does fragmentation affect performance?
13. What is the difference between internal and external fragmentation?
14. How do databases maintain indexes after updates?
15. What is an index scan vs index seek?
16. How do covering indexes improve performance?
17. How can you check if your query uses an index (e.g., using EXPLAIN)?

---

## 💭 Summary
Indexes improve read performance dramatically but come with a write overhead.
Understanding how they work internally (especially B-Trees and fragmentation) is key for
database optimization and interview success.
What is Indexing?

👉 Definition (simple version):

An index in a database is like the index of a book — it helps you find information faster
without scanning every page (record).

💡 Real-Life Analogy

Imagine a dictionary — if you want to find the word “Database”,

you don’t read every word — you look it up using the alphabetical index.

Similarly, in a database:

 Without an index → the database scans every row (slow).

 With an index → it jumps directly to the required rows (fast).

⚙️How Does Indexing Work?

When you create an index on a column (say student_id),

the database creates a separate data structure (usually a B-tree) that stores:

student_id → pointer to the row in the table

This allows quick lookups, just like searching in a sorted list.

🧮 Example

Suppose we have a table:

ID Name Age City

1 Raj 22 Delhi

2 Anu 23 Mumbai

3 Ravi 21 Delhi

... ... ... ...

Now if you run:

SELECT * FROM students WHERE ID = 1000;

 Without index: Database scans all rows → O(n) time.

 With index on ID: Database jumps directly to ID = 1000 → O(log n) time.

🧰 Types of Indexes

Type Description Example

Automatically created on the primary key.

Primary Index PRIMARY KEY (id)
Unique and not null.

Ensures all values are unique but not CREATE UNIQUE INDEX
Unique Index
necessarily a primary key. idx_email ON users(email);

Clustered Physically rearranges table data according

Often on primary key.
Index to index order (only one allowed per table).

Non-Clustered Separate structure; table data remains CREATE INDEX idx_city ON

Index unchanged (can have many). students(city);

Composite CREATE INDEX idx_name_city

Index on multiple columns.
Index ON students(name, city);

🚀 Advantages of Indexing

✅ Faster data retrieval

✅ Improved performance for SELECT, WHERE, JOIN, ORDER BY
✅ Helps maintain uniqueness when using UNIQUE INDEX

⚠️Disadvantages

❌ Slows down INSERT, UPDATE, DELETE (because index must also be updated)
❌ Takes extra memory/storage space
❌ Poorly chosen indexes can actually hurt performance

💬 Example in SQL

-- Create an index

CREATE INDEX idx_student_city ON students(city);

-- Drop an index

DROP INDEX idx_student_city;

-- View indexes (MySQL)

SHOW INDEX FROM students;

🔍 Clustered vs Non-Clustered Index (Very Common Question)

Feature Clustered Index Non-Clustered Index

Data storage Rearranges the table rows physically Keeps a separate structure

Number allowed Only 1 per table Many per table

Access speed Faster (direct access) Slightly slower (extra lookup)

Example Primary Key index Secondary indexes on other columns

🧠 Analogy:

 Clustered index → Sorted phonebook itself.

 Non-clustered index → A separate list with names and page numbers.

🧠 Interview Q&A Summary

💬 Basic Questions

1. What is an index in a database?

2. Why do we need indexing?

3. What happens when we don’t use an index?

4. What type of data structure is used in indexing (B-tree / Hash)?

5. Can a table have multiple indexes?

6. What is the difference between Clustered and Non-Clustered index?

7. What is a composite index?

⚙️Intermediate Questions

1. How does indexing affect INSERT/UPDATE/DELETE operations?

2. Can we create an index on multiple columns?

3. What is the drawback of using too many indexes?

4. What happens internally when we create an index?

5. What is the difference between Primary Key and Unique Index?

🔥 Advanced / SP-Level Questions

1. How do indexes improve the performance of a JOIN query?

2. How are B-tree and Hash indexes different?

3. What is an index scan vs a table scan?

4. When should you not use an index?

5. How can you check if your query is using an index (EXPLAIN PLAN)?

⚡ Tip for Interviews

“An index is like a shortcut — it improves search speed in large databases, but overusing it
can slow down updates and take extra space.”

Would you like me to make a Word file or 1-page revision sheet (with diagrams + SQL
examples + Q&A) for indexing just like I made for transactions?
It’ll be perfect to revise right before your interview.

You said:

What happens internally when we create an index?

ChatGPT said:

Excellent question — this one often impresses interviewers when you answer it clearly.
Let’s break it down in simple terms, then go a bit deeper technically so you’re interview-
ready 👇
💡 First, a quick recap — What is an Index?

An index is like a shortcut or table of contents in a book.

Instead of scanning every row (like reading every page), the database can jump directly to
where your data lives.

⚙️What Happens Internally When You Create an Index

Let’s say you run:

CREATE INDEX idx_name ON Employees(name);

Here’s what happens step by step 👇

🧩 Step 1: Database scans the table

 The DB engine reads all rows from the Employees table.

 It looks at the name column (the one you want to index).

🧱 Step 2: The DB builds a separate data structure (usually a B-Tree)

 The database creates a balanced tree (B-Tree) — a sorted structure that stores:

o The indexed column’s value (name)

o A pointer (address) to the actual row in the table where that value is stored.

So internally it looks like:

Index (B-Tree)

---------------

| "Alice" → RowID 105 |

| "Bob" → RowID 210 |

| "Charlie"→ RowID 314 |

This structure allows binary search, so finding "Bob" is O(log n), not O(n).

⚡ Step 3: The index file is stored separately

 The index is not part of the main table — it’s a separate file or data segment.

 Every time you query using the indexed column:

 SELECT * FROM Employees WHERE name = 'Bob';

the DB uses the index file to jump directly to RowID 210 instead of scanning every row.

🔄 Step 4: Updates are tracked

 When you INSERT, UPDATE, or DELETE data:

o The index must also be updated.

o This adds some overhead (why too many indexes can slow down writes).

🧠 Types of Index Structures

Type Description Example Use

B-Tree Index Default in most DBs, supports range queries (>, <) salary > 50000

email =
Hash Index Fast for equality (=) lookups only
'[email protected]'

Bitmap Index Uses bitmaps for low-cardinality columns gender, status

Data stored physically in index order (only one

Clustered Index Primary key
per table)

Non-clustered
Separate structure with pointers to rows Secondary indexes
Index

🧩 Example

Table

ID Name Age

1 Bob 30

2 Alice 25

3 John 28

After creating an index on Name

The DB creates:

Index (B-Tree):
"Alice" → Row 2

"Bob" → Row 1

"John" → Row 3

So if you run:

SELECT * FROM Employees WHERE Name = 'John';

The database directly follows the pointer from the index instead of scanning all 3 rows.

💬 Common Interview Questions on Indexing

🔹 Basic Level

1. What is an index and why is it used?

2. What are clustered and non-clustered indexes?

3. What are advantages and disadvantages of indexes?

4. What happens internally when you create an index?

5. Why does inserting or deleting data slow down when you have many indexes?

🔹 Intermediate Level

6. What data structure is used to implement an index?

7. What is a composite (multi-column) index?

8. What’s the difference between a unique index and a primary key?

9. What’s the difference between a B-Tree and Hash index?

10. When should you avoid creating an index?

🔹 Advanced Level

11. What is index fragmentation?

12. How do databases maintain indexes after updates?

13. What is an index scan vs index seek?

14. How do covering indexes improve performance?

15. How do you check if your query uses an index (e.g., using EXPLAIN in SQL)?
Would you like me to make a Word file of this explanation + all interview questions
(formatted for quick revision before interview)?

You said:

give word file for the last two chats

ChatGPT said:

Analysis paused

You said:

What is index fragmentation?

ChatGPT said:

Excellent question — this is one of the commonly asked advanced database interview
topics. Let’s break it down very simply 👇

🧩 What is Index Fragmentation?

👉 Definition:
Index fragmentation happens when the logical order of pages (how they’re linked in the
index) does not match the physical order of pages on disk.

In simpler words:

Fragmentation means your index’s data pages are scattered all over the place, making it
slower for the database to read sequentially.

📚 Real-Life Analogy

Imagine a book where:

 Chapter 1 is on page 10,

 Chapter 2 is on page 100,

 Chapter 3 is on page 25.

📖 When you read the book, you keep flipping back and forth — that’s fragmentation.
It slows down reading because the content isn’t stored in order.
⚙️How Does Fragmentation Happen?

Fragmentation usually occurs over time as data changes (insertions, deletions, updates):

1. INSERTS — New rows are added in the middle, and if there’s no space, new pages
are created elsewhere.

2. DELETES — Rows are removed, leaving “holes” in pages.

3. UPDATES — When a row grows larger than its current space, it may be moved to
another page.

Over time, these actions scatter the index pages across the disk.

📊 Types of Index Fragmentation

Type Description Example

Internal Wasted space inside pages (due to A page has only 60% filled
Fragmentation deletes or updates). data.

External Page 1 points to Page 100,

Pages are out of order on disk.
Fragmentation then Page 20.

🐢 Why Fragmentation is a Problem

 Slower range scans (BETWEEN, <, >)

 More I/O operations (disk heads jump around)

 Cache inefficiency

 Increased query latency

🧹 How to Fix It

You can fix fragmentation using database maintenance commands:

Database Command Description

SQL Server ALTER INDEX REORGANIZE Defrag pages (lightweight)

Database Command Description

SQL Server ALTER INDEX REBUILD Recreate index (heavier, more effective)

MySQL (InnoDB) OPTIMIZE TABLE table_name; Rebuilds table and indexes

PostgreSQL VACUUM FULL table_name; Reorganizes storage

💬 How to Check Fragmentation

In SQL Server:

SELECT

OBJECT_NAME(OBJECT_ID) AS TableName,

index_id,

avg_fragmentation_in_percent

FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, NULL)

WHERE avg_fragmentation_in_percent > 10;

🧠 Interview-Focused Summary

 Index fragmentation = pages out of logical order.

 Caused by inserts, deletes, and updates.

 Leads to slow read performance.

 Fixed using rebuild or reorganize operations.

 Checked via system views or maintenance tools.

💬 Frequently Asked Interview Questions

1. What is index fragmentation?

2. What causes index fragmentation?

3. How does fragmentation affect performance?

4. What’s the difference between internal and external fragmentation?

5. How can you reduce fragmentation?

6. How often should indexes be rebuilt?

7. What’s the difference between REBUILD and REORGANIZE?

8. How can you check the fragmentation level of an index?

SQL and NoSQL databases differ fundamentally in their data models, scalability, schema
flexibility, and query languages.

SQL Databases (Relational Databases):

 Data Model:

Organize data into tables with predefined columns and rows, establishing relationships
between tables through foreign keys.

 Schema:

Employ a strict, fixed schema that requires defining data types and relationships before
data entry.

 Scalability:

Primarily scale vertically, meaning increasing the resources (CPU, RAM) of a single
server. Horizontal scaling can be achieved with more complex architectures like sharding.

 Query Language:

Use Structured Query Language (SQL) for data definition, manipulation, and retrieval.

 ACID Properties:

Emphasize Atomicity, Consistency, Isolation, and Durability (ACID) to ensure data integrity
and reliability, especially in transactional environments.

NoSQL Databases (Non-Relational Databases):

 Data Model:

Offer diverse data models, including document-oriented (e.g., MongoDB), key-value stores
(e.g., Redis), wide-column stores (e.g., Cassandra), and graph databases (e.g., Neo4j).

 Schema:
Typically have a dynamic or flexible schema, allowing for easier adaptation to changing data
structures and types.

 Scalability:

Designed for horizontal scalability, distributing data across multiple servers or nodes in a
cluster.

 Query Language:

Use various query languages specific to their data model, often involving API calls or
specialized query syntaxes.

 BASE Properties:

Often prioritize Basic Availability, Soft-state, and Eventual consistency (BASE) over strict
ACID compliance, favoring availability and partition tolerance in distributed systems.

Key Differences Summarized:

Feature SQL Databases NoSQL Databases

Data Model Tabular (relational) Document, Key-Value, Column, Graph

Schema Fixed, predefined Dynamic, flexible

Scalability Primarily vertical Primarily horizontal

Query Language SQL Diverse (e.g., JSON, API calls)

Properties ACID (Consistency, Reliability) BASE (Availability, Scalability)

.
🧩 What is Sharding?

Sharding means splitting a large database into smaller pieces (called shards) and
storing them on different servers.

Each shard holds a portion of the total data, and together they make up the complete
database.

You can think of it like splitting a huge book into multiple smaller volumes — each volume
has a part of the story, but not all of it.

🎯 Why Sharding?

As the amount of data grows, a single database server:

 Becomes slow

 Runs out of storage

 Can’t handle too many requests (high load)

To solve this, companies use sharding — dividing the data across multiple machines so
that:

 Each server stores less data

 Queries run faster

 System can scale horizontally (add more servers easily)

⚙️How Sharding Works (Simple Example)

Let’s say you have a Users table with millions of records.

user_id name city

1 Raj Delhi

2 Anita Mumbai

3 John London

4 Mary New York

Instead of storing everything in one database, you can split it like this:
 Shard 1: user_id 1–1000 → stored on Server A

 Shard 2: user_id 1001–2000 → stored on Server B

 Shard 3: user_id 2001–3000 → stored on Server C

Now, if you need to fetch user with ID = 1500 → you directly go to Shard 2.

🔑 Sharding Key

The sharding key decides how data is split between shards.

Common choices:

 User ID

 Region/Country

 Customer ID

Example:

“All users from India go to Shard 1, users from the USA go to Shard 2.”

A good sharding key ensures data is evenly distributed and queries are predictable.

🧠 Benefits of Sharding

✅ Scalability — Add more servers as data grows

✅ Performance — Smaller data per shard → faster queries
✅ Availability — If one shard fails, others still work
✅ Cost-effective — Use multiple smaller servers instead of one giant one

⚠️Challenges in Sharding

❌ Complexity — Application must know which shard to query

❌ Joins become hard — Data from two shards can’t be joined easily
❌ Resharding — Moving data when shards fill up is difficult
❌ Uneven load — If the sharding key isn’t balanced, one shard may get too much traffic

Real-World Example

 Instagram — shards user data by user ID

 YouTube — shards video metadata

 Twitter — shards tweets by user ID

This allows them to handle billions of users efficiently

Clustering in DBMS (Server-Level or Database Clustering)

🧩 What is Database Clustering?

Database Clustering (also known as Replica Sets) means using multiple servers (nodes)
that are connected to the same database or hold replicas of the same dataset, to improve
performance, availability, and reliability.

👉 It’s like having multiple machines serving the same data, so if one fails, others continue
serving requests.

⚙️Why Clustering is Needed

Sometimes, a single server cannot handle:

 Large amount of data

 Too many user requests

 Need for high uptime

So, multiple servers are grouped together — forming a Database Cluster — to share load
and provide backup.

🧬 How It Works

1. The same dataset is replicated across multiple database servers.

2. A load balancer distributes incoming requests among these servers.

3. If one server fails, another server automatically handles the requests.

So, the system remains available and responsive.

⚙️Architecture Example

Imagine your app connects to:

 Server A

 Server B

 Server C

All these servers have the same database.

If Server A goes down, requests automatically go to Server B or C.

This ensures:
✅ No downtime
✅ Continuous data availability

🪣 Advantages of Database Clustering

1️⃣ Data Redundancy

 The same data is stored on multiple servers.

 If one crashes, others have the data.

💡 This is useful redundancy, not the kind that causes anomalies.

2️⃣ Load Balancing

 Requests are distributed evenly among servers.

 Prevents overloading a single machine.

 Improves scalability — more servers = more users supported.

💡 Example: If your website suddenly gets 10,000 users, the load can be shared among 3–4
servers seamlessly.

3️⃣ High Availability

 Even if one or more servers fail, the system keeps running.

 The database stays online almost 100% of the time.

 Perfect for mission-critical systems (e.g., banking, e-commerce).

🧭 How Clustering Works (Step-by-Step)

1. A cluster contains multiple servers (called nodes).

2. A load balancer divides incoming requests between nodes.

3. Each node has a replica of the same data.

4. If a node fails:

o The request is redirected to another node.

o Synchronization ensures all nodes remain consistent.

Hence, there are few or no chances of total failure.

🧠 Analogy

Think of a restaurant with many chefs cooking the same menu.

If one chef gets sick (server failure), others keep cooking — customers (users) won’t notice.

⚙️Types of Database Clustering

Type Description

Active-Active All nodes actively handle requests at the same time.

Active- Only one node handles requests; others are on standby, ready to take over in
Passive case of failure.

⚠️Challenges of Clustering

❌ Complex to set up and maintain

❌ Needs synchronization between servers
❌ Expensive hardware resources
❌ Conflicts may occur if replication isn’t properly managed

💬 Common Interview Questions on Clustering

1. What is database clustering?

2. How does clustering improve database performance?

3. What is the difference between Active-Active and Active-Passive clustering?

4. How does clustering ensure high availability?

5. What is the role of replication in clustering?

6. How is clustering different from sharding?

7. What are the main advantages and disadvantages of clustering?

8. How does load balancing work in database clusters?

9. What happens if one node in the cluster fails?

10. Can you give a real-life example of database clustering?

ACD - Eca.2408 038 (PIS) EnerG X Park-Educational Trip 20240823-EP
No ratings yet
ACD - Eca.2408 038 (PIS) EnerG X Park-Educational Trip 20240823-EP
3 pages
SQL Server Resume & CV Builder Tools
No ratings yet
SQL Server Resume & CV Builder Tools
9 pages
My Contacts 1
No ratings yet
My Contacts 1
10 pages
Adani Group: India's Business Powerhouse
No ratings yet
Adani Group: India's Business Powerhouse
3 pages
RFQ - Welding Consumble Lot Testing Brief
No ratings yet
RFQ - Welding Consumble Lot Testing Brief
2 pages
Monthly Current Affairs Quiz - February 2019
No ratings yet
Monthly Current Affairs Quiz - February 2019
40 pages
Software Engineering Project A...
No ratings yet
Software Engineering Project A...
16 pages
Mobile Gaming and Academic Performance of The Students
100% (1)
Mobile Gaming and Academic Performance of The Students
14 pages
Eight-Week Academic Calendar for Primary
No ratings yet
Eight-Week Academic Calendar for Primary
162 pages
SP - Building Construction
No ratings yet
SP - Building Construction
9 pages
Lectures in International Marketing 2019
No ratings yet
Lectures in International Marketing 2019
61 pages
HSG 6
No ratings yet
HSG 6
4 pages
Modular Kitchen Design Details
No ratings yet
Modular Kitchen Design Details
25 pages
Pop Culture
No ratings yet
Pop Culture
270 pages
Indian Data Privacy in Digital Marketing
No ratings yet
Indian Data Privacy in Digital Marketing
31 pages
Carol Rivers'board Review Flashcards
No ratings yet
Carol Rivers'board Review Flashcards
440 pages
Vendor Pre Qualification Form
No ratings yet
Vendor Pre Qualification Form
6 pages
MSME
No ratings yet
MSME
3 pages
Al115c Mio Sporty Exhaust
No ratings yet
Al115c Mio Sporty Exhaust
2 pages
Angelarium Oracle of Emanations
No ratings yet
Angelarium Oracle of Emanations
145 pages
Scope of Work
100% (1)
Scope of Work
2 pages
Dechow Dichev TAR 2002-1
No ratings yet
Dechow Dichev TAR 2002-1
25 pages
Union Properties Case Study
No ratings yet
Union Properties Case Study
2 pages
Build A Large Space Saving CNC Router For Under 60
No ratings yet
Build A Large Space Saving CNC Router For Under 60
10 pages
ECE3009
No ratings yet
ECE3009
2 pages
CA Intermediate FM SM Exam
No ratings yet
CA Intermediate FM SM Exam
6 pages
ABB Low Voltage Coils & Kits Pricing
No ratings yet
ABB Low Voltage Coils & Kits Pricing
1 page
Siemens Containerized Substation
No ratings yet
Siemens Containerized Substation
10 pages
Stacks and Subroutines in 8085
No ratings yet
Stacks and Subroutines in 8085
25 pages
40 (3ph) KVA-WC
No ratings yet
40 (3ph) KVA-WC
5 pages