0% found this document useful (0 votes)
22 views43 pages

A Data Analytics Project

A Data analytics project for industrial training

Uploaded by

kumargrohit123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views43 pages

A Data Analytics Project

A Data analytics project for industrial training

Uploaded by

kumargrohit123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

A

Industrial Training Report


ON
“Data Analysis”

Submitted in

Bachelor of Technology in
Department of Computer Science

Submitted to: Submitted By:


Mr. Pawan Sen Neeshu Yadav
HOD (Head of Department) Roll No.: 22EAICS198

Department of Computer Science

Arya College of Engineering, Kukas, Jaipur (2025-26)


i
Certificate Of Completion
iii
DECLARATION BY THE CANDIDATE

I, Neeshu Yadav hereby declare that the training work titled “Data Analysis” is an original and
authentic piece of work carried out by me at Arya College of Engineering, Jaipur in partial
fulfilment of the requirements for the award of the degree of Bachelor of Technology ([Link]).
I further affirm that this work has not been submitted, either fully or partially, for the award
of any other degree or diploma at any institution.

Date: 09 Sep. 2025 Name : N Yadav


Acknowledgement

It gives me immense pleasure to express my heartfelt gratitude to all


those who have supported and guided me throughout the successful completion
of this final year Training.
I would first like to extend my sincere thanks to Dr. Arvind Agarwal, Chairman, and
Dr. Puja Agarwal, Vice-President of Arya College of Engineering, Kukas, Jaipur, for
providing an excellent academic environment and all the necessary facilities that
foster learning and innovation.
I am deeply grateful to our respected Principal, Dr. Himanshu Arora, for his
continuous encouragement and administrative support.
I would also like to express my special thanks to Er. Pawan Sen, Head of the
Department, Computer Science and Engineering, for his visionary leadership,
valuable suggestions, and motivation throughout the course of this project.
I am also thankful to my In-charge, Er Narender Narwal, whose timely feedback,
guidance, and coordination played a vital role in ensuring the quality and
completion of this work.
Last but not least, I sincerely thank all the faculty members, lab assistants, and my
peers who contributed directly or indirectly with their support and inspiration
during the entire training duration.

v
This training has been a significant learning experience and a stepping stone in my
professional development.

Neeshu Yadav
22EAICS198
[Link] 7th Sem
PREFACE

The present report is the outcome of the online practical training as a “Data Analysis”
provided by “Seldom India Pvt. Ltd. , Jaipur” The objective of this training is to learn about
technologies with practical applications and to get a masterful grip on it. This course
provided me with hands-on experience and exposure to developing Front-End and Back-
End applications for browsers. This course also helped in build a strong foundation on
Front-End which provided me with the tools to build a responsive web application.
During the tenure of 45 days, I learned about and worked on different technologies like MS
Excel, SQL , Query Language, DML etc.

vii
CONTENTS

CHAPTER 1: INTRODUCTION ABOUT THE COURSE 1-2

1.1 Objectives 1

1.2 Motivation 1

1.3 Layout of the Report 2

CHAPTER 2: BASICS OF DATA ANALYSIS 3

CHAPTER 3: PYTHON 5-12

3.1 Data collection and cleaning 5

3.2 Data analysis and manipulation 7

3.3 Data visualization and reporting 12

CHAPTER 4: MS EXCEL 13-22

4.1 Basics and foundation 13

4.2 Formulas and functions 14

4.3 Data Management 15

4.4 Data Analysis Tools 17

4.5 Data Visualization 19

CHAPTER 5: SQL 20-25


5.1 SQL basics and foundations 21

5.2 Data retrevial and filtering 22

5.3 Aggregation and grouping 23

5.4 Joins and relationships 24

5.5 Subquries and set operations 25

CHAPTER 6: Project 26-32

6.1 About Project 26


6.2 Methodology 28

6.3 Skills Learned 32

CHAPTER 7: Refrences 33

7.1 Data Source 33

ix
CHAPTER 1: INTRODUCTION ABOUT THE COURSE
1.1 Objectives
The primary objective of this industrial training on Data Analysis is to enable students to bridge the gap
between theoretical concepts and practical applications. The training emphasizes hands-on learning of tools
such as Python, MS Excel, and SQL. These tools empower students to handle, analyze, and visualize real-
world data effectively.

Specific objectives include:


1. To develop proficiency in data handling and cleaning techniques.
2. To gain practical experience with Python, Excel, and SQL.
3. To understand the importance of visualization in decision making.
4. To prepare students for real-world challenges in the field of data science.
5. To cultivate problem-solving skills through case studies and practical projects.

1
1.2 Motivation
The motivation for choosing Data Analysis as a training area arises from the increasing demand for data-
driven decision-making in every sector. Companies today rely on insights derived from large volumes of
data to optimize operations, increase efficiency, and enhance customer satisfaction. By acquiring these
skills during training, students position themselves for better career opportunities and future research
prospects.

In addition, analyzing data cultivates a logical mindset and helps students approach problems with critical
thinking. This training therefore not only prepares students technically but also enhances their professional
competence.
1.3 Layout of the Report
The report is divided into multiple chapters for clarity and systematic presentation:
- Chapter 1 introduces the objectives, motivation, and structure of the report.
- Chapter 2 provides the basics of Data Analysis.
- Chapter 3 covers Python programming for data handling and visualization.
- Chapter 4 explains the use of MS Excel as a data analysis tool.
- Chapter 5 demonstrates SQL for data retrieval and management.
Each chapter includes explanations, code examples, and case studies where relevant.

3
CHAPTER 2: BASICS OF DATA ANALYSIS
Data Analysis refers to the process of examining raw data with the purpose of drawing meaningful conclusions. It
is a multidisciplinary field that integrates mathematics, statistics, computer science, and domain-specific
knowledge. The objective is to discover useful information, identify patterns, and support better decision-making.

Key stages in data analysis include:


1. Data Collection – Gathering data from primary and secondary sources.
2. Data Cleaning – Removing inconsistencies and preparing the data.
3. Data Exploration – Understanding data through summary statistics and visualizations.
4. Data Modeling – Applying statistical or machine learning models.
5. Interpretation – Converting analysis results into actionable insights.

Types of Data Analysis:


- **Descriptive Analysis:** Describes historical data to understand what has happened.
- **Diagnostic Analysis:** Investigates why something happened.
- **Predictive Analysis:** Uses models to forecast future outcomes.
- **Prescriptive Analysis:** Suggests actions based on analysis.

Tools for Data Analysis:


1. **Python:** Rich libraries like pandas, NumPy, and matplotlib.
2. **MS Excel:** Widely used for its simplicity and built-in formulas.
3. **SQL:** Essential for querying structured databases.
4. **R:** Popular in statistical and research communities.
5. **Tableau/Power BI:** Visualization and business intelligence tools.

Example: If a retail store collects sales data for every day of the year, data analysis can help answer questions
such as:
- What was the total sales for each month?
- Which product categories contributed the most revenue?
- Can future demand be predicted using past sales patterns?
Answering these questions enables businesses to improve planning and profitability.
CHAPTER 3: PYTHON
3.1 Data Collection and Cleaning
Data collection and cleaning form the backbone of every data analysis pipeline. Without clean, reliable data, the
conclusions drawn from analysis will be misleading. In real-world scenarios, data often comes from multiple
sources and is rarely in perfect condition.

Sources of data:
- Relational databases (MySQL, PostgreSQL, Oracle).
- Flat files such as CSV, JSON, and Excel.
- APIs (Application Programming Interfaces) for real-time data.
- Web scraping using BeautifulSoup and Scrapy.
- Sensor/IoT devices in industries like healthcare and manufacturing.
- Survey forms and manual entries.

Challenges in raw data include missing entries, inconsistent formats, duplicate values, and outliers. Python
provides libraries like pandas and NumPy to clean structured data, and regex to clean textual data.

Example – Cleaning student marks data:

import pandas as pd

# Load dataset
df = pd.read_csv('[Link]')

# Fill missing attendance with mean


df['Attendance'].fillna(df['Attendance'].mean(), inplace=True)

# Standardize text fields


df['Name'] = df['Name'].[Link]()

# Remove duplicate entries


df.drop_duplicates(inplace=True)

5
# Detect outliers using IQR method
Q1 = df['Marks'].quantile(0.25)
Q3 = df['Marks'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Marks'] < Q1 - 1.5*IQR) | (df['Marks'] > Q3 + 1.5*IQR)]

This code not only handles missing and duplicate values but also identifies students with extremely low or
high marks, which could be due to data entry errors.

Case Study: In a healthcare project, patient datasets contained missing values for blood pressure and
glucose levels. By imputing missing data with median values and removing erroneous readings (like BP =
400), the dataset became reliable for further analysis.
3.2 Data Analysis and Manipulation
Once cleaned, data must be analyzed and transformed into meaningful insights. Pandas and NumPy provide
rich functionality for manipulation, while SciPy is useful for statistical analysis.

Key operations:
- Filtering rows using conditional statements.
- Selecting specific columns for targeted analysis.
- Grouping and aggregating data for summaries.
- Joining datasets to enrich information.
- Applying mathematical transformations.
- Generating pivot tables for multi-dimensional analysis.

Example – Sales dataset analysis:

# Filter high revenue


high_sales = sales[sales['Revenue'] > 50000]

# Group average revenue by region


avg_sales = [Link]('Region')['Revenue'].mean()

# Merge with employee dataset


merged = [Link](sales, employees, on='EmpID')

# Add profit margin column


sales['ProfitMargin'] = sales['Profit'] / sales['Revenue']

Through such operations, decision makers can identify the best performing regions, products, or
employees.

NumPy also supports vectorized operations for performance. For example, calculating normalized marks:

import numpy as np

df['Normalized'] = (df['Marks'] - [Link](df['Marks'])) / [Link](df['Marks'])


7
Case Study: In an e-commerce dataset, customer purchase patterns were grouped by location. It revealed
that metro cities had higher average order values, influencing marketing strategies.

Another tool is 'pandas profiling', which generates automated reports summarizing data distributions,
correlations, and missing values. This reduces manual effort in exploratory analysis.
3.3 Data Visualization and Reporting
Visualization converts numerical outputs into easily understandable visuals. This step is crucial for both
students and professionals as it enables better storytelling with data.

Common visualization types:


- Line chart: Show trends over time (e.g., monthly sales).
- Bar chart: Compare categories (e.g., department marks).
- Scatter plot: Explore relationships (e.g., hours studied vs marks).
- Histogram: Show distribution of a variable.
- Boxplot: Identify spread and outliers.
- Heatmap: Show correlation between variables.

Example – Plotting marks distribution and correlation heatmap:

import [Link] as plt


import seaborn as sns

# Histogram
[Link](df['Marks'], bins=10, kde=True)
[Link]('Distribution of Marks')
[Link]()

# Correlation Heatmap
[Link](figsize=(6,4))
[Link]([Link](), annot=True, cmap='coolwarm')
[Link]('Correlation Matrix')
[Link]()

Case Study: In an educational institute, visualizing average scores by department in a bar chart highlighted
that the Computer Science department consistently outperformed others. This insight helped management
replicate successful teaching strategies across weaker departments.

9
Reporting is the final stage. Analysts can use Jupyter Notebook to combine code, output, and observations
in a single notebook. For professional reporting, visualizations are exported as PDFs, PowerPoints, or
integrated into dashboards using Plotly or Power BI.

Thus, Python provides end-to-end capabilities from collection to reporting. Its rich ecosystem of libraries
ensures that even complex industrial problems can be tackled efficiently.
Aggregation functions summarize large datasets. SQL provides built-in functions like COUNT(), SUM(),
AVG(), MIN(), and MAX().

Example: Count total students:

SELECT COUNT(*) FROM Students;

Example: Find average marks of students:

SELECT AVG(Marks) FROM Students;

Grouping allows aggregation across categories using GROUP BY.

Example: Average marks by department:

SELECT Department, AVG(Marks)


FROM Students
GROUP BY Department;

HAVING clause is used to filter groups after aggregation.

Example: Departments with average marks above 70:

SELECT Department, AVG(Marks)


FROM Students
GROUP BY Department
HAVING AVG(Marks) > 70;

Case Study: A sales company grouped revenue by region and used HAVING to filter only those regions
where total sales exceeded 10 million.

11
CHAPTER 4: MS EXCEL
4.1 Basics and Foundation
Microsoft Excel is one of the most widely used applications in the Microsoft Office suite. It was first
released in 1985 and has since become the backbone of spreadsheet-based data management. Excel
provides a flexible environment to store, organize, and analyze numerical and textual data.

Excel Interface Components:


- **Workbook:** A file containing multiple worksheets.
- **Worksheet:** A single spreadsheet with rows and columns.
- **Cell:** The intersection of a row and column (e.g., A1).
- **Ribbon:** Toolbar with tabs such as Home, Insert, Formulas, Data.
- **Formula Bar:** Displays and allows editing of cell formulas.

Advantages of Excel:
- Easy to learn and widely available.
- Suitable for both small and medium datasets.
- Provides instant feedback and visualization.
- Compatible with other tools like SQL and Python for integration.

Limitations of Excel:
- Not efficient for handling very large datasets (>1 million rows).
- Limited advanced statistical and machine learning capabilities.
- Manual work can sometimes introduce errors.

Despite limitations, Excel remains an entry point for most students and professionals to develop analytical
thinking.

13
4.2 Formulas and Functions
Excel provides a wide range of formulas and functions that enable quick calculations. Formulas are user-
defined calculations, while functions are prebuilt operations.

Categories of Functions:
1. **Arithmetic Functions:** SUM(), PRODUCT(), ROUND().
2. **Statistical Functions:** AVERAGE(), MEDIAN(), STDEV().
3. **Logical Functions:** IF(), AND(), OR(), NOT().
4. **Text Functions:** LEN(), CONCAT(), UPPER(), LOWER().
5. **Date & Time Functions:** TODAY(), NOW(), DATEDIF().
6. **Lookup Functions:** VLOOKUP(), HLOOKUP(), INDEX(), MATCH().

Examples:
- =IF(B2>=40,"Pass","Fail") → Returns 'Pass' if marks ≥ 40.
- =AVERAGE(C2:C50) → Finds class average.
- =INDEX(A2:D20,5,3) → Returns value in row 5, column 3.
- =VLOOKUP(101,A2:D20,2,FALSE) → Finds student with ID 101.

Nested Formulas:
Excel allows combining functions, e.g., =IF(AND(B2>=40,C2>=40),"Pass","Fail") → Passes only if both
theory and practical marks ≥ 40.

Practical Example: In a finance dataset, Excel formulas can calculate monthly loan payments, interest, and
remaining balances using PMT() and related financial functions.
4.2 Formulas and Functions
Excel provides a wide range of formulas and functions that enable quick calculations. Formulas are user-
defined calculations, while functions are prebuilt operations.

Categories of Functions:
1. **Arithmetic Functions:** SUM(), PRODUCT(), ROUND().
2. **Statistical Functions:** AVERAGE(), MEDIAN(), STDEV().
3. **Logical Functions:** IF(), AND(), OR(), NOT().
4. **Text Functions:** LEN(), CONCAT(), UPPER(), LOWER().
5. **Date & Time Functions:** TODAY(), NOW(), DATEDIF().
6. **Lookup Functions:** VLOOKUP(), HLOOKUP(), INDEX(), MATCH().

Examples:
- =IF(B2>=40,"Pass","Fail") → Returns 'Pass' if marks ≥ 40.
- =AVERAGE(C2:C50) → Finds class average.
- =INDEX(A2:D20,5,3) → Returns value in row 5, column 3.
- =VLOOKUP(101,A2:D20,2,FALSE) → Finds student with ID 101.

Nested Formulas:
Excel allows combining functions, e.g., =IF(AND(B2>=40,C2>=40),"Pass","Fail") → Passes only if both
theory and practical marks ≥ 40.

Practical Example: In a finance dataset, Excel formulas can calculate monthly loan payments, interest, and
remaining balances using PMT() and related financial functions.

15
4.3 Data Management
Managing data efficiently is one of Excel’s core strengths. It ensures that large datasets can be transformed
into meaningful summaries.

1. **Sorting & Filtering:**


- Sorting allows ascending or descending arrangement.
- Filtering shows only relevant records, e.g., students with marks above 80.

2. **Conditional Formatting:**
- Highlights important data visually.
- Example: Red background for marks < 40, green for marks ≥ 40.

3. **Data Validation:**
- Ensures correctness of data entry.
- Example: Restricting input to numbers between 0–100 for marks.

4. **PivotTables:**
- Summarize data across multiple dimensions.
- Example: Summarize total sales by region and quarter.

Case Study: In a student attendance dataset, PivotTables were used to calculate average attendance per
department. Conditional formatting then highlighted students falling below 75%, assisting faculty in
compliance monitoring.
4.4 Data Analysis Tools
Excel offers built-in and add-in tools for advanced analysis. The Analysis ToolPak enhances capabilities for
statistical operations.

1. **Goal Seek:**
- Determines the input value required to achieve a target output.
- Example: Finding the required final exam marks to reach 60% overall.

2. **Solver:**
- Optimization tool for maximizing or minimizing outcomes under constraints.
- Example: Optimizing production to maximize profit within resource limits.

3. **Scenario Manager:**
- Stores and compares different input scenarios.
- Example: Analyzing profits under optimistic, realistic, and pessimistic sales.

4. **Descriptive Statistics (Analysis ToolPak):**


- Provides summary statistics such as mean, median, standard deviation, and variance.
- Useful for exploratory data analysis.

5. **Regression Analysis:**
- Helps identify relationships between dependent and independent variables.
- Example: Predicting sales based on advertising expenditure.

Case Study: A retail company used Solver to decide product mix under limited budget. By optimizing the
constraints, profitability increased by 12%.

17
4.5 Data Visualization
Visualization converts raw data into graphical representations that are easy to understand. Excel provides
multiple chart types that cater to different analytical needs.

Charts in Excel:
- **Column/Bar Chart:** Compare quantities across categories.
- **Line Chart:** Show trends over time.
- **Pie Chart:** Display proportions within a whole.
- **Scatter Plot:** Visualize correlation between variables.
- **Box Plot (via add-ins):** Identify outliers and distribution spread.

Dashboards:
- Dashboards combine charts, PivotTables, and slicers in one view.
- Used in organizations to track KPIs (Key Performance Indicators).

Example: A placement dashboard for a college displayed:


- Total number of students placed.
- Department-wise placement percentage.
- Highest and average packages.
- Company-wise hiring trends.

Case Study: An e-commerce firm created a sales dashboard in Excel. It highlighted peak sales during
festive seasons, helping the company to prepare inventory in advance.

In conclusion, Excel not only serves as a foundation for data analysis but also acts as a bridge between
basic manual calculations and advanced programming-based analytics. Its wide usage and accessibility
make it an indispensable tool for students and professionals.
CHAPTER 5: SQL
5.1 SQL Basics and Foundations
Structured Query Language (SQL) is the standard language for interacting with relational databases. It
allows users to define, manipulate, and query data stored in tables. SQL was developed in the 1970s and
has since become the foundation of database systems such as MySQL, PostgreSQL, Oracle, and SQL
Server.

SQL is broadly divided into categories:


- **DDL (Data Definition Language):** CREATE, ALTER, DROP.
- **DML (Data Manipulation Language):** INSERT, UPDATE, DELETE.
- **DQL (Data Query Language):** SELECT.
- **DCL (Data Control Language):** GRANT, REVOKE.
- **TCL (Transaction Control Language):** COMMIT, ROLLBACK, SAVEPOINT.

Creating a table in SQL:

CREATE TABLE Students (


StudentID INT PRIMARY KEY,
Name VARCHAR(50) NOT NULL,
Department VARCHAR(30),
Marks INT,
Attendance DECIMAL(5,2)
);

Constraints ensure data integrity. For example, PRIMARY KEY uniquely identifies a record, FOREIGN
KEY links tables, and CHECK enforces conditions like Marks >= 0.

Case Study: A student database was designed with tables for Students, Courses, and Enrollments.
Constraints were used to ensure no student could enroll in the same course twice.

19
5.1.1 SQL Basics and Foundations
Structured Query Language (SQL) is the standard language for interacting with relational databases. It
allows users to define, manipulate, and query data stored in tables. SQL was developed in the 1970s and
has since become the foundation of database systems such as MySQL, PostgreSQL, Oracle, and SQL
Server.

SQL is broadly divided into categories:


- **DDL (Data Definition Language):** CREATE, ALTER, DROP.
- **DML (Data Manipulation Language):** INSERT, UPDATE, DELETE.
- **DQL (Data Query Language):** SELECT.
- **DCL (Data Control Language):** GRANT, REVOKE.
- **TCL (Transaction Control Language):** COMMIT, ROLLBACK, SAVEPOINT.

Creating a table in SQL:

CREATE TABLE Students (


StudentID INT PRIMARY KEY,
Name VARCHAR(50) NOT NULL,
Department VARCHAR(30),
Marks INT,
Attendance DECIMAL(5,2)
);

Constraints ensure data integrity. For example, PRIMARY KEY uniquely identifies a record, FOREIGN
KEY links tables, and CHECK enforces conditions like Marks >= 0.

Case Study: A student database was designed with tables for Students, Courses, and Enrollments.
Constraints were used to ensure no student could enroll in the same course twice.
5.2 Data Retrieval and Filtering
The SELECT statement is the most commonly used SQL command. It retrieves data from one or more
tables.

Basic SELECT query:

SELECT Name, Marks FROM Students;

Filtering with WHERE clause:

SELECT * FROM Students


WHERE Marks > 75;

Operators used in filtering:


- Comparison (=, >, <, >=, <=, !=).
- Range (BETWEEN).
- Membership (IN).
- Pattern matching (LIKE).
- Logical (AND, OR, NOT).

Example: Retrieve students with attendance between 70 and 90:

SELECT Name, Attendance FROM Students


WHERE Attendance BETWEEN 70 AND 90;

Sorting results with ORDER BY:

SELECT Name, Marks FROM Students


ORDER BY Marks DESC;

Case Study: In a company database, filtering queries were used to find employees earning more than
50,000 in the Sales department. This helped HR identify high-performing employees for promotions.

21
5.3 Aggregation and Grouping
Aggregation functions summarize large datasets. SQL provides built-in functions like COUNT(), SUM(),
AVG(), MIN(), and MAX().

Example: Count total students:

SELECT COUNT(*) FROM Students;

Example: Find average marks of students:

SELECT AVG(Marks) FROM Students;

Grouping allows aggregation across categories using GROUP BY.

Example: Average marks by department:

SELECT Department, AVG(Marks)


FROM Students
GROUP BY Department;

HAVING clause is used to filter groups after aggregation.

Example: Departments with average marks above 70:

SELECT Department, AVG(Marks)


FROM Students
GROUP BY Department
HAVING AVG(Marks) > 70;

Case Study: A sales company grouped revenue by region and used HAVING to filter only those regions
where total sales exceeded 10 million.
5.3 Aggregation and Grouping
Aggregation functions summarize large datasets. SQL provides built-in functions like COUNT(), SUM(),
AVG(), MIN(), and MAX().

Example: Count total students:

SELECT COUNT(*) FROM Students;

Example: Find average marks of students:

SELECT AVG(Marks) FROM Students;

Grouping allows aggregation across categories using GROUP BY.

Example: Average marks by department:

SELECT Department, AVG(Marks)


FROM Students
GROUP BY Department;

HAVING clause is used to filter groups after aggregation.

Example: Departments with average marks above 70:

SELECT Department, AVG(Marks)


FROM Students
GROUP BY Department
HAVING AVG(Marks) > 70;

Case Study: A sales company grouped revenue by region and used HAVING to filter only those regions
where total sales exceeded 10 million.

23
5.4 Joins and Relationships
Databases are relational by nature. Relationships are built between tables using keys. Joins allow retrieval
of related data spread across multiple tables.

Types of Joins:
- **INNER JOIN:** Returns matching records from both tables.
- **LEFT JOIN:** Returns all records from left table and matched records from right.
- **RIGHT JOIN:** Returns all records from right table and matched records from left.
- **FULL OUTER JOIN:** Returns all records from both tables, with NULL where no match exists.
- **SELF JOIN:** Joins a table with itself.
- **CROSS JOIN:** Produces Cartesian product.

Example: Joining Students and Departments:

SELECT [Link], [Link]


FROM Students s
INNER JOIN Departments d
ON [Link] = [Link];

Case Study: A university database linked Students with Courses through an Enrollments table. INNER
JOIN was used to fetch details of which student was enrolled in which course.

Another Case: In a hospital database, Doctors and Patients tables were joined to analyze appointments and
identify busiest doctors.
5.5 Subqueries and Set Operations
Subqueries are queries inside other queries. They provide flexibility and allow stepwise analysis.

Example: Find students scoring above the average marks:

SELECT Name, Marks FROM Students


WHERE Marks > (SELECT AVG(Marks) FROM Students);

Correlated subqueries reference columns from the outer query.

Example: Find students whose marks are above department average:

SELECT Name, Department, Marks


FROM Students s
WHERE Marks > (SELECT AVG(Marks)
FROM Students
WHERE Department = [Link]);

Set operations combine results of multiple queries.


- **UNION:** Combines results, removes duplicates.
- **UNION ALL:** Combines results, keeps duplicates.
- **INTERSECT:** Returns common rows.
- **EXCEPT/MINUS:** Returns rows from first query not in second.

Example: Find students who enrolled in either Data Science or AI:

SELECT StudentID FROM Enrollments WHERE Course='Data Science'


UNION
SELECT StudentID FROM Enrollments WHERE Course='AI';

Case Study: In internship applications, UNION was used to combine lists of students applying to multiple
companies, while INTERSECT identified students applying to both Google and Microsoft.

In summary, SQL provides the backbone for structured data analysis. Its ability to store, retrieve, aggregate,
and combine data makes it indispensable for analysts and developers alike.

25
Chapter 6: About the Project

6.1: Project

This project involves developing an interactive dashboard for Vrinda Store using real-world (or
pseudo-real) data collected from open sources and/or GitHub/Kaggle repositories. The primary
aim is to provide a consolidated visual interface to help stakeholders (store management,
marketing teams, etc.) explore, analyze, and draw actionable insights from the store’s
transaction and customer data.

Key motivations and objectives include:

 To transform raw sales and customer data into meaningful visualizations and insights, enabling
better decision-making.

 To identify trends and patterns in sales over time, customer behavior across demographics,
order fulfillment status, and product performance.

 To support the store management in answering business-critical questions such as:

o Which months saw the highest sales and orders?

o What are the peak product categories?

o What is the distribution of customers by age, gender, and geography?

o Which sales channels contribute most to revenue?

o How efficient is the order fulfillment (delivered vs returned/cancelled orders)?

 To provide an interactive experience via slicers/filters so that users can dynamically explore
data slices such as time period, channel, category, customer segments, etc.

 To generate a dashboard that is intuitive, visually appealing, and aids in quick comprehension
of key KPIs (key performance indicators).

In this project, I used Python / Excel / SQL (or whatever tools you used) to perform data
cleaning, transformation, exploratory analysis, and dashboard creation. The dashboard was
designed to present the following elements:
 Monthly/quarterly trend of sales vs orders

 Top-selling product categories

 Regional/state-wise contribution to revenue

 Customer segmentation by age/gender and corresponding ordering behavior

 Order status distribution (delivered, cancelled, returned, etc.)

 Comparison of different sales channels (online/offline, marketplaces, etc.)

The end product is a consolidated, interactive dashboard that helps Vrinda Store’s management
quickly identify areas of strength and weakness, understand customer behavior, and make data-
driven decisions for future growth.

27
Aggregation functions summarize large datasets. SQL provides built-in functions like COUNT(), SUM(),
AVG(), MIN(), and MAX().

Example: Count total students:

SELECT COUNT(*) FROM Students;

Example: Find average marks of students:

SELECT AVG(Marks) FROM Students;

Grouping allows aggregation across categories using GROUP BY.

Example: Average marks by department:

SELECT Department, AVG(Marks)


FROM Students
GROUP BY Department;

HAVING clause is used to filter groups after aggregation.

Example: Departments with average marks above 70:

SELECT Department, AVG(Marks)


FROM Students
GROUP BY Department
HAVING AVG(Marks) > 70;

Case Study: A sales company grouped revenue by region and used HAVING to filter only those regions
where total sales exceeded 10 million.
6.2: Methodology

The methodology adopted for this project followed a structured approach to ensure
accuracy, clarity, and relevance of the final dashboard. The process can be summarized
in the following steps:

1. Data Collection

o Obtained Vrinda Store sales and customer data from open sources (CSV/Excel format).

o Verified completeness and scope of the dataset for analysis.

2. Data Cleaning & Preparation

o Removed duplicate records and handled missing values.

o Standardized date formats, product categories, and customer demographic fields.

o Ensured consistency in metrics such as revenue, order ID, and status.

3. Data Transformation

o Created calculated fields such as total sales, order counts, and profit margin.

o Categorized customers by age group and gender.

o Grouped products into categories and mapped regions/geographies.

4. Exploratory Data Analysis (EDA)

o Performed statistical summaries to understand sales trends.

o Identified top products, repeat customers, and seasonal demand variations.

o Detected anomalies such as unusually high returns or cancellations.

5. Dashboard Design

29
o Selected key performance indicators (KPIs) such as monthly sales, order trends, top
categories, customer demographics, regional performance, and sales channel
distribution.

o Designed interactive charts (line, bar, pie, and maps) for better visualization.

o Added slicers/filters for time period, category, and channel to enable dynamic
exploration.

6. Implementation & Testing

o Built the dashboard using MS Excel (PivotTables, charts, and slicers).

o Tested dashboard usability and responsiveness with sample queries.

o Validated accuracy by cross-checking results with raw dataset.

7. Insights & Reporting

o Generated actionable insights (e.g., peak sales months, best-performing categories, and
customer behavior patterns).

o Documented the workflow and prepared the final project report for academic
submission.
6.3: Skills Gained

During the course of this project, I gained both technical and analytical skills that will
be valuable in my academic and professional journey. The key skills acquired are:

1. Data Handling Skills

o Cleaning, organizing, and preparing raw datasets for analysis.

o Using Excel functions, formulas, and PivotTables for data transformation.

2. Analytical Thinking

o Ability to interpret sales and customer data to identify trends and anomalies.

o Developing a logical workflow to move from raw data to meaningful insights.

3. Dashboard Development

o Designing interactive dashboards with slicers and charts.

o Presenting KPIs such as revenue, sales trends, and customer demographics in a clear and visual
format.

4. Technical Proficiency

o Improved command over Excel (advanced formulas, conditional formatting, charts).

o Exposure to SQL/Python basics for query writing and initial data exploration (if applicable).

5. Business Insight Generation

o Understanding how data analysis supports decision-making in retail.

o Gaining knowledge about customer segmentation, seasonal demand, and sales channel performance.

6. Soft Skills

o Time management and planning to complete the project in stages.

o Documentation and report writing for academic submission.

o Presentation of results in a structured, professional format.

31
Chapter 7 : Refrences

7.1: Data Sources

The whole data for this project was taken from Kaggle official website: [Link]

Link: [Link]
33

You might also like