A Data Analytics Project
A Data Analytics Project
Submitted in
Bachelor of Technology in
Department of Computer Science
I, Neeshu Yadav hereby declare that the training work titled “Data Analysis” is an original and
authentic piece of work carried out by me at Arya College of Engineering, Jaipur in partial
fulfilment of the requirements for the award of the degree of Bachelor of Technology ([Link]).
I further affirm that this work has not been submitted, either fully or partially, for the award
of any other degree or diploma at any institution.
v
This training has been a significant learning experience and a stepping stone in my
professional development.
Neeshu Yadav
22EAICS198
[Link] 7th Sem
PREFACE
The present report is the outcome of the online practical training as a “Data Analysis”
provided by “Seldom India Pvt. Ltd. , Jaipur” The objective of this training is to learn about
technologies with practical applications and to get a masterful grip on it. This course
provided me with hands-on experience and exposure to developing Front-End and Back-
End applications for browsers. This course also helped in build a strong foundation on
Front-End which provided me with the tools to build a responsive web application.
During the tenure of 45 days, I learned about and worked on different technologies like MS
Excel, SQL , Query Language, DML etc.
vii
CONTENTS
1.1 Objectives 1
1.2 Motivation 1
CHAPTER 7: Refrences 33
ix
CHAPTER 1: INTRODUCTION ABOUT THE COURSE
1.1 Objectives
The primary objective of this industrial training on Data Analysis is to enable students to bridge the gap
between theoretical concepts and practical applications. The training emphasizes hands-on learning of tools
such as Python, MS Excel, and SQL. These tools empower students to handle, analyze, and visualize real-
world data effectively.
1
1.2 Motivation
The motivation for choosing Data Analysis as a training area arises from the increasing demand for data-
driven decision-making in every sector. Companies today rely on insights derived from large volumes of
data to optimize operations, increase efficiency, and enhance customer satisfaction. By acquiring these
skills during training, students position themselves for better career opportunities and future research
prospects.
In addition, analyzing data cultivates a logical mindset and helps students approach problems with critical
thinking. This training therefore not only prepares students technically but also enhances their professional
competence.
1.3 Layout of the Report
The report is divided into multiple chapters for clarity and systematic presentation:
- Chapter 1 introduces the objectives, motivation, and structure of the report.
- Chapter 2 provides the basics of Data Analysis.
- Chapter 3 covers Python programming for data handling and visualization.
- Chapter 4 explains the use of MS Excel as a data analysis tool.
- Chapter 5 demonstrates SQL for data retrieval and management.
Each chapter includes explanations, code examples, and case studies where relevant.
3
CHAPTER 2: BASICS OF DATA ANALYSIS
Data Analysis refers to the process of examining raw data with the purpose of drawing meaningful conclusions. It
is a multidisciplinary field that integrates mathematics, statistics, computer science, and domain-specific
knowledge. The objective is to discover useful information, identify patterns, and support better decision-making.
Example: If a retail store collects sales data for every day of the year, data analysis can help answer questions
such as:
- What was the total sales for each month?
- Which product categories contributed the most revenue?
- Can future demand be predicted using past sales patterns?
Answering these questions enables businesses to improve planning and profitability.
CHAPTER 3: PYTHON
3.1 Data Collection and Cleaning
Data collection and cleaning form the backbone of every data analysis pipeline. Without clean, reliable data, the
conclusions drawn from analysis will be misleading. In real-world scenarios, data often comes from multiple
sources and is rarely in perfect condition.
Sources of data:
- Relational databases (MySQL, PostgreSQL, Oracle).
- Flat files such as CSV, JSON, and Excel.
- APIs (Application Programming Interfaces) for real-time data.
- Web scraping using BeautifulSoup and Scrapy.
- Sensor/IoT devices in industries like healthcare and manufacturing.
- Survey forms and manual entries.
Challenges in raw data include missing entries, inconsistent formats, duplicate values, and outliers. Python
provides libraries like pandas and NumPy to clean structured data, and regex to clean textual data.
import pandas as pd
# Load dataset
df = pd.read_csv('[Link]')
5
# Detect outliers using IQR method
Q1 = df['Marks'].quantile(0.25)
Q3 = df['Marks'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Marks'] < Q1 - 1.5*IQR) | (df['Marks'] > Q3 + 1.5*IQR)]
This code not only handles missing and duplicate values but also identifies students with extremely low or
high marks, which could be due to data entry errors.
Case Study: In a healthcare project, patient datasets contained missing values for blood pressure and
glucose levels. By imputing missing data with median values and removing erroneous readings (like BP =
400), the dataset became reliable for further analysis.
3.2 Data Analysis and Manipulation
Once cleaned, data must be analyzed and transformed into meaningful insights. Pandas and NumPy provide
rich functionality for manipulation, while SciPy is useful for statistical analysis.
Key operations:
- Filtering rows using conditional statements.
- Selecting specific columns for targeted analysis.
- Grouping and aggregating data for summaries.
- Joining datasets to enrich information.
- Applying mathematical transformations.
- Generating pivot tables for multi-dimensional analysis.
Through such operations, decision makers can identify the best performing regions, products, or
employees.
NumPy also supports vectorized operations for performance. For example, calculating normalized marks:
import numpy as np
Another tool is 'pandas profiling', which generates automated reports summarizing data distributions,
correlations, and missing values. This reduces manual effort in exploratory analysis.
3.3 Data Visualization and Reporting
Visualization converts numerical outputs into easily understandable visuals. This step is crucial for both
students and professionals as it enables better storytelling with data.
# Histogram
[Link](df['Marks'], bins=10, kde=True)
[Link]('Distribution of Marks')
[Link]()
# Correlation Heatmap
[Link](figsize=(6,4))
[Link]([Link](), annot=True, cmap='coolwarm')
[Link]('Correlation Matrix')
[Link]()
Case Study: In an educational institute, visualizing average scores by department in a bar chart highlighted
that the Computer Science department consistently outperformed others. This insight helped management
replicate successful teaching strategies across weaker departments.
9
Reporting is the final stage. Analysts can use Jupyter Notebook to combine code, output, and observations
in a single notebook. For professional reporting, visualizations are exported as PDFs, PowerPoints, or
integrated into dashboards using Plotly or Power BI.
Thus, Python provides end-to-end capabilities from collection to reporting. Its rich ecosystem of libraries
ensures that even complex industrial problems can be tackled efficiently.
Aggregation functions summarize large datasets. SQL provides built-in functions like COUNT(), SUM(),
AVG(), MIN(), and MAX().
Case Study: A sales company grouped revenue by region and used HAVING to filter only those regions
where total sales exceeded 10 million.
11
CHAPTER 4: MS EXCEL
4.1 Basics and Foundation
Microsoft Excel is one of the most widely used applications in the Microsoft Office suite. It was first
released in 1985 and has since become the backbone of spreadsheet-based data management. Excel
provides a flexible environment to store, organize, and analyze numerical and textual data.
Advantages of Excel:
- Easy to learn and widely available.
- Suitable for both small and medium datasets.
- Provides instant feedback and visualization.
- Compatible with other tools like SQL and Python for integration.
Limitations of Excel:
- Not efficient for handling very large datasets (>1 million rows).
- Limited advanced statistical and machine learning capabilities.
- Manual work can sometimes introduce errors.
Despite limitations, Excel remains an entry point for most students and professionals to develop analytical
thinking.
13
4.2 Formulas and Functions
Excel provides a wide range of formulas and functions that enable quick calculations. Formulas are user-
defined calculations, while functions are prebuilt operations.
Categories of Functions:
1. **Arithmetic Functions:** SUM(), PRODUCT(), ROUND().
2. **Statistical Functions:** AVERAGE(), MEDIAN(), STDEV().
3. **Logical Functions:** IF(), AND(), OR(), NOT().
4. **Text Functions:** LEN(), CONCAT(), UPPER(), LOWER().
5. **Date & Time Functions:** TODAY(), NOW(), DATEDIF().
6. **Lookup Functions:** VLOOKUP(), HLOOKUP(), INDEX(), MATCH().
Examples:
- =IF(B2>=40,"Pass","Fail") → Returns 'Pass' if marks ≥ 40.
- =AVERAGE(C2:C50) → Finds class average.
- =INDEX(A2:D20,5,3) → Returns value in row 5, column 3.
- =VLOOKUP(101,A2:D20,2,FALSE) → Finds student with ID 101.
Nested Formulas:
Excel allows combining functions, e.g., =IF(AND(B2>=40,C2>=40),"Pass","Fail") → Passes only if both
theory and practical marks ≥ 40.
Practical Example: In a finance dataset, Excel formulas can calculate monthly loan payments, interest, and
remaining balances using PMT() and related financial functions.
4.2 Formulas and Functions
Excel provides a wide range of formulas and functions that enable quick calculations. Formulas are user-
defined calculations, while functions are prebuilt operations.
Categories of Functions:
1. **Arithmetic Functions:** SUM(), PRODUCT(), ROUND().
2. **Statistical Functions:** AVERAGE(), MEDIAN(), STDEV().
3. **Logical Functions:** IF(), AND(), OR(), NOT().
4. **Text Functions:** LEN(), CONCAT(), UPPER(), LOWER().
5. **Date & Time Functions:** TODAY(), NOW(), DATEDIF().
6. **Lookup Functions:** VLOOKUP(), HLOOKUP(), INDEX(), MATCH().
Examples:
- =IF(B2>=40,"Pass","Fail") → Returns 'Pass' if marks ≥ 40.
- =AVERAGE(C2:C50) → Finds class average.
- =INDEX(A2:D20,5,3) → Returns value in row 5, column 3.
- =VLOOKUP(101,A2:D20,2,FALSE) → Finds student with ID 101.
Nested Formulas:
Excel allows combining functions, e.g., =IF(AND(B2>=40,C2>=40),"Pass","Fail") → Passes only if both
theory and practical marks ≥ 40.
Practical Example: In a finance dataset, Excel formulas can calculate monthly loan payments, interest, and
remaining balances using PMT() and related financial functions.
15
4.3 Data Management
Managing data efficiently is one of Excel’s core strengths. It ensures that large datasets can be transformed
into meaningful summaries.
2. **Conditional Formatting:**
- Highlights important data visually.
- Example: Red background for marks < 40, green for marks ≥ 40.
3. **Data Validation:**
- Ensures correctness of data entry.
- Example: Restricting input to numbers between 0–100 for marks.
4. **PivotTables:**
- Summarize data across multiple dimensions.
- Example: Summarize total sales by region and quarter.
Case Study: In a student attendance dataset, PivotTables were used to calculate average attendance per
department. Conditional formatting then highlighted students falling below 75%, assisting faculty in
compliance monitoring.
4.4 Data Analysis Tools
Excel offers built-in and add-in tools for advanced analysis. The Analysis ToolPak enhances capabilities for
statistical operations.
1. **Goal Seek:**
- Determines the input value required to achieve a target output.
- Example: Finding the required final exam marks to reach 60% overall.
2. **Solver:**
- Optimization tool for maximizing or minimizing outcomes under constraints.
- Example: Optimizing production to maximize profit within resource limits.
3. **Scenario Manager:**
- Stores and compares different input scenarios.
- Example: Analyzing profits under optimistic, realistic, and pessimistic sales.
5. **Regression Analysis:**
- Helps identify relationships between dependent and independent variables.
- Example: Predicting sales based on advertising expenditure.
Case Study: A retail company used Solver to decide product mix under limited budget. By optimizing the
constraints, profitability increased by 12%.
17
4.5 Data Visualization
Visualization converts raw data into graphical representations that are easy to understand. Excel provides
multiple chart types that cater to different analytical needs.
Charts in Excel:
- **Column/Bar Chart:** Compare quantities across categories.
- **Line Chart:** Show trends over time.
- **Pie Chart:** Display proportions within a whole.
- **Scatter Plot:** Visualize correlation between variables.
- **Box Plot (via add-ins):** Identify outliers and distribution spread.
Dashboards:
- Dashboards combine charts, PivotTables, and slicers in one view.
- Used in organizations to track KPIs (Key Performance Indicators).
Case Study: An e-commerce firm created a sales dashboard in Excel. It highlighted peak sales during
festive seasons, helping the company to prepare inventory in advance.
In conclusion, Excel not only serves as a foundation for data analysis but also acts as a bridge between
basic manual calculations and advanced programming-based analytics. Its wide usage and accessibility
make it an indispensable tool for students and professionals.
CHAPTER 5: SQL
5.1 SQL Basics and Foundations
Structured Query Language (SQL) is the standard language for interacting with relational databases. It
allows users to define, manipulate, and query data stored in tables. SQL was developed in the 1970s and
has since become the foundation of database systems such as MySQL, PostgreSQL, Oracle, and SQL
Server.
Constraints ensure data integrity. For example, PRIMARY KEY uniquely identifies a record, FOREIGN
KEY links tables, and CHECK enforces conditions like Marks >= 0.
Case Study: A student database was designed with tables for Students, Courses, and Enrollments.
Constraints were used to ensure no student could enroll in the same course twice.
19
5.1.1 SQL Basics and Foundations
Structured Query Language (SQL) is the standard language for interacting with relational databases. It
allows users to define, manipulate, and query data stored in tables. SQL was developed in the 1970s and
has since become the foundation of database systems such as MySQL, PostgreSQL, Oracle, and SQL
Server.
Constraints ensure data integrity. For example, PRIMARY KEY uniquely identifies a record, FOREIGN
KEY links tables, and CHECK enforces conditions like Marks >= 0.
Case Study: A student database was designed with tables for Students, Courses, and Enrollments.
Constraints were used to ensure no student could enroll in the same course twice.
5.2 Data Retrieval and Filtering
The SELECT statement is the most commonly used SQL command. It retrieves data from one or more
tables.
Case Study: In a company database, filtering queries were used to find employees earning more than
50,000 in the Sales department. This helped HR identify high-performing employees for promotions.
21
5.3 Aggregation and Grouping
Aggregation functions summarize large datasets. SQL provides built-in functions like COUNT(), SUM(),
AVG(), MIN(), and MAX().
Case Study: A sales company grouped revenue by region and used HAVING to filter only those regions
where total sales exceeded 10 million.
5.3 Aggregation and Grouping
Aggregation functions summarize large datasets. SQL provides built-in functions like COUNT(), SUM(),
AVG(), MIN(), and MAX().
Case Study: A sales company grouped revenue by region and used HAVING to filter only those regions
where total sales exceeded 10 million.
23
5.4 Joins and Relationships
Databases are relational by nature. Relationships are built between tables using keys. Joins allow retrieval
of related data spread across multiple tables.
Types of Joins:
- **INNER JOIN:** Returns matching records from both tables.
- **LEFT JOIN:** Returns all records from left table and matched records from right.
- **RIGHT JOIN:** Returns all records from right table and matched records from left.
- **FULL OUTER JOIN:** Returns all records from both tables, with NULL where no match exists.
- **SELF JOIN:** Joins a table with itself.
- **CROSS JOIN:** Produces Cartesian product.
Case Study: A university database linked Students with Courses through an Enrollments table. INNER
JOIN was used to fetch details of which student was enrolled in which course.
Another Case: In a hospital database, Doctors and Patients tables were joined to analyze appointments and
identify busiest doctors.
5.5 Subqueries and Set Operations
Subqueries are queries inside other queries. They provide flexibility and allow stepwise analysis.
Case Study: In internship applications, UNION was used to combine lists of students applying to multiple
companies, while INTERSECT identified students applying to both Google and Microsoft.
In summary, SQL provides the backbone for structured data analysis. Its ability to store, retrieve, aggregate,
and combine data makes it indispensable for analysts and developers alike.
25
Chapter 6: About the Project
6.1: Project
This project involves developing an interactive dashboard for Vrinda Store using real-world (or
pseudo-real) data collected from open sources and/or GitHub/Kaggle repositories. The primary
aim is to provide a consolidated visual interface to help stakeholders (store management,
marketing teams, etc.) explore, analyze, and draw actionable insights from the store’s
transaction and customer data.
To transform raw sales and customer data into meaningful visualizations and insights, enabling
better decision-making.
To identify trends and patterns in sales over time, customer behavior across demographics,
order fulfillment status, and product performance.
To provide an interactive experience via slicers/filters so that users can dynamically explore
data slices such as time period, channel, category, customer segments, etc.
To generate a dashboard that is intuitive, visually appealing, and aids in quick comprehension
of key KPIs (key performance indicators).
In this project, I used Python / Excel / SQL (or whatever tools you used) to perform data
cleaning, transformation, exploratory analysis, and dashboard creation. The dashboard was
designed to present the following elements:
Monthly/quarterly trend of sales vs orders
The end product is a consolidated, interactive dashboard that helps Vrinda Store’s management
quickly identify areas of strength and weakness, understand customer behavior, and make data-
driven decisions for future growth.
27
Aggregation functions summarize large datasets. SQL provides built-in functions like COUNT(), SUM(),
AVG(), MIN(), and MAX().
Case Study: A sales company grouped revenue by region and used HAVING to filter only those regions
where total sales exceeded 10 million.
6.2: Methodology
The methodology adopted for this project followed a structured approach to ensure
accuracy, clarity, and relevance of the final dashboard. The process can be summarized
in the following steps:
1. Data Collection
o Obtained Vrinda Store sales and customer data from open sources (CSV/Excel format).
3. Data Transformation
o Created calculated fields such as total sales, order counts, and profit margin.
5. Dashboard Design
29
o Selected key performance indicators (KPIs) such as monthly sales, order trends, top
categories, customer demographics, regional performance, and sales channel
distribution.
o Designed interactive charts (line, bar, pie, and maps) for better visualization.
o Added slicers/filters for time period, category, and channel to enable dynamic
exploration.
o Generated actionable insights (e.g., peak sales months, best-performing categories, and
customer behavior patterns).
o Documented the workflow and prepared the final project report for academic
submission.
6.3: Skills Gained
During the course of this project, I gained both technical and analytical skills that will
be valuable in my academic and professional journey. The key skills acquired are:
2. Analytical Thinking
o Ability to interpret sales and customer data to identify trends and anomalies.
3. Dashboard Development
o Presenting KPIs such as revenue, sales trends, and customer demographics in a clear and visual
format.
4. Technical Proficiency
o Exposure to SQL/Python basics for query writing and initial data exploration (if applicable).
o Gaining knowledge about customer segmentation, seasonal demand, and sales channel performance.
6. Soft Skills
31
Chapter 7 : Refrences
The whole data for this project was taken from Kaggle official website: [Link]
Link: [Link]
33