0% found this document useful (0 votes)

60 views10 pages

Bda Report

This micro-project report focuses on data cleaning and handling missing values using the Pandas library in Python, emphasizing the importance of accurate datasets in Big Data Analytics. It outlines the challenges of missing data, methods for detection and imputation, and the automation of data cleaning processes. The project aims to enhance technical skills, problem-solving abilities, and project management through practical application and teamwork.

Uploaded by

vanjaresiddhi08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views10 pages

Bda Report

Uploaded by

vanjaresiddhi08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

BIG DATA ANALYTICS (BDA)-22684

MICRO-PROJECT REPORT
DATA CLEANING AND HANDLING MISSING VALUES IN PANDAS

1.0 RATIONALE:

Data cleaning is a fundamental process in Big Data Analytics that ensures datasets are accurate,
consistent, and free from errors before analysis. In large-scale data processing, raw data often
contains missing values, duplicates, inconsistencies, and incorrect data types, which can impact
the quality of insights derived from analytics models. One of the key challenges in BDA is handling
missing values efficiently, as large datasets may have incomplete records due to data collection
issues or system failures. Pandas, a powerful Python library, provides various methods to handle
missing values, such as detecting them using df.isnull().sum(), removing them with df.dropna(),
or filling them with appropriate values using df.fillna(). For numerical data, missing values are
commonly replaced with the mean, median, or mode, while categorical data can be filled with the
most frequent value. In big data environments, automated data cleaning techniques, including data
imputation algorithms and machine learning models, are often used to handle large volumes of
missing or corrupted data efficiently.

Since Big Data Analytics relies on high-quality data to generate valuable insights, proper data
preprocessing, including cleaning, transformation, and standardization, is essential for ensuring
accurate and meaningful results in predictive modeling, business intelligence, and decision-making
processes.

2.0 AIM OF THE MICRO-PROJECT:

The aim of this microproject is to implement data cleaning techniques in Pandas by handling
missing values, detecting duplicates, and separating categorical and numerical data.

3.0 COURSE OUTCOMES ADDRESSED:

CO-a: Describe Big data and Big data analytics.

CO-b: Apply the big data analytics procedure to work on datasets.
4.0 LITERATURE REVIEW:

Data cleaning is a critical step in data preprocessing that ensures datasets are accurate, complete,
and free of inconsistencies. Handling missing values is an essential part of this process, as missing
data can lead to incorrect analyses and unreliable models. Pandas, a powerful Python library,
provides various methods for detecting, handling, and imputing missing values. This review covers
the key aspects of data cleaning, missing value handling techniques, and the role of Pandas in
automating these processes efficiently.
Key Points:
1. Importance of Data Cleaning in Data Analysis:
• Ensures that data used for analysis and machine learning is accurate and reliable.
• Helps in reducing biases and errors caused by incomplete or incorrect data.
• Prevents misleading conclusions that arise from poor data quality.
2. Common Causes of Missing Data:
• Human errors during data entry or collection.
• System errors or failures in data storage.
• Non-responses in survey data or missing values in large datasets.
3. Pandas Methods for Handling Missing Values:
• Detection: df.isnull().sum() helps identify missing values in each column.
• Removal: df.dropna() removes rows or columns with missing values.
• Imputation: df.fillna(value) replaces missing values with a specified value.
4. Automation of Data Cleaning Using Pandas:
• Automating data cleaning helps improve efficiency in handling large datasets.
• Python scripts can be used to detect, clean, and impute missing values automatically.
• Reduces manual errors and speeds up the preprocessing workflow.
5. Security and Integrity of Data Cleaning:
• Ensures consistency and reliability of data for decision-making.
• Helps in maintaining data integrity by preventing loss of critical information.
• Reduces errors that could impact business intelligence and machine learning models.
6. Scalability in Large Datasets:
• Handling missing values in big data environments requires efficient methods.
• Pandas, combined with libraries like NumPy and Scikit-learn, enhances scalability.
• Automating cleaning pipelines makes it easier to process large datasets efficiently.
7. Future Research and Development:
• Research focuses on improving imputation techniques using machine learning.
• Enhancing automation tools for better detection and handling of missing values.
• Developing more advanced and scalable solutions for big data applications.
Conclusion
Data cleaning and handling missing values are crucial for ensuring the quality of datasets used in
analytics and machine learning. Pandas provides robust tools to detect, remove, and impute
missing values efficiently, making it an essential library for data preprocessing. Automating data
cleaning not only improves accuracy but also enhances efficiency in handling large datasets. As
data-driven decision-making continues to grow, optimizing data cleaning techniques will remain
a key focus in research and industry applications.
5.0 ACTUAL PROCEDURE FOLLOWED:

Initially, the selection of topic for micro-project was done and the group of 3 members was formed.
Each member was first asked to understand the topic with details and then the discussion was
carried out. The work of micro-project was then divided into three parts i.e. proposal, report and
execution of code. The efficient team mate was chosen and we distributed team members
respectively. The deadline of project was taken into consideration first. The information on the
topic create data cleaning and handling missing values in pandas was gathered with the reference
to internet and various websites. This information was further examined by the team members and
changes were made. The information was now given to respected team mate for typing and editing
was done. After this the draft document was re- examined by the group members and changes were
made according to that. We concluded project by executing the code. The micro-project was finally
completed with all the corrections done. We then submitted the micro-project on the respected
date.

6.0 ACTUAL RESOURCES USED:

Sr. Names Of The Specification Quantity Remarks

No Resources

1 Computer/Laptop - 1

2 Google Various websites -

7.0 OUTPUT OF THE MICRO-PROJECT:

Code:
8.0 SKILLS DEVELOPED/LEARNING OUTCOME OF THE MICRO-PROJECT:

Working on a micro-project enhances practical knowledge and technical proficiency in a focused

domain. It helps in applying theoretical concepts to real-world problems, fostering problem-
solving abilities and hands-on experience. This review highlights key skills developed during a
micro-project, ranging from technical competencies to teamwork and project management skills.
Key Points:
1. Technical Skills:
• Gained proficiency in Python and data handling using Pandas.
• Learned data cleaning techniques, including handling missing values and outlier
detection.
• Applied machine learning or statistical methods for data preprocessing and analysis.
2. Problem-Solving Abilities:
• Developed critical thinking by addressing inconsistencies in datasets.
• Implemented logical strategies to resolve missing or incorrect data.
• Enhanced debugging skills by troubleshooting issues in scripts and models.
3. Automation and Efficiency:
• Learned how to automate repetitive data processing tasks using Python.
• Implemented workflows to clean and manage large datasets efficiently.
• Used scripting techniques to improve the speed and accuracy of data handling.
4. Project Management Skills:
• Gained experience in planning and structuring a project timeline.
• Managed tasks effectively by prioritizing data cleaning, analysis, and reporting.
• Maintained documentation to ensure clear understanding and reproducibility of work.
5. Collaboration and Communication:
• Worked in a team environment, improving coordination and task delegation.
• Developed the ability to present findings and explain data preprocessing steps.
• Improved written communication skills by documenting methodologies and results.
6. Adaptability and Continuous Learning:
• Learned new techniques and best practices for data cleaning and preprocessing.
• Adapted to new tools and libraries based on project requirements.
• Engaged in self-learning and research to optimize data handling strategies.
7. Future Applications and Career Growth:
• The skills gained can be applied to larger data science projects and real-world datasets.
• Understanding data cleaning strengthens the foundation for machine learning and
analytics.
• Enhances employability by demonstrating practical experience in data preprocessing.
Conclusion
The micro-project provided valuable hands-on experience in data handling, automation, and
problem-solving. It improved technical expertise in Python and Pandas, along with essential soft
skills like collaboration and project management. These skills are fundamental for future data-
driven roles and research in data science and analytics.
9.0 APPLICATION OF THE MICRO-PROJECT:
A micro-project serves as a practical implementation of theoretical knowledge, allowing students
to understand real-world applications of their work. The skills and insights gained during the
micro-project can be applied across various domains, enhancing efficiency, decision-making, and
technological advancements. This review highlights the key applications of the micro-project in
different fields.
Key Points:
1. Data Analysis and Decision-Making:
• The data cleaning and preprocessing techniques learned can be used in business analytics
to make informed decisions.
• Helps in structuring raw data into meaningful insights for organizations and researchers.
• Improves data accuracy, leading to better forecasting and trend analysis.
2. Machine Learning and AI Development:
• Clean and structured data is crucial for training machine learning models.
• The micro-project techniques can be used in predictive modeling, sentiment analysis, and
recommendation systems.
• Reduces bias in datasets, leading to improved model performance.
3. Industry Applications:
• Used in industries such as finance, healthcare, and marketing to process large volumes of
data.
• Helps in fraud detection, customer segmentation, and patient record management.
• Ensures regulatory compliance by maintaining high data quality standards.
4. Automation and Efficiency Enhancement:
• Automating data cleaning processes reduces manual workload and improves efficiency.
• Can be integrated into ETL (Extract, Transform, Load) pipelines for real-time data
processing.
• Enhances data pipeline management in cloud computing environments.
5. Research and Academic Use:
• Facilitates better research outcomes by providing clean datasets for analysis.
• Helps in conducting statistical studies, experiments, and survey data processing.
• Can be applied to thesis projects and case studies in data science and analytics.
6. Software and Application Development:
• The techniques learned can be used in building data-driven applications.
• Essential for developing dashboards, reporting tools, and decision-support systems.
• Enhances user experience by ensuring reliable and accurate data presentation.
7. Future Scope and Advancements:
• The micro-project techniques can be extended to handle big data challenges.
• Integration with AI-powered automation for advanced data preprocessing.
• Continuous improvement of data cleaning methodologies for real-time applications.
Conclusion
The micro-project has practical applications across various domains, including business analytics,
machine learning, automation, and research. The skills and techniques learned help in improving
data quality, enhancing efficiency, and driving better decision-making. As industries continue to
rely on data-driven insights, the importance of effective data cleaning and handling will remain
crucial in technological advancements.

NAME OF TEAM MEMBERS:

SR.NO NAME ROLL NO

1. Vibha Shetty 22957

2. Ichchha Singh 22959

3. Siddhi Vanjare 22962

ETI Microproject
No ratings yet
ETI Microproject
14 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
PDS Exp 7 To 9
No ratings yet
PDS Exp 7 To 9
10 pages
Ass 3 - Average
No ratings yet
Ass 3 - Average
10 pages
Mid Term Project
No ratings yet
Mid Term Project
3 pages
Prac 7
No ratings yet
Prac 7
5 pages
Document
No ratings yet
Document
29 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
6.data Cleaning
No ratings yet
6.data Cleaning
20 pages
Exp-3 - Rai - 05
No ratings yet
Exp-3 - Rai - 05
7 pages
III Unit
No ratings yet
III Unit
4 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Data Cleaning Essentials Guide
No ratings yet
Data Cleaning Essentials Guide
22 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
E-Book Data Cleaning Techniques in Python
100% (2)
E-Book Data Cleaning Techniques in Python
50 pages
DWM Exp 7
No ratings yet
DWM Exp 7
4 pages
Task 1
No ratings yet
Task 1
2 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Data Wrangling
No ratings yet
Data Wrangling
6 pages
U2 Data Collection, Cleaning & Handling
No ratings yet
U2 Data Collection, Cleaning & Handling
5 pages
Regression
No ratings yet
Regression
26 pages
Data Cleaning Techniques in Python
No ratings yet
Data Cleaning Techniques in Python
11 pages
New Review
No ratings yet
New Review
12 pages
Arnav MLlab01
No ratings yet
Arnav MLlab01
7 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Unit II (DWDM)
No ratings yet
Unit II (DWDM)
19 pages
Data Cleaning
No ratings yet
Data Cleaning
6 pages
Data Cleaning
No ratings yet
Data Cleaning
2 pages
Data Cleaning Guide
No ratings yet
Data Cleaning Guide
4 pages
Data Preprocessing for Visualization
No ratings yet
Data Preprocessing for Visualization
25 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Pandas Data Cleaning Techniques Guide
No ratings yet
Pandas Data Cleaning Techniques Guide
11 pages
Date Cleaning Notes
No ratings yet
Date Cleaning Notes
2 pages
Assignment 4 MB511
No ratings yet
Assignment 4 MB511
6 pages
Foundation of DS
No ratings yet
Foundation of DS
21 pages
Data Cleaning Using Pandas
No ratings yet
Data Cleaning Using Pandas
9 pages
B.Tech Graduate Data Analyst Resume
No ratings yet
B.Tech Graduate Data Analyst Resume
2 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
PWP Project Done
No ratings yet
PWP Project Done
19 pages
Data Cleaning and Storage in Python
No ratings yet
Data Cleaning and Storage in Python
8 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Avneesh - To Be Printed Information Practice
No ratings yet
Avneesh - To Be Printed Information Practice
8 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Capstone Project Guidelines
No ratings yet
Capstone Project Guidelines
2 pages
Ass 3 - Best
No ratings yet
Ass 3 - Best
13 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Data Analytics - Project Videos & Ideas
No ratings yet
Data Analytics - Project Videos & Ideas
6 pages
Assignment03 DataScience Report
No ratings yet
Assignment03 DataScience Report
4 pages
Ass 3 - Average
No ratings yet
Ass 3 - Average
6 pages
Python and PowerBI Syllabus
No ratings yet
Python and PowerBI Syllabus
3 pages
Data Science Project Guidelines 2025
No ratings yet
Data Science Project Guidelines 2025
3 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
05 Data Cleaning
No ratings yet
05 Data Cleaning
9 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
2609 BDA Final
No ratings yet
2609 BDA Final
23 pages
Data Cleaning
No ratings yet
Data Cleaning
28 pages
Cryptocurrency Legality in India
No ratings yet
Cryptocurrency Legality in India
15 pages
Engine Knock Sensor Guide
100% (1)
Engine Knock Sensor Guide
15 pages
Openmama Developers Guide
No ratings yet
Openmama Developers Guide
105 pages
ASME B16.5 Flange Dimensions
No ratings yet
ASME B16.5 Flange Dimensions
11 pages
KSTAR Utility System Solutions Overview
No ratings yet
KSTAR Utility System Solutions Overview
10 pages
FTM FinatTestMethods - Adhesivo de Papel Tapiz en Maderas
No ratings yet
FTM FinatTestMethods - Adhesivo de Papel Tapiz en Maderas
3 pages
IEC 61850 Substation Communication Guide
No ratings yet
IEC 61850 Substation Communication Guide
1 page
MDV R32 RemoteControllerUsersManual ENG
No ratings yet
MDV R32 RemoteControllerUsersManual ENG
10 pages
Ceb A-73-5033
No ratings yet
Ceb A-73-5033
4 pages
MTC Student Handbook V9.2
No ratings yet
MTC Student Handbook V9.2
50 pages
Rotalign Ultra Brochure English
100% (1)
Rotalign Ultra Brochure English
12 pages
Employee Management Automation
100% (1)
Employee Management Automation
37 pages
SITECO SL10 Street Lighting in Relux
100% (1)
SITECO SL10 Street Lighting in Relux
40 pages
48HNNCW7TNP4RE4N4MKE6PVFLPL2A4KM
No ratings yet
48HNNCW7TNP4RE4N4MKE6PVFLPL2A4KM
2 pages
Master Thesis Genetic Algorithm
100% (3)
Master Thesis Genetic Algorithm
6 pages
Manual Satellite p755 s5120
0% (1)
Manual Satellite p755 s5120
4 pages
OTBI/BI Training for ERP Professionals
No ratings yet
OTBI/BI Training for ERP Professionals
32 pages
OptScale Product Description
No ratings yet
OptScale Product Description
37 pages
School of Information Technology: Brochure 2023
No ratings yet
School of Information Technology: Brochure 2023
19 pages
iFreeiCloud IMEI Check Experts
No ratings yet
iFreeiCloud IMEI Check Experts
1 page
Maintenance Information: IBM 3590 Tape Subsystem
No ratings yet
Maintenance Information: IBM 3590 Tape Subsystem
279 pages
Affinnova Fact Sheet - Intro To IDDEA IIn
No ratings yet
Affinnova Fact Sheet - Intro To IDDEA IIn
2 pages
Formative Evaluation in HCI
No ratings yet
Formative Evaluation in HCI
9 pages
Eec 553,3RD Year Ece Lab Manual PDF
No ratings yet
Eec 553,3RD Year Ece Lab Manual PDF
5 pages
IoT Design for Tech Enthusiasts
No ratings yet
IoT Design for Tech Enthusiasts
29 pages
2018-03-01 BA Tabeo Sinteröfen Zirkon en
No ratings yet
2018-03-01 BA Tabeo Sinteröfen Zirkon en
46 pages
Solenoid Valves (Implement) 966
100% (4)
Solenoid Valves (Implement) 966
6 pages
PowerTech 4.5L Engine Replacement Parts Guide
100% (1)
PowerTech 4.5L Engine Replacement Parts Guide
3 pages
PC & Tech Authority - July 2018
No ratings yet
PC & Tech Authority - July 2018
116 pages
Lte3316-M604 2
No ratings yet
Lte3316-M604 2
4 pages

Bda Report

Uploaded by

Bda Report

Uploaded by

BIG DATA ANALYTICS (BDA)-22684

2.0 AIM OF THE MICRO-PROJECT:

3.0 COURSE OUTCOMES ADDRESSED:

CO-a: Describe Big data and Big data analytics.

6.0 ACTUAL RESOURCES USED:

Sr. Names Of The Specification Quantity Remarks

2 Google Various websites -

Working on a micro-project enhances practical knowledge and technical proficiency in a focused

NAME OF TEAM MEMBERS:

SR.NO NAME ROLL NO

1. Vibha Shetty 22957

2. Ichchha Singh 22959

3. Siddhi Vanjare 22962

You might also like