0% found this document useful (0 votes)

9 views7 pages

Modular Assignment 1 DAR

The assignment requires students to implement a complete data analytics pipeline, including data preprocessing, cleaning, feature selection, and reporting on a chosen real-world dataset. Students must select a dataset with at least 1000 records and 8 features, perform various tasks across five parts, and deliver a Jupyter notebook, executive summary, dataset versions, and presentation slides. Evaluation will be based on understanding of data science fundamentals, implementation of techniques, generation of insights, and professional presentation of findings.

Uploaded by

anshulgola5339

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views7 pages

Modular Assignment 1 DAR

Uploaded by

anshulgola5339

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Data Analytics and Reporting (DAR) - Module 1 Assignment

Module 1: Introduction to Data Science and Analytics

Duration: 4 hours

Problem Statement

Implement a complete data analytics pipeline demonstrating data preprocessing, cleaning, feature
selection, and basic reporting on a real-world dataset.

You are required to select a dataset from the provided sources, perform comprehensive data
preprocessing including cleaning and feature selection techniques (PCA/LDA), and create meaningful
analytical reports that demonstrate your understanding of data science fundamentals.

Data Sources

Choose ONE dataset from the following sources:

Primary Sources:

1. Kaggle Datasets (https://www.kaggle.com/datasets)

- Customer Purchase Behavior Dataset

- Employee Attrition Dataset

- Housing Price Prediction Dataset

- Student Performance Dataset

2. UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets.html)

- Wine Quality Dataset

- Breast Cancer Wisconsin Dataset

- Iris Dataset (for beginners)

- Adult Income Dataset

3. Government Open Data Portals

• data.gov (US Government)

• data.gov.in (Indian Government)
- Choose any socio-economic dataset

4. Alternative Sources

- Google Dataset Search

- AWS Open Data

- Microsoft Azure Open Datasets

Note: Dataset should contain at least 1000 records with minimum 8 features including both numerical
and categorical variables.

Assignment Requirements

Part A: Data Understanding and Exploration (45 minutes)

Tasks:

1. Dataset Selection and Justification (10 minutes)

- Select and download your chosen dataset

- Provide brief justification for selection

- Document data source and basic metadata

2. Initial Data Exploration (35 minutes)

- Load data using appropriate Python libraries (pandas, numpy)

- Display basic dataset information (shape, columns, data types)

- Generate descriptive statistics

- Identify missing values, duplicates, and outliers

- Create initial visualizations (histograms, scatter plots, correlation matrix)

Expected Deliverables:

• Jupyter notebook with documented code

• Summary report of dataset characteristics
• Initial data quality Assignment

Part B: Data Preprocessing and Cleaning (90 minutes)

Tasks:

1. Data Cleaning (45 minutes)

- Handle missing values using appropriate techniques (mean/median imputation, forward fill, etc.)

- Remove or treat duplicate records

- Identify and handle outliers using statistical methods

- Standardize data formats and correct inconsistencies

2. Data Preprocessing (45 minutes)

- Encode categorical variables (one-hot encoding, label encoding)

- Normalize/standardize numerical features

- Create new features through feature engineering

- Handle date/time variables if present

- Split data into training and testing sets

Expected Deliverables:

• Clean, preprocessed dataset

• Documentation of all cleaning steps taken
• Before/after comparison statistics
• Visualization of data distribution changes

Part C: Feature Selection Implementation (75 minutes)

Tasks:

1. Principal Component Analysis (PCA) (40 minutes)

- Implement PCA on numerical features

- Determine optimal number of components using explained variance

- Visualize principal components

- Transform data using selected components

- Interpret results and component loadings

2. Linear Discriminant Analysis (LDA) (35 minutes)

- Apply LDA for dimensionality reduction (if classification target exists)

- Compare LDA results with PCA

- Evaluate discriminant power

- Visualize class separability

Expected Deliverables:

• PCA implementation with variance explanation analysis

• LDA implementation with discriminant analysis
• Comparative analysis of both techniques
• Visualizations showing dimensionality reduction results

Part D: Analytics Components - Reporting and Analysis (60 minutes)

Tasks:

1. Descriptive Analytics (30 minutes)

- Create comprehensive statistical summary

- Generate key performance indicators (KPIs)

- Build interactive dashboards using matplotlib/seaborn/plotly

- Perform correlation analysis and trend identification

2. Analytical Reporting (30 minutes)

- Develop insights from processed data

- Create executive summary with key findings

- Generate automated reports with visualizations

- Present actionable recommendations based on analysis

Expected Deliverables:

• Interactive dashboard/visualizations
• Executive summary report
• Key insights and recommendations
• Analytical methodology documentation

Part E: Integration and Presentation (30 minutes)

Tasks:

• Combine all components into coherent analysis

• Create final presentation summarizing methodology and findings
• Ensure reproducibility of entire pipeline
• Document lessons learned and potential improvements

Evaluation Rubrics

1. Data Understanding and Exploration (20 points)

• Excellent (18-20): Comprehensive dataset analysis with meaningful insights

• Good (14-17): Adequate exploration with basic statistical analysis
• Satisfactory (10-13): Basic understanding demonstrated with minimal analysis
• Needs Improvement (0-9): Incomplete or superficial data exploration

2. Data Preprocessing and Cleaning (25 points)

• Excellent (23-25): Thorough cleaning with appropriate techniques and documentation

• Good (18-22): Good cleaning practices with minor gaps
• Satisfactory (13-17): Basic cleaning implemented but lacks depth
• Needs Improvement (0-12): Inadequate cleaning or poor technique selection

3. Feature Selection Implementation (25 points)

• Excellent (23-25): Correct implementation of both PCA and LDA with proper interpretation
• Good (18-22): Good implementation with minor technical issues
• Satisfactory (13-17): Basic implementation but lacks proper analysis
• Needs Improvement (0-12): Incorrect implementation or poor understanding

4. Analytics and Reporting (20 points)

• Excellent (18-20): Insightful analysis with professional reporting

• Good (14-17): Good analytical skills with adequate reporting
• Satisfactory (10-13): Basic analysis with simple reporting
• Needs Improvement (0-9): Poor analysis or inadequate reporting

5. Integration and Presentation (10 points)

• Excellent (9-10): Seamless integration with clear, professional presentation

• Good (7-8): Good integration with minor presentation issues
• Satisfactory (5-6): Basic integration with adequate presentation
• Needs Improvement (0-4): Poor integration or unprofessional presentation

Technical Requirements

Required Python Libraries:

• pandas, numpy (data manipulation)
• matplotlib, seaborn, plotly (visualization)
• scikit-learn (machine learning and preprocessing)
• scipy (statistical analysis)
• jupyter notebook (development environment)

Submission Format:

1. Jupyter Notebook (.ipynb) with complete code and documentation

2. Executive Summary (PDF) - 2-3 pages maximum
3. Dataset (original and processed versions)
4. Presentation Slides (PowerPoint/PDF) - 8-10 slides maximum

Time Management Guidelines

Task Allocated Time Key Focus Areas

Data Exploration 45 minutes Understanding dataset characteristics
Data Cleaning 90 minutes Quality assurance and preprocessing
Feature Selection 75 minutes PCA and LDA implementation
Analytics & Reporting 60 minutes Insights generation and visualization
Integration & Presentation 30 minutes Final documentation and presentation

Success Criteria

To pass this Assignment, students must:

• Demonstrate understanding of data science fundamentals

• Successfully implement data preprocessing techniques
• Correctly apply PCA and LDA for feature selection
• Generate meaningful analytical insights
• Present findings in a professional manner
• Complete all tasks within the 4-hour timeframe

Bonus Points Available:

• Creative feature engineering (+5 points)

• Advanced visualization techniques (+5 points)
• Statistical significance testing (+5 points)
• Industry-relevant insights (+5 points)

Additional Resources

Documentation References:

• Pandas Documentation: https://pandas.pydata.org/docs/

• Scikit-learn User Guide: https://scikit-learn.org/stable/user_guide.html
• Matplotlib Documentation: https://matplotlib.org/stable/contents.html

Best Practices:

• Follow PEP 8 coding standards

• Include comprehensive comments in code
• Use version control (Git) for project management
• Validate assumptions with statistical tests
• Ensure reproducibility of results

Note: This Assignment is designed to evaluate practical understanding of data science concepts within a
realistic industry timeframe. Focus on demonstrating core competencies rather than perfect
optimization.

Computer Science Project
No ratings yet
Computer Science Project
21 pages
Data Mining: OLAP Operations
100% (1)
Data Mining: OLAP Operations
8 pages
Daily Security Quality Control Runbook
No ratings yet
Daily Security Quality Control Runbook
8 pages
Distributed File Systems Overview
No ratings yet
Distributed File Systems Overview
49 pages
SQL Transformation
No ratings yet
SQL Transformation
8 pages
CMPG311 Su5-Ch7
No ratings yet
CMPG311 Su5-Ch7
37 pages
Java Unit V
No ratings yet
Java Unit V
9 pages
Logical Data Model Project Plan
No ratings yet
Logical Data Model Project Plan
6 pages
tMM Template Creation Guide
No ratings yet
tMM Template Creation Guide
5 pages
EY Tax Automation Career Profile
No ratings yet
EY Tax Automation Career Profile
4 pages
Introduction To Phishing - LetsDefend
No ratings yet
Introduction To Phishing - LetsDefend
4 pages
LiveReports Administration Guide
No ratings yet
LiveReports Administration Guide
20 pages
Essential Linux Configuration Files
No ratings yet
Essential Linux Configuration Files
1,878 pages
SAP MM - Inventory Management
No ratings yet
SAP MM - Inventory Management
16 pages
Oracle Database Connection and SQL Basics
No ratings yet
Oracle Database Connection and SQL Basics
133 pages
Azure Pipelines CI/CD Guide
No ratings yet
Azure Pipelines CI/CD Guide
11 pages
DBA Level 3 Exam Questions and Answers
No ratings yet
DBA Level 3 Exam Questions and Answers
7 pages
Geektrust in Family Java PDF
No ratings yet
Geektrust in Family Java PDF
11 pages
Dfsort Tutorial: Input File
No ratings yet
Dfsort Tutorial: Input File
14 pages
Java Programming Lab Manual R23
100% (4)
Java Programming Lab Manual R23
108 pages
Online Shopping System Overview
No ratings yet
Online Shopping System Overview
56 pages
Software Development MCQs for Beginners
No ratings yet
Software Development MCQs for Beginners
97 pages
Subject: Information System & Business Analytics Presented by Reema Zahoor
No ratings yet
Subject: Information System & Business Analytics Presented by Reema Zahoor
34 pages
SAP Note 1786761: Manual Pre-Implementation Steps
No ratings yet
SAP Note 1786761: Manual Pre-Implementation Steps
7 pages
CR-IR346/348CL Service Manual: MT: Machine Troubleshooting
No ratings yet
CR-IR346/348CL Service Manual: MT: Machine Troubleshooting
156 pages
InnoData Assessment Test Overview
No ratings yet
InnoData Assessment Test Overview
7 pages
Ubuntu Setup for Ultra-96 FPGA
0% (1)
Ubuntu Setup for Ultra-96 FPGA
5 pages
Robert CV
No ratings yet
Robert CV
1 page
Synopsis For Separable Reversible Data Hiding in Encrypted Image Using AES
100% (1)
Synopsis For Separable Reversible Data Hiding in Encrypted Image Using AES
26 pages
JavaScript HTML Form Validation Guide
No ratings yet
JavaScript HTML Form Validation Guide
4 pages