0% found this document useful (0 votes)
9 views7 pages

Modular Assignment 1 DAR

The assignment requires students to implement a complete data analytics pipeline, including data preprocessing, cleaning, feature selection, and reporting on a chosen real-world dataset. Students must select a dataset with at least 1000 records and 8 features, perform various tasks across five parts, and deliver a Jupyter notebook, executive summary, dataset versions, and presentation slides. Evaluation will be based on understanding of data science fundamentals, implementation of techniques, generation of insights, and professional presentation of findings.

Uploaded by

anshulgola5339
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views7 pages

Modular Assignment 1 DAR

The assignment requires students to implement a complete data analytics pipeline, including data preprocessing, cleaning, feature selection, and reporting on a chosen real-world dataset. Students must select a dataset with at least 1000 records and 8 features, perform various tasks across five parts, and deliver a Jupyter notebook, executive summary, dataset versions, and presentation slides. Evaluation will be based on understanding of data science fundamentals, implementation of techniques, generation of insights, and professional presentation of findings.

Uploaded by

anshulgola5339
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Data Analytics and Reporting (DAR) - Module 1 Assignment

Module 1: Introduction to Data Science and Analytics

Duration: 4 hours

Problem Statement

Implement a complete data analytics pipeline demonstrating data preprocessing, cleaning, feature
selection, and basic reporting on a real-world dataset.

You are required to select a dataset from the provided sources, perform comprehensive data
preprocessing including cleaning and feature selection techniques (PCA/LDA), and create meaningful
analytical reports that demonstrate your understanding of data science fundamentals.

Data Sources

Choose ONE dataset from the following sources:

Primary Sources:

1. Kaggle Datasets (https://www.kaggle.com/datasets)

- Customer Purchase Behavior Dataset

- Employee Attrition Dataset

- Housing Price Prediction Dataset

- Student Performance Dataset

2. UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets.html)

- Wine Quality Dataset

- Breast Cancer Wisconsin Dataset

- Iris Dataset (for beginners)

- Adult Income Dataset

3. Government Open Data Portals

• data.gov (US Government)


• data.gov.in (Indian Government)
- Choose any socio-economic dataset

4. Alternative Sources

- Google Dataset Search

- AWS Open Data

- Microsoft Azure Open Datasets

Note: Dataset should contain at least 1000 records with minimum 8 features including both numerical
and categorical variables.

Assignment Requirements

Part A: Data Understanding and Exploration (45 minutes)

Tasks:

1. Dataset Selection and Justification (10 minutes)

- Select and download your chosen dataset

- Provide brief justification for selection

- Document data source and basic metadata

2. Initial Data Exploration (35 minutes)

- Load data using appropriate Python libraries (pandas, numpy)

- Display basic dataset information (shape, columns, data types)

- Generate descriptive statistics

- Identify missing values, duplicates, and outliers

- Create initial visualizations (histograms, scatter plots, correlation matrix)

Expected Deliverables:

• Jupyter notebook with documented code


• Summary report of dataset characteristics
• Initial data quality Assignment

Part B: Data Preprocessing and Cleaning (90 minutes)


Tasks:

1. Data Cleaning (45 minutes)

- Handle missing values using appropriate techniques (mean/median imputation, forward fill, etc.)

- Remove or treat duplicate records

- Identify and handle outliers using statistical methods

- Standardize data formats and correct inconsistencies

2. Data Preprocessing (45 minutes)

- Encode categorical variables (one-hot encoding, label encoding)

- Normalize/standardize numerical features

- Create new features through feature engineering

- Handle date/time variables if present

- Split data into training and testing sets

Expected Deliverables:

• Clean, preprocessed dataset


• Documentation of all cleaning steps taken
• Before/after comparison statistics
• Visualization of data distribution changes

Part C: Feature Selection Implementation (75 minutes)

Tasks:

1. Principal Component Analysis (PCA) (40 minutes)

- Implement PCA on numerical features

- Determine optimal number of components using explained variance

- Visualize principal components

- Transform data using selected components

- Interpret results and component loadings

2. Linear Discriminant Analysis (LDA) (35 minutes)


- Apply LDA for dimensionality reduction (if classification target exists)

- Compare LDA results with PCA

- Evaluate discriminant power

- Visualize class separability

Expected Deliverables:

• PCA implementation with variance explanation analysis


• LDA implementation with discriminant analysis
• Comparative analysis of both techniques
• Visualizations showing dimensionality reduction results

Part D: Analytics Components - Reporting and Analysis (60 minutes)

Tasks:

1. Descriptive Analytics (30 minutes)

- Create comprehensive statistical summary

- Generate key performance indicators (KPIs)

- Build interactive dashboards using matplotlib/seaborn/plotly

- Perform correlation analysis and trend identification

2. Analytical Reporting (30 minutes)

- Develop insights from processed data

- Create executive summary with key findings

- Generate automated reports with visualizations

- Present actionable recommendations based on analysis

Expected Deliverables:

• Interactive dashboard/visualizations
• Executive summary report
• Key insights and recommendations
• Analytical methodology documentation

Part E: Integration and Presentation (30 minutes)


Tasks:

• Combine all components into coherent analysis


• Create final presentation summarizing methodology and findings
• Ensure reproducibility of entire pipeline
• Document lessons learned and potential improvements

Evaluation Rubrics

1. Data Understanding and Exploration (20 points)

• Excellent (18-20): Comprehensive dataset analysis with meaningful insights


• Good (14-17): Adequate exploration with basic statistical analysis
• Satisfactory (10-13): Basic understanding demonstrated with minimal analysis
• Needs Improvement (0-9): Incomplete or superficial data exploration

2. Data Preprocessing and Cleaning (25 points)

• Excellent (23-25): Thorough cleaning with appropriate techniques and documentation


• Good (18-22): Good cleaning practices with minor gaps
• Satisfactory (13-17): Basic cleaning implemented but lacks depth
• Needs Improvement (0-12): Inadequate cleaning or poor technique selection

3. Feature Selection Implementation (25 points)

• Excellent (23-25): Correct implementation of both PCA and LDA with proper interpretation
• Good (18-22): Good implementation with minor technical issues
• Satisfactory (13-17): Basic implementation but lacks proper analysis
• Needs Improvement (0-12): Incorrect implementation or poor understanding

4. Analytics and Reporting (20 points)

• Excellent (18-20): Insightful analysis with professional reporting


• Good (14-17): Good analytical skills with adequate reporting
• Satisfactory (10-13): Basic analysis with simple reporting
• Needs Improvement (0-9): Poor analysis or inadequate reporting

5. Integration and Presentation (10 points)

• Excellent (9-10): Seamless integration with clear, professional presentation


• Good (7-8): Good integration with minor presentation issues
• Satisfactory (5-6): Basic integration with adequate presentation
• Needs Improvement (0-4): Poor integration or unprofessional presentation

Technical Requirements

Required Python Libraries:


• pandas, numpy (data manipulation)
• matplotlib, seaborn, plotly (visualization)
• scikit-learn (machine learning and preprocessing)
• scipy (statistical analysis)
• jupyter notebook (development environment)

Submission Format:

1. Jupyter Notebook (.ipynb) with complete code and documentation


2. Executive Summary (PDF) - 2-3 pages maximum
3. Dataset (original and processed versions)
4. Presentation Slides (PowerPoint/PDF) - 8-10 slides maximum

Time Management Guidelines

Task Allocated Time Key Focus Areas


Data Exploration 45 minutes Understanding dataset characteristics
Data Cleaning 90 minutes Quality assurance and preprocessing
Feature Selection 75 minutes PCA and LDA implementation
Analytics & Reporting 60 minutes Insights generation and visualization
Integration & Presentation 30 minutes Final documentation and presentation

Success Criteria

To pass this Assignment, students must:

• Demonstrate understanding of data science fundamentals


• Successfully implement data preprocessing techniques
• Correctly apply PCA and LDA for feature selection
• Generate meaningful analytical insights
• Present findings in a professional manner
• Complete all tasks within the 4-hour timeframe

Bonus Points Available:

• Creative feature engineering (+5 points)


• Advanced visualization techniques (+5 points)
• Statistical significance testing (+5 points)
• Industry-relevant insights (+5 points)

Additional Resources

Documentation References:

• Pandas Documentation: https://pandas.pydata.org/docs/


• Scikit-learn User Guide: https://scikit-learn.org/stable/user_guide.html
• Matplotlib Documentation: https://matplotlib.org/stable/contents.html

Best Practices:

• Follow PEP 8 coding standards


• Include comprehensive comments in code
• Use version control (Git) for project management
• Validate assumptions with statistical tests
• Ensure reproducibility of results

Note: This Assignment is designed to evaluate practical understanding of data science concepts within a
realistic industry timeframe. Focus on demonstrating core competencies rather than perfect
optimization.

You might also like