Data Analytics and Reporting (DAR) - Module 1 Assignment
Module 1: Introduction to Data Science and Analytics
Duration: 4 hours
Problem Statement
Implement a complete data analytics pipeline demonstrating data preprocessing, cleaning, feature
selection, and basic reporting on a real-world dataset.
You are required to select a dataset from the provided sources, perform comprehensive data
preprocessing including cleaning and feature selection techniques (PCA/LDA), and create meaningful
analytical reports that demonstrate your understanding of data science fundamentals.
Data Sources
Choose ONE dataset from the following sources:
Primary Sources:
1. Kaggle Datasets (https://www.kaggle.com/datasets)
- Customer Purchase Behavior Dataset
- Employee Attrition Dataset
- Housing Price Prediction Dataset
- Student Performance Dataset
2. UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets.html)
- Wine Quality Dataset
- Breast Cancer Wisconsin Dataset
- Iris Dataset (for beginners)
- Adult Income Dataset
3. Government Open Data Portals
• data.gov (US Government)
• data.gov.in (Indian Government)
- Choose any socio-economic dataset
4. Alternative Sources
- Google Dataset Search
- AWS Open Data
- Microsoft Azure Open Datasets
Note: Dataset should contain at least 1000 records with minimum 8 features including both numerical
and categorical variables.
Assignment Requirements
Part A: Data Understanding and Exploration (45 minutes)
Tasks:
1. Dataset Selection and Justification (10 minutes)
- Select and download your chosen dataset
- Provide brief justification for selection
- Document data source and basic metadata
2. Initial Data Exploration (35 minutes)
- Load data using appropriate Python libraries (pandas, numpy)
- Display basic dataset information (shape, columns, data types)
- Generate descriptive statistics
- Identify missing values, duplicates, and outliers
- Create initial visualizations (histograms, scatter plots, correlation matrix)
Expected Deliverables:
• Jupyter notebook with documented code
• Summary report of dataset characteristics
• Initial data quality Assignment
Part B: Data Preprocessing and Cleaning (90 minutes)
Tasks:
1. Data Cleaning (45 minutes)
- Handle missing values using appropriate techniques (mean/median imputation, forward fill, etc.)
- Remove or treat duplicate records
- Identify and handle outliers using statistical methods
- Standardize data formats and correct inconsistencies
2. Data Preprocessing (45 minutes)
- Encode categorical variables (one-hot encoding, label encoding)
- Normalize/standardize numerical features
- Create new features through feature engineering
- Handle date/time variables if present
- Split data into training and testing sets
Expected Deliverables:
• Clean, preprocessed dataset
• Documentation of all cleaning steps taken
• Before/after comparison statistics
• Visualization of data distribution changes
Part C: Feature Selection Implementation (75 minutes)
Tasks:
1. Principal Component Analysis (PCA) (40 minutes)
- Implement PCA on numerical features
- Determine optimal number of components using explained variance
- Visualize principal components
- Transform data using selected components
- Interpret results and component loadings
2. Linear Discriminant Analysis (LDA) (35 minutes)
- Apply LDA for dimensionality reduction (if classification target exists)
- Compare LDA results with PCA
- Evaluate discriminant power
- Visualize class separability
Expected Deliverables:
• PCA implementation with variance explanation analysis
• LDA implementation with discriminant analysis
• Comparative analysis of both techniques
• Visualizations showing dimensionality reduction results
Part D: Analytics Components - Reporting and Analysis (60 minutes)
Tasks:
1. Descriptive Analytics (30 minutes)
- Create comprehensive statistical summary
- Generate key performance indicators (KPIs)
- Build interactive dashboards using matplotlib/seaborn/plotly
- Perform correlation analysis and trend identification
2. Analytical Reporting (30 minutes)
- Develop insights from processed data
- Create executive summary with key findings
- Generate automated reports with visualizations
- Present actionable recommendations based on analysis
Expected Deliverables:
• Interactive dashboard/visualizations
• Executive summary report
• Key insights and recommendations
• Analytical methodology documentation
Part E: Integration and Presentation (30 minutes)
Tasks:
• Combine all components into coherent analysis
• Create final presentation summarizing methodology and findings
• Ensure reproducibility of entire pipeline
• Document lessons learned and potential improvements
Evaluation Rubrics
1. Data Understanding and Exploration (20 points)
• Excellent (18-20): Comprehensive dataset analysis with meaningful insights
• Good (14-17): Adequate exploration with basic statistical analysis
• Satisfactory (10-13): Basic understanding demonstrated with minimal analysis
• Needs Improvement (0-9): Incomplete or superficial data exploration
2. Data Preprocessing and Cleaning (25 points)
• Excellent (23-25): Thorough cleaning with appropriate techniques and documentation
• Good (18-22): Good cleaning practices with minor gaps
• Satisfactory (13-17): Basic cleaning implemented but lacks depth
• Needs Improvement (0-12): Inadequate cleaning or poor technique selection
3. Feature Selection Implementation (25 points)
• Excellent (23-25): Correct implementation of both PCA and LDA with proper interpretation
• Good (18-22): Good implementation with minor technical issues
• Satisfactory (13-17): Basic implementation but lacks proper analysis
• Needs Improvement (0-12): Incorrect implementation or poor understanding
4. Analytics and Reporting (20 points)
• Excellent (18-20): Insightful analysis with professional reporting
• Good (14-17): Good analytical skills with adequate reporting
• Satisfactory (10-13): Basic analysis with simple reporting
• Needs Improvement (0-9): Poor analysis or inadequate reporting
5. Integration and Presentation (10 points)
• Excellent (9-10): Seamless integration with clear, professional presentation
• Good (7-8): Good integration with minor presentation issues
• Satisfactory (5-6): Basic integration with adequate presentation
• Needs Improvement (0-4): Poor integration or unprofessional presentation
Technical Requirements
Required Python Libraries:
• pandas, numpy (data manipulation)
• matplotlib, seaborn, plotly (visualization)
• scikit-learn (machine learning and preprocessing)
• scipy (statistical analysis)
• jupyter notebook (development environment)
Submission Format:
1. Jupyter Notebook (.ipynb) with complete code and documentation
2. Executive Summary (PDF) - 2-3 pages maximum
3. Dataset (original and processed versions)
4. Presentation Slides (PowerPoint/PDF) - 8-10 slides maximum
Time Management Guidelines
Task Allocated Time Key Focus Areas
Data Exploration 45 minutes Understanding dataset characteristics
Data Cleaning 90 minutes Quality assurance and preprocessing
Feature Selection 75 minutes PCA and LDA implementation
Analytics & Reporting 60 minutes Insights generation and visualization
Integration & Presentation 30 minutes Final documentation and presentation
Success Criteria
To pass this Assignment, students must:
• Demonstrate understanding of data science fundamentals
• Successfully implement data preprocessing techniques
• Correctly apply PCA and LDA for feature selection
• Generate meaningful analytical insights
• Present findings in a professional manner
• Complete all tasks within the 4-hour timeframe
Bonus Points Available:
• Creative feature engineering (+5 points)
• Advanced visualization techniques (+5 points)
• Statistical significance testing (+5 points)
• Industry-relevant insights (+5 points)
Additional Resources
Documentation References:
• Pandas Documentation: https://pandas.pydata.org/docs/
• Scikit-learn User Guide: https://scikit-learn.org/stable/user_guide.html
• Matplotlib Documentation: https://matplotlib.org/stable/contents.html
Best Practices:
• Follow PEP 8 coding standards
• Include comprehensive comments in code
• Use version control (Git) for project management
• Validate assumptions with statistical tests
• Ensure reproducibility of results
Note: This Assignment is designed to evaluate practical understanding of data science concepts within a
realistic industry timeframe. Focus on demonstrating core competencies rather than perfect
optimization.