Course Plan: Python, Data Analytics, and Generative AI
Session 1: Python Refresher
Topics
- Python essentials (data types, loops, conditionals)
- Functions and modules
- File handling (CSV/Excel)
- Pythonic coding practices
Practical Focus
- Write a function to load a CSV file and summarize basic statistics.
Session 2: Data Analysis with pandas and NumPy
Topics
- pandas DataFrame basics: loading, slicing, merging
- NumPy arrays: indexing, slicing, reshaping
- Descriptive statistics (mean, median, variance)
Practical Focus
- Analyze a CSV dataset (e.g., sales data) to extract summary statistics.
Session 3: Data Wrangling and Cleaning
Topics
- Handling missing data: dropna, fillna
- String manipulations and date conversions
- Combining and reshaping datasets (merge, concat, pivot)
Practical Focus
- Clean a messy dataset by handling missing values, converting data types, and merging files.
Session 4: Exploratory Data Analysis (EDA)
Topics
- Visualizing distributions (histograms, box plots)
- Correlation analysis and heatmaps
- Identifying patterns and outliers
Practical Focus
- Perform EDA on a dataset (e.g., customer data) to identify trends and relationships.
Session 5: Introduction to Machine Learning
Topics
- Machine Learning basics: supervised vs. unsupervised
- Overview of ML workflow: data preprocessing → model training → evaluation
- Common ML use cases in business
Practical Focus
- Discuss business-relevant ML use cases and map them to available datasets.
Session 6: Supervised Learning – Regression
Topics
- Linear Regression: simple and multiple
- Scikit-learn ML pipeline
- Model evaluation: MSE, RMSE, MAE
Practical Focus
- Build a linear regression model to predict sales/revenue from a dataset.
Session 7: Supervised Learning – Classification
Topics
- Logistic Regression, Decision Trees
- Evaluation metrics: accuracy, precision, recall, F1-score
- Confusion matrix interpretation
Practical Focus
- Train a logistic regression model to classify customers as likely churners or not.
Session 8: Model Evaluation and Validation
Topics
- Cross-validation (K-Fold)
- Hyperparameter tuning (GridSearchCV, RandomizedSearchCV)
- Bias-variance tradeoff
Practical Focus
- Perform cross-validation and hyperparameter tuning on a classification or regression
model.
Session 9: Feature Engineering
Topics
- Encoding categorical variables
- Feature scaling (standardization, normalization)
- Creating new features from existing data
Practical Focus
- Engineer features (e.g., date-based, interactions) to improve a machine learning model.
Session 10: Unsupervised Learning – Clustering
Topics
- K-means clustering
- Applications: customer segmentation, anomaly detection
- Cluster evaluation: silhouette score
Practical Focus
- Perform K-means clustering to segment customers and analyze cluster profiles.
Session 11: Ensemble Methods
Topics
- Random Forest and Gradient Boosting (XGBoost/LightGBM)
- Bagging vs. Boosting
- Practical tips for tuning ensembles
Practical Focus
- Train a Gradient Boosting model to improve classification accuracy.
Session 12: SQL for Business Analytics
Topics
- Writing advanced SQL queries (joins, subqueries, window functions)
- Query optimization and indexing
- Integrating SQL queries into Python (using sqlite3 or SQLAlchemy)
Practical Focus
- Query and analyze data from an SQL database integrated with a Python script.
Session 13: Introduction to Generative AI (GenAI)
Topics
- Overview of Generative AI (text generation, summarization)
- Working with pre-trained LLMs (e.g., Hugging Face transformers)
- Introduction to prompt engineering
Practical Focus
- Generate text summaries or insights from a dataset using an LLM.
Session 14: Retrieval-Augmented Generation (RAG) – Concepts
Topics
- How RAG combines retrieval systems with generative models
- Use cases for RAG in business (Q&A, report generation, decision support)
- Overview of vector databases (e.g., FAISS, Pinecone)
Practical Focus
- Sketch a workflow where queries fetch relevant data to feed into a generative model.
Session 15: Building the CSV Analytics Tool – Design
Topics
- Requirements for a CSV analytics tool (querying, summarizing, filtering)
- Efficient file handling for large datasets (chunking)
- Designing user-friendly outputs (charts, tables)
Practical Focus
- Draft the logic for a CSV analytics module that summarizes key metrics interactively.
Session 16: Implementing the CSV Analytics Tool
Topics
- Building core functionalities: query execution, metric calculations, visualizations
- Error handling and logging
- Exporting insights (e.g., saving summaries to Excel/CSV)
Practical Focus
- Build the CSV analytics tool and test it on real-world datasets.
Session 17: SQL Integration for RAG
Topics
- Querying SQL databases for context retrieval
- Converting SQL results into context for LLMs
- Handling large datasets and dynamic query results
Practical Focus
- Write Python code to retrieve data from SQL, format it, and prepare it for a generative
model.
Session 18: Building the RAG-Based Chatbot
Topics
- Connecting the chatbot to SQL and CSV modules
- Structuring prompts dynamically based on user queries
- Handling missing data or ambiguous queries
Practical Focus
- Build an initial RAG-based chatbot pipeline that retrieves context and generates responses.
Session 19: Testing and Refining the GenAI Project
Topics
- Testing edge cases for the CSV tool and RAG chatbot
- Handling incomplete user inputs or noisy data
- Improving performance and response accuracy
Practical Focus
- Test the combined system, focusing on query accuracy, response quality, and speed.
Session 20: Advanced Features and Final Review
Topics
- Adding advanced features: embedding-based similarity search, interactive filtering
- Business scalability considerations (security, multi-user support)
- Future enhancements: extending RAG or adding predictive analytics
Practical Focus
- Explore extensions, such as adding ML-driven recommendations or summarization
features to the chatbot.