1.
Data Collection
Collect and clean all data related to stock prices and financial reports for NIFTY
50 companies.
Get historical stock prices (10-15 years)
Scrape/download quarterly financial reports (PDF/HTML) from company
websites or screener.in, tradeview
Normalize financial indicators (Revenue, EPS, ROCE, etc.)
Parse PDFs to text for LLM training using tools like PyMuPDF or PDFPlumber
Store cleaned data in structured format like CSV
We will do this on Local machine or Google Colab for small batches
Technologies we will utilize - yfinance, nsetools, pandas, requests, BeautifulSoup,
PyMuPDF, pdfplumber
What to Learn:
Data scraping
Pandas & NumPy
File parsing (PDF, HTML)
Data normalization techniques
Exploratory Data Analysis (EDA)
2.NLP Pipeline for Financial Understanding
Objective:
Convert raw financial text into embeddings for semantic search and LLM fine-tuning.
Tasks:
Clean financial language (remove junk, segment paragraphs, remove headers)
Convert cleaned texts into vector embeddings
Use a sentence embedding model like sentence-transformers
Start building your vector database (FAISS or ChromaDB)
Technologies/Tools:
transformers, sentence-transformers, faiss, ChromaDB
NLTK, spaCy for preprocessing
What to Learn:
Basic NLP pipeline
Tokenization, stopword removal, chunking
Embeddings and vector similarity
Basics of vector databases
3.Fine-Tuning the LLM
Objective:
Fine-tune a pre-trained language model on domain-specific data (financial reports,
news).
Tasks:
Choose a base model: LLaMA 2, Mistral, or Falcon (7B/13B models)
Fine-tune using LoRA or QLoRA (memory-efficient methods)
Train on structured financial questions & answers or extracted paragraphs
Validate with sample queries
Where to Do:
Must be done on PARAM with A100/V100 GPUs
Technologies/Tools:
Hugging Face Transformers, PEFT, bitsandbytes, accelerate
wandb for experiment tracking
datasets for preprocessing training data
What to Learn:
LLM architecture basics
Transfer learning vs fine-tuning
LoRA/QLoRA training
Prompt engineering basics
Hugging Face model training
4. RAG Architecture + Vector Database
Objective:
Enable document-level question-answering using Retrieval-Augmented Generation
(RAG).
Tasks:
Index embeddings in FAISS or Chroma
Setup a retriever and connect it to the fine-tuned LLM
Build a prompt template and generation chain
Validate by asking real-world finance questions (e.g., "Compare Q4 results of Reliance
2023 vs 2022")
Where to Do:
PARAM (or Colab for testing)
PARAM for full integration and testing with financial corpus
Technologies/Tools:
LangChain, Haystack, FAISS, Chroma, Transformers
Your fine-tuned LLM from Phase 3
What to Learn:
What is RAG and how it works
Retriever and Generator roles
Prompt engineering for domain-specific Q&A
LangChain workflows
5.Advanced Stock Prediction Engine
Objective:
Use multi-variate time series, sentiment, and technical indicators to forecast stock
performance.
Tasks:
Build prediction models (e.g., LSTM, Temporal Fusion Transformer, Prophet)
Combine technical indicators + sentiment features
Train and validate models per company or sector
Backtest using historical data
Where to Do:
PARAM for large-scale training
Colab for experimenting on small datasets
Technologies/Tools:
scikit-learn, XGBoost, Prophet, TensorFlow / PyTorch, TA-Lib
Sentiment scores from Phase 2 as additional features
What to Learn:
Time-series forecasting
Feature engineering (technical + NLP-based)
Deep learning for forecasting
Evaluation metrics (RMSE, MAPE, etc.)
6.Evaluation, Visualization, and Final Integration
Objective:
Integrate all modules into a coherent system, validate performance, and prepare demo.
Tasks:
Validate predictions and LLM answers
Create final RAG + prediction interface
Optional UI using Streamlit, Flask, or Gradio
Create visualizations (e.g., trend lines, company comparison, sector forecasts)
Prepare documentation and final presentation
Where to Do:
Local/Cloud
UI and visualizations can be lighter workloads
Technologies/Tools:
Streamlit, Dash, Gradio
Plotly, Matplotlib, Seaborn
Integration with APIs and vector DBs
What to Learn:
Data visualization
UI basics
Integration techniques
System testing and deployment
Bonus: Skills to Learn Throughout
Skill Use Case
Linux + Shell Scripting Running jobs on PARAM
Git & GitHub Version control