I built this project to understand how digital payment fraud works in India's UPI ecosystem. Started as a data analysis project, then evolved into a full production-grade platform with real-time fraud scoring, vector similarity search, causal inference, and containerized deployment.
India's UPI system processes 16+ billion transactions every month (as of Feb 2025). With that kind of volume, even a tiny fraud rate means crores of rupees lost. I wanted to see if I could build a system that not only detects fraudulent transactions but also measures the causal impact of security interventions like 2FA.
Since real UPI data isn't publicly available (obvious privacy reasons), I generated synthetic data calibrated against official NPCI ecosystem stats and RBI fraud reports. The distributions, bank market shares, and fraud patterns all mirror what actually happens in the real world.
┌─────────────┐ ┌──────────────┐ ┌─────────────────┐
│ FastAPI │───▶│ Celery │───▶│ Fraud Scoring │
│ Gateway │ │ Workers │ │ (Ensemble + │
│ (api.py) │ │ (Redis Q) │ │ FAISS + DiD) │
└──────┬───────┘ └──────────────┘ └─────────────────┘
│ │
▼ ▼
┌──────────────┐ ┌───────────────────┐
│ /causal/* │ │ FAISS Index │
│ /fraud/* │ │ (fraud pattern │
│ /health │ │ embeddings) │
└──────────────┘ └───────────────────┘
Three-service stack (Docker Compose):
- API (FastAPI + Uvicorn) — handles all HTTP requests, serves fraud scores and causal results
- Worker (Celery) — processes async fraud scoring tasks from the Redis queue
- Redis — message broker + result backend for Celery
Generates 1M transactions with realistic patterns (spending peaks at lunch and evening, more P2M on weekdays, festival season spikes etc). Injected 6 types of fraud that actually occur in UPI — rapid fire transactions, round-number structuring, unusual hour activity, geographic impossibility, and more.
Loaded everything into SQLite and wrote 12 queries covering daily trends, bank failure rates, fraud patterns by city/time/type, threshold monitoring etc. Not textbook SQL — actual analytical queries you'd write at a job.
| Method | What it does | Why it works |
|---|---|---|
| Z-Score | Flags amounts 3+ σ from a user's average | Catches sudden behavior changes |
| Isolation Forest | Unsupervised ML anomaly detection | Finds outliers without labels |
| Rule-Based | Manual rules from RBI fraud reports | Catches known patterns (structuring, odd hours) |
| FAISS Similarity | Finds K-nearest fraud neighbors in vector space | Catches variants of known fraud types |
Combined into an ensemble that flags when 2+ methods agree — reduces false positives significantly.
Every transaction is embedded into an 11-dimensional feature vector:
[amount_norm, hour_sin, hour_cos, dow_sin, dow_cos, is_night,
is_weekend, velocity_norm, amount_zscore, city_encoded, txn_type_encoded]
Built a FAISS IndexFlatL2 over all 29,139 confirmed fraud transactions. When a new transaction comes in, query for the K most similar historical fraud cases in sub-millisecond time. Returns a fraud_similarity_score (0-1) based on average L2 distance to nearest fraud neighbors. FAISS flagged 20,383 transactions (1.98%) — closely tracking the actual fraud rate.
Why FAISS over basic KNN? FAISS is what Meta, Spotify, and Uber use for production-scale vector search. Even though our dataset is ~1M rows, using FAISS demonstrates knowledge of the same infrastructure powering RAG pipelines, recommendation systems, and semantic search.
There's a mathematical law that says in natural data, the digit "1" appears as the first digit 30% of the time while "9" only appears 5%. Fraudsters don't know this so their fake transactions violate this pattern. Used chi-squared tests to quantify the deviation.
Three statistical tests with proper significance:
- Fraud amounts are higher than normal (Mann-Whitney U, p < 0.001)
- Fraud rate is higher at night 1-5 AM (Chi-squared, p < 0.001)
- Fraud rate varies across transaction types (Chi-squared, p = 0.011)
This is the key evolution — shifting from "is this fraudulent?" (prediction) to "does 2FA cause fraud to decrease?" (causal impact measurement).
What is the causal effect of introducing 2-Factor Authentication (2FA) on the likelihood of a UPI transaction being fraudulent?
DiD is a quasi-experimental method that isolates the causal effect of a policy intervention by comparing treatment and control groups before and after the intervention. It's the same technique economists use to evaluate minimum wage laws, healthcare mandates, and tax policy changes.
Model specification:
is_fraud = β₀ + β₁(has_2fa) + β₂(post_treatment) + β₃(has_2fa × post_treatment)
+ β₄(amount_log) + β₅(is_night) + β₆(is_weekend) + β₇(city_tier)
+ β₈(user_risk_score) + ε
β₃(the coefficient on the interaction term) = Average Treatment Effect (ATE)- Estimated with OLS using HC1 heteroscedasticity-robust standard errors
- Treatment date: July 1, 2024 (simulated RBI mandate for mandatory 2FA)
| Metric | Value |
|---|---|
| ATE | -0.0073 (2FA reduces fraud probability by 0.73 percentage points) |
| 95% CI | [-0.013, -0.0016] |
| p-value | 0.013 → Statistically significant |
| Parallel Trends | Valid (F = 1.02, p = 0.40) |
| Placebo Test | Passed (ATE ≈ 0, p = 0.99) |
| R² | 0.118 |
| N | 100,000 |
ATE remains stable across all 4 model specifications — robust finding:
| Specification | ATE | p-value | R² |
|---|---|---|---|
| No controls | -0.0077 | 0.013 | 0.0001 |
| Time controls | -0.0077 | 0.011 | 0.031 |
| Amount controls | -0.0070 | 0.017 | 0.116 |
| Full controls | -0.0073 | 0.013 | 0.118 |
2FA adoption is not randomly assigned — this is a deliberate design choice. In reality:
- Tier-1 city users adopt 2FA more (higher fintech literacy)
- PhonePe/GPay users get it first (fintech-first banks roll out faster)
- Power users adopt faster than infrequent users
This creates confounding that DiD must properly handle, making the analysis more realistic and rigorous than a simple randomized comparison.
DiD rests on several assumptions. Here's how I validated each:
| Assumption | What it means | How I validated |
|---|---|---|
| Parallel Trends | Treatment and control groups had the same fraud trend before 2FA | F-test on has_2fa × month interactions in pre-treatment data. Non-significant (p=0.40) = assumption holds. |
| No Anticipation | Users didn't change behavior before the 2FA launch | By construction — the treatment date is not known to users in advance |
| SUTVA | One user's 2FA doesn't affect another user's fraud risk | Reasonable for individual fraud scoring (not for network-level fraud) |
| Common Support | Treatment and control groups overlap in characteristics | Verified by design — both groups span all cities, banks, and transaction types |
-
Placebo Test: Re-ran DiD with a fake treatment date (April 2024). ATE ≈ 0 with p = 0.99 — no spurious effect at a made-up date ✓
-
Sensitivity Analysis: Ran 4 model specifications (no controls → full controls) and ATE stayed in the narrow range of -0.007 to -0.008 — robust finding ✓
- Treatment is simulated, not from a real natural experiment
- Complier-level effects (LATE) may differ from population-level ATE
- No geographic or time fixed effects (could improve precision)
- Cross-contamination possible if fraud rings span treatment/control groups
Transactions come in at thousands per second in production. You can't block the API while running ensemble + FAISS + causal models. The architecture uses:
- Celery for async task processing with Redis as the message broker
- Graceful degradation: if Redis isn't running, everything falls back to synchronous execution
- Task monitoring: check task status by ID via the API
- Periodic tasks: FAISS index rebuild can run nightly via Celery Beat
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check + component status |
/fraud/score |
POST | Score a single transaction (ensemble + FAISS) |
/fraud/batch |
POST | Score multiple transactions (sync or async) |
/fraud/similar/{txn_id} |
GET | FAISS K-nearest fraud pattern search |
/causal/summary |
GET | Full DiD results (ATE, CI, p-value, robustness) |
/causal/estimate |
POST | DiD estimate for a filtered subset |
/causal/parallel-trends |
GET | Pre-treatment fraud rate time series |
/task/{id}/status |
GET | Async Celery task status |
Interactive docs at http://localhost:8000/docs (Swagger UI).
Built a 3-page interactive dashboard with a star schema data model using real NPCI data (Apr 2023 - Feb 2025) and RBI Annual Report fraud statistics (FY2020-2025).
Page 1 - Overview
Page 2 - Fraud Intelligence
Page 3 - Real Data & Benford's Law
Interactive web dashboard with 6 pages — overview, fraud intelligence, bank/city analytics, real market context with NPCI data, Benford's Law visualization, and transaction lookup.
6-sheet report for non-technical stakeholders with executive summary, fraud details, statistical results, method comparison, bank performance, and recommendations.
| Method | Precision | Recall | F1 |
|---|---|---|---|
| Z-Score | 34.4% | 36.7% | 35.5% |
| Isolation Forest | 57.6% | 61.0% | 59.3% |
| Rule-Based | 73.9% | 54.4% | 62.7% |
| Ensemble (2/3) | 69.1% | 53.9% | 60.6% |
Key findings:
- Fraud transactions have significantly higher amounts than normal ones
- 1-5 AM window has 3x higher fraud rate than daytime
- Structuring (amounts just under Rs 10K reporting threshold) is the most common pattern
- Benford's Law deviation is way more pronounced in fraudulent transactions
- Ensemble approach beats any single method for precision-recall balance
- 2FA causally reduces fraud by 0.73 percentage points (p = 0.013)
- NPCI UPI Ecosystem Statistics — monthly transaction volumes Apr 2023 to Feb 2025
- RBI Annual Report on Banking — fraud case counts and values FY2020-2025
- Bank market shares calibrated to NPCI Q3 2024 data (PhonePe 47%, GPay 34% etc)
FY25 saw a 53% drop in card/internet fraud cases after banks adopted AI/ML detection systems (source: RBI Annual Report 2025).
Python, Pandas, NumPy, SQLite, Scikit-learn, SciPy, statsmodels, FAISS, FastAPI, Uvicorn, Celery, Redis, Docker, Plotly, Streamlit, Matplotlib, Seaborn, OpenPyXL, Power BI, GitHub Actions
upi-fraud-detection/
├── src/
│ ├── data_generator.py # synthetic data generation (1M rows)
│ ├── data_cleaning.py # cleaning + feature engineering
│ ├── db_utils.py # SQLite database + 12 analytical queries
│ ├── fraud_detector.py # z-score, isolation forest, rules, ensemble
│ ├── fraud_similarity.py # FAISS vector similarity search engine
│ ├── causal_inference.py # Difference-in-Differences analysis
│ ├── tasks.py # Celery async task definitions
│ └── report_generator.py # Excel report
├── api.py # FastAPI gateway (8 endpoints)
├── dashboard/
│ └── app.py # Streamlit dashboard (6 pages)
├── data/
│ ├── raw/ # generated CSV
│ ├── processed/ # cleaned, flagged data + causal results
│ ├── faiss_index/ # FAISS index + artifacts
│ ├── real_npci_stats.csv # real NPCI monthly data
│ └── real_rbi_fraud_data.csv # real RBI annual fraud data
├── database/ # SQLite DB
├── reports/ # Excel + Power BI + parallel trends plot
├── screenshots/ # dashboard screenshots
├── tests/
│ ├── test_causal.py # causal inference tests (17 tests)
│ └── test_similarity.py # FAISS similarity tests (12 tests)
├── Dockerfile # containerized API
├── docker-compose.yml # 3-service stack (API + Worker + Redis)
├── .github/workflows/ci.yml # CI pipeline (lint + test + docker build)
├── .dockerignore
├── run_pipeline.py # runs everything end to end (7 steps)
├── requirements.txt
└── README.md
# setup
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
# run full pipeline (~5-10 min for 1M rows)
python run_pipeline.py
# launch streamlit dashboard
streamlit run dashboard/app.py
# start the API
uvicorn api:app --reload
# then visit http://localhost:8000/docs# build and start everything (API + Celery Worker + Redis)
docker-compose up -d
# check logs
docker-compose logs -f api
# API available at http://localhost:8000
# Swagger docs at http://localhost:8000/docs
# stop
docker-compose downpytest tests/ -vThe synthetic data isn't random — it's calibrated against real published numbers:
- NPCI UPI Ecosystem Statistics — bank market shares, volumes, monthly growth
- RBI Annual Report on Trends and Progress of Banking — fraud rates, common patterns
- NPCI monthly reports — transaction volumes, value trends
If I had access to real labeled data or more time:
- Add real-time streaming with Kafka
- Try deep learning (autoencoders) for anomaly detection
- Add network graph analysis to detect fraud rings
- Implement Instrumental Variable (IV) estimation as an alternative to DiD
- Add geographic and time fixed effects to the causal model
- Deploy to AWS ECS with CloudWatch monitoring



