Problem Statement 1: Student Habits vs Academic Performance
Dataset: Student Habits vs Academic Performance
Understanding the impact of lifestyle habits on students’ academic performance is critical for
educators, policymakers, and learners alike. However, real-world data on such sensitive and
multidimensional topics is often inaccessible due to privacy concerns. To address this gap, the
provided simulated dataset offers a comprehensive view of how various lifestyle factors—such as
study hours, sleep patterns, social media usage, diet quality, and mental health—correlate with
academic outcomes, specifically final exam scores.
With 1,000 synthetic student records and over 15 behavioural and academic features, this dataset
enables data-driven exploration and hypothesis testing. It serves as a foundation for machine learning
models, regression analysis, clustering techniques, and data visualization aimed at uncovering key
patterns and predictive factors that influence student success.
Objective:
To analyze and model the relationship between students' lifestyle habits and their academic
performance using synthetic data, with the goal of identifying which factors most significantly predict
or influence final exam scores.
Key Use Cases:
Predicting academic performance using regression or classification models
Identifying student lifestyle clusters using unsupervised learning
Exploring correlations through exploratory data analysis (EDA)
Visualizing the impact of habits like screen time, sleep, and diet on GPA-like scores
Problem Statement 2: Impact of AI on Digital Media (2020-2025)
Dataset: 🌍 Impact of AI on Digital Media (2020-2025)
As AI-generated content becomes increasingly integrated into industries such as journalism, social
media, entertainment, and marketing, understanding its broader influence is essential for informed
decision-making and responsible innovation. However, the rapid evolution and cross-sectoral nature
of this phenomenon present challenges in tracking trends, assessing public sentiment, and anticipating
regulatory or economic impacts.
This dataset addresses these challenges by compiling structured insights on the proliferation of AI-
generated content across multiple domains. It captures key indicators including public sentiment,
engagement patterns, economic effects, and policy responses over time, offering a multi-faceted view
of how AI-generated media is reshaping content ecosystems.
Objective:
To analyze the societal, economic, and regulatory impacts of AI-generated content across industries,
and to uncover trends, biases, and potential future directions of AI adoption using data-driven
methods.
Key Use Cases:
Sentiment and trend analysis of public perception toward AI content
Forecasting AI content adoption in different sectors
Identifying biases or disparities in how AI-generated content is received
Studying the correlation between AI adoption and economic or regulatory shifts.
Problem Statement 3: Financial Transactions Dataset for Fraud Detection
Dataset: Financial Transactions Dataset for Fraud Detection
Financial fraud remains a major threat to global economies, costing billions annually and evolving
constantly in complexity. Detecting fraudulent transactions in real time requires sophisticated models
trained on large, realistic datasets—yet access to such high-quality, labelled financial data is often
limited due to privacy, legal, and security concerns.
To overcome this challenge, this dataset of 5 million synthetically generated financial transactions
simulates real-world behaviour and fraud patterns. It includes detailed transaction information,
behavioural analytics (e.g., velocity and anomaly scores), metadata (e.g., device, location, IP), and
comprehensive fraud labels (both binary and multiclass), offering a rich environment for developing
and testing fraud detection solutions.
Objective:
To analyze and model financial transaction data to accurately detect fraudulent activity using binary
and multiclass classification techniques, time-series anomaly detection, and feature-based model
interpretation.
Key Use Cases:
Building and evaluating fraud detection models using supervised learning
Identifying behavioural and transactional anomalies with time-series methods
Engineering features to improve detection accuracy and reduce false positives
Enhancing explainability of fraud prediction models for regulatory compliance
Problem Statement 4: HR Analytics: Job Change of Data Scientists
Dataset: HR Analytics: Job Change of Data Scientists
Recruiting the right talent is critical for companies operating in high-demand fields like Big Data and
Data Science. However, training large numbers of candidates is both time-consuming and resource-
intensive—especially when a significant portion of them may not intend to work with the company
post-training. To optimize recruitment, training, and talent pipeline planning, the company seeks to
predict which candidates are genuinely interested in employment opportunities after completing their
courses.
This dataset captures candidate-level information including demographics, education, work
experience, and enrollment details, and is designed to support predictive modeling of a candidate’s
likelihood to seek new employment. It enables organizations not only to improve hiring efficiency and
course design, but also contributes to HR research by identifying key factors influencing job-seeking
behaviour.
Objective:
To develop a predictive model that estimates the probability of a candidate seeking new job
opportunities after training, based on available demographic, educational, and professional features—
thereby assisting the company in identifying serious candidates likely to join post-training.
Key Use Cases:
Predicting candidate intent to change jobs using classification models
Identifying influential features that affect job-seeking behaviour
Segmenting candidates for personalized training or career planning
Reducing training costs by filtering out unlikely hires early in the pipeline