Proactive Disaster Detection
Sandhya L Manoj M
Prof.Computer Science and Computer Science and Engineering(Data
Engineering Scinece)
Swarna Lohith Sanjay S
Computer Science and Computer Science and Engineering(Data
Engineering (Data Science) Science)
Abstract—Floods are one of nature's most catastrophic Floods are one of the most destructive natural disasters that
calamities which cause irreversible and immense damage to result in severe loss of life, damage to infrastructure and
human life, agriculture, infrastructure and socio-economic socio-economic systems. Floods have increasingly become
system. Several studies on flood catastrophe management and
flood forecasting systems have been conducted. The accurate frequent and intense globally, caused mainly by climate
prediction of the onset and progression of floods in real time is change, deforestation, urbanization, and lack of proper
challenging. To estimate water levels and velocities across a large regulation in places prone to floods. As a state in India with
area, it is necessary to combine data with computationally extraordinary geographical and climatic conditions, Kerala is
demanding flood propagation models. This paper aims to reduce more vulnerable to floods. Location and prevalence along the
the extreme risks of this natural disaster and also contributes to
policy suggestions by providing a prediction for floods using Western Ghats in this state, with numerous rivers together
different machine learning models. This research will use Binary with intense monsoon showers result in recurrent flooding.
Logistic Regression, K-Nearest Neighbor (KNN), Support Vector The havoc wrecked during the 2018 catastrophic floods
Classifier (SVC) and Decision tree Classifier to provide an brought into sharp relief an urgent need for advanced systems
accurate prediction. With the outcome, a comparative analysis of flood prediction to work together to reduce risks and
will be conducted to understand which model delivers a better
accuracy. readiness. Traditional hydrological models based on the
I. INTRODUCTION physical processes of rainfall-runoff relationships and river
Floods are one of the most destructive natural disasters that flow dynamics have long been the backbone of flood
result in severe loss of life, damage to infrastructure and socio- prediction. However, these models are often computationally
economic systems. Floods have increasingly become frequent expensive, require significant computational resources, and
and intense globally, caused mainly by climate change, lack real-time forecasting capabilities; they also fail to
deforestation, urbanization, and lack of proper regulation in represent the modern climatic conditions in their full
places prone to floods. As a state in India with extraordinary complexity. Machine Learning has emerged as a powerful
geographical and climatic conditions, Kerala is more alternative to these approaches, offering data-driven methods
vulnerable to floods. Location and prevalence along the to identify patterns and predict. floods with high accuracy.
Western Ghats in this state, with numerous rivers together with This paper centres on the development of a flood prediction
intense monsoon showers result in recurrent flooding. The system using ML models, based on a dataset from Kerala
havoc wrecked during the 2018 catastrophic floods brought from the year 1900 to 2018. The dataset contains a lot of
into sharp relief an urgent need for advanced systems of flood information, including historical records of rainfall, data on
prediction to work together to reduce risks and readiness. instances of floods, and all other hydrological parameters.
Traditional hydrological models based on the physical Making use of ML algorithms including Binary Logistic
processes of rainfall-runoff relationships and river flow Regression, SVC, KNN, and DTC, this research will discover
dynamics have long been the backbone of flood prediction. trends and thresholds that lead to flood events
However, these models are often computationally expensive, II. LITERATURE REVIEW
require significant computational resources, and lack real-time Flood prediction has been extensively studied using
forecasting capabilities; they also fail to represent the modern historical data, statistical techniques, and machine
climatic conditions in their full complexity. Machine Learning learning models to enhance disaster preparedness and
Flood prediction has been extensively studied using historical data, fine-tuning BERT on the FIN-FACT dataset and evaluating
statistical techniques, and machine learning models to enhance explanation quality using both linguistic and semantic metrics,
disaster preparedness and response strategies. Ruslan et al. (2014) we aim to set a benchmark for financial misinformation
utilized a Nonlinear Autoregressive Network with Exogenous Inputs detection systems.
(NNARX) model to predict flood levels in Kuala Lumpur, achieving
III. METHODOLOGY
73.54% accuracy. The study highlighted the importance of upstream
river water levels as a key input variable. Similarly, Adnan et al. A. DATASET
(2012) used Back Propagation Neural Networks (BPNN) to forecast Overview of the FIN-FACT Dataset
water levels in Johor, Malaysia. Despite initial limitations, The FIN-FACT dataset is specifically designed for the task of
incorporating an Extended Kalman Filter (EKF) improved prediction financial misinformation detection. It comprises a diverse
accuracy, demonstrating the potential of hybrid models in flood collection of financial claims, each annotated with one of four
forecasting. Sankaranarayanan et al. (2019) applied Support Vector labels:
Machines (SVM) to classify flood risks using weather parameters, True: Claims supported by factual financial data or verified
achieving high classification accuracy. This research underscored the evidence.
ability of SVM models to handle complex datasets with non-linear False: Claims refuted by verified data or proven to be
relationships, making them suitable for flood prediction. In misleading.
Bangladesh, Khatun (2020) analyzed rainfall and flood dataset Not Enough Information (NEI): Claims that cannot be
Sankaranarayanan et al. (2019) applied Support Vector Machines verified due to a lack of sufficient evidence.
(SVM) to classify flood risks using weather parameters, achieving Neutral: Claims that are neither explicitly true nor false but
high classification accuracy. This research underscored the ability of reflect subjective opinions or general statements.
SVM models to handle complex datasets with non-linear Each claim is paired with supporting evidence or explanations
relationships, making them suitable for flood prediction. In that provide additional context. This makes the dataset suitable
Bangladesh, Khatun (2020) analyzed rainfall and flood datasets to not only for classification tasks but also for generating and
develop a Decision Tree Classifier (DTC) for flood prediction. The evaluating explanations.
study found that DTC effectively explained decision-making
processes through interpretable rules but noted the need for Preprocessing the Dataset To prepare the dataset for fine-
addressing overfitting in certain cases. Ahmed et al. (2020) tuning BERT, several preprocessing steps were undertaken:
conducted a comprehensive review of machine learning applications Label Mapping: String labels were mapped to numerical
in flood prediction, emphasizing the advantages of long-term datasets values (e.g., True: 0, False: 1, NEI: 2, Neutral: 3) to ensure
compatibility with the classification model.
improving model robustness. The study concluded that combining
multiple data sources, such as rainfall, river discharge, and land-use
Data Cleaning: Claims were checked for inconsistencies,
data, enhances predictive capabilities. Recent research has also
such as redundant whitespaces or erroneous entries, to ensure
explored the applicability of shorter data timelines for more accurate
high-quality input data.
predictions. For example, studies have shown that focusing on the
last decade of rainfall data often improves accuracy, as it accounts for
Splitting: The dataset was split into training (80 %) and
recent climatic changes. This approach aligns with findings by
testing (20 %) subsets to evaluate model performance.
Brown et al. (2021), who emphasized the need for adaptive models
Stratified sampling was used to maintain label distribution
that incorporate evolving flood thresholds. This literature
across subsets.
demonstrates the growing importance of machine learning in flood
prediction and provides a foundation for leveraging long-term
Tokenization: Each claim was tokenized using BERT’s
datasets like Kerala's (1900–2018) to enhance accuracy and resilience
tokenizer, which breaks down text into subword tokens. To
in disaster management. Develop a robust machine learning-based
handle varying claim lengths, tokenization was performed with
system to predict flood occurrences in Kerala using historical data
the following parameters:
from 1900 to 2018 with high precision and reliability. Data
Maximum Sequence Length: 64 tokens (sufficient for most
Preprocessing and Feature Engineering : Clean, preprocess, and
claims).
engineer features from raw rainfall and flood data to ensure high-
Truncation: Claims exceeding 64 tokens were truncated.
quality inputs for machine learning models. Apply Multiple Machine
Padding: Claims shorter than 64 tokens were padded to
Learning Algorithms : Compare and evaluate a number of
maintain uniform input size.
algorithms, such as Logistic Regression, Decision Tree Classifier,
SVM, Random Forest, and KNN, to determine the most Make Dataset Challenges: The dataset posed several challenges
predictions regional-specific for Kerala by applying its climatic and that influenced model design and training:
geographical characteristics in the modeling procedure. Real-Time Imbalanced Labels: NEI and Neutral labels were under-
Prediction System : Design a scalable architecture that can represented compared to True and False claims. To address
incorporate real-time rainfall data for instant flood forecasts and this, techniques such as weighted loss functions and oversam-
alerts. Optimization of Model Performance : pling were considered. Ambiguity in Explanations: Human-
generated explanations occasionally lacked clarity or contained
inconsistencies, which impacted the evaluation of generated
explanations.
B. PROPOSED METHODOLOGY
Pre-trained BERT: At the core of the proposed framework
is BERT (Bidirectional Encoder Representations from
Transformers), a transformer-based model pre-trained on large-
scale corpora like Wikipedia and BooksCorpus. Its bidirectional
attention mechanism enables it to capture contextual nuances,
making it ideal for financial claims where subtle differences in
phrasing can alter meaning.
Fine-Tuning for Classification: A classification head was
added on top of BERT for fine-tuning. This consisted of:
A fully connected layer to map the pooled [CLS] token output
to four logits corresponding to the class labels. A softmax
activation function to generate probabilities for each class. The
fine-tuning process allowed BERT to adapt to the domain-
specific characteristics of the FIN-FACT dataset.
Explainability via Attention Mechanisms: To generate
explanations for predictions, we leveraged the attention weights
within BERT. These weights indicate the importance assigned
to different parts of the input text during classification,
providing insights into the model’s reasoning.
Training Setup:
Training Configuration: The model was fine-tuned using the
HuggingFace Transformers library. Key hyperparameters in-
cluded:
Learning Rate: 2e-5, selected for stability during fine-tuning.
Batch Size: 16 for both training and evaluation to balance
computational efficiency and performance.
Number of Epochs: 3, sufficient for convergence given the
dataset size.
Evaluation Strategy: Performance was evaluated after each
epoch using the test set. The AdamW optimizer was used to
update model weights, with a linear learning rate scheduler for
gradual warmup.
Hardware and Tools:
Training was conducted on an NVIDIA GPU to expedite
computations. The fine-tuned model and tokenizer were saved
locally for deployment. Fig. 1. Proposed Model
Handling Imbalanced Data: To address the imbalanced
label distribution, the following strategies were employed: (logit) is selected as the predicted label.
Class Weights: During training, loss weights were adjusted Explanation Generation: Explanations are generated by
inversely proportional to class frequencies. Data Augmentation: analyzing attention weights. For each prediction:
Synonyms and paraphrases were introduced for Tokens with the highest attention scores are identified. These
underrepresented classes to increase diversity. tokens are mapped back to the original claim to highlight key
phrases influencing the prediction. The explanation is framed
Prediction and Explanation: as a textual summary, emphasizing these key phrases.
Claim Classification: When a claim is passed to the model: It is
tokenized and converted into input embeddings. The IV. RESULTS AND DISCUSSION
embeddings are processed through BERT and the classification Evaluation Metrics
head to generate logits. The class with the highest probability Classification Metrics: Accuracy: Measures the
proportion of correctly classified claims. The flood prediction system for Kerala, developed using
Precision: Evaluates the model’s ability to avoid false historical rainfall and flood occurrence data (1900–2018)
positives. and machine learning algorithms, has successfully
Recall: Assesses the model’s ability to identify all achieved its objective of providing accurate and reliable
relevant claims. flood predictions. By integrating multiple models, such
Micro-F1 Score: Combines precision and recall for a as Logistic Regression, Decision Tree Classifier,
holistic performance measure. Random Forest, SVC, and KNN, the system allows users
to identify and select the most effective algorithm for
Explanation Metrics predicting floods. The system's data preprocessing,
ROUGE: Measures lexical overlap between generated feature engineering, and real-time prediction capabilities
explanations and reference explanations. ROUGE-1, ensure that users receive actionable insights with
ROUGE-2, and ROUGE-L are reported. minimal effort. Visualization tools further enhance the
BERTScore: Evaluates semantic similarity between usability of the system by providing clear and
generated and reference explanations using BERT interpretable results, enabling informed decision-making
embeddings. for disaster preparedness and mitigation. The user-
BARTScore: Assesses explanation quality using a pre- centric design of the system, implemented within Google
trained BART model, which considers fluency and Colab, ensures accessibility for users with varying
informativeness. technical expertise. The seamless process for uploading
These metrics ensure that both the classification and data, training models, and visualizing results simplifies
interpretability aspects of the model are rigorously complex machine learning workflows, making it a
evaluated. practical tool for researchers and policymakers alike. As
The high performance metrics validate the robustness of the system is scaled for larger datasets and more frequent
our model. Explanations scored well on ROUGE, usage, future improvements will include incorporating
BERTScore, and BARTScore, underscoring the additional data sources (e.g., river discharge and soil
framework’s interpretability. However, performance moisture), exploring advanced algorithms like LSTM for
varied for Neutral and Not Enough Information labels, temporal pattern analysis, and optimizing computational
suggesting room for improvement. efficiency. These enhancements will further solidify the
system's role as a critical tool for managing flood risks in
the flood prediction system for Kerala, developed using
TABLE I historical rainfall and flood occurrence data (1900–2018)
RESULT TABLE. and machine learning algorithms, has successfully
achieved its objective of providing accurate and reliable
Evaluation Metric Value flood predictions. By integrating multiple models, such
Accuracy 0.892 as Logistic Regression, Decision Tree Classifier,
Precision 0.875 Random Forest, SVC, and KNN, the system allows users
Recall 0.868
Micro-F1 0.871
to identify and select the most effective algorithm for
predicting floods. The system's data preprocessing,
feature engineering, and real-time prediction capabilities
Explanation Quality ensure that users receive actionable insights with
ROUGE Scores: ROUGE-1: 78.4, ROUGE-2: 65.3, minimal effort. Visualization tools further enhance the
ROUGE-L: 73.9 usability of the system by providing clear and
BERTScore: Precision: 85.2, Recall: 84.7, F1: 84.9 interpretable results, enabling informed decision-making
BARTScore: 81.4 for disaster preparedness and mitigation. The user-
centric design of the system, implemented within Google
Case Studies Colab, ensures accessibility for users with varying
Example: technical expertise. The seamless process for uploading
Claim: ”The stock price of Company X will double by data, training models, and visualizing results simplifies
next quarter.” complex machine learning workflows, making it a
Prediction: False practical tool for researchers and policymakers alike. As
Explanation: ”The claim was classified as ’False’ due to a the system is scaled for larger datasets and more frequent
lack of supporting evidence in recent financial reports.” As the system is scaled for larger datasets and more
frequent usage, future improvements will include
incorporating additional data sources (e.g., river
V. CONCLUSION
discharge and soil moisture), exploring advanced
In this paper, we presented a comprehensive framework for algorithms like LSTM for temporal pattern analysis,.
detecting and explaining financial misinformation
GRAPH
REFERENCES
[1] Ruchansky, N., Seo, S., & Liu, Y. ”CSI: A Hybrid Deep
Model for Fake News Detection.” CIKM, 2017.
[2] Mihalcea, R., & Strapparava, C. ”The Lie Detector:
Explorations in the Automatic Recognition of Deceptive
Language.” ACL, 2009.
[3] Hochreiter, S., & Schmidhuber, J. ”Long Short-Term
Memory.” Neural Computation, 1997.
[4] Vaswani, A., et al. ”Attention Is All You Need.”
NeurIPS, 2017.
[5] Devlin, J., et al. ”BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding.”
NAACL-HLT, 2019.
[6] Tetlock, P. C. ”Giving Content to Investor Sentiment:
The Role of Media in the Stock Market.” The Journal of
Finance, 2007.
[7] Bollen, J., Mao, H., & Zeng, X. ”Twitter Mood Predicts
the Stock Market.” Journal of Computational Science, 2011.
[8] Wang, W. Y. ”Liar, Liar Pants on Fire: A New
Benchmark Dataset for Fake News Detection.” ACL, 2017.
[9] Shu, K., et al. ”FakeNewsNet: A Data Repository with
News Content, Social Context, and Dynamic Information for
Fake News Research.” Big Data, 2020.
[10] ”FIN-FACT Dataset.” [Online]. Available:
https://coling2025fmd.thefin.ai
[11] Ribeiro, M. T., Singh, S., & Guestrin, C. ”Why Should
I Trust You? Explaining the Predictions of Any Classifier.”
KDD, 2016.
[12] Lundberg, S. M., & Lee, S.-I. ”A Unified Approach to
Interpreting Model Predictions.” NeurIPS, 2017.
[13] Lin, C.-Y. ”ROUGE: A Package for Automatic
Evaluation of Summaries.” ACL, 2004.
[14] Zhang, T., et al. ”BERTScore: Evaluating Text
Generation with BERT.” ICLR, 2020.