Disease Prediction with Machine Learning
Disease Prediction with Machine Learning
LEARNING
A PROJECT REPORT
Submitted by
S GOKUL (210519205052)
of
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
MAY 2023
ANNA UNIVERSITY :: CHENNAI 600 025
BONAFIDE CERTIFICATE
Certified that this project report “MULTIPLE DISEASE PRIDICTION
SIGNATURE SIGNATURE
Palanchur,Chennai-600123. Palanchur,Chennai-600123.
II
ACKNOWLEDGEMENT
At the outset from the core of our heart, we thank the LORD ALMIGHTY for
the manifold blessings showered on us and strengthening us to complete our
project work without any hurdles.
We thank our parents, our family members and friends for their moral
support.
iii
ABSTRACT
iv
TABLE OF CONTENT
3 SYSTEM ANALYSIS 7
3.1 EXISTING SYSTEM
3.1.1 DISADVANTAGES
3.2 PROPOSED SYSTEM
3.2.1 ADVANTAGES
3.3 SYSTEM REQUIREMENTS
4 SYSTEM DESIGNS 12
4.1 ARCHITECTURE DIAGRAM
4.2 MODULES
4.3 COLLECTION OF DATA
4.4 PRE-PROCESSING THE DATA
4.4.1 FORMATTING
4.4.2 CLEANING
4.4.3 SAMPLING
v
4.5 EXTRACTION OF FEATURES
4.6 EVALUATING THE MODEL
4.7 DATA FLOW DIAGRAM
4.8 UML DIAGRAM
4.8.1 USE CASE DIAGRAM
4.8.2 CLASS DIAGRAM
4.8.3 SEQUENCE DIAGRAM
4.8.4 ACTIVITY DIAGRAM
5 IMPLEMENTATION 16
5.1 DOMAIN SPECIFICATION 16
5.1.1 MACHINE LEARNING 16
5.1.2 MACHINE LEARNING VS 16
TRADITIONAL PROGRAMMING
5.1.3 SUPERVISED LEARNING
5.1.4 UNSUPERVISED LEARNING
5.1.5 REINFORCEMENT LEARNING
5.2 TENSORFLOW 17
5.2.1 TENSORFLOW ARCHITECTURE 17
5.3 PYTHON OVERVIEW
5.4 ANACONDA NAVIGATOR
6 TESTING 20
6.1 TESTING
7 RESULT 23
8 CONCLUSION AND DISCUSSION 24
8.1 CONCLUSION
8.2 FUTURE WORK
9 APPENDIX
vi
vii
CHAPTER I
INTRODUCTION
1.1OBJECTIVE
The objective of using Machine Learning (ML) for health education during
an infectious disease outbreak is to create effective and targeted educational
materials that can help prevent the spread of the disease. By analyzing the data
from social media conversations and other sources, ML algorithms can identify
patterns and insights that help understand the concerns and questions of different
populations. This information can then be used to create personalized educational
materials that specifically address those concerns and provide relevant information.
1.2.METHODOLOGY
1
infectious disease outbreak. Ensure that the data collected is representative
of different populations and contains diverse perspectives.
DATA ANALYSIS : Perform exploratory data analysis to gain insights into
the collected data. This analysis can involve statistical measures,
visualizations, and topic modeling techniques to understand the trends,
patterns, and common themes present in the data. This step helps in
understanding the concerns, questions, and misconceptions of different
populations.
LABELING AND ANNOTATION : Annotate the data by assigning
appropriate labels or categories to different pieces of text. This can involve
classifying the text into predefined categories, identifying key topics,
sentiment labeling, or any other relevant annotations based on the project's
objectives. The annotation process may require domain expertise and can be
done manually or through automated techniques.
MODEL DEVELOPMENT : Select appropriate ML algorithms and
develop models based on the project objectives. This can include techniques
such as classification algorithms (e.g., logistic regression, random forest,
support vector machines), clustering algorithms (e.g., K-means, hierarchical
clustering), or recommendation algorithms (e.g., collaborative filtering).
Consideration should be given to model performance, interpretability, and
scalability.
MODEL TRAINING AND EVALUATION : Split the annotated data into
training and evaluation sets. Train the ML models using the training data and
fine-tune them to optimize their performance. Evaluate the models using
appropriate evaluation metrics such as accuracy, precision, recall, F1-score,
or area under the receiver operating characteristic curve (AUC-ROC). Iterate
2
on the model development and evaluation process to improve the models'
performance.
3
1.3 SCOPE OF FUTURE WORK
Gather relevant data from sources such as social media platforms, online
forums, news articles, and other relevant sources.
Clean and preprocess the collected data by removing noise, irrelevant
information, and duplicate entries.
Apply techniques such as text normalization, stop word removal, and
sentiment analysis to enhance the quality of the data.
Perform statistical analysis and visualizations to gain insights into the
collected data.
Identify common concerns, questions, and misconceptions related to the
infectious disease outbreak.
Assign appropriate labels or categories to the collected text data based on the
project objectives.
Perform manual or automated annotation processes to label the data for
training the ML models.
Ensure the quality and accuracy of the annotated data.
Utilize the trained ML models to generate personalized educational
materials.
Create informative and engaging content such as infographics, videos,
articles, or interactive platforms.
Ensure that the educational materials effectively address concerns, provide
accurate information, and promote healthy behaviors.
Deploy the generated educational materials and monitor their effectiveness.
Collect user feedback, track engagement metrics, and conduct surveys or
interviews to evaluate the impact of the materials.
Incorporate user feedback to refine the ML models and improve the
educational materials.
4
CHAPTER 2
LITERATURE SURVEY
Text Mining and Social Media Analytics for Improved Health Management:
A Review by C. Castillo et al. (2019): This review article focuses on text mining
and social media analytics for health management. It discusses the use of ML and
NLP techniques to analyze social media conversations and extract valuable insights
5
for public health interventions, including disease education and prevention
strategies.
6
CHAPTER 3
SYSTEM ANALYSIS
3.1.1 DISADVANTAGES
Axillary,
Deep learning,
Electronic nose,
Feature extraction,
Infectious respiratory disease, stacked.
We will use the input to read the dataset, clean it, and then check to see
whether there is a null value. Predict the feature engineering, use the model, and
then forecast the results. Machine learning, prediction-making Natural Language
Processing (NLP), and other uses of artificial intelligence are required. The
purpose of this study is to forecast the onset of infectious disease utilizing two
7
machine learning techniques, namely Visualization of Data and Prognoses
algorithms for neural networks, recurrent neural networks, and boost.
3.2.1 ADVANTAGES
Hardware Requirements
Software Requirements
8
CHAPTER 4
SYSTEM DESIGNS
4.2 MODULES
COLLECTION OF DATA
PRE-PROCESSING THE DATA
EXTRACTION OF FEATURES
EVALUATING THE MODEL
9
10
4.3 .COLLECTION OF DATA
Data collection is a process that gathers information on Health education
based from a variety of sources, which is then utilised to create machine learning
models. A set of cervical cancer data with features is the type of data used in this
work. The selection of the subset of all accessible data that you will be working
with is the focus of this stage. Ideally, ML challenges begin with a large amount of
data (examples or observations) for which you already know the desired solution.
Labelled data is information for which you already know the desired outcome.
4.4 PRE-PROCESSING THE DATA
Format, clean, and sample from your chosen data to organise it. There are
three typical steps in data pre-processing:
4.4.1 FORMATTING
It's possible that the format of the data you've chosen is not one that allows
you to deal with it. The data may be in a proprietary file format and you would like
it in a relational database or text file, or the data may be in a relational database
and you would like it in a flat file.
4.4.2 CLEANING
Data cleaning is the process of replacing missing data. There can be data
instances that are insufficient and lack the information you think you need to
address the issue. These occurrences might need to be eliminated.
4.4.3 SAMPLING
You may have access to much more data than you actually need that has
been carefully chosen. Algorithms may require more compute and memory to run
as well as take significantly longer to process larger volumes of data. You can
choose a smaller representative sample of the chosen data, which may be much
faster for exploring and testing ideas, rather than thinking about the complete
dataset.
11
4.5 EXTRACTION OF FEATURES
The next step is to A process of attribute reduction is feature extraction.
Feature extraction actually alters the attributes as opposed to feature selection,
which ranks the current attributes according to their predictive relevance. The
original attributes are linearly combined to generate the changed attributes, or
features. Finally, the Classifier algorithm is used to train our models. We make use
of the acquired labelled dataset. The models will be assessed using the remaining
labelled data we have. Pre-processed data was categorised using a few machine
learning methods. Random forest classifiers were selected.
12
3. Pre-process the dataset before using it.
4. Distinguish training from testing data.
5. Analyse the testing dataset using the classification algorithm after training the
model with training data.
6. You will receive results as accuracy metrics at the end.
Dataset
Collection
Pre-
processing
Random
selection
Trained &
Testing
dataset
13
LEVEL 1:
Dataset
collection
Pre-
processing
Apply
Algorithm
Feature
Extraction
14
LEVEL 2:
Classify
the
dataset
Accuracy
of Result
Prediction of
Infectious
Disease
Outbreak
Finalize the
accuracy of
Infectious
Disease
Outbreak
15
4.8 UML DIAGRAM
Unified Modelling Language (UML) is used to specify, visualize, modify,
build, and document the artefacts of object-oriented software-intensive systems
under development. UML provides a standard way to visualize a system's
architectural blueprint, including elements such as:
Actor
● Business process
● (logical) components
● Activities
● programming language statements
● Database schema and
16
FIG 4.8.1 USECASE DIAGRAM
17
FIG 4.8.2 CLASS DIAGRAM
18
FIG 4.8.3 SEQUANCE DIAGRAM
19
FIG 4.8.4 ACTIVITY DIAGRAM
20
CHAPTER 5
IMPLEMENTATION
Machine Learning is a system that can learn from example through self-
improvement and without being explicitly coded by programmers. Breakthrough is
based on the idea that machines can learn independently from data (that is,
samples) to produce accurate results. Machine learning combines data with
statistical tools to predict output. These results are used by companies to generate
actionable insights. Machine learning is closely related to data mining and
Bayesian predictive modelling. A machine takes data as input and uses an
algorithm to create a response. A typical machine learning task are to provide a
recommendation. For those who have a Netflix account, all recommendations of
movies or series are based on the user's historical data. Tech companies are using
unsupervised learning to improve the user experience with personalizing
recommendations.
Machine learning is also used for a variety of tasks like fraud detection,
predictive maintenance, portfolio optimization, automatizing tasks and so on. The
machine executes the output after the logical statements. As your system becomes
more complex, you will need to create more rules. Similarly, the odds of success in
unfamiliar situations are lower than in known ones. First, machines learn by
discovering patterns. This discovery is thanks to data. It's important for data
scientists to choose carefully which data to make available to machines. A list of
attributes used to solve a problem is called a feature vector. A feature vector can be
thought of as a subset of data used to address a problem.
21
5.1.2 MACHINE LEARNING VS TRADITIONAL PROGRAMMING
Traditional programming differs significantly from machine learning. In
traditional programming, a programmer codes all the rules in consultation with an
expert in the industry for which software is being developed. The machine
executes the output after the logical statements. As your system becomes more
complex, you will need to create more rules. Similarly, the odds of success in
unfamiliar situations are lower than in known ones.
The following bullet points sum up the straightforward nature of machine learning
programs:
1. Establish a question
2. Compile data
22
4. Exercise algorithm
6. Gather comments
The algorithm applies this knowledge to new sets of data once it becomes
proficient at arriving at the correct conclusions.
Regression
When the gain is a constant profit, the task is a reversion. For instance, a
commercial accountant concede possibility need to forecast the profit of a stock
established a range of looks like impartiality, premature stock acts,
macroeconomics index. The system will acquire information to estimate the price
of the stocks accompanying rude likely mistake.
24
way or manner to classify the dossier and want the treasure to find patterns and
categorize the dossier for you.
5.2 TENSORFLOW
● Researchers.
● Data scientists.
● Programmers.
Data pre-processing.
Build the model.
25
Train and estimate a model.
● Python is interactive. You can actually open a Python command prompt and
communicate directly with the translator to doodle the program.
26
Python was bred in the late 1980s and early 1990s by Guido Camper Rossum at the
Dutch National Institute of Mathematics and Computers at his Science Institute.
Python appeared in many different languages, including ABC, Modula-3, C, C++,
Algol-68, Small Talk, Unix coverage, and various musical languages. Python is the
control. Like Perl, Python Beginning Law is now available under the GNU General
Public License (GPL). Python will soon be supported by the Despite the large
development team at the institute, Guido Camper Rossum is actively involved in
overseeing the exciting development.
27
You can use it to find the bundle you want, establish ruling class in an
surroundings, run the bundle and revise ruling class, all inside Navigator.
JupyterLab
Jupyter Notebook
QT Console
Spyder
VS Code
Glue viz
Orange 3 App
Rodeo
RStudio
Advanced Conda customers can create more custom Navigator applications.
How do I enforce the rules in the Navigator? Spiders are accompanied by the
simplest outfits. In the Navigator home ticket, click Spyder to create and destroy
rules. You can also overuse Jupyter Notebook. A Jupyter Notebook is a widely
used scheme that combines rules, explanations, benefits, numbers, and a
common interface into a single history file, modified, pondered, and used in
mesh computer network/netting belief tables.
28
29
CHAPTER 6
TESTING
6.1 TESTING
● Software experiments are searches conducted to support employees and include
information about tested product or tool functionality. Software testing also
determines an objective free view of the table so that you can enjoy and understand
the risks of using an operating system. Testing methods include, but are not limited
to, the process of killing programs and using programs to identify software bugs.
Rather, software testing can be established as the process of verifying and proving
that a spre Using NLP
1. Text classification: Used to categorize text data into predefined categories based on
its content.
2. Sentiment analysis: Used to determine the sentiment expressed in a piece of text,
such as positive, negative, or neutral.
3. Named entity recognition: Used to extract named entities from text, such as people,
organizations, and locations.
4. Part-of-speech tagging: Used to identify the parts of speech in a sentence, such as
nouns, verbs, and adjectives.
30
5. Summarization: Used to condense a large piece of text into a shorter, more concise
summary.
6. Question answering: Used to automatically answer questions posed in natural
language.
7. Machine translation: Used to automatically translate text from one language to
another.
NLP is a rapidly growing field, and recent advances in deep learning have led to
significant improvements in NLP performance. However, NLP still faces many
challenges, including dealing with ambiguity and context, understanding sarcasm
and irony, and handling different languages and dialects.
Overall, NLP has the potential to transform the way people interact with computers
and to enable new applications in areas such as customer service, e-commerce, and
information retrieval
Boosting algorithms are among the most powerful of all other machine learning
algorithms, with the highest performance and higher accuracy. All boosting
algorithms work based on learning from the errors of previously trained models
and try to avoid the same errors introduced by previously trained weak learning
algorithms.
It's also a great interview question to ask in a data science interview. This article
describes the main differences between the GradientBoosting, AdaBoost,
XGBoost, CatBoost and LightGBM algorithms and their working mechanics and
mathematics.
31
Gradient boosting
Following the same pattern, a second weak learner is trained and residuals are
computed. The residuals are used as output sequences for the next weak learner. In
this way, the process continues until the residual is zero.
For gradient boosting, the data set must be in the form of numeric or categorical
data, and the loss function used to compute the residuals must be derivative at all
points.
XGBoost
32
One of the main features of XGBoost is efficient missing value handling.
This allows you to handle real data with missing values without the need for
significant preprocessing. Additionally, XGBoost includes built-in parallel
processing support, enabling models to be trained on sizable datasets quickly.
Applications for XGBoost include click-through rate prediction, recommendation
engines, and Kaggle competitions among others. Additionally, you can fine-tune
various model parameters to enhance performance thanks to its high degree of
customizability.
Deciding tree:
33
With CatBoost, the main difference that makes it stand out from the rest is
that the decision tree is growing. In CatBoost, the grown decision tree is
symmetric. This library can be easily installed with the following command:
Catboost
After fitting the data to the model, all algorithms give roughly similar results. Here
LightGBM seems to perform poorly compared to other algorithms, but XGBoost
works well in this case.
To visualize the performance of all algorithms on the same data, we can also plot
graphs between y_test and y_pred for all algorithms
The network receives an input step that happens only once. The set of current
inputs and prior states are then used to compute the current state. The current ht
will be ht-1 at the next time step. Depending on the problem, we can take any
number of time steps and combine information from all previous states. After all
time steps are complete, the final current state is used to compute the output. The
34
output is then compared with the actual output. H. generates an error in the target
output. The error is then propagated to the network to update the weights and train
the network (RNN).
Adsheet program/application/product:
● Meets the misrepresentation and mechanical requirements that guided the design
and development of Fascination.
TESTING METHODS: -
1. Functional Testing
Functional tests provide systematic demonstrations that functions tested are
available as specified by the business and technical requirements, system
documentation, and user manuals.
2. Integration Testing
35
Software unification experiment is the incremental unification experiment of two
or more joined operating system components on a alone program to produce
deteriorations generated by interface defects.
Test Case 1:
Code:
Output:
System Requirements:-
Hardware:
1. OS – Windows 7, 8 and 10 (32 and 64 bit)
2. RAM – 4GB
Software:
1. Python / Anaconda Navigator
2. In Python language
Jupiter notebook
36
CHAPTER –7
RESULT
Multiple disease prediction using machine learning can have several positive
outcomes. Here are some potential results:
37
information is accessible and relevant to individuals from diverse linguistic and
cultural backgrounds, improving their understanding and compliance with
preventive measures.
Early detection and monitoring:ML algorithms can analyze social media posts,
online forums, and other digital platforms to detect early signals of an infectious
disease outbreak. By monitoring trends and detecting relevant keywords or
patterns, health education systems can provide early warnings and alerts to the
public and healthcare authorities. Early detection can help initiate timely
responses, such as increased surveillance, contact tracing, or vaccination
campaigns.
38
39
CHAPTER – 8
CONCLUSION AND DISCUSSION
8.1 CONCLUSION
A widespread infectious disease that has a negative impact on human life and the
global economic infrastructure. Recovery from a consumes a tremendous amount
of time and resources and takes decades. Containment is the first step in dealing
with an infection disease outbreak. In such cases, speed is of the essence because
any delay could result in the exponential destruction of both the economy and
human life. In order to predict the potential pace of escalation of the infection
sickness, governments and health ministries all over the world must always be one
step ahead. The majority of countries are ill-equipped to deal with such unforeseen
breakouts. A revolutionary transformation is brought about by predictive modelling
since it can serve as the first line of defense in containing an illness or sickness in
early stages.
We can improve future work can also be increased and made more precise
by adding further components and modules, including tracking regional traffic and
international flight data. To understand more about the current epidemic, advanced
ML methods and machine learning algorithms can be utilized in conjunction with
large amounts of data. When using features like periodic extraction, analysis, and
prediction, the entire model can be totally automated.
40
CHAPTER-9
APPENDIX
41