0% found this document useful (0 votes)

17 views38 pages

Classifying Breast Cancer Using Machine Learning

Uploaded by

muhdulamin16

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views38 pages

Classifying Breast Cancer Using Machine Learning

Uploaded by

muhdulamin16

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 38

IMPLEMENTATION OF BREAST CANCER USING MACHINE LEARNING

WITH WISCONSIN BREAST CANCER DATASET (WBCD)

A project report Submitted to the Skyline University Nigeria In

Partial fulfilment for the award of Degree of

BACHELOR OF SCIENCE IN SOFTWARE ENGINEERING

Submitted by

KAWTHAR MUHAMMAD
(ID: 1384)
Under the Guidance of

Dr. Vijay Aputharaj

HEAD OF DEPARTMENT

DEPARTMENT OF COMPUTER SCIENCE & INFORMATION SYSTEM

SCHOOL OF SCIENCE AND INFORMATION TECHNOLOGY

SKYLINE UNIVERSITY NIGERIA

JULY 2023
CERTIFICATE

This is to certify that the project work entitled “IMPLEMENTATION OF BREAST

CANCER USING MACHINE LEARNING WITH WISCONSIN BREAST CANCER

DATASET (WBCD)” is a bonafied record of project work done by KAWTHAR

MUHAMMAD (1384), submitted to Skyline University in partial fulfilment of the
requirement for the award of the Degree of Bachelor of Science in Computer
Science and Information Systems.

Signature of the Guide Signature of the HOD

Dr. Vijay Aputharaj Dr. Vijay Arputharaj

Supervisor Head of the Department

Submitted for the University Project Viva-voce held on____________

Internal Examiner External Examiner

ii
DECLARATION

I Kawthar Muhammad (1384) hereby declare that the project work entitled
“IMPLEMENTATION OF BREAST CANCER USING MACHINE LEARNING WITH
WISCONSIN BREAST CANCER DATASET (WBCD)” submitted to the Skyline

University, Nigeria in partial fulfilment of the requirements for the award of the
Degree of Bachelor of Science in Computer Science and Information systems is a
record of original work done by me during 2022-2023 under the supervision and
guidance of Dr. Vijay Aputharaj and that it has not formed the basis for the
award of any Degree or other similar title to any candidate of any University.

Signature of the Candidate

Place :

Date :

ACKNOWLEDGEMENT

iii
I express my sincere thanks to my parents Alh. Usman, for their support and encouragement
towards my study.

It is my pleasure to express our profound gratitude to the Management for admitting me into
this project. I extend my sincere thanks to our honourable Vice Chancellor and Registrar for
supporting and encouraging carrying out my project.

I express my thanks to Dr. VIJAY ARPUTHARAJ, Head of the Department of computer

science and Software Engineering of Skyline University, Nigeria for his constant suggestion and
persistent encouragement.

I also wish to extend my sincere thanks to our beloved guide Dr. A. SENTHIL KUMAR
MCA.,M. Phil., MBA., Ph.D. Dean-SSIT, for his support and words of encouragement.

I also thank all the faculty member and lab in-charge of our department who helped me to
complete my project successfully. I wish to thank all hand and minds that helped me, all my
friends and well-wishers who are all behind the success of my project.

KAWTHAR MUHAMMAD

ABSTRACT

iv
Brest cancer is the most common type of cancer in Nigeria. A mammogram is an X-ray of the
breast. Screening mammograms are evaluated by human readers. The reading process is tedious,
tiring, lengthy, costly and most importantly prone to errors. Multiple studies have shown that

20-30% of the diagnosed cancers could be found retrospectively on the previous negative
screening exam by blinded reviewers. These research propose a novice model to classify breast
cancer using the Wisconsin breast cancer dataset (WBCD). To do that we used a dataset by
analyzing the dataset first and then developing the model using python programming language.

The analysis showed that there exist a correlation between some of the attributes of the dataset.

The model proposed was evaluated using the performance measures including accuracy,
precision, recall and F-1 score.

Future research can be carried out for better optimization of the model, hyper perimeters
selection and better validation.

TABLE OF CONTENTS

v
CHAPTER ONE: INTRODUCTION……………………………………………1

1.0 RESEARCH BACKGROUND……………………………………………...1-2

1.1PROBLEM STATEMENT…………………………………………………...2-3
1.2 AIMS AND OBJECTIVES…………………………………………………….4
1.3SCOPE………………………………………………………………………….4
1.4 PROJECT OUTLINE…………………………………………………………..4

CHAPTER TWO: LITERATURE REVIEW……………..……………………5

2.0 INTRODUCTION……………………………………………………………...5

2.1 CONCEPTUAL FRAMEWORK………………………………………………5

2.1.1 machine learning…………………………………………………………5

2.1.1.1 TYPES OF MACHINE

LEARNING…………………………………..5

2.2 THEORITICAL FRAMEWORK………………………………………………7

2.2.1 MACHINE LEARNING THEORY……………………………………..7

2.3 LITERATURE
REVIEW………………………………………………….........8

2.4 SUMMARY…………………………………………………………………...11

CHAPTER THREE: RESEARCH

METHODOLOGY……………………….12

3.0 INTRODUCTION………………………………………………………….…12

3.1 DATASET……………………………………………………………….……12

3.1.1 DESCRIPTION OF DATASET………………………………..……….12

vi
3.2 RESEARCH FRAMEWORK……………….
………………………………...12

3.2.1 DATA COLLECTION……………………………………………...…..14

3.2.2 DATA PRE-PROCESSING…………………………………………....14

3.2.3 MODEL TRAINING……………………………………………….….14

3.2.4 MODEL TESTING……………………………………………….……14

3.2.5 MODEL EVALUATION…………………………………….………..14

3.3 EXPERIMENTAL FRAMEWORK………………………………………17

CHAPTER FOUR: ANALYSIS AND DISCUSSION……………………..….18

4.1 INTRODUCTION……………………………………….…………………..18

4.2 RESULTS……………………………………………………………………18

4.3 DISCUSSION………………………………………………………………...23

CHAPTER FIVE: SUMMARY, CONCLUSION AND

RECOMMENDATIONS………………………………………………………24

5.1 SUMMARY………………………………………………………………….24

5.2 CONCLUSION……………………………………………………………….24

5.3 RECOMMENDATIONS……………………………………………………..24

REFRENCES………………………………………….………………………….25

vii
viii
CHAPTER ONE

INTRODUCTION

1.0 Research Background

Cancer occurs when changes called mutations take place in genes that regulate cell growth. The
mutations let the cells divide and multiply in an uncontrolled way.

Anything that may cause a normal body cell to develop abnormally potentially can cause cancer,
general categories of cancer-related or causative agents are as follows: chemical or toxic
compound exposures, ionizing radiation, some pathogens, and human genetics.

Cancer symptoms and signs depend on the specific type and grade of cancer, although general
signs and symptoms are not very specific the following can be found in patients with different
cancers: fatigue, weight loss, pain, skin changes, change in bowel or bladder function, unusual
bleeding, persistent cough or voice change, fever, lumps or tissue masses.

The most common type of cancer in the US is breast cancer, followed by lung and prostrate
cancers, according to the National Cancer Institute, which excluded no melanoma skin cancers
from these findings. The focus of these project is on breast cancer.

Breast cancer is cancer that develops in breast cells. Typically, the cancer forms in either the
lobules or the ducts of the breast.

Breast cancer is the most common cancer in women and it is the main cause of death from cancer
among women in the world. Screening mammography was shown to reduce breast cancer
mortality by 38–48% among participants. Although they generally have less of it, men have
breast tissue just like women do. Men can develop breast cancer too, but it’s much rarer.

In its early stages, breast cancer may not cause any symptoms. In many cases, a tumor may be
too small to be felt, but an abnormality can still be seen on a mammogram.

Each type of breast cancer can cause a variety of symptoms. Many of these symptoms are
similar, but some can be different. Symptoms such as: a lump or thickening that feels different
than surrounding tissues; breast pain; red, pitted skin over the entire breast; a nipple discharge
other than milk and the rest are among the most common symptoms of breast cancer.

1
There are many types of breast cancer, and they are broken down into two main categories:
“invasive” and “noninvasive” or in situ. While invasive cancer has spread from breast ducts or
glands to other parts of the breast, noninvasive has not spread from the original tissue.

These two categories are used to describe the most common types of breast cancer, which
include: Ductal carcinoma in situ, Lobular carcinoma in situ, Invasive ductal carcinoma and
Invasive lobular carcinoma. Other, less common types of breast cancer include: Paget disease of
the nipple, Phyllodes tumor, Anginiosarcoma.

COMPUTER-AIDED DIAGNOSIS

Computer-aided detection (CADe), also called computer-aided diagnosis (CADx) are systems
that assist doctors in the interpretation of medical images. CAD is an interdisciplinary
technology combining the elements of Artificial intelligence and computer vision with
radiological and pathology image processing. A typical application is in the detection of a tumor.
For instance, some hospitals use CAD to support preventive medical check-ups in
mammography (diagnosis of breast cancer), the detection of polyps in the colon, and lung
cancer.

The benefits of using CAD are controversial. Initially several studies have shown promising
results with CAD. A large clinical trial in the United Kingdom has shown that single reading
with CAD assistance has similar performance to double reading. However, in the last decade
multiple studies concluded that currently used CAD technologies do not improve the
performance of radiologists in everyday practice in the United States. These controversial results
indicate that CAD systems need to be improved before radiologists can ultimately benefit from
using the technology in everyday practice. Currently used CAD approaches are based on
describing the X-ray image with meticulously designed hand crafted features, and machine
learning for classification on top of these features.

1.1 Problem Statement

A mammogram is an X-ray of the breast. It’s a screening tool used to detect breast cancer.
Together with regular clinical examination and monthly breast self-examination, mammograms
are a key element in the early diagnosis of breast cancer.

2
Screening mammograms are evaluated by human readers. The reading process is tedious, tiring,
lengthy, costly and most importantly prone to errors. Multiple studies have shown that 20-30%
of the diagnosed cancers could be found retrospectively on the previous negative screening exam
by blinded reviewers. The problem of missed cancers still persists despite modern full field
digital mammography (FFDM). The sensitivity and specificity of screening mammography is
reported to be between 77-87% and 89-97% respectively. These metrics describe the average
performance of readers, and there is substantial variance in the performance of individual
physicians, with reported false positive rates between 1-29%, and sensitivities between 29-97%.
Double reading was found to improve the performance of mammographic evaluation and it had
been implemented in many countries. Multiple reading can further improve diagnostic
performance up to more than 10 readers, proving that there is room for improvement in
mammogram evaluation beyond double reading. with the evolution of medical research,
numerous new systems have been developed for the detection and classification of breast cancer.
The research associated with this area is outlined in brief as follows.

Wang, D., Zhang and Y.-H Huang used Logistic Regression and achieved an Accuracy of
96.4%. Akbugday et al. performed classification on Breast Cancer Dataset by using KNN, SVM
and achieved accuracy of 96.85%. KAYA KELES et al., in the paper titled “Breast Cancer
Prediction and Detection Using Data Mining” used Random Forest and achieved accuracy of
92.2%.Vikas Chaurasia et al., compare the performance criterion of supervised learning
classifiers; such as Naïve Bayes, SVM-RBF kernel, RBF neural networks, Decision trees (J48)
and simple CART; to find the best classifier in breast cancer datasets. Dalen, D.Walker et al.
used ADABOOST and achieved accuracy of 97.5% better than Random Forest. Kavitha et al.,
used ensemble methods with Neural Networks and achieved accuracy of 96.3% lesser than
previous studies. According to Sinthia et al., they used backpropagation method with 94.2 %
accuracy. The experimental result shows that SVM-RBF kernel is more accurate than other
classifiers; it scores accuracy of 96.84% in Wisconsin Breast Cancer (original) datasets

3
1.2 Aim And Objectives

The aim of this research project is to design and implement a breast cancer classification
module.

The key objective of this project are:

1. To propose a novice model to classify breast cancer using the Wisconsin breast cancer dataset
(WBCD).

2. To provide an in-depth exploratory data analysis on the dataset.

1.3 Scope

Considering how vast the area of machine learning is, this project focuses on one of the three
general classes of machine learning, which is supervised learning. Algorithms such as linear
regression, naïve bayes, decision trees, support vector machine and so on will be considered.

1.4 Project Outline

This project is divided into five parts,

Chapter one provides an introduction to the topic.

Chapter two discusses the literature review.

Chapter three gives an overview of the methodology used.

Details of the results are provided in chapter four.

Chapter five concludes the project, summary and recommendations are also stated.

4
CHAPTER 2

LITERATURE REVIEW

2.0 Introduction

This chapter is grouped into four parts. It starts with the conceptual review which gives an
insight about what machine learning is all about. It is then followed by the theoretical framework
which explain other researches based on the subject of interest. Literature review, which is the
third part of this chapter review the existing academic works on the research topic. The last part
of this chapter which is the summary summarizes the entire content of the whole chapter.

2.1 Conceptual Framework

2.1.1 Machine Learning

The two famous definition of machine learning are:

Arthur Samuel (1959) defined machine learning as the field of study that gives computers the
ability to learn without being explicitly programmed. This definition is considered old and
informal.

Tom Mitchell in 1998 defined a well posed learning problem as: A computer program is said to
learn from experience E, with respect to some task T and some performance measure P, if it’s
performance on T, as measured by P, improves with T.

2.1.1.1 Types of machine Learning

There are basically three types of machine learning. They are:

1. Supervised learning.
2. Unsupervised learning.
3. Reinforcement learning.

Supervised learning is a type of machine learning that deals with label data. That is to say the
algorithm is trained on input data that has been labeled for a particular output. The model is
trained until it can detect the underlying patterns and relationship between the input data and

5
output labels, enabling it to yield accurate labeling results when presented with a new data.
Supervised learning can be applied into two problems; classification and regression.

 Classification: A classification problem is when the output variable is a category,

such as “red” or “blue” or “disease” and “no disease”. Examples of classification
problems include: Image classification, diagnosis, customer retention and identity
fraud detection.
 Regression: a regression problem is when the output variable is a real value, such as
“dollars” or “weight”. Examples of regression problem include: weather forecasting,
market forecasting, population growth prediction, estimating life expectancy and so
on.

Unsupervised learning use algorithms to identify patterns in data sets containing data points that
are neither classified nor labeled. The algorithms thus classify, label and/or group data points
contained within the data sets without having any external guidance in performing the task.
Unsupervised learning problems can be further grouped into clustering and association problems.

 Clustering: A clustering problem is where you want to discover the basic groupings in
the data, such as grouping costumers by purchasing behavior. Example of clustering
problems include: recommender system, targeted marketing, customer segmentation and
so on.
 Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data, such as people that buy X also tend to buy Y.

Reinforcement learning is a type of machine learning technique that enables an agent to learn in
an interactive environment by trial and error using feedback from its own actions and
experiences. Reinforcement learning differs from supervised learning in not needing labelled
input/output pairs be presented, and in not needing sub-optimal actions to be explicitly correct.
Instead the focus is on finding a balance between exploration (of uncharted territory) and
exploitation (of current knowledge).

6
Figure 2.1 Types of machine learning

2.2 Theoretical Framework

2.2.1 Machine Learning Theory

Machine learning Theory, also known as Computational learning theory, aims to understand the
fundamental principles of learning as a computational process. This field seeks to understand at a
precise mathematical level what capabilities and information are fundamentally needed to learn
different kinds of task successfully, and to understand the basic algorithmic principle involved in
getting computers to learn from data and to improve performance with feedback. The goals of
this theory are both to aid in the design of better automated learning methods and to understand
fundamental issues in the learning process itself.

7
2.3 Literature review

With the evolution of medical research, numerous systems have been developed for the detection
and classification of breast cancer. Kavitha et al (2014) used ensemble methods with Neural
Networks and achieved accuracy of 96.3% lesser than previous studies. A lot of researches have
been performed in the area of diagnosis and classification of breast cancer using many
techniques and methods to obtain certain accuracy and to develop a system fully capable of
diagnosing the aforementioned disease. Sinitha et al. used backpropagation method with 94.2 %
accuracy. The experimental result shows that SVM-RBF kernel is more accurate than other
classifiers; it scores accuracy of 96.84% in Wisconsin Breast Cancer (original) datasets.
Akbugday et al (2019) performed classification on Breast Cancer Dataset by using KNN, SVM
and achieved accuracy of 96.85%.

The following is the summary of the existing works on the given domain:

AUTHOR DATASET TOOL TECHNIQUE ADVANTAGES ACCURACY

USED USED USED
Wang et Electronic WEKA Logistic 5-year survivability 96.4
al. (2018) health Regression. prediction using logistic
Records regression
Chaurisya Wisconsin WEKA Statistical Patient features sorted out 92.3
(2014) Dataset. feature from data materials are
selection. statistically tested based on
the type of individual
feature. Then 51 attributes
or features are selected out,
a feature’s importance
score is calculated.
Akbugday Breast WEKA KNN and KNN -
(2019) Cancer SVM 96.85% SVM
Wisconsin – 96.85%
dataset

8
Keles, M. Wisconsin Python SVM vs SVMs map the input vector 96.91%
Kaya Diagnostic KNN, into a feature space of
(2019) Breast decision trees higher dimensionality and
Cancer and Naives identify the hyperplane that
dataset bayes. separates the data points
into two classes.
Chauraisa UC Irvine WEKA Naive Bayes, Decision tree (C5) is the 96.5%
et. Al machine J48 Decision best predictor on the
(2016) learning Tree and holdout sample (this
repository Bagging prediction accuracy is
algorithm better than any reported in
the literature
Kavithaa Cancer MATL Ensemble Multiple Learners are 96.3 %
et. At Society. AB method with combined giving higher
(2014) Logistic and accuracy.
Neural
Network
Sinthia et Wisconsin CAD Logistic 94.2%
al. Diagnosis System Regression
(2014) Breast and the
Cancer Backpropagat
BCI i on neural
dataset Network
Khourdifi Wisconsin WEKA Fast Attributes are reduced by 96.1%
et al. Breast Correlation deleting irrelevant and
(2018) Cancer Based Filter redundant attributes, which
dataset. with SVM, have no meaning in the
Random classification task
Forest, Naive techniques.
Bayes, K-NN
and MLP

9
Khuriwal Haberman' WEKA NAÏVE Helps in marginalizing the 74.44%
et.al s Survival BAYES AND hyper-parameters and
(2018) dataset SVM differentiating classes.
Shravya et UCI Spyder SVM Hyperplane separates two 92.7%
al. (2019) repository classes which helps in
higher accuracy.
Bellaachia SEER WEKA Naïve Bayes Gives a probabilistic model 96.3 %
et al Public-Use for classification Helping in
(2016) Data. classification.
Mohana,et. Wisconsin WEKA DECISION Helps in Splitting and 96.3%
al Breast TREE choosing the best attributes
(2019) Cancer
dataset.
Kibeom Gene WEKA C4.5, Ensemble Method helps to Single C4.5 –
et. al Expression Bagging and combine multiple learners. 95.6%,
(2017) Dataset ADABOOST Bagging C4.5
Collection Decision trees – 93.29%,AD
ABOOST
C4.5 –
92.62%
Cruz et. Pubmed MATL SVM, Helps to form a decision 97.3%
Al (biomedica AB NAÏVE- boundary and helps in
(2007) l BAYES classification.
literature),
the Science
Citation

Medjahed Wisconsin WEKA Decision Helps in splitting 96.1 %

et. al breast Trees
(2013 cancer
dataset

10
2.4 SUMMARY
A lot of researches have been performed in the area of diagnosis and classification of breast
cancer using many techniques and methods to obtain certain accuracy and to develop a system
fully capable of diagnosing the aforementioned disease. However there is still a lot to be done in
the field, large datasets are often not available for most task. Because of this, many deep learning
techniques cannot be applied due to the possibility of over fitting and poor generalization. As a
result, simpler models such as random forests or logistic regression - which don’t require large
amounts of data are often used instead.

11
CHAPTER 3
RESEARCH METHODOLOGY
3.0 INTRODUCTION
This chapter present the description of the research process. It provides the information
concerning the method that was used in undertaking this research as well as a justification for the
use of this method.
3.1 DATASET
The data(dataset) used for this research was obtained online. It has 569 instances and 32
attributes. IT can be found on UCI Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
3.1.1 DESCRIPTION OF DATASET
The features of the dataset were computed from a digitized image of a fine needle aspirate (FNA)
of a breast mass. They describe characteristics of the cell nuclei present in the image. Ten real-
valued features are computed for each cell nucleus. The mean, standard error and "worst" or
largest (mean of the three largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field13 is Radius SE, field 23 is
Worst Radius. The table below provide the attributes of the dataset.
S/N ATTTRIBUTES
1 ID Number
2 Diagnosis (M = malignant, B = benign)
3 Radius (mean of distances from center to points on the perimeter)
4 Texture (standard deviation of gray-scale values)
5 Perimeter
6 Area
7 Smoothness (local variation in radius lengths)
8 Compactness (perimeter^2 / area - 1.0)
9 Concavity (severity of concave portions of the contour)
10 Concave points (number of concave portions of the contour)
11 Symmetry
12 Fractal dimension ("coastline approximation" - 1)

The “Diagnosis column contains the classification of the cancer which has two values either
malignant or benign. The class distribution is: 357 benign, 212 malignant

3.2 RESEARCH FRAMEWORK

The research in machine learning is mainly quantitative. Therefore, it has standardized
approaches and uses statistical models in the various segments of the research.

12
The general approach used in conducting experiments in machine learning research is as follows:
Data collection, Data pre-processing, Model training, model testing, model evaluation. These
steps are summarized in the diagram below.

Data
Data
Collection Pre-Processing

Model

Training

Model

Testing

Model

Evaluation

Figure 3.1 Machine Learning Research Approach.

13
3.2.1 Data collection
The method in which the data was collected and its description have been provided at 3.1

3.2.2 Data pre-processing

Data cleaning is the process to remove incorrect data, incomplete data and inaccurate data from
the datasets, and it also replaces the missing values. Seeing as the source of data to be used for
this research was secondary, the data is already in a structured format i.e in the form of rows and
columns, also there are no missing values in our data, However, the columns “ID” and the last
column titled “Unnamed: 32” were removed from the data set as the former is just a means of
identification while the latter contain no value. Apart from the issues mentioned the data is
already pre-processed. The issue of data normalization however will be discussed in the next
section which is model training. Also the issue of data reduction is not discussed here as the
dataset is not that big.
3.2.3 Model Training
To train our model we split the dataset into training(to train the model) and testing(to give the
model to predict) set. We then replace the “Diagnosis” column of the dataset as it contains
objects with numbers(1 for malignant and 0 for benign) to pass it into our model. Before we pass
the dataset into the model we then scale it using the standardization method.

3.2.4 Model Testing

After the model was trained, we then test it by passing to the model the testing set of our data
were it predict the diagnosis of the set.
3.2.5 Model Evaluation
To evaluate our model, we use metrics used to evaluate classification models. Classification
metrics evaluate a model’s performance and tell you how good or bad the classification is, but
each of them evaluates it in a different way. So to evaluate our model, we use the following
metrics:
 Confusion Matrix
 Accuracy
 Precision and Recall
 F1-score

14
CONFUSION MATRIX
Confusion matrix is a tabular visualization of the ground-truth labels versus model predictions.
Each row of the confusion matrix represent the instance In a predicted class and each column
represent the instance in an actual class. Confusion matrix is not exactly a performance matrix
but sort of a basis on which other metrics evaluate the result.

Predicted Values
Negative Positive
TN FN
True Negative False positive Negative
Actual values
FN TP
Positive False Negative True positive

Each prediction can be one of the four outcomes, based on how it matches up to the actual value:
 True Positive(TP) signifies how many positive class a model predicted correctly.
 True Negative(TN) signifies how many negative class the model predicted correctly.
 False Positive(FP) signifies how many negative class a model predicted incorrectly.
 False negative(FN) signifies how many positive class a model predicted incorrectly.

ACCURACY
Accuracy is a common evaluation metric for classification problem. It’s the number of correct
prediction made as a ratio of all predictions made.
Overall, how often is the classifier correct?
ACCURACY = (TP/TN)/total
PRECISION
Precision is the ratio of true positives and total positives predicted.

P = TP/TP+FP

15
A precision score towards 1 will signify that our model didn’t miss any true positives, and is able
to classify well between correct and incorrect labelling of cancer patients. A low precision score
means that our model has a high number of false positives.
RECALL
A recall is essentially the ratio of true positives to all the positives in ground truth

R = TP/TP+FN

Recall towards 1 will signify that a model didn’t miss any true positives, and is able to classify
well between correctly and incorrectly labeling of cancer patients. A low recall means a model
has a high number of false negatives which can be an outcome of imbalanced class or untuned
model hyperparameters.

F1 SCORE
The F1 score metric use a combination of precision and recall. In fact, F1 score is the harmonic
mean of the two. The formula of the two essentially is:
F1 = 2 * precision * recall / precision + recall

SUPPORT VECTOR MACHINE

The algorithm used to build the model used for these research is known as Support Vector
Machine (SVM). The objective of support vector machine is to find a hyperplane in an N-
dimensional space(N- the number of features that distinctly classify the data points. To separate
the two classes of data points, there are many possible hyperplanes that could be chosen. Our
objective is to find a plane that has the maximum margin, i.e the distance between data points of
both classes. Maximizing the margin distance provides some reinforcement so that future data
points can be classified with more accuracy. Hyperplanes are decision boundaries that help
classify data points. Data points falling on either side of the hyperplane can be attributed to a
different class. Also, the dimension of the hyperplane depends upon the number of features. If
the number of input features is 2, then the hyperplane is just a line. If the number of features is 3,
then the hyperplane becomes a two-dimensional plane. Support vectors are data points that are
closer to the hyperplane and influence the position and orientation of the hyperplane. Using these
support vectors, we maximize the margin of the classifier.

16
Algorithm: BCCA
Input: Array of classes
Output: Classified classes
1. Select the hyper plane which divides the class better
2. To find better hype plane you calculate the distance between the planes and the data
called margin
3. If the distance between the classes is low, then
(a) The chances of miss conception is high and vice versa. So we need to
(b) Select the class with the high margin
Margin = distance to positive point – Distance to negative point.

3.3 Experimental Framework

This project was performed on a windows laptop with a processing power of 2.30GH and a
memory of 4.0GB. The programming language used was python. Python is a general purpose
programming language. it’s the most useful language for data science because: its simple, easy
and quicker to learn than most other languages. Secondly, its very scalabe. Also python has a lot
of built in libraries to analyze and work with data some of these libraries include NumPy,
Pandas, Scikit-learn, etc. python libraries are powerful and very broad.
 Numpy is great for linear algebra, high level mathematical function and random number
crunching.
 Pandas provides a range of functions for handling data structures and operations such as
manipulating tables and time series
 Scikit-learn library futures various classification, regression and clustering algorithms
including support vector machine, random forest, gradient boosting etc. and it is designed
to interoperate with python numerical and scientific libraries.

3.4 IMPLEMENTATION OF PROJECT

CATEGORY DESCRIPTION
Database MySQL
Technology Phyton

17
CHAPTER FOUR
ANALYSIS AND DISCUSSION
4.0 INTRODUCTION
In these chapter, the results of the experiment conducted are presented and discussed. Graphical
presentation of the results are also given.
4.1 RESULTS
To analyze our data, we begin by exploring our class distribution where we found that out of the
569 sample, 357 were benign while 212 were malignant. The sample distribution is 63 to 37
percent.

Figure 4.1Distribution of samples

18
Figure 4.2 Percentage of samples

We then try to find out if their exist any correlation between the attributes in our dataset. To do
that, we divide the dataset into 3 parts, the mean standard error and the worst for easy anlysis.
After taking the first part of our dataset, the following observations were made.

The columns

(a) Radius, perimeter, area and concave points are related to one another.

(b) compactness, concavity and concave point are related to one another.

(b) Texture, smoothness symmetry and fractal_dimension do not relate to any other column.

19
Figure 4.3 Heatmap of the mean columns

Figure 4.4 Pair plot of the mean columns.

We then take the second part of our dataset (i.e the se columns) and the observation hold true
also for them.

20
Figure 4.5 Heat map for the se columns.

We then concluded that the same correlation also holds for the worst columns.

21
To develop our model, we first remove two columns from our dataset which are “Id” and
“Unnamed: 32”. We then separate the dataset into target and features, split the data for training
and testing, scale the data and pass it into our model. We then evaluate the model based on the
performance measures stated in chapter 3 and the result of the evaluation is as follows.

1. Accuracy: The accuracy of the model is 96.5%

2. Precision: The precision of the model is 91.1%
3. Recall : The recall of the model is 97.6%
4. F1 Score: The f1 score of the model is 94.6%
5. The true (positive, negative) and false (positive, negative) are:
 TP : 41
 FN: 4
 FP: 1
 TN: 97

Figure 4.6 CONFUSION MATRIX SCORES

4.2 DISCUSSION

22
Understanding how well a machine learning model is going to perform on unseen data is the
ultimate purpose of evaluating a model using the metrics used above. Seeing that all the
performance metrics used to evaluate our model gives an accuracy of greater than 90 percent and
seeing also that the values for the confussion matrix were almost all predicted correctly(only 5
instances were predicted incorrectly, we can safely say that the model performed very well
considering the distribution of the class and also considering the fact that default hyper-
peremeters were used for our model.

CHAPTER 5

23
SUMMARY, CONCLUSION AND RECOMMENDATION.
5.0 SUMMARY
The aim of these research was to develop a module for classifying breast cancer into either
malignant or benign. To do that, the jupyter platform was used suing python programming
language using the winsconsin breast cancer dataset. The model was built using the support
vector machine algorithm. An exploratory data analysis of the dataset was also carried out. The
result of the analysis shows that some of the features of the dataset are positively correlated to
each other while others are not. To evaluate the model, we used four performance measures that
were computed using the confusion matrix. These performance measures are

 Accuracy
 Precision
 Recall
 F1- score

The accuracy of the model was around 96 percent while the precision, recall and f1-score were

94%, 92% and 93% respectively.

5.1 CONCLUSION

It was observed that some of the attribute of the dataset exhibit correlation with one another
while others do not. The model was developed and it was observed that the accuracy of the
model is around 95 percent. In summary, the algorithm performed effectively and efficiently on
the performance measures evaluated with it. Also putting into consideration is the un even
distribution of the target column as on contain 63 percent of the dataset while the other class
(malignant) contain only 37 percent.

5.2 RECOMMENDATION

Feature research should be carried out for better performance of this model. This can be achieved
be performing hyper-parameter optimization seeing as the model was built using default hyper-
parameters. The use of k-fold validation techniques can also be used to better evaluate the model
and obtain more accuracy.

REFERENCES

24
Abdelghani Bellaachia, Erhan Guven, “Predicting Breast Cancer Survivability Using Data

Mining Techniques”

B. Akbugday, "Classification of Breast Cancer Data Using Machine Learning Algorithms," 2019

Medical Technologies Congress (TIPTEKNO), Izmir, Turkey, 2019, pp. 1-4.

Ch. Shravya, K. Pravalika, Shaik Subhani, “Prediction of Breast Cancer Using Supervised

Machine Learning Techniques”, International Journal of Innovative Technology

and Exploring Engineering (IJITEE), Volume-8 Issue-6, April 2019.

Delen, D.; Walker, G.; Kadam, A. Predicting breast cancer survivability: A comparison of three

data mining methods. Artif. Intell. Med. 2005, 34, 113–127.

Joseph A. Cruz and David S. Wishart “Applications of Machine Learning in cancer prediction

and prognosis Cancer informatics” 2(3):59-77 · February 2007

Keles, M. Kaya, "Breast Cancer Prediction and Detection Using Data Mining Classification

Algorithms: A Comparative Study." Tehnicki Vjesnik - Technical Gazette, vol.

26, no. 1, 2019, p. 149+.

Kibeom Jang, Minsoon Kim, Candace A Gilbert, Fiona Simpkins, Tan A Ince, Joyce M

Slingerland “WEGFA activates an epigenetic pathway regulating ovarian cancer

initiating cells” Embo Molecular Medicines Volume 9 Issue 3 (2017)

N. Khuriwal, N. Mishra. “A Review on Breast Cancer Diagnosis in Mammography Images

Using Deep Learning Techniques”, (2018), Vol. 1, No. 1.

P. Sinthia, R. Devi, S. Gayathri and R. Sivasankari, “Breast Cancer detection using PCPCET and

ADEWNN”, CIEEE’ 17, p.63-65

25
R. K. Kavitha1, D. D. Rangasamy, “Breast Cancer Survivability Using Adaptive Voting

Ensemble Machine Learning Algorithm Adaboost and CART Algorithm” Volume

3, Special Issue 1, February 2014

R. M. Mohana, R. Delshi Howsalya Devi, Anita Bai, “Lung Cancer Detection using Nearest

Neighbour Classifier”, International Journal of Recent Technology and

Engineering (IJRTE), Volume-8, Issue-2S11, September 2019

SA Medjahed, TA Saadi, A Benyettou “Breast cancer diagnosis by using k-nearest neighbor with

different distances and classification rules” International Journal of Computer

Applications 62 (1), 2013

V. Chaurasia and S. Pal, “Data Mining Techniques: To Predict and Resolve Breast Cancer

Survivability”, IJCSMC, Vol. 3, Issue. 1, January 2014, pg.10 – 22.

Vikas Chaurasia and S.Pal, “Using Machine Learning Algorithms for Breast Cancer Risk

Prediction and Diagnosis” (FAMS 2016) 83 ( 2016 ) 1064 – 1069

Wang, D. Zhang and Y. H. Huang “Breast Cancer Prediction Using Machine Learning” (2018),

Vol. 66, NO. 7.

Y. Khourdifi and M. Bahaj, "Feature Selection with Fast Correlation-Based Filter for Breast

Cancer Prediction and Classification Using Machine Learning Algorithms," 2018

International Symposium on Advanced Electrical and Communication

Technologies (ISAECT), Rabat, Morocco, 2018, pp. 1-6.

26
APPENDICES

APPENDIX – 1 : SAMPLE SOURCE CODES

27
28
APPENDIX – 2 : SAMPLE SCREENSHOTS

29
30

Brest Cancer Tumor Detection
No ratings yet
Brest Cancer Tumor Detection
40 pages
Breast Cancerr Main
100% (1)
Breast Cancerr Main
47 pages
Mini Project Report (1) .Final
No ratings yet
Mini Project Report (1) .Final
23 pages
Breast Cancer Detection via ML
No ratings yet
Breast Cancer Detection via ML
29 pages
Final Project Record Exmple
No ratings yet
Final Project Record Exmple
93 pages
Breast Cancer Prediction with ML
No ratings yet
Breast Cancer Prediction with ML
32 pages
Breast Cancer Prediction Project
No ratings yet
Breast Cancer Prediction Project
33 pages
Breast Cancer Detection Final
No ratings yet
Breast Cancer Detection Final
72 pages
BATCH7 ECE3 BreastCancerDetectionUsingAI
No ratings yet
BATCH7 ECE3 BreastCancerDetectionUsingAI
72 pages
PFY G-12 BC Classification
No ratings yet
PFY G-12 BC Classification
57 pages
Project PPT1 Enhanced
No ratings yet
Project PPT1 Enhanced
16 pages
Viyan Report
No ratings yet
Viyan Report
59 pages
Breast Cancer Diagnostiic Using Machine Learning
No ratings yet
Breast Cancer Diagnostiic Using Machine Learning
72 pages
Lakshmi 26EDIT
No ratings yet
Lakshmi 26EDIT
63 pages
Breast Cancer Diagnosis Using Deep Learning Algorithm: Naresh Khuriwal DR Nidhi Mishra
No ratings yet
Breast Cancer Diagnosis Using Deep Learning Algorithm: Naresh Khuriwal DR Nidhi Mishra
6 pages
Breast Cancer Classification - Team6
No ratings yet
Breast Cancer Classification - Team6
22 pages
The Roadmap To A Strong Business
No ratings yet
The Roadmap To A Strong Business
49 pages
Project Report: Bangladesh University of Business & Technology (BUBT)
No ratings yet
Project Report: Bangladesh University of Business & Technology (BUBT)
18 pages
Breast+Cancer+Detection (Id58)
No ratings yet
Breast+Cancer+Detection (Id58)
12 pages
PFY G-12 BC Classification
No ratings yet
PFY G-12 BC Classification
55 pages
CPP Report
No ratings yet
CPP Report
16 pages
Sample Document
No ratings yet
Sample Document
26 pages
Internship Report Final
No ratings yet
Internship Report Final
38 pages
G5 Research Paper
No ratings yet
G5 Research Paper
14 pages
A Project Report
No ratings yet
A Project Report
59 pages
Jatin Synopsis
No ratings yet
Jatin Synopsis
19 pages
Research Paper
No ratings yet
Research Paper
9 pages
Project Report On Breast Cancer
67% (3)
Project Report On Breast Cancer
47 pages
Pt4 Project Report Updatedd
No ratings yet
Pt4 Project Report Updatedd
47 pages
Breast Cancer Detection via ML Model
No ratings yet
Breast Cancer Detection via ML Model
6 pages
Project Report
No ratings yet
Project Report
27 pages
Major Project Report
No ratings yet
Major Project Report
62 pages
Kanchan Chandolkar Major Project Report File
No ratings yet
Kanchan Chandolkar Major Project Report File
27 pages
Breast Cancer Prediction Model Assignment
No ratings yet
Breast Cancer Prediction Model Assignment
37 pages
(IJCST-V11I3P3) :DR M Narendra, A Nandini, T Kamal Raj, V Sai Sowmya, CH Brahma Reddy
No ratings yet
(IJCST-V11I3P3) :DR M Narendra, A Nandini, T Kamal Raj, V Sai Sowmya, CH Brahma Reddy
3 pages
Breast Cancer Detection Using ML Techniques
No ratings yet
Breast Cancer Detection Using ML Techniques
11 pages
Proposal Cancer
No ratings yet
Proposal Cancer
4 pages
Thomson Reuters Indexing Journals
No ratings yet
Thomson Reuters Indexing Journals
7 pages
Breast Cancer Survey
No ratings yet
Breast Cancer Survey
8 pages
Breast Cancer Vijay & Aravind Project 2024-06-28 Recreate
No ratings yet
Breast Cancer Vijay & Aravind Project 2024-06-28 Recreate
14 pages
Design and Implementation of An Expert System in Diagnosis and Treatment of Breast Cancer
100% (1)
Design and Implementation of An Expert System in Diagnosis and Treatment of Breast Cancer
74 pages
Report MP
No ratings yet
Report MP
26 pages
Mobile Application Development
No ratings yet
Mobile Application Development
75 pages
Machine Learning in Breast Cancer Diagnosis
No ratings yet
Machine Learning in Breast Cancer Diagnosis
31 pages
Breast Cancer Diagnosis with CNNs
No ratings yet
Breast Cancer Diagnosis with CNNs
86 pages
245 Yusra Zafar
No ratings yet
245 Yusra Zafar
51 pages
Breast Cancer Modeling and Prediction Combining
No ratings yet
Breast Cancer Modeling and Prediction Combining
6 pages
Project
No ratings yet
Project
40 pages
Intelligent Diagnostic System For The Diagnosis and Prognosis of Breast Cancer Using ANN
No ratings yet
Intelligent Diagnostic System For The Diagnosis and Prognosis of Breast Cancer Using ANN
6 pages
Breast Cancer Detection
No ratings yet
Breast Cancer Detection
41 pages
245 Yusra Zafar
No ratings yet
245 Yusra Zafar
50 pages
Proposal PDF
No ratings yet
Proposal PDF
16 pages
Report On Breast Cancer
No ratings yet
Report On Breast Cancer
42 pages
BC Proposal
No ratings yet
BC Proposal
18 pages
Breast Cancer Classification Using Machine Learning
No ratings yet
Breast Cancer Classification Using Machine Learning
9 pages
Breast Cancer Detection with ML Techniques
No ratings yet
Breast Cancer Detection with ML Techniques
68 pages
Breast Cancer Detection With Machine Learning
No ratings yet
Breast Cancer Detection With Machine Learning
7 pages
BIO9029M Professional and Research Skills in The Life and Environmental Sciences
No ratings yet
BIO9029M Professional and Research Skills in The Life and Environmental Sciences
8 pages
Hda TP Final
No ratings yet
Hda TP Final
29 pages
Calculus in Basketball 1
100% (1)
Calculus in Basketball 1
3 pages
Batch Change Price For Multiple Offers Put Sale Offer-Price-Change-Commands Commandid
No ratings yet
Batch Change Price For Multiple Offers Put Sale Offer-Price-Change-Commands Commandid
5 pages
New Freshwater Fish Species Described in 2024 CAS IUCN
No ratings yet
New Freshwater Fish Species Described in 2024 CAS IUCN
22 pages
7.06 Method of Eigenfunction Expansions
No ratings yet
7.06 Method of Eigenfunction Expansions
7 pages
Junior Picture Encyclopedia (Paperback) (Jan 01, 2012) Aman - Aman - PS - Dreamland Publications
No ratings yet
Junior Picture Encyclopedia (Paperback) (Jan 01, 2012) Aman - Aman - PS - Dreamland Publications
84 pages
LP Simplex Minimization
No ratings yet
LP Simplex Minimization
23 pages
Introduction to Chemistry Basics
No ratings yet
Introduction to Chemistry Basics
4 pages
Soil Test Report Greater Nodia
100% (2)
Soil Test Report Greater Nodia
21 pages
Managing Cognitive Load in Adaptive Multimedia Learning by Slava Kalyuga
No ratings yet
Managing Cognitive Load in Adaptive Multimedia Learning by Slava Kalyuga
336 pages
KU Reserved Seats Admissions 2023
No ratings yet
KU Reserved Seats Admissions 2023
5 pages
INSE 6640: Smart Grids and Control System Security: Lecture 10 - Cyber-Attacks Against State Estimation in Smart Grid
No ratings yet
INSE 6640: Smart Grids and Control System Security: Lecture 10 - Cyber-Attacks Against State Estimation in Smart Grid
55 pages
Previewpdf
No ratings yet
Previewpdf
52 pages
Detailed Project Report On SOYA PROTEIN ISOLATE (POWDER) FROM SOYA GRIT (CAP: 20 TPD)
No ratings yet
Detailed Project Report On SOYA PROTEIN ISOLATE (POWDER) FROM SOYA GRIT (CAP: 20 TPD)
7 pages
Arctic Sustainability Key Methodologies and Knowledge Domains A Synthesis of Knowledge I Jessica K Graybill Instant Download
No ratings yet
Arctic Sustainability Key Methodologies and Knowledge Domains A Synthesis of Knowledge I Jessica K Graybill Instant Download
47 pages
Revision of Credit Structure - R24 (IoT)
No ratings yet
Revision of Credit Structure - R24 (IoT)
20 pages
PHD Candidate in Statistics and Pharmacometrics
No ratings yet
PHD Candidate in Statistics and Pharmacometrics
3 pages
University Lab Report: Drying Process
No ratings yet
University Lab Report: Drying Process
19 pages
Bigfin Squid: Bigfin Squids Bigfin Squids Are A Group of Rarely Seen Cephalopods With A
No ratings yet
Bigfin Squid: Bigfin Squids Bigfin Squids Are A Group of Rarely Seen Cephalopods With A
4 pages
No. Person Id Full Name
No ratings yet
No. Person Id Full Name
15 pages
ACTIVITY SHEETS in IPHP MODULE 1
No ratings yet
ACTIVITY SHEETS in IPHP MODULE 1
3 pages
Grammar for Night Owls
No ratings yet
Grammar for Night Owls
1 page
David Easton
No ratings yet
David Easton
12 pages
lecturezeroCSE377 1
No ratings yet
lecturezeroCSE377 1
23 pages
Portable Oil Viscosity Tester
No ratings yet
Portable Oil Viscosity Tester
2 pages
SCSC Complaint on AICTE Non-Compliance
No ratings yet
SCSC Complaint on AICTE Non-Compliance
33 pages
Test 4 Quadratic Relations
No ratings yet
Test 4 Quadratic Relations
4 pages
ISTE Standards for Students and Educators 2024
No ratings yet
ISTE Standards for Students and Educators 2024
11 pages
Neural Network Surrogate Models For Aerodynamic Analysis in Truck Platoons Implications On Autonomous Freight Delivery
No ratings yet
Neural Network Surrogate Models For Aerodynamic Analysis in Truck Platoons Implications On Autonomous Freight Delivery
10 pages
CV Rezaei
No ratings yet
CV Rezaei
2 pages
Frequency Distribution & Data Visualisation
No ratings yet
Frequency Distribution & Data Visualisation
26 pages