0% found this document useful (0 votes)
17 views38 pages

Classifying Breast Cancer Using Machine Learning

Uploaded by

muhdulamin16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views38 pages

Classifying Breast Cancer Using Machine Learning

Uploaded by

muhdulamin16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

IMPLEMENTATION OF BREAST CANCER USING MACHINE LEARNING

WITH WISCONSIN BREAST CANCER DATASET (WBCD)

A project report Submitted to the Skyline University Nigeria In


Partial fulfilment for the award of Degree of

BACHELOR OF SCIENCE IN SOFTWARE ENGINEERING


Submitted by

KAWTHAR MUHAMMAD
(ID: 1384)
Under the Guidance of

Dr. Vijay Aputharaj

HEAD OF DEPARTMENT

DEPARTMENT OF COMPUTER SCIENCE & INFORMATION SYSTEM

SCHOOL OF SCIENCE AND INFORMATION TECHNOLOGY

SKYLINE UNIVERSITY NIGERIA

JULY 2023
CERTIFICATE

This is to certify that the project work entitled “IMPLEMENTATION OF BREAST


CANCER USING MACHINE LEARNING WITH WISCONSIN BREAST CANCER

DATASET (WBCD)” is a bonafied record of project work done by KAWTHAR


MUHAMMAD (1384), submitted to Skyline University in partial fulfilment of the
requirement for the award of the Degree of Bachelor of Science in Computer
Science and Information Systems.

Signature of the Guide Signature of the HOD

Dr. Vijay Aputharaj Dr. Vijay Arputharaj

Supervisor Head of the Department

Submitted for the University Project Viva-voce held on____________

Internal Examiner External Examiner

ii
DECLARATION

I Kawthar Muhammad (1384) hereby declare that the project work entitled
“IMPLEMENTATION OF BREAST CANCER USING MACHINE LEARNING WITH
WISCONSIN BREAST CANCER DATASET (WBCD)” submitted to the Skyline

University, Nigeria in partial fulfilment of the requirements for the award of the
Degree of Bachelor of Science in Computer Science and Information systems is a
record of original work done by me during 2022-2023 under the supervision and
guidance of Dr. Vijay Aputharaj and that it has not formed the basis for the
award of any Degree or other similar title to any candidate of any University.

Signature of the Candidate

Place :

Date :

ACKNOWLEDGEMENT

iii
I express my sincere thanks to my parents Alh. Usman, for their support and encouragement
towards my study.

It is my pleasure to express our profound gratitude to the Management for admitting me into
this project. I extend my sincere thanks to our honourable Vice Chancellor and Registrar for
supporting and encouraging carrying out my project.

I express my thanks to Dr. VIJAY ARPUTHARAJ, Head of the Department of computer


science and Software Engineering of Skyline University, Nigeria for his constant suggestion and
persistent encouragement.

I also wish to extend my sincere thanks to our beloved guide Dr. A. SENTHIL KUMAR
MCA.,M. Phil., MBA., Ph.D. Dean-SSIT, for his support and words of encouragement.

I also thank all the faculty member and lab in-charge of our department who helped me to
complete my project successfully. I wish to thank all hand and minds that helped me, all my
friends and well-wishers who are all behind the success of my project.

KAWTHAR MUHAMMAD

ABSTRACT

iv
Brest cancer is the most common type of cancer in Nigeria. A mammogram is an X-ray of the
breast. Screening mammograms are evaluated by human readers. The reading process is tedious,
tiring, lengthy, costly and most importantly prone to errors. Multiple studies have shown that

20-30% of the diagnosed cancers could be found retrospectively on the previous negative
screening exam by blinded reviewers. These research propose a novice model to classify breast
cancer using the Wisconsin breast cancer dataset (WBCD). To do that we used a dataset by
analyzing the dataset first and then developing the model using python programming language.

The analysis showed that there exist a correlation between some of the attributes of the dataset.

The model proposed was evaluated using the performance measures including accuracy,
precision, recall and F-1 score.

Future research can be carried out for better optimization of the model, hyper perimeters
selection and better validation.

TABLE OF CONTENTS

v
CHAPTER ONE: INTRODUCTION……………………………………………1

1.0 RESEARCH BACKGROUND……………………………………………...1-2


1.1PROBLEM STATEMENT…………………………………………………...2-3
1.2 AIMS AND OBJECTIVES…………………………………………………….4
1.3SCOPE………………………………………………………………………….4
1.4 PROJECT OUTLINE…………………………………………………………..4

CHAPTER TWO: LITERATURE REVIEW……………..……………………5

2.0 INTRODUCTION……………………………………………………………...5

2.1 CONCEPTUAL FRAMEWORK………………………………………………5

2.1.1 machine learning…………………………………………………………5

2.1.1.1 TYPES OF MACHINE


LEARNING…………………………………..5

2.2 THEORITICAL FRAMEWORK………………………………………………7

2.2.1 MACHINE LEARNING THEORY……………………………………..7

2.3 LITERATURE
REVIEW………………………………………………….........8

2.4 SUMMARY…………………………………………………………………...11

CHAPTER THREE: RESEARCH


METHODOLOGY……………………….12

3.0 INTRODUCTION………………………………………………………….…12

3.1 DATASET……………………………………………………………….……12

3.1.1 DESCRIPTION OF DATASET………………………………..……….12

vi
3.2 RESEARCH FRAMEWORK……………….
………………………………...12

3.2.1 DATA COLLECTION……………………………………………...…..14

3.2.2 DATA PRE-PROCESSING…………………………………………....14

3.2.3 MODEL TRAINING……………………………………………….….14

3.2.4 MODEL TESTING……………………………………………….……14

3.2.5 MODEL EVALUATION…………………………………….………..14

3.3 EXPERIMENTAL FRAMEWORK………………………………………17

CHAPTER FOUR: ANALYSIS AND DISCUSSION……………………..….18

4.1 INTRODUCTION……………………………………….…………………..18

4.2 RESULTS……………………………………………………………………18

4.3 DISCUSSION………………………………………………………………...23

CHAPTER FIVE: SUMMARY, CONCLUSION AND


RECOMMENDATIONS………………………………………………………24

5.1 SUMMARY………………………………………………………………….24

5.2 CONCLUSION……………………………………………………………….24

5.3 RECOMMENDATIONS……………………………………………………..24

REFRENCES………………………………………….………………………….25

vii
viii
CHAPTER ONE

INTRODUCTION

1.0 Research Background

Cancer occurs when changes called mutations take place in genes that regulate cell growth. The
mutations let the cells divide and multiply in an uncontrolled way.

Anything that may cause a normal body cell to develop abnormally potentially can cause cancer,
general categories of cancer-related or causative agents are as follows: chemical or toxic
compound exposures, ionizing radiation, some pathogens, and human genetics.

Cancer symptoms and signs depend on the specific type and grade of cancer, although general
signs and symptoms are not very specific the following can be found in patients with different
cancers: fatigue, weight loss, pain, skin changes, change in bowel or bladder function, unusual
bleeding, persistent cough or voice change, fever, lumps or tissue masses.

The most common type of cancer in the US is breast cancer, followed by lung and prostrate
cancers, according to the National Cancer Institute, which excluded no melanoma skin cancers
from these findings. The focus of these project is on breast cancer.

Breast cancer is cancer that develops in breast cells. Typically, the cancer forms in either the
lobules or the ducts of the breast.

Breast cancer is the most common cancer in women and it is the main cause of death from cancer
among women in the world. Screening mammography was shown to reduce breast cancer
mortality by 38–48% among participants. Although they generally have less of it, men have
breast tissue just like women do. Men can develop breast cancer too, but it’s much rarer.

In its early stages, breast cancer may not cause any symptoms. In many cases, a tumor may be
too small to be felt, but an abnormality can still be seen on a mammogram.

Each type of breast cancer can cause a variety of symptoms. Many of these symptoms are
similar, but some can be different. Symptoms such as: a lump or thickening that feels different
than surrounding tissues; breast pain; red, pitted skin over the entire breast; a nipple discharge
other than milk and the rest are among the most common symptoms of breast cancer.

1
There are many types of breast cancer, and they are broken down into two main categories:
“invasive” and “noninvasive” or in situ. While invasive cancer has spread from breast ducts or
glands to other parts of the breast, noninvasive has not spread from the original tissue.

These two categories are used to describe the most common types of breast cancer, which
include: Ductal carcinoma in situ, Lobular carcinoma in situ, Invasive ductal carcinoma and
Invasive lobular carcinoma. Other, less common types of breast cancer include: Paget disease of
the nipple, Phyllodes tumor, Anginiosarcoma.

COMPUTER-AIDED DIAGNOSIS

Computer-aided detection (CADe), also called computer-aided diagnosis (CADx) are systems
that assist doctors in the interpretation of medical images. CAD is an interdisciplinary
technology combining the elements of Artificial intelligence and computer vision with
radiological and pathology image processing. A typical application is in the detection of a tumor.
For instance, some hospitals use CAD to support preventive medical check-ups in
mammography (diagnosis of breast cancer), the detection of polyps in the colon, and lung
cancer.

The benefits of using CAD are controversial. Initially several studies have shown promising
results with CAD. A large clinical trial in the United Kingdom has shown that single reading
with CAD assistance has similar performance to double reading. However, in the last decade
multiple studies concluded that currently used CAD technologies do not improve the
performance of radiologists in everyday practice in the United States. These controversial results
indicate that CAD systems need to be improved before radiologists can ultimately benefit from
using the technology in everyday practice. Currently used CAD approaches are based on
describing the X-ray image with meticulously designed hand crafted features, and machine
learning for classification on top of these features.

1.1 Problem Statement

A mammogram is an X-ray of the breast. It’s a screening tool used to detect breast cancer.
Together with regular clinical examination and monthly breast self-examination, mammograms
are a key element in the early diagnosis of breast cancer.

2
Screening mammograms are evaluated by human readers. The reading process is tedious, tiring,
lengthy, costly and most importantly prone to errors. Multiple studies have shown that 20-30%
of the diagnosed cancers could be found retrospectively on the previous negative screening exam
by blinded reviewers. The problem of missed cancers still persists despite modern full field
digital mammography (FFDM). The sensitivity and specificity of screening mammography is
reported to be between 77-87% and 89-97% respectively. These metrics describe the average
performance of readers, and there is substantial variance in the performance of individual
physicians, with reported false positive rates between 1-29%, and sensitivities between 29-97%.
Double reading was found to improve the performance of mammographic evaluation and it had
been implemented in many countries. Multiple reading can further improve diagnostic
performance up to more than 10 readers, proving that there is room for improvement in
mammogram evaluation beyond double reading. with the evolution of medical research,
numerous new systems have been developed for the detection and classification of breast cancer.
The research associated with this area is outlined in brief as follows.

Wang, D., Zhang and Y.-H Huang used Logistic Regression and achieved an Accuracy of
96.4%. Akbugday et al. performed classification on Breast Cancer Dataset by using KNN, SVM
and achieved accuracy of 96.85%. KAYA KELES et al., in the paper titled “Breast Cancer
Prediction and Detection Using Data Mining” used Random Forest and achieved accuracy of
92.2%.Vikas Chaurasia et al., compare the performance criterion of supervised learning
classifiers; such as Naïve Bayes, SVM-RBF kernel, RBF neural networks, Decision trees (J48)
and simple CART; to find the best classifier in breast cancer datasets. Dalen, D.Walker et al.
used ADABOOST and achieved accuracy of 97.5% better than Random Forest. Kavitha et al.,
used ensemble methods with Neural Networks and achieved accuracy of 96.3% lesser than
previous studies. According to Sinthia et al., they used backpropagation method with 94.2 %
accuracy. The experimental result shows that SVM-RBF kernel is more accurate than other
classifiers; it scores accuracy of 96.84% in Wisconsin Breast Cancer (original) datasets

3
1.2 Aim And Objectives

The aim of this research project is to design and implement a breast cancer classification
module.

The key objective of this project are:

1. To propose a novice model to classify breast cancer using the Wisconsin breast cancer dataset
(WBCD).

2. To provide an in-depth exploratory data analysis on the dataset.

1.3 Scope

Considering how vast the area of machine learning is, this project focuses on one of the three
general classes of machine learning, which is supervised learning. Algorithms such as linear
regression, naïve bayes, decision trees, support vector machine and so on will be considered.

1.4 Project Outline

This project is divided into five parts,

Chapter one provides an introduction to the topic.

Chapter two discusses the literature review.

Chapter three gives an overview of the methodology used.

Details of the results are provided in chapter four.

Chapter five concludes the project, summary and recommendations are also stated.

4
CHAPTER 2

LITERATURE REVIEW

2.0 Introduction

This chapter is grouped into four parts. It starts with the conceptual review which gives an
insight about what machine learning is all about. It is then followed by the theoretical framework
which explain other researches based on the subject of interest. Literature review, which is the
third part of this chapter review the existing academic works on the research topic. The last part
of this chapter which is the summary summarizes the entire content of the whole chapter.

2.1 Conceptual Framework

2.1.1 Machine Learning

The two famous definition of machine learning are:

Arthur Samuel (1959) defined machine learning as the field of study that gives computers the
ability to learn without being explicitly programmed. This definition is considered old and
informal.

Tom Mitchell in 1998 defined a well posed learning problem as: A computer program is said to
learn from experience E, with respect to some task T and some performance measure P, if it’s
performance on T, as measured by P, improves with T.

2.1.1.1 Types of machine Learning

There are basically three types of machine learning. They are:

1. Supervised learning.
2. Unsupervised learning.
3. Reinforcement learning.

Supervised learning is a type of machine learning that deals with label data. That is to say the
algorithm is trained on input data that has been labeled for a particular output. The model is
trained until it can detect the underlying patterns and relationship between the input data and

5
output labels, enabling it to yield accurate labeling results when presented with a new data.
Supervised learning can be applied into two problems; classification and regression.

 Classification: A classification problem is when the output variable is a category,


such as “red” or “blue” or “disease” and “no disease”. Examples of classification
problems include: Image classification, diagnosis, customer retention and identity
fraud detection.
 Regression: a regression problem is when the output variable is a real value, such as
“dollars” or “weight”. Examples of regression problem include: weather forecasting,
market forecasting, population growth prediction, estimating life expectancy and so
on.

Unsupervised learning use algorithms to identify patterns in data sets containing data points that
are neither classified nor labeled. The algorithms thus classify, label and/or group data points
contained within the data sets without having any external guidance in performing the task.
Unsupervised learning problems can be further grouped into clustering and association problems.

 Clustering: A clustering problem is where you want to discover the basic groupings in
the data, such as grouping costumers by purchasing behavior. Example of clustering
problems include: recommender system, targeted marketing, customer segmentation and
so on.
 Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data, such as people that buy X also tend to buy Y.

Reinforcement learning is a type of machine learning technique that enables an agent to learn in
an interactive environment by trial and error using feedback from its own actions and
experiences. Reinforcement learning differs from supervised learning in not needing labelled
input/output pairs be presented, and in not needing sub-optimal actions to be explicitly correct.
Instead the focus is on finding a balance between exploration (of uncharted territory) and
exploitation (of current knowledge).

6
Figure 2.1 Types of machine learning

2.2 Theoretical Framework

2.2.1 Machine Learning Theory

Machine learning Theory, also known as Computational learning theory, aims to understand the
fundamental principles of learning as a computational process. This field seeks to understand at a
precise mathematical level what capabilities and information are fundamentally needed to learn
different kinds of task successfully, and to understand the basic algorithmic principle involved in
getting computers to learn from data and to improve performance with feedback. The goals of
this theory are both to aid in the design of better automated learning methods and to understand
fundamental issues in the learning process itself.

7
2.3 Literature review

With the evolution of medical research, numerous systems have been developed for the detection
and classification of breast cancer. Kavitha et al (2014) used ensemble methods with Neural
Networks and achieved accuracy of 96.3% lesser than previous studies. A lot of researches have
been performed in the area of diagnosis and classification of breast cancer using many
techniques and methods to obtain certain accuracy and to develop a system fully capable of
diagnosing the aforementioned disease. Sinitha et al. used backpropagation method with 94.2 %
accuracy. The experimental result shows that SVM-RBF kernel is more accurate than other
classifiers; it scores accuracy of 96.84% in Wisconsin Breast Cancer (original) datasets.
Akbugday et al (2019) performed classification on Breast Cancer Dataset by using KNN, SVM
and achieved accuracy of 96.85%.

The following is the summary of the existing works on the given domain:

AUTHOR DATASET TOOL TECHNIQUE ADVANTAGES ACCURACY


USED USED USED
Wang et Electronic WEKA Logistic 5-year survivability 96.4
al. (2018) health Regression. prediction using logistic
Records regression
Chaurisya Wisconsin WEKA Statistical Patient features sorted out 92.3
(2014) Dataset. feature from data materials are
selection. statistically tested based on
the type of individual
feature. Then 51 attributes
or features are selected out,
a feature’s importance
score is calculated.
Akbugday Breast WEKA KNN and KNN -
(2019) Cancer SVM 96.85% SVM
Wisconsin – 96.85%
dataset

8
Keles, M. Wisconsin Python SVM vs SVMs map the input vector 96.91%
Kaya Diagnostic KNN, into a feature space of
(2019) Breast decision trees higher dimensionality and
Cancer and Naives identify the hyperplane that
dataset bayes. separates the data points
into two classes.
Chauraisa UC Irvine WEKA Naive Bayes, Decision tree (C5) is the 96.5%
et. Al machine J48 Decision best predictor on the
(2016) learning Tree and holdout sample (this
repository Bagging prediction accuracy is
algorithm better than any reported in
the literature
Kavithaa Cancer MATL Ensemble Multiple Learners are 96.3 %
et. At Society. AB method with combined giving higher
(2014) Logistic and accuracy.
Neural
Network
Sinthia et Wisconsin CAD Logistic 94.2%
al. Diagnosis System Regression
(2014) Breast and the
Cancer Backpropagat
BCI i on neural
dataset Network
Khourdifi Wisconsin WEKA Fast Attributes are reduced by 96.1%
et al. Breast Correlation deleting irrelevant and
(2018) Cancer Based Filter redundant attributes, which
dataset. with SVM, have no meaning in the
Random classification task
Forest, Naive techniques.
Bayes, K-NN
and MLP

9
Khuriwal Haberman' WEKA NAÏVE Helps in marginalizing the 74.44%
et.al s Survival BAYES AND hyper-parameters and
(2018) dataset SVM differentiating classes.
Shravya et UCI Spyder SVM Hyperplane separates two 92.7%
al. (2019) repository classes which helps in
higher accuracy.
Bellaachia SEER WEKA Naïve Bayes Gives a probabilistic model 96.3 %
et al Public-Use for classification Helping in
(2016) Data. classification.
Mohana,et. Wisconsin WEKA DECISION Helps in Splitting and 96.3%
al Breast TREE choosing the best attributes
(2019) Cancer
dataset.
Kibeom Gene WEKA C4.5, Ensemble Method helps to Single C4.5 –
et. al Expression Bagging and combine multiple learners. 95.6%,
(2017) Dataset ADABOOST Bagging C4.5
Collection Decision trees – 93.29%,AD
ABOOST
C4.5 –
92.62%
Cruz et. Pubmed MATL SVM, Helps to form a decision 97.3%
Al (biomedica AB NAÏVE- boundary and helps in
(2007) l BAYES classification.
literature),
the Science
Citation

Medjahed Wisconsin WEKA Decision Helps in splitting 96.1 %


et. al breast Trees
(2013 cancer
dataset

10
2.4 SUMMARY
A lot of researches have been performed in the area of diagnosis and classification of breast
cancer using many techniques and methods to obtain certain accuracy and to develop a system
fully capable of diagnosing the aforementioned disease. However there is still a lot to be done in
the field, large datasets are often not available for most task. Because of this, many deep learning
techniques cannot be applied due to the possibility of over fitting and poor generalization. As a
result, simpler models such as random forests or logistic regression - which don’t require large
amounts of data are often used instead.

11
CHAPTER 3
RESEARCH METHODOLOGY
3.0 INTRODUCTION
This chapter present the description of the research process. It provides the information
concerning the method that was used in undertaking this research as well as a justification for the
use of this method.
3.1 DATASET
The data(dataset) used for this research was obtained online. It has 569 instances and 32
attributes. IT can be found on UCI Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
3.1.1 DESCRIPTION OF DATASET
The features of the dataset were computed from a digitized image of a fine needle aspirate (FNA)
of a breast mass. They describe characteristics of the cell nuclei present in the image. Ten real-
valued features are computed for each cell nucleus. The mean, standard error and "worst" or
largest (mean of the three largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field13 is Radius SE, field 23 is
Worst Radius. The table below provide the attributes of the dataset.
S/N ATTTRIBUTES
1 ID Number
2 Diagnosis (M = malignant, B = benign)
3 Radius (mean of distances from center to points on the perimeter)
4 Texture (standard deviation of gray-scale values)
5 Perimeter
6 Area
7 Smoothness (local variation in radius lengths)
8 Compactness (perimeter^2 / area - 1.0)
9 Concavity (severity of concave portions of the contour)
10 Concave points (number of concave portions of the contour)
11 Symmetry
12 Fractal dimension ("coastline approximation" - 1)

The “Diagnosis column contains the classification of the cancer which has two values either
malignant or benign. The class distribution is: 357 benign, 212 malignant

3.2 RESEARCH FRAMEWORK


The research in machine learning is mainly quantitative. Therefore, it has standardized
approaches and uses statistical models in the various segments of the research.

12
The general approach used in conducting experiments in machine learning research is as follows:
Data collection, Data pre-processing, Model training, model testing, model evaluation. These
steps are summarized in the diagram below.

Data
Data
Collection Pre-Processing

Model

Training

Model

Testing

Model

Evaluation

Figure 3.1 Machine Learning Research Approach.

13
3.2.1 Data collection
The method in which the data was collected and its description have been provided at 3.1

3.2.2 Data pre-processing


Data cleaning is the process to remove incorrect data, incomplete data and inaccurate data from
the datasets, and it also replaces the missing values. Seeing as the source of data to be used for
this research was secondary, the data is already in a structured format i.e in the form of rows and
columns, also there are no missing values in our data, However, the columns “ID” and the last
column titled “Unnamed: 32” were removed from the data set as the former is just a means of
identification while the latter contain no value. Apart from the issues mentioned the data is
already pre-processed. The issue of data normalization however will be discussed in the next
section which is model training. Also the issue of data reduction is not discussed here as the
dataset is not that big.
3.2.3 Model Training
To train our model we split the dataset into training(to train the model) and testing(to give the
model to predict) set. We then replace the “Diagnosis” column of the dataset as it contains
objects with numbers(1 for malignant and 0 for benign) to pass it into our model. Before we pass
the dataset into the model we then scale it using the standardization method.

3.2.4 Model Testing


After the model was trained, we then test it by passing to the model the testing set of our data
were it predict the diagnosis of the set.
3.2.5 Model Evaluation
To evaluate our model, we use metrics used to evaluate classification models. Classification
metrics evaluate a model’s performance and tell you how good or bad the classification is, but
each of them evaluates it in a different way. So to evaluate our model, we use the following
metrics:
 Confusion Matrix
 Accuracy
 Precision and Recall
 F1-score

14
CONFUSION MATRIX
Confusion matrix is a tabular visualization of the ground-truth labels versus model predictions.
Each row of the confusion matrix represent the instance In a predicted class and each column
represent the instance in an actual class. Confusion matrix is not exactly a performance matrix
but sort of a basis on which other metrics evaluate the result.

Predicted Values
Negative Positive
TN FN
True Negative False positive Negative
Actual values
FN TP
Positive False Negative True positive

Each prediction can be one of the four outcomes, based on how it matches up to the actual value:
 True Positive(TP) signifies how many positive class a model predicted correctly.
 True Negative(TN) signifies how many negative class the model predicted correctly.
 False Positive(FP) signifies how many negative class a model predicted incorrectly.
 False negative(FN) signifies how many positive class a model predicted incorrectly.

ACCURACY
Accuracy is a common evaluation metric for classification problem. It’s the number of correct
prediction made as a ratio of all predictions made.
Overall, how often is the classifier correct?
ACCURACY = (TP/TN)/total
PRECISION
Precision is the ratio of true positives and total positives predicted.

P = TP/TP+FP

15
A precision score towards 1 will signify that our model didn’t miss any true positives, and is able
to classify well between correct and incorrect labelling of cancer patients. A low precision score
means that our model has a high number of false positives.
RECALL
A recall is essentially the ratio of true positives to all the positives in ground truth

R = TP/TP+FN

Recall towards 1 will signify that a model didn’t miss any true positives, and is able to classify
well between correctly and incorrectly labeling of cancer patients. A low recall means a model
has a high number of false negatives which can be an outcome of imbalanced class or untuned
model hyperparameters.

F1 SCORE
The F1 score metric use a combination of precision and recall. In fact, F1 score is the harmonic
mean of the two. The formula of the two essentially is:
F1 = 2 * precision * recall / precision + recall

SUPPORT VECTOR MACHINE


The algorithm used to build the model used for these research is known as Support Vector
Machine (SVM). The objective of support vector machine is to find a hyperplane in an N-
dimensional space(N- the number of features that distinctly classify the data points. To separate
the two classes of data points, there are many possible hyperplanes that could be chosen. Our
objective is to find a plane that has the maximum margin, i.e the distance between data points of
both classes. Maximizing the margin distance provides some reinforcement so that future data
points can be classified with more accuracy. Hyperplanes are decision boundaries that help
classify data points. Data points falling on either side of the hyperplane can be attributed to a
different class. Also, the dimension of the hyperplane depends upon the number of features. If
the number of input features is 2, then the hyperplane is just a line. If the number of features is 3,
then the hyperplane becomes a two-dimensional plane. Support vectors are data points that are
closer to the hyperplane and influence the position and orientation of the hyperplane. Using these
support vectors, we maximize the margin of the classifier.

16
Algorithm: BCCA
Input: Array of classes
Output: Classified classes
1. Select the hyper plane which divides the class better
2. To find better hype plane you calculate the distance between the planes and the data
called margin
3. If the distance between the classes is low, then
(a) The chances of miss conception is high and vice versa. So we need to
(b) Select the class with the high margin
Margin = distance to positive point – Distance to negative point.

3.3 Experimental Framework


This project was performed on a windows laptop with a processing power of 2.30GH and a
memory of 4.0GB. The programming language used was python. Python is a general purpose
programming language. it’s the most useful language for data science because: its simple, easy
and quicker to learn than most other languages. Secondly, its very scalabe. Also python has a lot
of built in libraries to analyze and work with data some of these libraries include NumPy,
Pandas, Scikit-learn, etc. python libraries are powerful and very broad.
 Numpy is great for linear algebra, high level mathematical function and random number
crunching.
 Pandas provides a range of functions for handling data structures and operations such as
manipulating tables and time series
 Scikit-learn library futures various classification, regression and clustering algorithms
including support vector machine, random forest, gradient boosting etc. and it is designed
to interoperate with python numerical and scientific libraries.

3.4 IMPLEMENTATION OF PROJECT


CATEGORY DESCRIPTION
Database MySQL
Technology Phyton

17
CHAPTER FOUR
ANALYSIS AND DISCUSSION
4.0 INTRODUCTION
In these chapter, the results of the experiment conducted are presented and discussed. Graphical
presentation of the results are also given.
4.1 RESULTS
To analyze our data, we begin by exploring our class distribution where we found that out of the
569 sample, 357 were benign while 212 were malignant. The sample distribution is 63 to 37
percent.

Figure 4.1Distribution of samples

18
Figure 4.2 Percentage of samples

We then try to find out if their exist any correlation between the attributes in our dataset. To do
that, we divide the dataset into 3 parts, the mean standard error and the worst for easy anlysis.
After taking the first part of our dataset, the following observations were made.

The columns

(a) Radius, perimeter, area and concave points are related to one another.

(b) compactness, concavity and concave point are related to one another.

(b) Texture, smoothness symmetry and fractal_dimension do not relate to any other column.

19
Figure 4.3 Heatmap of the mean columns

Figure 4.4 Pair plot of the mean columns.

We then take the second part of our dataset (i.e the se columns) and the observation hold true
also for them.

20
Figure 4.5 Heat map for the se columns.

We then concluded that the same correlation also holds for the worst columns.

21
To develop our model, we first remove two columns from our dataset which are “Id” and
“Unnamed: 32”. We then separate the dataset into target and features, split the data for training
and testing, scale the data and pass it into our model. We then evaluate the model based on the
performance measures stated in chapter 3 and the result of the evaluation is as follows.

1. Accuracy: The accuracy of the model is 96.5%


2. Precision: The precision of the model is 91.1%
3. Recall : The recall of the model is 97.6%
4. F1 Score: The f1 score of the model is 94.6%
5. The true (positive, negative) and false (positive, negative) are:
 TP : 41
 FN: 4
 FP: 1
 TN: 97

Figure 4.6 CONFUSION MATRIX SCORES

4.2 DISCUSSION

22
Understanding how well a machine learning model is going to perform on unseen data is the
ultimate purpose of evaluating a model using the metrics used above. Seeing that all the
performance metrics used to evaluate our model gives an accuracy of greater than 90 percent and
seeing also that the values for the confussion matrix were almost all predicted correctly(only 5
instances were predicted incorrectly, we can safely say that the model performed very well
considering the distribution of the class and also considering the fact that default hyper-
peremeters were used for our model.

CHAPTER 5

23
SUMMARY, CONCLUSION AND RECOMMENDATION.
5.0 SUMMARY
The aim of these research was to develop a module for classifying breast cancer into either
malignant or benign. To do that, the jupyter platform was used suing python programming
language using the winsconsin breast cancer dataset. The model was built using the support
vector machine algorithm. An exploratory data analysis of the dataset was also carried out. The
result of the analysis shows that some of the features of the dataset are positively correlated to
each other while others are not. To evaluate the model, we used four performance measures that
were computed using the confusion matrix. These performance measures are

 Accuracy
 Precision
 Recall
 F1- score

The accuracy of the model was around 96 percent while the precision, recall and f1-score were

94%, 92% and 93% respectively.

5.1 CONCLUSION

It was observed that some of the attribute of the dataset exhibit correlation with one another
while others do not. The model was developed and it was observed that the accuracy of the
model is around 95 percent. In summary, the algorithm performed effectively and efficiently on
the performance measures evaluated with it. Also putting into consideration is the un even
distribution of the target column as on contain 63 percent of the dataset while the other class
(malignant) contain only 37 percent.

5.2 RECOMMENDATION

Feature research should be carried out for better performance of this model. This can be achieved
be performing hyper-parameter optimization seeing as the model was built using default hyper-
parameters. The use of k-fold validation techniques can also be used to better evaluate the model
and obtain more accuracy.

REFERENCES

24
Abdelghani Bellaachia, Erhan Guven, “Predicting Breast Cancer Survivability Using Data

Mining Techniques”

B. Akbugday, "Classification of Breast Cancer Data Using Machine Learning Algorithms," 2019

Medical Technologies Congress (TIPTEKNO), Izmir, Turkey, 2019, pp. 1-4.

Ch. Shravya, K. Pravalika, Shaik Subhani, “Prediction of Breast Cancer Using Supervised

Machine Learning Techniques”, International Journal of Innovative Technology

and Exploring Engineering (IJITEE), Volume-8 Issue-6, April 2019.

Delen, D.; Walker, G.; Kadam, A. Predicting breast cancer survivability: A comparison of three

data mining methods. Artif. Intell. Med. 2005, 34, 113–127.

Joseph A. Cruz and David S. Wishart “Applications of Machine Learning in cancer prediction

and prognosis Cancer informatics” 2(3):59-77 · February 2007

Keles, M. Kaya, "Breast Cancer Prediction and Detection Using Data Mining Classification

Algorithms: A Comparative Study." Tehnicki Vjesnik - Technical Gazette, vol.

26, no. 1, 2019, p. 149+.

Kibeom Jang, Minsoon Kim, Candace A Gilbert, Fiona Simpkins, Tan A Ince, Joyce M

Slingerland “WEGFA activates an epigenetic pathway regulating ovarian cancer

initiating cells” Embo Molecular Medicines Volume 9 Issue 3 (2017)

N. Khuriwal, N. Mishra. “A Review on Breast Cancer Diagnosis in Mammography Images

Using Deep Learning Techniques”, (2018), Vol. 1, No. 1.

P. Sinthia, R. Devi, S. Gayathri and R. Sivasankari, “Breast Cancer detection using PCPCET and

ADEWNN”, CIEEE’ 17, p.63-65

25
R. K. Kavitha1, D. D. Rangasamy, “Breast Cancer Survivability Using Adaptive Voting

Ensemble Machine Learning Algorithm Adaboost and CART Algorithm” Volume

3, Special Issue 1, February 2014

R. M. Mohana, R. Delshi Howsalya Devi, Anita Bai, “Lung Cancer Detection using Nearest

Neighbour Classifier”, International Journal of Recent Technology and

Engineering (IJRTE), Volume-8, Issue-2S11, September 2019

SA Medjahed, TA Saadi, A Benyettou “Breast cancer diagnosis by using k-nearest neighbor with

different distances and classification rules” International Journal of Computer

Applications 62 (1), 2013

V. Chaurasia and S. Pal, “Data Mining Techniques: To Predict and Resolve Breast Cancer

Survivability”, IJCSMC, Vol. 3, Issue. 1, January 2014, pg.10 – 22.

Vikas Chaurasia and S.Pal, “Using Machine Learning Algorithms for Breast Cancer Risk

Prediction and Diagnosis” (FAMS 2016) 83 ( 2016 ) 1064 – 1069

Wang, D. Zhang and Y. H. Huang “Breast Cancer Prediction Using Machine Learning” (2018),

Vol. 66, NO. 7.

Y. Khourdifi and M. Bahaj, "Feature Selection with Fast Correlation-Based Filter for Breast

Cancer Prediction and Classification Using Machine Learning Algorithms," 2018

International Symposium on Advanced Electrical and Communication

Technologies (ISAECT), Rabat, Morocco, 2018, pp. 1-6.

26
APPENDICES

APPENDIX – 1 : SAMPLE SOURCE CODES

27
28
APPENDIX – 2 : SAMPLE SCREENSHOTS

29
30

You might also like