0% found this document useful (0 votes)
22 views53 pages

Report of Mini Project

This document discusses a project focused on demand forecasting for restaurants and supermarkets, emphasizing the importance of accurately predicting future sales to minimize food waste and enhance profitability. It proposes a sales prediction model utilizing various data mining classification algorithms, including SVM, KNN, and CNN, to analyze and predict sales volume based on point-of-sale data. The project aims to improve existing systems by addressing their drawbacks and enhancing prediction accuracy through advanced machine learning techniques.

Uploaded by

spotify6369
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views53 pages

Report of Mini Project

This document discusses a project focused on demand forecasting for restaurants and supermarkets, emphasizing the importance of accurately predicting future sales to minimize food waste and enhance profitability. It proposes a sales prediction model utilizing various data mining classification algorithms, including SVM, KNN, and CNN, to analyze and predict sales volume based on point-of-sale data. The project aims to improve existing systems by addressing their drawbacks and enhancing prediction accuracy through advanced machine learning techniques.

Uploaded by

spotify6369
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

ABSTRACT

Demand forecasting is the major key aspect to successfully manage


restaurants, supermarkets and staff canteens. In particular, properly predicting future sales of
menu items allows for a precise ordering of food items. This will ensure a low level of pre-
consumer food waste, while this is critical to the profitability of the restaurant. Hence, this paper
is interested in predicting future values of the daily sold quantities of given menu items.

The time series will show multiple strong seasonalities, trend changes, data gaps, and
outliers. This project proposes a forecasting approach that is solely based on the data retrieved
from point-of-sale systems and allows for a straightforward human interpretation.
Therefore, it proposes two generalized models for predicting future sales. In
an extensive evaluation, data sets are taken which consists of super market sales data.

The main motivation of doing this project is to present a sales prediction model for the
prediction of supermarket data. Further, this research work is aimed towards identifying the best
classification algorithm for sales analysis.

In this work, data mining classification algorithms like SVM, KNN and Naïve Bayes
classification is addressed and used to develop a prediction system in order to analyze and
predict the sales volume. In addition, CNN classification is also used in proposed system for
better classification results. The project is designed using Python 3.11.
CONTENTS
CHAPTER NO. TITLE PAGE
NO.
ABSTRACT
1 INTRODUCTION
1.1 Objectives
1.2 About the project

2 LITERATURE SURVEY
2.1 Related Work

3 SYSTEM ANALYSIS
3.1 Existing System
3.2 Drawbacks of Existing System
3.3 Proposed System
3.4 Advantages of Proposed System
3.5 Feasibility Study
3.5.1 Economical Feasibility
3.5.2 Operational Feasibility
3.5.1 Technical Feasibility
4 SYSTEM SPECIFICATION
4.1 Hardware Requirements
4.2 Software Requirements
5 SOFTWARE DESCRIPTION
5.1 Front End
5.2 Back End
6 PROJECT DESCRIPTION
6.1 Problem Definition
6.2 Overview of the project
6.3 Module Description
6.4 Input Design
6.5 Output Design
6.6 System Flow Diagram
6.7 Use Case Diagram
7 SYSTEM TESTING
8 SYSTEM IMPLEMENTATION
9 CONCLUSION AND FUTURE ENHANCEMENT
10 BIBLIOGRAPHY
10.1 Book References
10.2 Web References
10.3 Journal References
APPENDIX
[1] Screen shots
[2] Source Code
CHAPTER 1

INTRODUCTION

1.1 OBJECTIVES

The main objectives of the project are

▪ To apply naïve bayes classification for finding conditional probability of outlet sales.
▪ To apply SVM/KNN which generally handles larger number of instances to work its
randomization concept well and generalize to the novel data.
▪ To apply SVM/KNN that could be preferred when the data set grows larger.
▪ To apply SVM/KNN that could be preferred when the outlier data is more.
▪ To plot various charts for outlet sales.

1.2 ABOUT THE PROJECT

Machine Learning is a category of algorithms that allows software applications to


become more accurate in predicting outcomes without being explicitly programmed. The basic
premise of machine learning is to build models and employ algorithms that can receive input
data and use statistical analysis to predict an output while updating outputs as new data becomes
available. These models can be applied in different areas and trained to match the expectations of
management so that accurate steps can be taken to achieve the organization’s target.

In today’s modern world, huge shopping centers such as big malls and marts are
recording data related to sales of items or products with their various dependent or independent
factors as an important step to be helpful in prediction of future demands and inventory
management.

The dataset built with various dependent and independent variables is a composite
form of item attributes, data gathered by means of customer, and also data related to inventory
management in a data warehouse. The data is thereafter refined in order to get accurate
predictions and gather new as well as interesting results that shed a new light on our knowledge
with respect to the task’s data.

Demand forecasting is the major key aspect to successfully manage


restaurants, supermarkets and staff canteens. In particular, properly predicting future sales of
menu items allows for a precise ordering of food items. This will ensure a low level of pre-
consumer food waste, while this is critical to the profitability of the restaurant. Hence, this paper
is interested in predicting future values of the daily sold quantities of given menu items.

SVM and K-Nearest Neighbors (KNN) learning algorithms are used which is based on
instances and knowledge gained through them [4]. Unlike mining in data stream scenarios, cases
where every sample can simultaneously belong to multiple classes in hierarchical multi-label
classification problems, k-NN is being proposed to be applied to predict outputs in structured
form.
CHAPTER 2
LITERATURE SURVEY

2.1 RELATED WORK


1. INTRODUCTION TO MACHINE LEARNING

AUTHORS
ALEX SMOLA
S.V.N. VISHWANATHAN
In this paper [1], the authors stated that over the past two decades Machine Learning has
become one of the main-stays of information technology and with that, a rather central, albeit
usually hidden, part of our life. With the ever increasing amounts of data becoming available
there is good reason to believe that smart data analysis will become even more pervasive as a
necessary ingredient for technological progress. The purpose of this chapter is to provide the
reader with an overview over the vast range of applications which have at their heart a machine
learning problem and to bring some degree of order to the zoo of problems. After that, they
discussed some basic tools from statistics and probability theory, since they form the language in
which many machine learning problems must be phrased to become amenable to solving.
Finally, they outlined a set of fairly basic yet effective algorithms to solve an important problem,
namely that of classification. More sophisticated tools, a discussion of more general problems
and a detailed analysis will follow in later parts of the book.

A Taste of Machine Learning


Machine learning can appear in many guises. The authors discussed a number of
applications, the types of data they deal with, and finally, they formalized the problems in a
somewhat more stylized fashion. The latter is key if they want to avoid reinventing the wheel for
every new application. Instead, much of the art of machine learning is to reduce a range of fairly
disparate problems to a set of fairly narrow prototypes. Much of the science of machine learning
is then to solve those problems and provide good guarantees for the solutions.
2. AN INTRODUCTION TO DATA SCIENCE
AUTHORS
JEFFREY S. SALTZ
JEFFREY M. STANTON

In this paper [2], the authors answered the question: What Is Data Science? For some, the
term data science evokes images of statisticians in white lab coats staring fixedly at blinking
computer screens filled with scrolling numbers. Nothing could be farther from the truth. First,
statisticians do not wear lab coats: this fashion statement is reserved for biologists, physicians,
and others who have to keep their clothes clean in environments filled with unusual fluids.
Second, much of the data in the world is non-numeric and unstructured.

In this context, unstructured means that the data are not arranged in neat rows and
columns. Think of a web page full of photographs and short messages among friends: very few
numbers to work with there. While it is certainly true that companies, schools, and governments
use plenty of numeric information—sales of products, grade point averages, and tax assessments
are a few examples—there is lots of other information in the world that mathematicians and
statisticians look at and cringe. So, while it is always useful to have great math skills, there is
much to be accomplished in the world of data science for those of us who are presently more
comfortable working with words, lists, photographs, sounds, and other kinds of information.

In addition, data science is much more than simply analyzing data. There are many
people who enjoy analyzing data and who could happily spend all day looking at histograms and
averages, but for those who prefer other activities, data science offers a range of roles and
requires a range of skills. Let’s consider this idea by thinking about some of the data involved in
buying a box of cereal. Whatever your cereal preferences—fruity, chocolaty, fibrous, or nutty—
you prepare for the purchase by writing “cereal” on your grocery list. Already your planned
purchase is a piece of data, also called a datum, albeit a pencil scribble on the back on an
envelope that only you can read. When you get to the grocery store, you use your datum as a
reminder to grab that jumbo box of FruityChocoBoms off the shelf and put it in your cart. At the
checkout line, the cashier scans the barcode on your box, and the cash register logs the price.
Back in the warehouse, a computer tells the stock manager that it is time to request
another order from the distributor, because your purchase was one of the last boxes in the store.
You also have a coupon for your big box, and the cashier scans that, giving you a predetermined
discount. At the end of the week, a report of all the scanned manufacturer coupons gets uploaded
to the cereal company so they can issue a reimbursement to the grocery store for all of the
coupon discounts they have handed out to customers. Finally, at the end of the month a store
manager looks at a colorful collection of pie charts showing all the different kinds of cereal that
were sold and, on the basis of strong.

3. NOTES ON MACHINE LEARNING

AUTHOR
CAIO CORRO

In this paper [3], the author stated that the goal of this paper is to give tools to understand
how to train machine learning models. This problem is an optimization problem therefore we
will treat it as such. The author only consider linear models, switching between different tasks:
regression, binary classification, multiclass classification and structured prediction. Many
important topics will not be covered, e.g. proof of convergence rates of different optimization
techniques. We will also skip some definitions, e.g. lower semi-continuity, and some proofs. The
goal is that you get the main idea behind different concepts, so definitions and theorems may be
"handwavy" to not burden the material. The main reason is that time is limited and I prefer to
focus on things that are simpler to understand and that you can code. However, I hope that
overall this course will provide you a strong background to read the literature and explore
optimization techniques for machine learning.

Linear models

The author consider the following scenario: we are given feature values and we want to
predict an output. The feature values will be represented as a vector x ∈ Rd where d is the
number of features. In this course, feature values will always be real numbers. However, it may
be that a given feature takes only values 0 and 1 to indicate the presence or absence of the
feature. The output will be either a scalar y ∈ R (regression, binary classification) or a vector y ∈
Rk (multiclass classification and structured prediction). Again, we write that the output is in
R but, for example, for binary classification it will actually be strict subspace of R: y ∈ {−1, 1}.
The general framework we will study is machine learning models that compute the output in two
steps:

• a parameterized scoring function s(x) that maps from the input space to the score space (also
called weights or logits). In this course, we will only focus on functions that applies a linear
transformation to the input, e.g. s(x) = Ax + b where A and b are the parameters of the scoring
function.
• a prediction function q(w) that maps a value w from the score space to the output space. Note
that, as usual, we will often ignore the bias term in the linear transformation and assume there is
a special feature that is always set to one.
The author considers both the hard decision and the likelihood estimation cases. When
the prediction function outputs a hard decision, it doesn’t give any information about its own
uncertainty. This setting can be understood as a probability distribution over the output space
where all mass is concentrated in a single. Hence, we won’t really make the difference between
the two cases and we will actually show that they can be understood under the same framework.
The main problem the author addressed in this course is the training problem, i.e. how do we fix
the parameters of the scoring function. We will focus on the supervised training scenario where
we assume we have access to a labeled training set {〈x(i), y(i)〉}n i=1 of n pairs 〈x(i), y(i)〉.
The idea is that we want to select the element of a restricted set of functions that:
• classifies correctly all (or most of) the datapoints in the training data
• or that maximizes the probability of the training data (probabilistic setting).
It is often important to also regularize the training objective to avoid overfitting on the
training data. In this case, the set of function will be a set of parameterized linear models
of a given dimension: F(d, k) = {s(·; A)|A ∈ Rk×d} where d is the input dimension (the
number of features including bias), k the output dimension (1 for regression and binary
classification, larger for other problems) and s(x; A) = Ax. The training problem, that is
selecting the best function in F(d, k) is an optimization problem so we will treat it as
such.
CHAPTER 3
SYSTEM ANALYSIS

3.1 EXISTING SYSTEM AND ITS DRAWBACKS

In existing system, the supermarket dataset which contains attributes (branch, gender,
quantity, unit price, total along with rating, etc) are taken and two algorithms are carried out for
classification/prediction purpose. The Naïve Bayes algorithm is used for finding conditional
probability. In addition SVM and KNN classification is also made. The training data is taken
75% from the whole data set and model is predicted. Then the remaining 25% of the data is
taken as test data and checked against the predicted model.

3.1.1. DRAWBACKS

▪ The Naïve bayes classification yields conditional probability values only for existing
given dataset. New test data is added for classification.
▪ CNN is not applied so the outlier data could not be predicted well.
▪ Naïve bayes classification could not be preferred when the outlier data is more.

3.2 PROPOSED SYSTEM

All the existing system approaches are carried out in proposed system. In addition, SVM
and K-nearest neighbor based classification are used to predict the model as it helps better in
various ways along with random forest. It is found to be suitable especially if the data set is
having more number of records is contains outlier data. A wide variety of sales records can be
taken for all branches classification purpose and predicting a new model at the same time
increasing the efficiency.
3.2.2 ADVANTAGES

The proposed system has following advantages.

▪ CNN generally handles larger number of instances to work its randomization concept
well and generalize to the novel data.
▪ CNN could be preferred when the data set grows larger.
▪ CNN could be preferred when the outlier data is more.

3.3 FEASIBILITY STUDY

The feasibility study deals with all the analysis that takes up in developing the project.
Each structure has to be thought of in the developing of the project, as it has to serve the end user
in a user-friendly manner. One must know the type of information to be gathered and the system
analysis consist of collecting, Organizing and evaluating facts about a system and its
environment.
The main objective of the system analysis is to study the existing operation and to learn
and accomplish the processing activities. The record classification with uncertain data through
python application needs to be analyzed well. The details are processed through coding
themselves. It will be controlled by the programs alone.

3.3.1 ECONOMIC FEASIBILITY

The organization has to buy a personal computer with a keyboard and a mouse, this is a
direct cost. There are many direct benefits of covering the manual system to computerized
system. The user can be given responses on asking questions, justification of any capital outlay is
that it will reduce expenditure or improve the quality of service or goods, which in turn may be
expected to provide the increased profits.
3.3.2 OPERATIONAL FEASIBILITY

The Proposed system accessing process to solves problems what occurred in existing
system. The current day-to-day operations of the organization can be fit into this system. Mainly
operational feasibility should include on analysis of how the proposed system will affects the
organizational structures and procedures.

3.3.3 TECHNICAL FEASIBILITY

The cost and benefit analysis may be concluded that computerized system is favorable in
today’s fast moving world. The assessment of technical feasibility must be based on an outline
design of the system requirements in terms of input, output, files, programs and procedure.

The project aims to detect the given input image is having brain tumor or not. The system
is tested if it is technical feasible to improve the effectiveness of classification using
classification routines. The current system aims to overcome the problems of the existing system.
The current system is to reduce the technical skill requirements.
CHAPTER 4
SYSTEM SPECIFICATION

4.1 HARDWARE REQUIREMENTS


This section gives the details and specification of the hardware on which the system is
expected to work.
Processor : Intel Core 2 Quad
Hard Disk Capacity : 500 GB
RAM : 4 GB SD
Monitor : 17inch Color
Keyboard : 102 keys
Mouse : Optical Mouse

4.2 SOFTWARE REQUIREMENTS


This section gives the details of the software that are used for the development.
Operating System : Windows 10 Pro
Environment : Python 3.11
Language : Python
CHAPTER 5
SOFTWARE DESCRIPTION
5.1 PYTHON

Python is a Unix shell script or Windows batch files for some of these tasks, but shell
scripts are best at moving around files and changing text data, not well-suited for GUI
applications or games. could write a C/C++/Java program, but it can take a lot of development
time to get even a first-draft program. Python is simpler to use, available on Windows, Mac OS
X, and Unix operating systems, and will help get the job done more quickly.

Python is simple to use, but it is a real programming language, offering much more
structure and support for large programs than shell scripts or batch files can offer. On the other
hand, Python also offers much more error checking than C, and, being a very-high-level
language, it has high-level data types built in, such as flexible arrays and dictionaries. Because of
its more general data types Python is applicable to a much larger problem domain than Awk or
even Perl, yet many things are at least as easy in Python as in those languages.

Python allows to split r program into modules that can be reused in other Python
programs. It comes with a large collection of standard modules that can use as the basis of r
programs — or as examples to start learning to program in Python. Some of these modules
provide things like file I/O, system calls, sockets, and even interfaces to graphical user interface
toolkits like Tk.

Python is an interpreted language, which can save considerable time during program
development because no compilation and linking is necessary. The interpreter can be used
interactively, which makes it easy to experiment with features of the language, to write throw-
away programs, or to test functions during bottom-up program development. It is also a handy
desk calculator.
5.1.1. Invoking the Interpreter

The Python interpreter is usually installed as /usr/local/bin/python3.3 on those machines


where it is available; putting /usr/local/bin in UNIX shell’s search path makes it possible to start
it by typing the command: python3.3 to the shell.

Since the choice of the directory where the interpreter lives is an installation option, other
places are possible; check with r local Python guru or system administrator. (E.g.,
/usr/local/python is a popular alternative location.) On Windows machines, the Python
installation is usually placed in C:\Python33, though can change this when ’re running the
installer. To add this directory to r path, can type the following command into the command
prompt in a DOS box:set path=%path%;C:\python33

Typing an end-of-file character (Control-D on Unix, Control-Z on Windows) at the


primary prompt causes the interpreter to exit with a zero exit status. If that doesn’t work, can
exit the interpreter by typing the following command: quit().
The interpreter’s line-editing features usually aren’t very sophisticated. On Unix,
whoever installed the interpreter may have enabled support for the GNU read-line library, which
adds more elaborate interactive editing and history features. Perhaps the quickest check to see
whether command line editing is supported is typing Control-P to the first Python prompt get. If
it beeps, have command line editing; If nothing appears to happen, or if ^P is echoed, command
line editing isn’t available; ’ll only be able to use backspace to remove characters from the
current line.

The interpreter operates somewhat like the Unix shell: when called with standard input
connected to a tty device, it reads and executes commands interactively; when called with a file
name argument or with a file as standard input, it reads and executes a script from that file.
A second way of starting the interpreter is python -c command [arg] ..., which executes the
statement(s) in command, analogous to the shell’s -c option. Since Python statements often
contain spaces or other characters that are special to the shell, it is usually advised to quote
command in its entirety with single quotes. Some Python modules are also useful as scripts.
5.1.2 Error Handling
When an error occurs, the interpreter prints an error message and a stack trace. In
interactive mode, it then returns to the primary prompt; when input came from a file, it exits with
a nonzero exit status after printing the stack trace. (Exceptions handled by an except clause in a
try statement are not errors in this context.) Some errors are unconditionally fatal and cause an
exit with a nonzero exit; this applies to internal inconsistencies and some cases of running out of
memory. All error messages are written to the standard error stream; normal output from
executed commands is written to standard output. Typing the interrupt character (usually
Control-C or DEL) to the primary or secondary prompt cancels the input and returns to the
primary prompt. Typing an interrupt while a command is executing raises the Keyboard Interrupt
exception, which may be handled by a try statement.
5.1.3. Informal Introduction to Python
In the following examples, input and output are distinguished by the presence or absence
of prompts (>>> and ...): to repeat the example, you must type everything after the prompt,
when the prompt appears; lines that do not begin with a prompt are output from the interpreter.
Note that a secondary prompt on a line by itself in an example means you must type a blank line;
this is used to end a multi-line command.
Many of the examples in this manual, even those entered at the interactive prompt,
include comments. Comments in Python start with the hash character, #, and extend to the end of
the physical line. A comment may appear at the start of a line or following whitespace or code,
but not within a string literal. A hash character within a string literal is just a hash character.
Since comments are to clarify code and are not interpreted by Python, they may be omitted when
typing in examples.

# this is the first comment


spam = 1 # and this is the second comment
# ... and now a third!
text = "# This is not a comment because it's inside quotes."
5.1.4 Documentation Strings

The first line should always be a short, concise summary of the object’s purpose. For
brevity, it should not explicitly state the object’s name or type, since these are available by other
means (except if the name happens to be a verb describing a function’s operation). This line
should begin with a capital letter and end with a period.

If there are more lines in the documentation string, the second line should be blank,
visually separating the summary from the rest of the description. The following lines should be
one or more paragraphs describing the object’s calling conventions, its side effects, etc.

The Python parser does not strip indentation from multi-line string literals in Python, so
tools that process documentation have to strip indentation if desired. This is done using the
following convention. The first non-blank line after the first line of the string determines the
amount of indentation for the entire documentation string. (We can’t use the first line since it is
generally adjacent to the string’s opening quotes so its indentation is not apparent in the string
literal.) Whitespace “equivalent” to this indentation is then stripped from the start of all lines of
the string. Lines that are indented less should not occur, but if they occur all their leading
whitespace should be stripped. Equivalence of whitespace should be tested after expansion of
tabs (to 8 spaces, normally).

>>> def my_function():


... """Do nothing, but document it.
...
... No, really, it doesn't do anything.
...
... pass
...
>>> print(my_function.__doc__)
Do nothing, but document it.

No, really, it doesn't do anything.


5.1.5 Coding Style
Python, PEP 8 has emerged as the style guide that most projects adhere to; it promotes a
very readable and eye-pleasing coding style. Every Python developer should read it at some
point; here are the most important points extracted for program:
• Use 4-space indentation, and no tabs.
o 4 spaces are a good compromise between small indentation (allows greater
nesting depth) and large indentation (easier to read). Tabs introduce confusion,
and are best left out.
• Wrap lines so that they don’t exceed 79 characters.
o This helps users with small displays and makes it possible to have several code
files side-by-side on larger displays.
• Use blank lines to separate functions and classes, and larger blocks of code inside
functions.
• When possible, put comments on a line of their own.
• Use docstrings.
• Use spaces around operators and after commas, but not directly inside bracketing
constructs: a = f(1, 2) + g(3, 4).
• Name your classes and functions consistently; the convention is to use CamelCase for
classes and lower_case_with_underscores for functions and methods. Always use self as
the name for the first method argument (see A First Look at Classes for more on classes
and methods).
• Don’t use fancy encodings if your code is meant to be used in international environments.
Python’s default, UTF-8, or even plain ASCII work best in any case.
• Likewise, don’t use non-ASCII characters in identifiers if there is only the slightest
chance people speaking a different language will read or maintain the code.
CHAPTER 6
PROJECT DESCRIPTION
6.1 PROBLEM DEFINITION

“To find out what role certain properties of an item play and how they affect their sales
by understanding Big Mart sales.” In order to help Big Mart achieve this goal, a predictive model
can be built to find out for every store, the key factors that can increase their sales and what
changes could be made to the product or store’s characteristics.

Since, the basic premise of machine learning is to build models and employ
algorithms that can receive input data and use statistical analysis to predict an output while
updating outputs as new data becomes available. These models, if applied in different
areas and trained to match the expectations of management, then accurate steps could be taken to
achieve the organization’s target.

Hence in the case of Big Mart, a one-stop-shopping-center, should been discussed to


predict the sales of different types of items and for understanding the effects of different factors
on the items’ sales. Taking various aspects of a dataset collected for Big Mart, and the
methodology should be followed for building a predictive model, results with high levels of
accuracy are generated, and these observations could be employed to take decisions to improve
sales.
6.2 OVERVIEW OF THE PROJECT

Demand forecasting is the major key aspect to successfully manage


restaurants, supermarkets and staff canteens. In particular, properly predicting future sales of
menu items allows for a precise ordering of food items. This will ensure a low level of pre-
consumer food waste, while this is critical to the profitability of the restaurant. Hence, this paper
is interested in predicting future values of the daily sold quantities of given menu items.

The time series will show multiple strong seasonalities, trend changes, data gaps, and
outliers. This project proposes a forecasting approach that is solely based on the data retrieved
from point-of-sale systems and allows for a straightforward human interpretation.
Therefore, it proposes two generalized models for predicting future sales. In
an extensive evaluation, data sets are taken which consists of super market sales data.

The main motivation of doing this project is to present a sales prediction model for the
prediction of supermarket data. Further, this research work is aimed towards identifying the best
classification algorithm for sales analysis.

In this work, data mining classification algorithms like Naïve Bayes classification is
addressed and used to develop a prediction system in order to analyze and predict the sales
volume. In addition, SVM and KNN classification is also used in proposed system for better
classification results. The project is designed using Python 3.7.
6.3 MODULE DESCIPTION

The following modules are present in the project.


1. DATASET COLLECTION
2. NAÏVE BAYES CLASSIFICATION
3. SVM CLASSIFICATION
4. KNN CLASSIFICATION
5. CNN CLASSIFICATION

1. DATASET COLLECTION
In this module, the sales dataset from kaggle which contains attributes (branch, gender,
quantity, unit price, total along with rating, etc) are taken. Null value records are eliminated
during preprocessing work. In addition, Big Mart Sales Data is also taken.

2. NAÏVE BAYES CLASSIFICATION


Naive Bayes classifier is based on Bayes’ theorem with independence assumptions
between predictors. A Naive Bayesian model is easy to build, with no complicated iterative
parameter estimation which makes it particularly useful for very large datasets. Despite its
simplicity, the Naive Bayesian classifier often does surprisingly well and is widely used because
it often outperforms more sophisticated classification methods. Bayes theorem provides a way of
calculating the posterior probability, P(c|x), from P(c), P(x), and P(x|c). Naive Bayes classifier
assumes that the effect of the value of a predictor (x) on a given class (c) is independent of the
values of other predictors. This assumption is called class conditional independence.
(1) P(c|x) is the posterior probability of class (target) given predictor (attribute).
(2) P(c) is the prior probability of class.
(3) P(x|c) is the likelihood which is the probability of predictor given class.
(4) P(x) is the prior probability of predictor.

In this module, Branch wise, gender wise rating similarity (conditional probability) is
found out both for below and above 5.0 rating. Moreover, Outlet_Size wise items sold is found
out. Item wise comparison chart is also prepared for sold quantity.
3. SVM CLASSIFICATION

SVM stands for Support Vector Machine. It is a machine learning approach used for
classification and regression analysis. It depends on supervised learning models and trained by
learning algorithms. They analyze the large amount of data to identify patterns from them. An
SVM generates parallel partitions by generating two parallel lines. For each category of data in a
high-dimensional space and uses almost all attributes. It separates the space in a single pass to
generate flat and linear partitions. Divide the 2 categories by a clear gap that should be as wide
as possible. Do this partitioning by a plane called hyperplane. An SVM creates hyperplanes that
have the largest margin in a high-dimensional space to separate given data into classes. The
margin between the 2 classes represents the longest distance between closest data points of those
classes. The larger the margin, the lower is the generalization error of the classifier. The records
are classified into disease 1 or 2 using SVM classification in this module.

4. KNN CLASSIFICATION
In this module, KNN classification is being done with K value given as 6 and type
column (generated based on rating value with below and above 5.0 values) column as binary
classification factor. 75% of the data is given as training data and 25% as testing data. The
testing data’s record number and the record type is found out and displayed as result.

5) CNN BASED MODEL PREDICTION


Here the dataset is taken first. It can be seen that image data is stored in the form of pixel
values. But we cannot feed data to the CNN model in this format. So, it is converted it into
numpy arrays. It is required to convert the categorical data into one hot encodings. Now reshape
the data and cast it into float32 type so that it can be used it conveniently. Preprocessing the data
is finished by normalizing it. Normalizing image data will map all the pixel values in each image
to the values between ‘0’ to ‘1’. This helps to reduce inconsistencies in data. Before normalizing,
the image data can have large variations in pixel values which can lead to some unusual behavior
during the training process.

A Convolutional Neural Network is built for modeling the image data. CNNs are
modified versions of regular neural networks. These are modified specifically for image data.
Feeding images to regular neural networks would require the network to have a large number of
input neurons. The main purpose of convolutional neural networks is to take advantage of the
spatial structure of the image and to extract high level features from that and then train on those
features. It does so by performing a convolution operation on the matrix of pixel values.

It is also used the Dropout, Batch-normalization and Flatten layers in addition to the
layers. Flatten layer converts the output of convolutional layers into a one dimensional feature
vector. It is important to flatten the outputs because Dense (Fully connected) layers only accept a
feature vector as input. Dropout and Batch-normalization layers are for preventing the model
from overfitting.

Once the model is created, it can be imported and then compiled using ‘model.compile’.
The model is trained for just five epochs but we can increase the number of epochs. After the
training process is completed we can make predictions on the test set. The accuracy value is
displayed during iterations. Multi class image labelling is also possible here.
6.4 INPUT DESIGN

Input design is the process of converting user-originated inputs to a computer


understandable format. Input design is one of the most expensive phases of the operation of
computerized system and is often the major problem of a system. A large number of problems
with a system can usually be tracked backs to fault input design and method. Every moment of
input design should be analyzed and designed with utmost care.

The system takes input from the users, processes it and produces an output. Input design
is link that ties the information system into the world of its users. The system should be user-
friendly to gain appropriate information to the user. The decisions made during the input design
are

✓ To provide cost effective method of input.


✓ To achieve the highest possible level of accuracy.
✓ To ensure that the input is understand by the user.
System analysis decide the following input design details like, what data to input, what
medium to use, how the data should be arranged or coded, data items and transactions needing
validations to detect errors and at last the dialogue to guide user in providing input.

Input data of a system may not be necessarily is raw data captured in the system from
scratch. These can also be the output of another system or subsystem. The design of input covers
all the phases of input from the creation of initial data to actual entering of the data to the system
for processing. The design of inputs involves identifying the data needed, specifying the
characteristics of each data item, capturing and preparing data from computer processing and
ensuring correctness of data. Any Ambiguity in input leads to a total fault in output. The goal of
designing the input data is to make data entry as easy and error free as possible.

6.5 OUTPUT DESIGN

Output design generally refers to the results and information that are generated by the
system for many end-users; output is the main reason for developing the system and the basis on
which they evaluate the usefulness of the application.

The output is designed in such a way that it is attractive, convenient and informative. As
the outputs are the most important sources of information to the users, better design should
improve the system’s relationships with user and also will help in decision-making. Form design
elaborates the way output is presented and the layout available for capturing information.
6.6 SYSTEM FLOW DIAGRAM

Data Set collection (with Sales Attributes)

Classification

1. CNN (Convolutional 3. SVM Classification


Neural Network)

4. K-Nearest Neighbor

Fig 6.6.1 System Flow Diagram


6.7 USE CASE DIAGRAM

Training Data and Test Data


Set Taken

Display Sample Records

Administrator Find Conditional Probability


using Naïve Bayes

SVM/KNN Classification with


accuracy

CNN Classification with


accuracy

Display Charts

Fig 6.7.1 Use Case Diagram


CHAPTER 7
SYSTEM TESTING AND IMPLEMENTATION

7.1 SYSTEM TESTING


After the source code has been completed, documented as related data structures.
Completed the project has to undergo testing and validation where there is subtitle and definite
attempt to get errors. The project developer treads lightly, designing and execution test that will
demonstrates that the program works rather than uncovering errors, unfortunately errors will be
present and if the project developer doesn’t find them, the user will find out. The project
developer is always responsible for testing the individual units i.e. modules of the program. In
many cases developer also conducts integration testing i.e. the testing step that leads to the
construction of the complete program structure. This project has undergone the following testing
procedures to ensure its correctness.
1. Unit testing
2. User Acceptance Testing
7.1.1 UNIT TESTING
In unit testing, we have to test the programs making up the system. For this reason, Unit
testing sometimes called as Program testing. The software units in a system are the modules and
routines that are assembled and integrated to perform a specific function, Unit testing first on the
modules independently of one another, to locate errors. This enables, to detect errors in coding
and logic that are contained with the module alone. The testing was carried out during
programming stage itself.

7.1.2 USER ACCEPTANCE TESTING


In these testing procedures the project is given to the customer to test whether all
requirements have been fulfilled and after the user is fully satisfied. The project is perfectly
ready. If the user makes request for any change and if they found any errors those all errors has
to be taken into consideration and to be correct it to make a project a perfect project.
CHAPTER 8
SYSTEM IMPLEMENTATION

When the initial design was done for the system, the client was consulted for the
acceptance of the design so that further proceedings of the system development can be carried
on. After the development of the system a demonstration was given to them about the working of
the system. The aim of the system illustration was to identify any malfunction of the system.

Implementation is the process of converting a new or revised system design into an


operational one when the initial design was done by the system; a demonstration was given to the
end user about the working system. This process is uses to verify and identify any logical mess
working of the system by feeding various combinations of test data. After the approval of the
system by both end user and management the system was implemented. System implementation
is made up of many activities. The six major activities are as follows.
Coding
Coding is the process of whereby the physical design specifications created by the
analysis team turned into working computer code by the programming team.
Testing
Once the coding process is begin and proceed in parallel, as each program module can be
tested.
Installation
Installation is the process during which the current system is replaced by the new
system. This includes conversion of existing data, software, and documentation and work
procedures to those consistent with the new system.
Documentation
It is result from the installation process, user guides provides the information of how the
use the system and its flow.
Training and support
Training plan is a strategy for training user so they quickly learn to the new system. The
development of the training plan probably began earlier in the project.The best-suited application
package to develop the system is Python under windows 8/10’ environment.
CHAPTER 9

CONCLUSION AND SCOPE FOR FUTURE ENHANCEMENTS

The project is to implement a sales prediction model to predict supermarket data.


In addition, this research works also aiming towards the best classification algorithm
identification for sales analysis. In this work, algorithms like Naïve Bayes and
SVM/KNN classification are used to develop a prediction method to analyze/ predict the
sales volume. In addition, CNN classification is also used in proposed system to yield
better classification results. The project is designed using Python Language 3.11.
CHAPTER 10
BIBLIOGRAPHY

10.1 BOOKS

1. CyberPunk Architects, “PYTHON: THE BLUEPRINT TO PYTHON PROGRAMMING:


A Beginners Guide: Everything You Need to Know to Get Started (CyberPunk Blueprint
Series)” Kindle Edition.
2. David Mertz, “Functional Programming in Python”, O’Reilly, May 2015, First Edition.

10.2 WEB REFERENCES


http://www.python.org
http://www.w3schools.com
http://www.tutorialspoint.com

10.3 JOURNAL REFERENCES

[1] Smola, A., & Vishwanathan, S. V. N. (2008). Introduction to machine learning. Cambridge University, UK, 32,
34.

[2] Saltz, J. S., & Stanton, J. M. (2017). An introduction to data science. Sage Publications.

[3] Daumé III, H. (2012). A course in machine learning. Publisher, ciml. info, 5, 69.

[4] Quinlan, J. R. (2014). C4. 5: programs for machine learning. Elsevier.

[5] Neal, R. M., and Hinton, G. E. (1998) A new view of the EM algorithm that justifies incremental, sparse, and
other variants. In Learning in Graphical Models, ed. by M. I. Jordan, NATO Science Series, pp. 355–368. Kluwer.

[6] Nielsen, M., and Chuang, I. (2000) Quantum Computation and Quantum Information. Cambridge Univ. Press.

[7] Offer, E., and Soljanin, E. (2000) An algebraic description of iterative decoding schemes. In Codes, Systems and
Graphical Models, ed. by B. Marcus and J. Rosenthal, volume 123 of IMA Volumes in Mathematics and its
Applications, pp. 283–298. Springer.

[8] Offer, E., and Soljanin, E. (2001) LDPC codes: a group algebra formulation. In Proc. Internat. Workshop on
Coding and Cryptography WCC 2001, 8-12 Jan. 2001, Paris
APPENDIX
A. SAMPLE SCREENS

SUPERMARK SAMPLE RECORDS


DATASET COLUMNS

DATA SET WITH ITEM OUTLET SALES VALUES IN RANGE


DATA SET WITH ITEM FAT CONTENT VALUES
DATA SET WITH ITEM OUTLET SALES VALUES
DATA SET WITH ITEM OUTLET SALES GROUPED

DATA SET WITH ITEM OUTLET SALES GROUPED


NBC WITH ACCURACY

SVM WITH ACCURACY


KNN WITH ACCURACY
B. SAMPLE CODE
#Python 3.7 32 bit
#python -m pip install numpy
#python -m pip install pandas
import pandas as pd
#from sklearn import datasets,metrics
#from sklearn.model_selection import train_test_split
#from sklearn.preprocessing import StandardScaler
SAMPLE RECORDS
dataset = pd.read_csv('supermarket.csv')
# In[ ]:
X = pd.DataFrame(data = dataset)
print()
print()
print()
print('Super Market Data Set Sample Records')
print('----------------------------------')
print(X.head())
print('---------------------------------')
dataset = pd.read_csv('BigMartSales.csv')
# In[ ]:
X = pd.DataFrame(data = dataset)
print()
print()
print()
print('Big Mart Sales Data Set Sample Records')
print('----------------------------------')
print(X.head())
print('----------------------------------')
#Python 3.7 32 bit
import pandas as pd
#from sklearn import datasets,metrics
#from sklearn.model_selection import train_test_split
#from sklearn.preprocessing import StandardScaler
dataset = pd.read_csv('supermarket.csv')
# In[ ]:
X = pd.DataFrame(data = dataset)
print()
print()
print()
print('Data Set Columns')
print('----------------------------------')
lst=X.columns.values.tolist()
for i in range(0,len(lst)):
print( lst[i].ljust(30) ,end='\t')
print('----------------------------------')
dataset = pd.read_csv('BigMartSales.csv')
# In[ ]:
X = pd.DataFrame(data = dataset)
print()
print()
print()
print('Data Set Columns')
print('----------------------------------')
lst=X.columns.values.tolist()
for i in range(0,len(lst)):
print( lst[i].ljust(30) ,end='\t')
print('----------------------------------')
#Python 3.7 32 bit
import pandas as pd
# # Loading the Built-in Sklearn Breast Cancer Dataset
# In[ ]:
dataset = pd.read_csv('BigMartSales.csv')
# In[ ]:
X = pd.DataFrame(data = dataset)
print()
print()
print()
print('Data Set with Item Outlet Sales Value <=1000')
print('----------------------------------')
con1 = X.loc[X['Item_Outlet_Sales'] <=1000]
print('No. of Records: ',end='');
print(len(con1))
print('----------------')
print(con1)
print('----------------------------------')
print('Data Set with Item Outlet Sales Value >1000')
print('----------------------------------')
con1 = X.loc[X['Item_Outlet_Sales'] >1000]
print('No. of Records: ',end='');
print(len(con1))
print('----------------')
print(con1)
#Python 3.7 32 bit
import pandas as pd
#from sklearn import datasets,metrics
#from sklearn.model_selection import train_test_split
#from sklearn.preprocessing import StandardScaler
# # Loading the Built-in Sklearn Breast Cancer Dataset
# In[ ]:
dataset = pd.read_csv('BigMartSales.csv')
# In[ ]:
X = pd.DataFrame(data = dataset)
print()
print()
print()
print('Data Set with Item Fat Content Value ==Low Fat')
print('----------------------------------')
con1 = X.loc[X['Item_Fat_Content'] =='Low Fat']
print('No. of Records: ',end='');
print(len(con1))
print('----------------')
print(con1)
print('----------------------------------')
print('Data Set with Item Fat Content Value ==Regular')
print('----------------------------------')
con1 = X.loc[X['Item_Fat_Content'] =='Regular']
print('No. of Records: ',end='');
print(len(con1))
print('----------------')
print(con1)
#Python 3.7 32 bit
import pandas as pd
# # Loading the Built-in Sklearn Breast Cancer Dataset
# In[ ]:
dataset = pd.read_csv('BigMartSales.csv')
# In[ ]:
X = pd.DataFrame(data = dataset)
print()
print()
print()
print('Data Set with Outlet Size Value ==High')
print('----------------------------------')
con1 = X.loc[X['Outlet_Size'] =='High']
print('No. of Records: ',end='');
print(len(con1))
print('----------------')
print(con1)
print('----------------------------------')
print('Data Set with Outlet Size Value ==Medium')
print('----------------------------------')
con1 = X.loc[X['Outlet_Size'] =='Medium']
print('No. of Records: ',end='');
print(len(con1))
print('----------------')
print(con1)
print('----------------------------------')
print('Data Set with Outlet Size Value ==Small')
print('----------------------------------')
con1 = X.loc[X['Outlet_Size'] =='Small']
print('No. of Records: ',end='');
print(len(con1))
print('----------------')
print(con1)
#Python 3.7 32 bit
import pandas as pd
dataset = pd.read_csv('BigMartSales.csv')
# In[ ]:
X = pd.DataFrame(data = dataset)
df1 = X.groupby(["Outlet_Type"]).sum()['Item_Outlet_Sales'];
pd.options.display.float_format = '{:.0f}'.format
#print(df1)
df1['Item_Outlet_Sales'] = X.groupby(["Outlet_Type"]).sum()['Item_Outlet_Sales'];
print()
print()
#print(df1.columns)
print()
print('Output Data [Sum of Sales]')
print('----------------------------------')
#print(type(df1))
print(df1['Item_Outlet_Sales'])
NAIVE BAYES CLASSIFICATION
#[email protected]
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('BigMartSales.csv')
#dataset.fillna(0)
dataset.replace(np.nan,'0',regex=True)
X = dataset.iloc[:, [1, 3,12]].values
y = dataset.iloc[:, -1].values
indices =range(8523)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test,tr,te = train_test_split(X, y,indices, test_size = 0.20,
random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#print(y_test)
# Training the Naive Bayes model on the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score
ac = accuracy_score(y_test,y_pred)
cm = confusion_matrix(y_test, y_pred)
leng=len(te)
print("------");
for i in range(0,leng):
print(te[i] , ":", y_pred[i])
print('Accuracy')
print (ac)
print('Confusion Matrix')
print(cm)
SVM CLASSIFICATION
#!/usr/bin/env python
#python 3.7 32 bit
# coding: utf-8
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#get_ipython().run_line_magic('matplotlib', 'inline')
#Import Cancer data from the Sklearn library
# Dataset can also be found here
(http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28diagnostic%29)
#from sklearn.datasets import load_breast_cancer
#cancer = load_breast_cancer()
#dataset = pd.read_csv('BigMartSales1000Records.csv')
dataset = pd.read_csv('BigMartSales.csv')
df = pd.DataFrame(data = dataset)
#X = X.filter(['Item_Weight', 'Item_Visibility', 'Outlet_Establishment_Year','Item_Outlet_Sales',
'classfactor'])
# In[2]:
#X
print(df.columns.values)
df=df.iloc[:, [1,3,5,7,11,12]] #11
# As we can see above, not much can be done in the current form of the dataset. We need to view
the data in a better format.
# # Let's view the data in a dataframe.
# In[3]:
#df_Sales = pd.DataFrame(np.c_[X['Item_Outlet_Sales'], X['classfactor']], columns =
np.append(X['Item_Outlet_Sales'], ['classfactor']))
df_Sales =df#pd.DataFrame(np.c_[X['Item_Weight'], X['Item_Visibility'],
X['Outlet_Establishment_Year'],X['Item_Outlet_Sales'], X['classfactor']])
#,
#columns =
np.append(['Item_Weight','Item_Visibility','Outlet_Establishment_Year','Item_Outlet_Sales','clas
sfactor'])
X=df_Sales
print(X.head())
print(X.columns.values)
# # Let's Explore Our Dataset
# In[4]:
print(X.shape)
# As we can see,we have 596 rows (Instances) and 31 columns(Features)
# In[5]:
print(X.columns)
# Above is the name of each columns in our dataframe.
# # The next step is to Visualize our data
# In[6]:
# Let's plot out just the first 5 variables (features)
#sns.pairplot(df_Sales)#, vars = ['Item Weight', 'Item Visibility', 'Outlet Establishment Year',
'Item Outlet Sales'] )
# The above plots shows the relationship between our features. But the only problem with them
is that they do not show us which of the "dots" is Malignant and which is Benign.
# This issue will be addressed below by using "target" variable as the "hue" for the plots.
# In[7]:
# Let's plot out just the first 5 variables (features)
#sns.pairplot(df_Sales, hue = 'classfactor', vars = ['Item_Weight', 'Item_Fat_Content',
'Outlet_Establishment_Year','Item_Outlet_Sales'] )
# **Note:**
# 1.0 (Orange) = Benign (No Cancer)
# 0.0 (Blue) = Malignant (Cancer)
# # How many Benign and Malignant do we have in our dataset?
# In[8]:
print(X['classfactor'])
print(X['classfactor'].value_counts())
# As we can see, we have 212 - Malignant, and 357 - Benign
# Let's visulaize our counts
# In[9]:
sns.countplot(X['classfactor'], label = "Count")
# # Let's check the correlation between our features
# In[10]:
plt.figure(figsize=(20,12))
sns.heatmap(df_Sales.corr(), annot=True)
#X = X.drop(['classfactor'], axis = 1) # We drop our "target" feature and use all the remaining
features in our dataframe to train the model.
#print(X.head())
# In[12]:
y = X['classfactor']
X = X.drop(['classfactor'], axis = 1)
print(y.head())
from sklearn.model_selection import train_test_split
# Let's split our data using 80% for training and the remaining 20% for testing.
# In[14]:
indices =range(len(X))
X_train, X_test, y_train, y_test,tr,te = train_test_split(X, y, indices,test_size = 0.2, random_state
= 20)
# Let now check the size our training and testing data.
# In[15]:
print ('The size of our training "X" (input features) is', X_train.shape)
print ('\n')
print ('The size of our testing "X" (input features) is', X_test.shape)
print ('\n')
print ('The size of our training "y" (output feature) is', y_train.shape)
print ('\n')
print ('The size of our testing "y" (output features) is', y_test.shape)
# # Import Support Vector Machine (SVM) Model
# In[16]:
from sklearn.svm import SVC
# In[17]:
svc_model = SVC()
# # Now, let's train our SVM model with our "training" dataset.
# In[18]:
svc_model.fit(X_train, y_train)
# # Let's use our trained model to make a prediction using our testing data
# In[19]:
y_predict = svc_model.predict(X_test)
print(y_predict)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# In[21]:
cm = np.array(confusion_matrix(y_test, y_predict, labels=[1,0]))
confusion = pd.DataFrame(cm, index=['is_Low', 'is_High'],
columns=['predicted_Low','predicted_High'])
print(confusion)
#sns.heatmap(confusion, annot=True)
#print(classification_report(y_test, y_predict))
X_train_min = X_train.min()
print(X_train_min)
# In[25]:
X_train_max = X_train.max()
print(X_train_max)
# In[26]:
X_train_range = (X_train_max- X_train_min)
print(X_train_range)
# In[27]:
X_train_scaled = (X_train - X_train_min)/(X_train_range)
print('X_train_scaled:')
print(X_train_scaled.head())
# # Normalize Training Data
# In[28]:
X_test_min = X_test.min()
X_test_range = (X_test - X_test_min).max()
X_test_scaled = (X_test - X_test_min)/X_test_range
print('X_test_scaled:')
print(X_test_scaled.head())
# In[29]:
svc_model = SVC()
svc_model.fit(X_train_scaled, y_train)
# In[30]:
y_predict = svc_model.predict(X_test_scaled)
cm = confusion_matrix(y_test, y_predict)
print('Confusion Matrix:')
print(cm)
# # SVM with Normalized data
# In[31]:
cm = np.array(confusion_matrix(y_test, y_predict, labels=[1,0]))
confusion = pd.DataFrame(cm, index=['is_Low', 'is_High'],
columns=['predicted_Low','predicted_High'])
ac = accuracy_score(y_test,y_predict)
leng=len(te)
print("------");
for i in range(0,leng):
print(te[i] , ":", y_predict[i])
print('Accuracy:')
print(ac)
print('Confusion Matrix')
print(confusion_matrix(y_test, y_predict))
exit()
#End of module SVM
KNN CLASSIFICATION
#!/usr/bin/env python
#python 3.7 32 bit
# coding: utf-8
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#get_ipython().run_line_magic('matplotlib', 'inline')
#Import Cancer data from the Sklearn library
# Dataset can also be found here
(http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28diagnostic%29)
#from sklearn.datasets import load_breast_cancer
#cancer = load_breast_cancer()
#dataset = pd.read_csv('BigMartSales1000Records.csv')
dataset = pd.read_csv('BigMartSales.csv')
df = pd.DataFrame(data = dataset)
#X = X.filter(['Item_Weight', 'Item_Visibility', 'Outlet_Establishment_Year','Item_Outlet_Sales',
'classfactor'])
# In[2]:
#X
print(df.columns.values)
df=df.iloc[:, [1,3,5,7,11,12]]
# As we can see above, not much can be done in the current form of the dataset. We need to view
the data in a better format.
# # Let's view the data in a dataframe.
# In[3]:
#df_Sales = pd.DataFrame(np.c_[X['Item_Outlet_Sales'], X['classfactor']], columns =
np.append(X['Item_Outlet_Sales'], ['classfactor']))
df_Sales =df#pd.DataFrame(np.c_[X['Item_Weight'], X['Item_Visibility'],
X['Outlet_Establishment_Year'],X['Item_Outlet_Sales'], X['classfactor']])
#,
#columns =
np.append(['Item_Weight','Item_Visibility','Outlet_Establishment_Year','Item_Outlet_Sales','clas
sfactor'])
X=df_Sales
print(X.head())
print(X.columns.values)
# # Let's Explore Our Dataset
# In[4]:
print(X.shape)
# As we can see,we have 596 rows (Instances) and 31 columns(Features)
# In[5]:
print(X.columns)
# Above is the name of each columns in our dataframe.
# # The next step is to Visualize our data
# In[6]:
# Let's plot out just the first 5 variables (features)
#sns.pairplot(df_Sales)#, vars = ['Item Weight', 'Item Visibility', 'Outlet Establishment Year',
'Item Outlet Sales'] )
# The above plots shows the relationship between our features. But the only problem with them
is that they do not show us which of the "dots" is Malignant and which is Benign.
# This issue will be addressed below by using "target" variable as the "hue" for the plots.
# In[7]:
# Let's plot out just the first 5 variables (features)
#sns.pairplot(df_Sales, hue = 'classfactor', vars = ['Item_Weight', 'Item_Fat_Content',
'Outlet_Establishment_Year','Item_Outlet_Sales'] )
# **Note:**
# 1.0 (Orange) = Benign (No Cancer)
# 0.0 (Blue) = Malignant (Cancer)
# # How many Benign and Malignant do we have in our dataset?
# In[8]:
print(X['classfactor'])
print(X['classfactor'].value_counts())
# As we can see, we have 212 - Malignant, and 357 - Benign
# Let's visulaize our counts
# In[9]:
sns.countplot(X['classfactor'], label = "Count")
# # Let's check the correlation between our features
# In[10]:
plt.figure(figsize=(20,12))
sns.heatmap(df_Sales.corr(), annot=True)
#X = X.drop(['classfactor'], axis = 1) # We drop our "target" feature and use all the remaining
features in our dataframe to train the model.
#print(X.head())
# In[12]:
y = X['classfactor']
X = X.drop(['classfactor'], axis = 1)
print(y.head())
from sklearn.model_selection import train_test_split
# Let's split our data using 80% for training and the remaining 20% for testing.
# In[14]:
indices =range(len(X))
X_train, X_test, y_train, y_test,tr,te = train_test_split(X, y, indices,test_size = 0.1, random_state
= 20)
# Let now check the size our training and testing data.
# In[15]:
print ('The size of our training "X" (input features) is', X_train.shape)
print ('\n')
print ('The size of our testing "X" (input features) is', X_test.shape)
print ('\n')
print ('The size of our training "y" (output feature) is', y_train.shape)
print ('\n')
print ('The size of our testing "y" (output features) is', y_test.shape)
# # Import Support Vector Machine (SVM) Model
from sklearn import neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
#iris = datasets.load_iris()
#X, y = iris.data[:, :], iris.target
indices =range(len(X))
Xtrain, Xtest, y_train, y_test,tr,te = train_test_split(X, y, indices,stratify = y, random_state = 0,
train_size = 0.7)
scaler = preprocessing.StandardScaler().fit(Xtrain)
Xtrain = scaler.transform(Xtrain)
Xtest = scaler.transform(Xtest)
knn = neighbors.KNeighborsClassifier(n_neighbors=3)
knn.fit(Xtrain, y_train)
y_pred = knn.predict(Xtest)
# In[16]:
leng=len(te)
print("------");
for i in range(0,leng):
print(te[i] , ":", y_pred[i])
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print('Confusion Matrix [KNN]')
print('----------------------')
print(confusion_matrix(y_test, y_pred))
exit()

You might also like