Report of Mini Project
Report of Mini Project
The time series will show multiple strong seasonalities, trend changes, data gaps, and
outliers. This project proposes a forecasting approach that is solely based on the data retrieved
from point-of-sale systems and allows for a straightforward human interpretation.
Therefore, it proposes two generalized models for predicting future sales. In
an extensive evaluation, data sets are taken which consists of super market sales data.
The main motivation of doing this project is to present a sales prediction model for the
prediction of supermarket data. Further, this research work is aimed towards identifying the best
classification algorithm for sales analysis.
In this work, data mining classification algorithms like SVM, KNN and Naïve Bayes
classification is addressed and used to develop a prediction system in order to analyze and
predict the sales volume. In addition, CNN classification is also used in proposed system for
better classification results. The project is designed using Python 3.11.
CONTENTS
CHAPTER NO. TITLE PAGE
NO.
ABSTRACT
1 INTRODUCTION
1.1 Objectives
1.2 About the project
2 LITERATURE SURVEY
2.1 Related Work
3 SYSTEM ANALYSIS
3.1 Existing System
3.2 Drawbacks of Existing System
3.3 Proposed System
3.4 Advantages of Proposed System
3.5 Feasibility Study
3.5.1 Economical Feasibility
3.5.2 Operational Feasibility
3.5.1 Technical Feasibility
4 SYSTEM SPECIFICATION
4.1 Hardware Requirements
4.2 Software Requirements
5 SOFTWARE DESCRIPTION
5.1 Front End
5.2 Back End
6 PROJECT DESCRIPTION
6.1 Problem Definition
6.2 Overview of the project
6.3 Module Description
6.4 Input Design
6.5 Output Design
6.6 System Flow Diagram
6.7 Use Case Diagram
7 SYSTEM TESTING
8 SYSTEM IMPLEMENTATION
9 CONCLUSION AND FUTURE ENHANCEMENT
10 BIBLIOGRAPHY
10.1 Book References
10.2 Web References
10.3 Journal References
APPENDIX
[1] Screen shots
[2] Source Code
CHAPTER 1
INTRODUCTION
1.1 OBJECTIVES
▪ To apply naïve bayes classification for finding conditional probability of outlet sales.
▪ To apply SVM/KNN which generally handles larger number of instances to work its
randomization concept well and generalize to the novel data.
▪ To apply SVM/KNN that could be preferred when the data set grows larger.
▪ To apply SVM/KNN that could be preferred when the outlier data is more.
▪ To plot various charts for outlet sales.
In today’s modern world, huge shopping centers such as big malls and marts are
recording data related to sales of items or products with their various dependent or independent
factors as an important step to be helpful in prediction of future demands and inventory
management.
The dataset built with various dependent and independent variables is a composite
form of item attributes, data gathered by means of customer, and also data related to inventory
management in a data warehouse. The data is thereafter refined in order to get accurate
predictions and gather new as well as interesting results that shed a new light on our knowledge
with respect to the task’s data.
SVM and K-Nearest Neighbors (KNN) learning algorithms are used which is based on
instances and knowledge gained through them [4]. Unlike mining in data stream scenarios, cases
where every sample can simultaneously belong to multiple classes in hierarchical multi-label
classification problems, k-NN is being proposed to be applied to predict outputs in structured
form.
CHAPTER 2
LITERATURE SURVEY
AUTHORS
ALEX SMOLA
S.V.N. VISHWANATHAN
In this paper [1], the authors stated that over the past two decades Machine Learning has
become one of the main-stays of information technology and with that, a rather central, albeit
usually hidden, part of our life. With the ever increasing amounts of data becoming available
there is good reason to believe that smart data analysis will become even more pervasive as a
necessary ingredient for technological progress. The purpose of this chapter is to provide the
reader with an overview over the vast range of applications which have at their heart a machine
learning problem and to bring some degree of order to the zoo of problems. After that, they
discussed some basic tools from statistics and probability theory, since they form the language in
which many machine learning problems must be phrased to become amenable to solving.
Finally, they outlined a set of fairly basic yet effective algorithms to solve an important problem,
namely that of classification. More sophisticated tools, a discussion of more general problems
and a detailed analysis will follow in later parts of the book.
In this paper [2], the authors answered the question: What Is Data Science? For some, the
term data science evokes images of statisticians in white lab coats staring fixedly at blinking
computer screens filled with scrolling numbers. Nothing could be farther from the truth. First,
statisticians do not wear lab coats: this fashion statement is reserved for biologists, physicians,
and others who have to keep their clothes clean in environments filled with unusual fluids.
Second, much of the data in the world is non-numeric and unstructured.
In this context, unstructured means that the data are not arranged in neat rows and
columns. Think of a web page full of photographs and short messages among friends: very few
numbers to work with there. While it is certainly true that companies, schools, and governments
use plenty of numeric information—sales of products, grade point averages, and tax assessments
are a few examples—there is lots of other information in the world that mathematicians and
statisticians look at and cringe. So, while it is always useful to have great math skills, there is
much to be accomplished in the world of data science for those of us who are presently more
comfortable working with words, lists, photographs, sounds, and other kinds of information.
In addition, data science is much more than simply analyzing data. There are many
people who enjoy analyzing data and who could happily spend all day looking at histograms and
averages, but for those who prefer other activities, data science offers a range of roles and
requires a range of skills. Let’s consider this idea by thinking about some of the data involved in
buying a box of cereal. Whatever your cereal preferences—fruity, chocolaty, fibrous, or nutty—
you prepare for the purchase by writing “cereal” on your grocery list. Already your planned
purchase is a piece of data, also called a datum, albeit a pencil scribble on the back on an
envelope that only you can read. When you get to the grocery store, you use your datum as a
reminder to grab that jumbo box of FruityChocoBoms off the shelf and put it in your cart. At the
checkout line, the cashier scans the barcode on your box, and the cash register logs the price.
Back in the warehouse, a computer tells the stock manager that it is time to request
another order from the distributor, because your purchase was one of the last boxes in the store.
You also have a coupon for your big box, and the cashier scans that, giving you a predetermined
discount. At the end of the week, a report of all the scanned manufacturer coupons gets uploaded
to the cereal company so they can issue a reimbursement to the grocery store for all of the
coupon discounts they have handed out to customers. Finally, at the end of the month a store
manager looks at a colorful collection of pie charts showing all the different kinds of cereal that
were sold and, on the basis of strong.
AUTHOR
CAIO CORRO
In this paper [3], the author stated that the goal of this paper is to give tools to understand
how to train machine learning models. This problem is an optimization problem therefore we
will treat it as such. The author only consider linear models, switching between different tasks:
regression, binary classification, multiclass classification and structured prediction. Many
important topics will not be covered, e.g. proof of convergence rates of different optimization
techniques. We will also skip some definitions, e.g. lower semi-continuity, and some proofs. The
goal is that you get the main idea behind different concepts, so definitions and theorems may be
"handwavy" to not burden the material. The main reason is that time is limited and I prefer to
focus on things that are simpler to understand and that you can code. However, I hope that
overall this course will provide you a strong background to read the literature and explore
optimization techniques for machine learning.
Linear models
The author consider the following scenario: we are given feature values and we want to
predict an output. The feature values will be represented as a vector x ∈ Rd where d is the
number of features. In this course, feature values will always be real numbers. However, it may
be that a given feature takes only values 0 and 1 to indicate the presence or absence of the
feature. The output will be either a scalar y ∈ R (regression, binary classification) or a vector y ∈
Rk (multiclass classification and structured prediction). Again, we write that the output is in
R but, for example, for binary classification it will actually be strict subspace of R: y ∈ {−1, 1}.
The general framework we will study is machine learning models that compute the output in two
steps:
• a parameterized scoring function s(x) that maps from the input space to the score space (also
called weights or logits). In this course, we will only focus on functions that applies a linear
transformation to the input, e.g. s(x) = Ax + b where A and b are the parameters of the scoring
function.
• a prediction function q(w) that maps a value w from the score space to the output space. Note
that, as usual, we will often ignore the bias term in the linear transformation and assume there is
a special feature that is always set to one.
The author considers both the hard decision and the likelihood estimation cases. When
the prediction function outputs a hard decision, it doesn’t give any information about its own
uncertainty. This setting can be understood as a probability distribution over the output space
where all mass is concentrated in a single. Hence, we won’t really make the difference between
the two cases and we will actually show that they can be understood under the same framework.
The main problem the author addressed in this course is the training problem, i.e. how do we fix
the parameters of the scoring function. We will focus on the supervised training scenario where
we assume we have access to a labeled training set {〈x(i), y(i)〉}n i=1 of n pairs 〈x(i), y(i)〉.
The idea is that we want to select the element of a restricted set of functions that:
• classifies correctly all (or most of) the datapoints in the training data
• or that maximizes the probability of the training data (probabilistic setting).
It is often important to also regularize the training objective to avoid overfitting on the
training data. In this case, the set of function will be a set of parameterized linear models
of a given dimension: F(d, k) = {s(·; A)|A ∈ Rk×d} where d is the input dimension (the
number of features including bias), k the output dimension (1 for regression and binary
classification, larger for other problems) and s(x; A) = Ax. The training problem, that is
selecting the best function in F(d, k) is an optimization problem so we will treat it as
such.
CHAPTER 3
SYSTEM ANALYSIS
In existing system, the supermarket dataset which contains attributes (branch, gender,
quantity, unit price, total along with rating, etc) are taken and two algorithms are carried out for
classification/prediction purpose. The Naïve Bayes algorithm is used for finding conditional
probability. In addition SVM and KNN classification is also made. The training data is taken
75% from the whole data set and model is predicted. Then the remaining 25% of the data is
taken as test data and checked against the predicted model.
3.1.1. DRAWBACKS
▪ The Naïve bayes classification yields conditional probability values only for existing
given dataset. New test data is added for classification.
▪ CNN is not applied so the outlier data could not be predicted well.
▪ Naïve bayes classification could not be preferred when the outlier data is more.
All the existing system approaches are carried out in proposed system. In addition, SVM
and K-nearest neighbor based classification are used to predict the model as it helps better in
various ways along with random forest. It is found to be suitable especially if the data set is
having more number of records is contains outlier data. A wide variety of sales records can be
taken for all branches classification purpose and predicting a new model at the same time
increasing the efficiency.
3.2.2 ADVANTAGES
▪ CNN generally handles larger number of instances to work its randomization concept
well and generalize to the novel data.
▪ CNN could be preferred when the data set grows larger.
▪ CNN could be preferred when the outlier data is more.
The feasibility study deals with all the analysis that takes up in developing the project.
Each structure has to be thought of in the developing of the project, as it has to serve the end user
in a user-friendly manner. One must know the type of information to be gathered and the system
analysis consist of collecting, Organizing and evaluating facts about a system and its
environment.
The main objective of the system analysis is to study the existing operation and to learn
and accomplish the processing activities. The record classification with uncertain data through
python application needs to be analyzed well. The details are processed through coding
themselves. It will be controlled by the programs alone.
The organization has to buy a personal computer with a keyboard and a mouse, this is a
direct cost. There are many direct benefits of covering the manual system to computerized
system. The user can be given responses on asking questions, justification of any capital outlay is
that it will reduce expenditure or improve the quality of service or goods, which in turn may be
expected to provide the increased profits.
3.3.2 OPERATIONAL FEASIBILITY
The Proposed system accessing process to solves problems what occurred in existing
system. The current day-to-day operations of the organization can be fit into this system. Mainly
operational feasibility should include on analysis of how the proposed system will affects the
organizational structures and procedures.
The cost and benefit analysis may be concluded that computerized system is favorable in
today’s fast moving world. The assessment of technical feasibility must be based on an outline
design of the system requirements in terms of input, output, files, programs and procedure.
The project aims to detect the given input image is having brain tumor or not. The system
is tested if it is technical feasible to improve the effectiveness of classification using
classification routines. The current system aims to overcome the problems of the existing system.
The current system is to reduce the technical skill requirements.
CHAPTER 4
SYSTEM SPECIFICATION
Python is a Unix shell script or Windows batch files for some of these tasks, but shell
scripts are best at moving around files and changing text data, not well-suited for GUI
applications or games. could write a C/C++/Java program, but it can take a lot of development
time to get even a first-draft program. Python is simpler to use, available on Windows, Mac OS
X, and Unix operating systems, and will help get the job done more quickly.
Python is simple to use, but it is a real programming language, offering much more
structure and support for large programs than shell scripts or batch files can offer. On the other
hand, Python also offers much more error checking than C, and, being a very-high-level
language, it has high-level data types built in, such as flexible arrays and dictionaries. Because of
its more general data types Python is applicable to a much larger problem domain than Awk or
even Perl, yet many things are at least as easy in Python as in those languages.
Python allows to split r program into modules that can be reused in other Python
programs. It comes with a large collection of standard modules that can use as the basis of r
programs — or as examples to start learning to program in Python. Some of these modules
provide things like file I/O, system calls, sockets, and even interfaces to graphical user interface
toolkits like Tk.
Python is an interpreted language, which can save considerable time during program
development because no compilation and linking is necessary. The interpreter can be used
interactively, which makes it easy to experiment with features of the language, to write throw-
away programs, or to test functions during bottom-up program development. It is also a handy
desk calculator.
5.1.1. Invoking the Interpreter
Since the choice of the directory where the interpreter lives is an installation option, other
places are possible; check with r local Python guru or system administrator. (E.g.,
/usr/local/python is a popular alternative location.) On Windows machines, the Python
installation is usually placed in C:\Python33, though can change this when ’re running the
installer. To add this directory to r path, can type the following command into the command
prompt in a DOS box:set path=%path%;C:\python33
The interpreter operates somewhat like the Unix shell: when called with standard input
connected to a tty device, it reads and executes commands interactively; when called with a file
name argument or with a file as standard input, it reads and executes a script from that file.
A second way of starting the interpreter is python -c command [arg] ..., which executes the
statement(s) in command, analogous to the shell’s -c option. Since Python statements often
contain spaces or other characters that are special to the shell, it is usually advised to quote
command in its entirety with single quotes. Some Python modules are also useful as scripts.
5.1.2 Error Handling
When an error occurs, the interpreter prints an error message and a stack trace. In
interactive mode, it then returns to the primary prompt; when input came from a file, it exits with
a nonzero exit status after printing the stack trace. (Exceptions handled by an except clause in a
try statement are not errors in this context.) Some errors are unconditionally fatal and cause an
exit with a nonzero exit; this applies to internal inconsistencies and some cases of running out of
memory. All error messages are written to the standard error stream; normal output from
executed commands is written to standard output. Typing the interrupt character (usually
Control-C or DEL) to the primary or secondary prompt cancels the input and returns to the
primary prompt. Typing an interrupt while a command is executing raises the Keyboard Interrupt
exception, which may be handled by a try statement.
5.1.3. Informal Introduction to Python
In the following examples, input and output are distinguished by the presence or absence
of prompts (>>> and ...): to repeat the example, you must type everything after the prompt,
when the prompt appears; lines that do not begin with a prompt are output from the interpreter.
Note that a secondary prompt on a line by itself in an example means you must type a blank line;
this is used to end a multi-line command.
Many of the examples in this manual, even those entered at the interactive prompt,
include comments. Comments in Python start with the hash character, #, and extend to the end of
the physical line. A comment may appear at the start of a line or following whitespace or code,
but not within a string literal. A hash character within a string literal is just a hash character.
Since comments are to clarify code and are not interpreted by Python, they may be omitted when
typing in examples.
The first line should always be a short, concise summary of the object’s purpose. For
brevity, it should not explicitly state the object’s name or type, since these are available by other
means (except if the name happens to be a verb describing a function’s operation). This line
should begin with a capital letter and end with a period.
If there are more lines in the documentation string, the second line should be blank,
visually separating the summary from the rest of the description. The following lines should be
one or more paragraphs describing the object’s calling conventions, its side effects, etc.
The Python parser does not strip indentation from multi-line string literals in Python, so
tools that process documentation have to strip indentation if desired. This is done using the
following convention. The first non-blank line after the first line of the string determines the
amount of indentation for the entire documentation string. (We can’t use the first line since it is
generally adjacent to the string’s opening quotes so its indentation is not apparent in the string
literal.) Whitespace “equivalent” to this indentation is then stripped from the start of all lines of
the string. Lines that are indented less should not occur, but if they occur all their leading
whitespace should be stripped. Equivalence of whitespace should be tested after expansion of
tabs (to 8 spaces, normally).
“To find out what role certain properties of an item play and how they affect their sales
by understanding Big Mart sales.” In order to help Big Mart achieve this goal, a predictive model
can be built to find out for every store, the key factors that can increase their sales and what
changes could be made to the product or store’s characteristics.
Since, the basic premise of machine learning is to build models and employ
algorithms that can receive input data and use statistical analysis to predict an output while
updating outputs as new data becomes available. These models, if applied in different
areas and trained to match the expectations of management, then accurate steps could be taken to
achieve the organization’s target.
The time series will show multiple strong seasonalities, trend changes, data gaps, and
outliers. This project proposes a forecasting approach that is solely based on the data retrieved
from point-of-sale systems and allows for a straightforward human interpretation.
Therefore, it proposes two generalized models for predicting future sales. In
an extensive evaluation, data sets are taken which consists of super market sales data.
The main motivation of doing this project is to present a sales prediction model for the
prediction of supermarket data. Further, this research work is aimed towards identifying the best
classification algorithm for sales analysis.
In this work, data mining classification algorithms like Naïve Bayes classification is
addressed and used to develop a prediction system in order to analyze and predict the sales
volume. In addition, SVM and KNN classification is also used in proposed system for better
classification results. The project is designed using Python 3.7.
6.3 MODULE DESCIPTION
1. DATASET COLLECTION
In this module, the sales dataset from kaggle which contains attributes (branch, gender,
quantity, unit price, total along with rating, etc) are taken. Null value records are eliminated
during preprocessing work. In addition, Big Mart Sales Data is also taken.
In this module, Branch wise, gender wise rating similarity (conditional probability) is
found out both for below and above 5.0 rating. Moreover, Outlet_Size wise items sold is found
out. Item wise comparison chart is also prepared for sold quantity.
3. SVM CLASSIFICATION
SVM stands for Support Vector Machine. It is a machine learning approach used for
classification and regression analysis. It depends on supervised learning models and trained by
learning algorithms. They analyze the large amount of data to identify patterns from them. An
SVM generates parallel partitions by generating two parallel lines. For each category of data in a
high-dimensional space and uses almost all attributes. It separates the space in a single pass to
generate flat and linear partitions. Divide the 2 categories by a clear gap that should be as wide
as possible. Do this partitioning by a plane called hyperplane. An SVM creates hyperplanes that
have the largest margin in a high-dimensional space to separate given data into classes. The
margin between the 2 classes represents the longest distance between closest data points of those
classes. The larger the margin, the lower is the generalization error of the classifier. The records
are classified into disease 1 or 2 using SVM classification in this module.
4. KNN CLASSIFICATION
In this module, KNN classification is being done with K value given as 6 and type
column (generated based on rating value with below and above 5.0 values) column as binary
classification factor. 75% of the data is given as training data and 25% as testing data. The
testing data’s record number and the record type is found out and displayed as result.
A Convolutional Neural Network is built for modeling the image data. CNNs are
modified versions of regular neural networks. These are modified specifically for image data.
Feeding images to regular neural networks would require the network to have a large number of
input neurons. The main purpose of convolutional neural networks is to take advantage of the
spatial structure of the image and to extract high level features from that and then train on those
features. It does so by performing a convolution operation on the matrix of pixel values.
It is also used the Dropout, Batch-normalization and Flatten layers in addition to the
layers. Flatten layer converts the output of convolutional layers into a one dimensional feature
vector. It is important to flatten the outputs because Dense (Fully connected) layers only accept a
feature vector as input. Dropout and Batch-normalization layers are for preventing the model
from overfitting.
Once the model is created, it can be imported and then compiled using ‘model.compile’.
The model is trained for just five epochs but we can increase the number of epochs. After the
training process is completed we can make predictions on the test set. The accuracy value is
displayed during iterations. Multi class image labelling is also possible here.
6.4 INPUT DESIGN
The system takes input from the users, processes it and produces an output. Input design
is link that ties the information system into the world of its users. The system should be user-
friendly to gain appropriate information to the user. The decisions made during the input design
are
Input data of a system may not be necessarily is raw data captured in the system from
scratch. These can also be the output of another system or subsystem. The design of input covers
all the phases of input from the creation of initial data to actual entering of the data to the system
for processing. The design of inputs involves identifying the data needed, specifying the
characteristics of each data item, capturing and preparing data from computer processing and
ensuring correctness of data. Any Ambiguity in input leads to a total fault in output. The goal of
designing the input data is to make data entry as easy and error free as possible.
Output design generally refers to the results and information that are generated by the
system for many end-users; output is the main reason for developing the system and the basis on
which they evaluate the usefulness of the application.
The output is designed in such a way that it is attractive, convenient and informative. As
the outputs are the most important sources of information to the users, better design should
improve the system’s relationships with user and also will help in decision-making. Form design
elaborates the way output is presented and the layout available for capturing information.
6.6 SYSTEM FLOW DIAGRAM
Classification
4. K-Nearest Neighbor
Display Charts
When the initial design was done for the system, the client was consulted for the
acceptance of the design so that further proceedings of the system development can be carried
on. After the development of the system a demonstration was given to them about the working of
the system. The aim of the system illustration was to identify any malfunction of the system.
10.1 BOOKS
[1] Smola, A., & Vishwanathan, S. V. N. (2008). Introduction to machine learning. Cambridge University, UK, 32,
34.
[2] Saltz, J. S., & Stanton, J. M. (2017). An introduction to data science. Sage Publications.
[3] Daumé III, H. (2012). A course in machine learning. Publisher, ciml. info, 5, 69.
[5] Neal, R. M., and Hinton, G. E. (1998) A new view of the EM algorithm that justifies incremental, sparse, and
other variants. In Learning in Graphical Models, ed. by M. I. Jordan, NATO Science Series, pp. 355–368. Kluwer.
[6] Nielsen, M., and Chuang, I. (2000) Quantum Computation and Quantum Information. Cambridge Univ. Press.
[7] Offer, E., and Soljanin, E. (2000) An algebraic description of iterative decoding schemes. In Codes, Systems and
Graphical Models, ed. by B. Marcus and J. Rosenthal, volume 123 of IMA Volumes in Mathematics and its
Applications, pp. 283–298. Springer.
[8] Offer, E., and Soljanin, E. (2001) LDPC codes: a group algebra formulation. In Proc. Internat. Workshop on
Coding and Cryptography WCC 2001, 8-12 Jan. 2001, Paris
APPENDIX
A. SAMPLE SCREENS