0% found this document useful (0 votes)

1K views67 pages

MAJOR PROJECT Documentation

This project aims to detect fake news on Twitter. It was conducted by 4 students at Sree Dattha Institute of Engineering and Science under the guidance of their professor, Mrs. Anupama. The students developed a machine learning model to analyze the language patterns in tweets and classify them as real or fake news. Recent political events have increased the spread of fake news on social media. Existing fact-checking methods rely on blacklists but cannot detect fake news from reliable sources. The growth of social media allows quick information sharing but also the spread of misinformation. The goal of this project was to create a tool to detect fake news on Twitter using natural language processing and machine learning techniques. The results demonstrate that machine learning can help with this

Uploaded by

Maheswara Reddy Kasa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views67 pages

MAJOR PROJECT Documentation

Uploaded by

Maheswara Reddy Kasa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

A Major Project On

FAKE NEWS DETECTION ON TWITTER

Jawaharlal Nehru Technological University, Hyderabad

In partial fulfillment of the Requirement for the award of degree of

BACHELOR OF TECHNOLOGY
IN
Computer Science & Engineering

Submitted By

M SHILPA (17E41A0585)

P SWATHI (17E41A0599)

J KARUNAKARA REDDY (17E41A05A0)

CH SAIRAM (17E41A0587)
Under the Esteemed guidance of

[Link]
Assistant Professor, Department of CSE

DEPARTMENT OF COMPUTER SCIENCE AND

ENGINEERING

SREE DATTHA INSTITUTE OF ENGINEERING & SCIENCE

(Approved by AICTE, Affiliated to JNTU,

Hyderabad) Sheriguda, Ibrahimpatnam,

R.R. Dist. Hyderabad.

2(f) recognition by

UGC Accredited by

NAAC for 5 years

i
SREE DATTHA INSTITUTE OF ENGINEERING & SCIENCE
(Approved by AICTE, Affiliated to JNTU, Hyderabad)
Sheriguda, Ibrahimpatnam, R.R. Dist. Hyderabad.
2(f) recognition by UGC
Accredited by NAAC for 5 years

Department of Computer Science and Engineering

CERTIFICATE
This is to certify that the major project report entitled “FAKE NEWS DETECTION ON
TWITTER” is being submitted by
M SHILPA (17E41A0585)
P SWATHI (17E41A0599)
J KARUNAKARA (17E41A05A0)
CH SAIRAM (17E1A0587)
In partialfulfillment of the requirement for the award of the degree of [Link]. In Computer
Science and Engineeringto the SreeDattha Institute of Engineering & Science, Hyderabad is a
record of bonafide work carried out by them under my/our guidance.
The results presented in this report have been verified and are found to be satisfactory. The results
embodied in this report have not been submitted to any other University for the award of any other
degree or diploma.

InternalGuideHOD – IN Charge

[Link]

External Examiner

ii
SREE DATTHA INSTITUTE OF ENGINEERING & SCIENCE
(Approved by AICTE, Affiliated to JNTU, Hyderabad)
Sheriguda, Ibrahimpatnam, R.R. Dist. Hyderabad.
2(f) recognition by UGC
Accredited by NAAC for 5 years

Department of Computer Science and Engineering

DECLARATION BY THE CANDIDATE

We arehereby declaring that the project report titled “FAKE NEWS DETECTION ON
TWITTER” is submitted in partial fulfillment of the requirement for the award of the degree of
Bachelor of Technologyin Computer Science & Engineering atSreeDattha institute of
Engineering and science affiliated to Jawaharlal Nehru Technological University, Hyderabadis a
record of bona fide work carried out by us and the results embodied in this project have not been
reproduced or copied from anysource.
The results embodied in this project report have not been submitted to any other University
orInstitute for the award of any Degree or Diploma.

M SHILPA (17E41A0585)

P SWATHI (17E41A0599)

J KARUNAKARA REDDY(17E41A05A0)

CH SAIRAM (17E41A0587)

iii
ACKNOWLEDGMENT

We are extremely grateful to Shri G. PanduRanga Reddy, Chairman and Dr. Md.
Sameeruddin Khan, Principal and Dr Amol Purohit, Head of the Department of
CSE , SreeDattha Institute of Engineering & Science.

We are extremely thankful to [Link] ,Internal guide, Department of CSE,

for her constant guidance, encouragement and moral support throughout the project.

We will be failing in duty if we do not acknowledge with grateful thanks to the

authors of the references and other literatures referred in this Project.

We express our thanks to all staff members and friends for all the help and co-
ordination extended in bringing out this Project successfully in time.

Finally, we are very much thankful to our parents who guided us for every step.

[Link] (17E41A0585)
P SWATHI (17E41A0599)
J KARUNAKARA REDDY (17E41A0587)
CH SAIRAM (17E41A0587)

Date:

Place:

iv
TABLE OF CONTENTS
[Link] CONTENTS PAG
NO
TITLE PAGE ⅰ
CERTIFICATION ⅱ
DECLARATION ⅲ
ACKNOWLEDGEMENT ⅳ
ABSTRACT Ⅶ

1 INTRODUCTION 1
2 LITERATURE SURVEY 2
3 SYSTEM ANALYSIS 14
3.1EXISTING SYSTEM
3.2 PROPOSED SYSTEM
4 SYSTEM OVERVIEW AND REQUIREMENTS 18

5 SOFTWARE TOOLS AND DESCRIPTION 21

6 OBJECTIVES 26

v
7 SYSTEM DESIGN 28

8 CODING 35
8.1 FRONT END
8.2 BACK END
9 SYSTEM TESTING 42
9.1TESTING
9.2 PRINCIPLES OF TESTING
9.3 TYPES OF TESTING
10 RESULT 45
11 CONCLUSION 52
12 FUTURE SCOPE 55

13 REFERENCES 57

vi
ABSTRACT

Problem statement:

The project is concerned with identifying a solution that could be used to detect and filter out sites
containing fake news for purposes of helping users to avoid being lured by clickbaits. It is imperative
that such solutions are identified as they will prove to be useful to both readers and tech companies
involved in the issue.

Solution:
Social media provide a platform for quick and seamless access to information. However, the
propagation of false information, raises major concerns, especially given the fact that social
media are the primary source of information for a large percentage of the population. False
information may manipulate people’s beliefs and have real-life consequences. Therefore, one
major challenge is to automatically identify false information by categorizing it into different
types and notify users about the credibility of different articles shared online.

Recent political events have lead to an increase in the popularity and spread of fake news.
As demonstrated by the widespread effects of the large onset of fake news, humans are
inconsistent if not outright poor detectors of fake news. With this, efforts have been made to
automate the process of fake news detection. The most popular of such attempts include
“blacklists” of sources and authors that are unreliable. While these tools are useful, in order
to create a more complete end to end solution, we need to account for more difficult cases
where reliable sources and authors release fake news.

The growth of social media has revolutionized the way people access information.
Although platforms like Facebook and Twitter allow for a quicker, wider and less restricted
access to information, they also consist of a breeding ground for the dissemination of fake
news. Most of the existing literature on fake news detection on social media proposes user-
based or content-based approaches. However, recent research revealed that real and fake
news also propagate significantly differently on Twitter.

As such, the goal of this project was to create a tool for detecting the language patterns that
characterize fake and real news through the use of machine learning and natural language
processing techniques. The results of this project demonstrate the ability for machine learning
to be useful in this task. We have built a model that catches many intuitive indications of real
and fake news

vii
[Link]

viii
[Link]
1.1 Purpose of Project
Fake news creates chaos in our societies, as people put fake news in order to conspire against
[Link] people are loosing their trust on media as sometimes news channels cover these
news to increase their [Link] is misleading people and is getting hard day by day to
separate fact from fiction .It is impacting the decisions of youth, letting them believe
something which is not [Link] partiesare taking advantage of fake news to manipulate
voters.

A large body of recent works has focused on understanding and detecting fake news stories
that are disseminated on social media. To accomplish this goal, these works explore several
types of features extracted from news stories, including source and posts from social media.
In addition to exploring the main features proposed in the literature for fake news detection,
we present a new set of features and measure the prediction performance of current
approaches and features for automatic detection of fake news. Our results reveal interesting
findings on the usefulness and importance of features for detecting false news. Finally, we
discuss how fake news detection approaches can be used in the practice. Number of Machine
Learning (ML) algorithms, such as Decision tree, Random Forest ,Logistic regression were
applied for the purpose of classification and prediction of fake news dataset, and many
promising results were presented in the literature.

1.2 Scope of the Project

In particular, we propose a modeling of the problem, where we capture relations between
articles and terms, as well as spatial/contextual relations between terms, towards unlocking
the full potential of the content. Furthermore, we propose an ensemble method which
judiciously combines and consolidates results from different decompositions into clean,
coherent, and high accuracy groups of articles that belong to different categories of false
news. We extensively evaluate our proposed method on real data, for which we have labels,
and demonstrate that the proposed algorithm was able to identify all different false news
categories within the corpus, with average homogeneity per group.

1
[Link] SURVEY

2
2. LITERATURE SURVEY

2.1 Machine learning

Tom Mitchell states machine learning as “A computer program is said to learn from
experience and from some tasks and some performance on, as measured by, improves with
experience”. Machine Learning is combination of correlations and relationships, most
machine learning algorithms in existence are concerned with finding and/or exploiting
relationship between [Link] Machine Learning Algorithms can pinpoint on certain
correlations, the model can either use these relationships to predict future observations or
generalize the data to reveal interesting patterns. In Machine Learning there are various types
of algorithms such as Regression, Linear Regression, Logistic Regression, Naive Bayes
Classifier, Bayes theorem, KNN (K-Nearest Neighbor Classifier),Decision Tress, Entropy,
ID3, SVM (Support Vector Machines), K-means Algorithm, Random Forest and etc.,

The name machine learning was coined in 1959 by Arthur Samuel. Machine learning
explores the study and construction of algorithms that can learn from and make predictions
on data Machine learning is closely related to (and often overlaps with) computational
statistics, which also focuses on prediction-making through the use of computers. It has
strong ties to mathematical optimization, which delivers methods, theory and application
domains to the field. Machine learning is sometimes conflated with data mining, where the
latter subfield focuses more on exploratory data analysis and is known as unsupervised
learning.

Within the field of data analytics, machine learning is a method used to devise complex
models and algorithms that lend themselves to prediction; in commercial use, this is known
as predictive analytics. These analytical models allow researchers, data scientists, engineers,
and analysts to "produce reliable, repeatable decisions and results" and uncover "hidden
insights" through learning from historical relationships and trends in the data

.Machine learning tasks:Machine learning tasks are typically classified into several
broad categories:

Supervised learning: The computer is presented with example inputs and their desired
outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to
outputs. As special cases, the input signal can be only partially available, or restricted to
special feedback.

Semi-supervised learning:The computer is given only an incomplete training signal, a

training set with some (often many) of the target outputs missing.

Active learning: The computer can only obtain training labels for a limited set of
instances (based on a budget), and also has to optimize its choice of objects to acquire labels
for. When used interactively, these can be presented to the user for labelling.

3
Reinforcement learning: Data (in form of rewards and punishments) are given only as
feedback to the program's actions in a dynamic environment, such as driving a vehicle or
playing a game against an opponent.

Unsupervised learning: No labels are given to the learning algorithm, leaving it on its
own to find structure in its input. Unsupervised learning can be a goal in itself (discovering
hidden patterns in data) or a means towards an end (feature learning).

Fig2.1-MachineLearningCategories

2.1.1 Features of machine learning

 It is nothing but automating the Automation.

• Getting computers to program themselves.

• Writing Software is bottleneck.

• Machine leaning models involves machines learning from data without the help of humans
or any kind of human intervention.

4
• Machine Learning is the science of making of making the computers learn andact like humans
by feeding data and information without being explicitly programmed.
• Machine Learning is a combination of Algorithms, Datasets, and Programs.

• Machine Learning is totally different from traditionally programming, here data and output
is given to the computer and in return it give us the programwhich provides solution
to the various problems. Below is the figure

Fig 2.1.1 Traditional Programming vs Machine Learning

2.2 Decision Tree

A decision tree is a flowchart-like structure in which each internal node represents a test on a
feature (e.g. whether a coin flip comes up heads or tails) , each leaf node represents a class
label (decision taken after computing all features) and branches represent conjunctions of
features that lead to those class labels. The paths from root to leaf represent classification
rules. Decision tree is one of the predictive modelling approaches used in statistics, data
mining and machine learning.

Decision trees are constructed via an algorithmic approach that identifies ways to split a
data set based on different conditions. It is one of the most widely used and practical methods
for supervised learning. Decision Trees are a non-parametric supervised learning method
used for both classification and regression tasks.

Tree models where the target variable can take a discrete set of values are called
classification trees. Decision trees where the target variable can take continuous values
(typically real numbers) are called regression trees. Classification And Regression Tree
(CART) is general term for this.

Data comes in records of forms I .e, (x, Y)=(x1,x2,x3,... ,xk ,Y). The dependent variable,
Y, is
the target variable that we are trying to understand, classify or generalize. The vector x is
composed of the features, x1, x2, x3 etc., that are used for that task .While making decision
tree, at each node of tree we ask different type of questions. Based on the asked question we
will calculate the information gain corresponding to it.
5
Information Gain
Information gain is used to decide which feature to split on at each step in building the
tree. Simplicity is best, so we want to keep our tree small. To do so, at each step we should
choose the split that results in the purest daughter nodes. A commonly used measure of purity
is called information. For each node of the tree, the information value measures how much
information a feature gives us about the class. The split with the highest information gain will
be taken as the first split and the process will continue until all children nodes are pure, or
until the information gain is 0.
Algorithm for constructing decision tree usually works top-down, by choosing a variable at
each step that best splits the set of items. Different algorithms use different metrices for
measuring best.

GiniImpurity

Pure:
Pure means, in a selected sample of dataset all data belongs to same class (PURE).

Impure:
Impure means, data is mixture of different classes.

Definition of Gini Impurity

Gini Impurity is a measurement of the likelihood of an incorrect classification of a new

instance of a random variable, if that new instance were randomly classified according to the
distribution of class labels from the data set.
If our dataset is Pure then likelihood of incorrect classification is 0. If our sample is mixture
of different classes then likelihood of incorrect classification will be high.
Steps for Making decision tree:
 Get list of rows (dataset) which are taken into consideration for making decision tree
(recursively at each nodes).
 Calculate uncertanity of our dataset or Gini impurity or how much our data is mixed
up etc.
 Generate list of all question which needs to be asked at that node.
 Partition rows into True rows and False rows based on each question asked.
 Calculate information gain based on gini impurity and partition of data from previous
step.
 Update highest information gain based on each question asked.
 Update best question based on information gain (higher information gain).
 Divide the node on best question. Repeat again from step 1 again until we get pure
node (leaf nodes).

Advantage of Decision Tree

 Easy to use and understand.

 Can handle both categorical and numerical data.
 Resistant to outliers, hence require little data preprocessing.

6
Disadvantage of Decision Tree

 Prone to overfitting.
 Require some kind of measurement as to how well they are doing.
 Need to be careful with parameter tuning.
 Can create biased learned trees if some classes dominate

2.3 Random Forest

Random Forest can be used for both classification and regression problems. Random Forest
algorithm is a supervised classification algorithm. We can see it from its name, which is to
create a forest by some way and make it random. There is a direct relationship between the
number of trees in the forest and the results it can get: the larger the number of trees, the more
accurate the result. But one thing to note is that creating the forest is not the same as
constructing the decision with information gain or gain index approach.

The decision tree is a decision support tool. It uses a tree-like graph to show the possible
consequences. If you input a training dataset with targets and features into the decision tree, it
will formulate some set of rules. These rules can be used to perform predictions. Through the
decision tree algorithm, you can generate the rules. You can then input the features of this
movie and see whether it will be liked by your daughter. The process of calculating these
nodes and forming the rules is using information gain and Gini index calculations.

The difference between Random Forest algorithm and the decision tree algorithm is that in
Random Forest, the process es of finding the root node and splitting the feature nodes will
run randomly.

Overfitting is one critical problem that may make the results worse, but for Random Forest
algorithm, if there are enough trees in the forest, the classifier won’t overfit the model. The
third advantage is the classifier of Random Forest can handle missing values, and the last
advantage is that the Random Forest classifier can be modeled for categorical values.

There are two stages in Random Forest algorithm, one is random forest creation, the other
is to make a prediction from the random forest classifier created in the first stage.

Random Forest creation pseudocode:

1. Randomly select “K” features from total “m” features where k << m

2. Among the “K” features, calculate the node “d” using the best split point

3. Split the node into daughter nodes using the best split

4. Repeat the a to c steps until “l” number of nodes has been reached

5. Build forest by repeating steps a to d for “n” number times to create “n” number of trees.
In the next stage, with the random forest classifier created, we will make the prediction.
The random forest prediction pseudocode is shown below:
7
Takes the test features and use the rules of each randomly created decision tree to predict
the outcome and stores the predicted outcome (target). Calculate the votes for each predicted
target.

Consider the high voted predicted target as the final prediction from the random forest
algorithm.

Advantages:

• Compared with other classification techniques, there are three advantages as the author
mentioned.

• For applications in classification problems, Random Forest algorithm will avoid the
overfitting problem

. • For both classification and regression task, the same random forest algorithm can be used.

• The Random Forest algorithm can be used for identifying the most important features from
the training dataset, in other words, feature engineering .

2.4 Logistic Regression

o Logistic regression is one of the most popular Machine Learning algorithms,
which comes under the Supervised Learning technique. It is used for
predicting the categorical dependent variable using a given set of
independent variables.
o Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be
either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how
they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such
as whether the cells are cancerous or not, a mouse is obese or not based on
its weight, etc.

8
o Logistic Regression is a significant machine learning algorithm because it has
the ability to provide probabilities and classify new data using continuous and
discrete datasets.
o Logistic Regression can be used to classify the observations using different
types of data and can easily determine the most effective variables used for
the classification. The below image is showing the logistic function:

Logistic Function (Sigmoid Function):

 The sigmoid function is a mathematical function used to map the

predicted values to probabilities.

 It maps any real value into another value within a range of 0 and 1.

 The value of the logistic regression must be between 0 and 1, which

cannot go beyond this limit, so it forms a curve like the "S" form. The
S-form curve is called the Sigmoid function or the logistic function.

 In logistic regression, we use the concept of the threshold value, which

defines the probability of either 0 or 1. Such as values above the
threshold value tends to 1, and a value below the threshold values
tends to 0.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into threetypes:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".

Steps in Logistic Regression:

To implement the Logistic Regression using Python, we will use the same steps as we have
done in previous topics of Regression. Below are the steps:

o Data Pre-processing step

9
o Fitting Logistic Regression to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

Advantages

Logistic regression is easier to implement, interpret, and very efficient to train.

It can easily extend to multiple classes(multinomial regression) and a natural

probabilistic view of class predictions.

It is very fast at classifying unknown records.

It can interpret model coefficients as indicators of feature importance.

Disadvantages

If the number of observations is lesser than the number of features, Logistic

Regression should not be used, otherwise, it may lead to overfitting.

The major limitation of Logistic Regression is the assumption of linearity between

the dependent variable and the independent variables.

Non-linear problems can’t be solved with logistic regression because it has a linear
decision surface. Linearly separable data is rarely found in real-world scenarios.

It is tough to obtain complex relationships using logistic regression. More powerful

and compact algorithms such as Neural Networks can easily outperform this
algorithm.

2.5 Data mining

Machine learning and data mining often employ the same methods and overlap
significantly, but while machine learning focuses on prediction, based on known
properties learned from the training data, data mining focuses on the discovery of
(previously) unknown properties in the data (this is the analysis step of knowledge
discovery in databases). Data mining uses many machine learning methods, but with
different goals; on the other hand, machine learning also employs data mining
methods as "unsupervised learning" or as a preprocessing step to improve learner
accuracy. Much of the confusion between these two research communities (which do
often have separate conferences and separate journals, ECML PKDD being a major

10
exception) comes from the basic assumptions they work with: in machine learning,
performance is usually evaluated with respect to the ability to reproduce known
knowledge, while in knowledge discovery and data mining (KDD) the key task is the
discovery of previously unknown knowledge.

2.6 JupyterLab
Jupyter Lab is a web-based interactive development environment for Jupyter
notebooks, code, and data. JupyterLab is flexible: configure and arrange the user
interface to support a wide range of workflows in data science, scientific computing,
and machine learning. Jupyter lab is extensible and modular: write plugins that add
new components and integrate with existing [Link] Lab is a next-generation
web-based user interface or Project .JupyterLab enables you to work with documents
and activities such as Jupyter notebooks, text editors, terminals, and custom
components in a flexible, integrated, and extensible manner.

You can arrange multiple documents and activities side by side in the work area using
tabs and splitters. Documents and activities integrate with each other, enabling new
workflows for interactive computing, for example: Code Consoles provide transient
scratchpads for running code interactively, with full support for rich output. A code
console can be linked to a notebook kernel as a computation log from the notebook,
for example. Kernel-backed documents enable code in any text file (Markdown,
Python, R, LaTeX, etc.) to be run interactively in any Jupyter kernel. Notebook cell
outputs can be mirrored into their own tab, side by side with the notebook, enabling
simple dashboards with interactive controls backed by a kernel. Multiple views of
documents with different editors or viewers enable live editing of documents reflected
in other viewers. For example, it is easy to have live preview of Markdown,
Delimiter-separated Values, or Vega/Vega-Lite documents. Jupyter Lab also offers a
unified model for viewing and handling data formats .Jupyter Lab understands many
file formats (images, CSV, JSON, Markdown, PDF, Vega, Vega-Lite, etc.) and can
also display rich kernel output in these formats. See File and Output Formats for more
information.

Packages

Pandas is a software library written for the Python programming language for data
manipulation and analysis. In particular, it offers data structures and operations for
manipulating numerical tables and time series. It is free software released under the
three clause BSD license. The name is derived from the term "panel data", an
econometrics term for data sets that include observations over multiple time periods
for the same individuals. Its name is a play on the phrase "Python data analysis" itself.

Library features

Data Frame object for data manipulation with integrated indexing.

Tools for reading and writing data between in-memory data structures and different
file formats.

11
Data alignment and integrated handling of missing data.

Reshaping and pivoting of data sets.

Label-based slicing, fancy indexing, and subsetting of large data sets

Data structure column insertion and deletion.

Group by engine allowing split-apply-combine operations on data sets.

Data set merging and joining.

Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional

data structure.

Time series-functionality: Date range generation and frequency conversion, moving

window statistics, moving window linear regressions, date shifting and lagging.

Provides data filtration.

Seaborn

Seaborn is a Python data visualization library based on matplotlib. It provides a

highlevel interface for drawing attractive and informative statistical graphics. Seaborn
is a library for making statistical graphics in Python. It builds on top of matplotlib and
integrates closely with pandas data structures. Seaborn helps you explore and
understand your data. Its plotting functions operate on dataframes and arrays
containing whole datasets and internally perform the necessary semantic mapping and
statistical aggregation to produce informative plots. Its datasetoriented, declarative
API lets you focus on what the different elements of your plots mean, rather than on
the details of how to draw them Seaborn is the only library we need to import for this
simple example. By convention, it is imported with the shorthand sns. Behind the
scenes, seaborn uses matplotlib to draw its plots. For interactive work, it’s
recommended to use a Jupyter/IPython interface in matplotlib mode, or else you’ll
have to call [Link]() when you want to see the plot. This uses the
matplotlib rcParam system and will affect how all matplotlib plots look, even if you
don’t make them with seaborn. Beyond the default theme, there are several other
options, and you can independently control the style and scaling of the plot to quickly
translate your work between presentation contexts (e.g., making a version of
yourfigure that will have readable fonts when projected during a talk). If you like the
matplotlib defaults or prefer a different theme, you can skip this step and still use the
seaborn plotting functions.

MatPlotLib

Matplotlib is a Sponsored Project of NumFOCUS, a 501(c)(3) nonprofit charity in the

United States. NumFOCUS provides Matplotlib with fiscal, legal, and administrative
support to help ensure the health and sustainability of the project. Matplotlib is a
comprehensive library for creating static, animated, and interactive visualizations in
Python. Matplotlib produces publication -quality figures in a variety of hardcopy
formats and interactive environments across platforms. Matplotlib can be used in
12
Python scripts, the Python and IPython shell, web application servers, and various
graphical user interface toolkits.

For installation instructions and requirements, see [Link] or the install

documentation.

Matplotlib is a plotting library for the Python programming language and its
numerical mathematics extension NumPy. It provides an object-oriented API for
embedding plots into applications using general-purpose GUI toolkits like Tkinter,
wxPython, Qt, or GTK+. There is also a procedural "pylab" interface based on a state
machine (like OpenGL), designed to closely resemble that of MATLAB, though its
use is [Link] makes use of Matplotlib.

Matplotlib was originally written by John D. Hunter, since then it has an active
development community,[4] and is distributed under a BSD-style license. Michael
Droettboom was nominated as matplotlib's lead developer shortly before John
Hunter's death in August 2012, and further joined by Thomas Caswell.

Matplotlib 2.0.x supports Python versions 2.7 through 3.6. Python 3 support started
with Matplotlib 1.2. Matplotlib 1.4 is the last version to support Python 2.6.
Matplotlib has pledged not to support Python 2 past 2020 by signing the Python 3
Statement.

Several toolkits are available which extend Matplotlib functionality. Some are
separate downloads, others ship with the Matplotlib source code but have external
dependencies.

Basemap: map plotting with various map projections, coastlines, and political
boundaries

Cartopy: a mapping library featuring object-oriented map projection definitions, and

arbitrary point, line, polygon and image transformation capabilities. (Matplotlib v1.2
and above)

Excel tools: utilities for exchanging data with Microsoft Excel

GTK tools: interface to the GTK+ library

Qt interface Mplot3d: 3-D

plotsNatgrid: interface to the natgrid library for gridding irregularly spaced data.

matplotlib2tikz: export to Pgfplots for smooth integration into LaTeX documents

Seaborn: provides an API on top of Matplotlib that offers sane choices for plot style
and color defaults, defines simple high-level functions for common statistical plot
types, and integrates with the functionality provided by Pandas.

Numpy
13
NumPy is a library for the Python programming language, adding support for large,
multi-dimensional arrays and matrices, along with a large collection of highlevel
mathematical functions to operate on these arrays. The ancestor of NumPy, Numeric,
was originally created by Jim Hugunin with contributions from several other
developers. In 2005, Travis Oliphant created NumPy by incorporating features of the
competing Numarray into Numeric, with extensive modifications.

NumPy is open-source software and has many contributors. NumPy targets the
CPython reference implementation of Python, which is a non- optimizing bytecode
interpreter. Mathematical algorithms written for this version of Python often run much
slower than compiled equivalents. NumPy addresses the slowness problem partly by
providing multidimensional arrays and functions and operators that operate efficiently
on arrays, requiring rewriting some code, mostly inner loops, using NumPy. Using
NumPy in Python gives functionality comparable to MATLAB since they are both
interpreted, and they both allow the user to write fast programs as long as most
operations work on arrays or matrices instead of scalars. In comparison, MATLAB
boasts a large number of additional toolboxes, notably Simulink, whereas NumPy is
intrinsically integrated with Python, a more modern and complete programming
language.

Moreover, complementary Python packages are available; SciPy is a library that adds
more MATLAB- like functionality and Matplotlib is a plotting package that provides
MATLAB-like plotting functionality. Internally, both MATLAB and NumPy rely on
BLAS and LAPACK for efficient linear algebra computations.

Python bindings of the widely used computer vision library OpenCV utilize NumPy
arrays to store and operate on data. Since images with multiple channels are simply
represented as three-dimensional arrays, indexing, slicing or masking with other
arrays are very efficient ways to access specific pixels of an image. The NumPy array
as universal data structure in OpenCV for images, extracted feature points, filter
kernels and many more vastly simplifies the programming workflow and debugging.

Kaggle
Our Dataset is extracted from Kaggle from UCI [Link], a subsidiary of
Google LLC, is an online community of data scientists and machine learning
practitioners. Kaggle allows users to find and publish data sets, explore and build
models in a web-based data-science environment, work with other data scientists and
machine learning engineers, and enter competitions to solve data science challenges.

Kaggle's services:

Machine learning competitions: this was Kaggle's first product. Companies post
problems and machine learners compete to build the best algorithm, typically with
cash prizes.

Kaggle Kernels: a cloud-based workbench for data science and machine learning.
Allows data scientists to share code and analysis in Python, R and R Markdown. Over
14
150K "kernels" (code snippets) have been shared on Kaggle covering everything from
sentiment analysis to object detection.

Public datasets platform: community members share datasets with each other. Has
datasets on everything from bone x-rays to results from boxing bouts.

Kaggle Learn: a platform for AI education in manageable chunks.

Kaggle has run hundreds of machine learning competitions since the company was
founded. Competitions have ranged from improving gesture recognition for
MicrosoftKinect to making an football AI for Manchester City to improving the
search for the Higgs boson at CERN.

DATASET:
[Link]

The dataset consists of 28000 real news and 31000 fake news. There are 4 columns in the dataset.

ATTRIBUTES:

Title: It contains the title of the data

Text: It contains the text

Subject: It contains the subject of the data.

Date: It contains the date of when it was posted.

15
[Link] ANALYSIS

16
3. SYSTEM ANALYSIS

3.1 Existing System

Using fake news as a political or economic tool is not new, but the scale of their
use is currently alarming, especially on social media. The authors of
misinformation try to influence the users' decisions, both in the economic and
political sphere. The facts of using disinformation during elections are well known.
Currently, two fake news detection approaches dominate. The first approach, so-
called fact or news checker, is based on the knowledge and work of volunteers,

3.2 Proposed System

The proposal is to deploy a system in Fake News Detection system which can identify
the False information. Since this problem is a kind of text classification, text-based
processing. It is appropriate operation like detection the news . The actual
fundamental is in text transformation, headlines and phrases and also check the full
text of information. Now check the method for characteristics of count vectorizer
method. An awareness that not everything we read on social media may be true, so we
always need to be thinking carefully and also the news that appear on the society real
or fake news

17
4. SYSTEM
REQUIREMENTS AND
OVERVIEW

18
4. SYSTEM REQUIREMENT SPECIFICATION
4.1What is SRS?

Software Requirement Specification (SRS) is the starting point of the software developing
activity. As system grew more complex it became evident that the goal of the entire system
cannot be easily comprehended. Hence the need for the requirement phase arose. The
software project is initiated by the client needs. The SRS is the means of translating the ideas
of the minds of clients (the input) into a formal document (the output of the requirement
phase.) The SRS phase consists of two basic activities: Problem/Requirement Analysis: The
process is order and more nebulous of the two, deals with understand the problem, the goal
and constraints. Requirement Specification: Here, the focus is on specifying what has been
found giving analysis such as representation, specification languages and tools, and checking
the specifications are addressed during this activity. The Requirement phase terminates with
the production of the validate SRS document. Producing the SRS document isthe basic goal
of this phase.

4.2 Role of SRS

The purpose of the Software Requirement Specification is to reduce the communication gap
between the clients and the developers. Software Requirement Specification is the medium
though which the client and user needs are accurately specified. It forms the basis of software
development. A good SRS should satisfy all the parties involved in the system.

4.3 Requirements Specification Document

A Software Requirements Specification (SRS) is a document that describes the nature of a

project, software or application. In simple words, SRS document is a manual of a project
provided it is prepared before you kick-start a project/application. This document is also
known by the names SRS report, software document. A software document is primarily
prepared for a project, software or any kind of [Link] are a set of guidelines to be
followed while preparing the softwarerequirement specification document. This includes the
purpose, scope, functional and non functional requirements, software and hardware
requirements of the project. In addition to this, it also contains the information about
environmental conditions required, safety and security requirements, software quality
attributes of the project etc. The purpose of SRS (Software Requirement Specification)
document is to describe the external behaviour of the application developed or software. It
defines the operations, performance and interfaces and quality assurance requirement of the
application or software. The complete software requirements for the system are captured by
the SRS.

19
4.4 Software Requirements

Operating System : Windows 10 or MAC OS.

Platform :Jupyter-Lab

Programming Language : Python

4.7 Hardware Requirements

Processor : Intel core i3 and above.

Hard Disk : 100 GB or above.

RAM : 1 GB or above.

Internet : 4 Mbps or above (Wireless).

20
5. SOFTWARES, TOOLS
AND DESCRIPTION

21
5. SOFTWARE TOOLS AND DESCRIPTION
5.1 PYTHON

Python is one of the most popular programming languages in both the coding and
Data Science communities. Guido Van Rossum created it in 1991 and ever since its
inception has been one of the most widely used languages along with C++, Java, etc.
Python is an opensource, high-level, general-purpose programming language that
incorporates the features of object-oriented, structural, and functional programming.
While Python’s simple syntax allows for writing readable code, which can be further
applied to complex software development processes to facilitate test-driven software
application development, machine learning, and data analytics. Python can run on all
the major operating systems, including Windows, Linux, and iOS.

Since it functions on cross-platform operating systems, Python can be used to develop

a host of applications, including web apps, gaming apps, enterprise-level applications,
ML apps, image processing, text processing, and so much more. Since it functions on
cross-platform operating systems, Python can be used to develop a host of
applications, including web apps, gaming apps, enterprise-level applications, ML
apps, image processing, text processing, and so much more. But beyond its innate
simplicity and versatility, what makes Python stand out are its vast assortments of
libraries and packages that can cater to a wide range of development as well as Data
Science requirements.

• Python has Prebuilt Libraries like Numpy for scientific computation, Scipy for
advanced computing and Pybrain for machine learning (Python Machine Learning)
making it one of the best languages for AI.
• Python developers around the world provide comprehensive support and assistance
via forums and tutorials making the job of the coder easier than any other popular
languages.
• Python is platform Independent and is hence one of the most flexible and popular
choices for use across different platforms and technologies with the least tweaks in
basic coding.
• Python is the most flexible of all others with options to choose between OOPs
approach and scripting. You can also use IDE itself to check for most codes and is a
22
boon for developers struggling with different algorithms.
Python also supports data analyzation and visualization, thereby further simplifying
the process of creating custom solutions minus the extra effort and time investment.

5.2 LIBRARIES
NUMPY:

NumPy stands for ‘Numerical Python’. It is an open-source Python library used to

perform various mathematical and scientific tasks. It contains multi-dimensional
arrays and matrices, along with many high-level mathematical functions that operate
on these arrays and matrices.

1. Installing Numpy:

You can install NumPy with:

conda install numpy

pip install numpy

2. How to import Numpy:

After installing NumPy, you can now use this library by importing it.
To import NumPy use:
import numpy as np

• Pandas:

Pandas is Python Data Analysis Library, pandas is an open source, BSD-

licensed library providing high-performance, easy-to-use data structures and
data analysis tools. Pandas is an open-source python package built on top of
Numpy developed by Wes McKinney. It is used as one of the most important
data cleaning and analysis tool. It provodies fast, flexible, and expressive data
structures.

Primary object types: int, float, object type

Data Frame: rows and columns (like a spreadsheet)

Series: a single column

1. Installing pandas: You can install Pandas by using the following

commands

:# To install pandas in terminal or command line use one of the

commands.

Pip install pandas

23
Or

conda install pandas

# To install pandas in jupyter notebook use this command

!pip install pandas

2. How to import Pandas:

import pandas as pd

By using the above command, you can easily import pandas library. Using
Pandas, we can accomplish five typical steps in the processing and
analysis of data, regardless of the origin of data — load, prepare,
manipulate, model, and analyze. Python with Pandas is used in a wide
range of fields including academic and commercial domains including
finance, economics, Statistics, analytics, etc.

• SCIKIT LEARN

Scikit-learn is a free machine learning library for Python. It features

various algorithms like support vector machine, random forests, and k-
neighbours, and it also supports Python numerical and scientific libraries
like NumPy and SciPy. Scikit-learn is probably the most useful library for
machine learning in Python. The sklearn library contains a lot of efficient
tools for machine learning and statistical modeling including classification,
regression, clustering, and dimensionality reduction.

1. Installing Scikit Learn:

For pip installation

Run the following command in the terminal:

pip install scikit-learn

If you like conda, you can also use the conda for package installation

Run the following command:

conda install scikit-learn

[Link] to import Scikit-learn:

Once you are done with the installation, you can use scikit-learn easily
in your Python code by importing it as:

import sklearn

24
• MATPLOTLIB :

Matplotlib is an open-source plotting library in Python introduced in

the year 2003. It is a very comprehensive library and designed in such
a way that most of the functions for plotting in MATLAB can be used
in Python. It consists of several plots like the Line Plot, Bar Plot,
Scatter Plot, Histogram etc. through which we can visualize various
types of data.

1. Installing Matplotlib:

#Windows, Linus, MacOS users can install this library using the
following command:
python -mpip install -U matplotlib

#To install Matplotlib in Jupyter Notebook run the following

command:

pip install matplotlib

#To install Matplotlib in Anaconda Prompt use the following

command:

conda install matplotlib

2. Importing Matplotlib:

#importingpyplot module from matplotlib

from matplotlib import pyplot as plt

import [Link] as pl

25
[Link]

26
6. OBJECTIVES
Many Believe Fake News Articles
Studies have shown that many Americans cannot tell what news is fake and what news is
real. This can create confusion and misunderstanding about important social and political
issues.
Fake News Can Affect Your Grades
ACC Professors require that you use quality sources of information for your research
assignments and papers. If you use sources that have false or misleading information, you
may get a lower grade.
Fake News Can Be Harmful to Your Health
There are many fake and misleading news stories related to medical treatments and major
diseases like cancer or diabetes. Trusting these false stories could lead you to make decisions
that may be harmful to your health.
Fake News Makes It Harder For People To See the Truth
A Research Center study found that those on the right and the left of the political spectrum
have different ideas about the definition of 'fake news', "The study suggests that fake-news
panic, rather than driving people to abandon ideological outlets and the fringe, may actually
be accelerating the process of polarization: It’s driving consumers to drop some outlets, to
simply consume less information overall, and even to cut out social relationships."
This is why it is important for people to seek out news with as little bias as humanly possible.
News services strive to provide accurate, neutral coverage of major events.

27
[Link] DESIGN

28
7. SYSTEM DESIGN
1. SYSTEM DESIGN

7.1 Introduction to UML

Unified Modeling Language allows the software engineer to express an analysis

model using the modeling notation that is governed by a set of syntactic, semantic and
pragmatic rules. A UML system is represented using five different views that describe
the system from distinctly different perspective. Each view is defined by a set of
diagram, which is as follows:

1. User Model View

This view represents the system from the users’ perspective. The analysis
representation describes a usage scenario from the end-users’ perspective.
2. Structural Model View
In this model, the data and functionality are arrived from inside the system. This
model view models the static structures.
3. Behavioural Model View It represents the dynamic of behavioural as parts of
the system, depicting he interactions of collection between various structural
elements described in the user model and structural model view.
4. Implementation Model View In this view, the structural and behavioural as
parts of the system are represented as they are to be built
5. Environmental Model View In this view, the structural and behavioural aspects
of the environment in which the system is to be implemented are represented.

7.2 UML DIAGRAMS

7.2.1 Use-Case Diagram

To model a system, the most important aspect is to capture the dynamic

behaviour. To clarify a bit in details, dynamic behaviour means the behaviour of
29
the system when it is running/operating.

So only static behaviour is not sufficient to model a system rather dynamic

behaviour is more important than static behaviour. In UML there are five
diagrams available to model dynamic nature and use case diagram is one of them.
Now as we have to discuss that the use case diagram is dynamic in nature there
should be some internal or external factors for making the interaction.

These internal and external agents are known as actors. So use case diagrams are
consisting of actors, use cases and their relationships. The diagram is used to
model the system/subsystem of an application. A single use case diagram
captures a particular functionality of a system. So to model the entire system
numbers of use case diagrams are used.

Use case diagrams are used to gather the requirements of a system including
internal and external influences. These requirements are mostly design
requirements. So when a system is analysed to gather its functionalities use cases
are prepared and actors are identified. In brief, the purposes of use case diagrams
can be as follows:
a. Used to gather requirements of a system.
b. Used to get an outside view of a system.
c. Identify external and internal factors influencing the system.
d. Show the interacting among the requirements are actors

Fig 7.2.1 – Use Case Diagram

7.2.2 Sequence Diagram

30
Sequence diagrams describe interactions among classes in terms of an exchange of messages
over time. They're also called event diagrams. A sequence diagram is a good way to visualize
and validate various runtime scenarios. These can help to predict how a system will behave
and to discover responsibilities a class may need to have in the process of modelling a new
system.

The aim of a sequence diagram is to define event sequences, which would have a desired
outcome. The focus is more on the order in which messages occur than on the message per se.
However, the majority of sequence diagrams will communicate what messages are sent and
the order in which they tend to occur.

Basic Sequence Diagram

Class Roles or Participants

Class roles describe the way an object will behave in context. Use the UML object
symbol to illustrate class roles, but don't list object attributes.

Activation or Execution Occurrence

Activation boxes represent the time an object needs to complete a task. When an object is
busy executing a process or waiting for a reply message, use a thin grey rectangle placed
vertically on its lifeline.

Messages

Messages are arrows that represent communication between objects. Use half arrowed lines
to represent asynchronous messages
. Asynchronous messages are sent from an object that will not wait for a response from the
receiver before continuing its tasks.

Lifelines

Lifelines are vertical dashed lines that indicate the object's presence over time.

Destroying Objects

Objects can be terminated early using an arrow labelled "<< destroy >>" that points to an X.
This object is removed from memory. When that object's lifeline ends, you can place an X at
the end of its lifeline to denote a destructionoccurrence.

Loops

A repetition or loop within a sequence diagram is depicted as a rectangle. Place the condition
for exiting the loop at the bottom left corner in square brackets [].

Guards

When modelling object interactions, there will be times when a condition must be met for a
message to be sent to an object. Guards are conditions that need to be used throughout UML
diagrams to control flow.

31
Fig 7.2.2 – Sequence Diagram

7.2.3 Class Diagram

Class diagrams are the main building blocks of every object oriented methods. The class
diagram can be used to show the classes, relationships, interface, association, and
collaboration. UML is standardized in class diagrams. Since classes are the building block of
an application that is based on OOPs, so as the class diagram has appropriate structure to
represent the classes, inheritance, relationships, and everythingthat OOPs have in its context.
It describes various kinds of objects and the static relationship in between them.

The main purpose to use class diagrams are:

[Link] is the only UML which can appropriately depict various aspects of OOPs concept.
[Link] design and analysis of application can be faster and efficient.
[Link] is base for deployment and component diagram.
Each class is represented by a rectangle having a subdivision of three compartments name
,attributes and operation.

32
Fig 7.2.3-Class Diagram

7.2.4 Object Diagram

Object is an instance of a class in a particular moment in runtime that can have its own state
and data values. Likewise a static UML object diagram is an instance of a class diagram; it
shows a snapshot of the detailed state of a system at a point in time, thus an object diagram
encompasses objects and their relationships which may be considered a special case of a class
diagram or a communication diagram.

Purpose of Object Diagram

The use of object diagrams is fairly limited, mainly to show examples of data structures.

During the analysis phase of a project, you might create a class diagram to describe the
structure of a system and then create a set of object diagrams as test cases to verify the
accuracy and completeness of the class diagram. Before you create a class diagram, you
might create an object diagram to discover facts about specific model elements and their
links, or to illustrate specific examples of the classifiers that are required.

An object diagram shows this relation between the instantiated classes and thedefined class,
and the relation between these objects in the system. They are be useful to explain smaller
portions of your system, when your system class diagram is very complex, and also
sometimes modeling recursive relationship in diagram.

Object Names:
33
Every object is actually symbolized like a rectangle, that offers the name from the object and
its class underlined as well as divided with a colon.

Object Attributes:

Similar to classes, you are able to list object attributes inside a separate compartment.
However, unlike classes, object attributes should have values assigned for them

.Links:

Links tend to be instances associated with associations. You can draw a link while using the
lines utilized in class diagrams.

7.2.4 object diagram

34
[Link]

35
[Link]

8.1Pseudo Code

step 1: Import the required packages

step 2: Load the dataset
step 3: Summarizing the dataset
step 4: Applying the datamining techniques like data preprocessing
step 5 :Apply feature extraction with TF-IDF vectorizer
step 6 : Visualizing the dataset(confusion matrix)
step 7: Implementing the algorithms mentioned logistic regression , random forest
,decision tree
step 8: Finding the accuracy

8.2 Code Snippets

import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from [Link] import accuracy_score
from sklearn.model_selection import train_test_split
from [Link] import Pipeline

Read datasets

fake = pd.read_csv("data/[Link]")
true = pd.read_csv("data/[Link]")

[Link]

Data cleaning and preparation

# Add flag to track fake and real

36
fake['target'] = 'fake'
true['target'] = 'true'

# Concatenate dataframes
data = [Link]([fake, true]).reset_index(drop = True)
[Link]

# Shuffle the data

from [Link] import shuffle
data = shuffle(data)
data = data.reset_index(drop=True)

# Check the data

[Link]()

# Removing the date (we won't use it for the analysis)

[Link](["date"],axis=1,inplace=True)
[Link]()

# Removing the title (we will only use the text)

[Link](["title"],axis=1,inplace=True)
[Link]()

# Convert to lowercase

data['text'] = data['text'].apply(lambda x: [Link]())

[Link]()

# Remove punctuation

import string

def punctuation_removal(text):
all_list = [char for char in text if char not in [Link]]
clean_str = ''.join(all_list)
return clean_str

data['text'] = data['text'].apply(punctuation_removal)

# Check
[Link]()

# Removing stopwords
import nltk
[Link]('stopwords')
from [Link] import stopwords
stop = [Link]('english')

data['text'] = data['text'].apply(lambda x: ' '.join([word for word in [Link]() if word not

in (stop)]))
37
[Link]()

Basic data exploration

# How many articles per subject?

print([Link](['subject'])['text'].count())
[Link](['subject'])['text'].count().plot(kind="bar")
[Link]()

# How many fake and real articles?

print([Link](['target'])['text'].count())
[Link](['target'])['text'].count().plot(kind="bar")
[Link]()

# Word cloud for fake news

from wordcloud import WordCloud

fake_data = data[data["target"] == "fake"]

all_words = ' '.join([text for text in fake_data.text])

wordcloud = WordCloud(width= 800, height= 500,

max_font_size = 110,
collocations = False).generate(all_words)

[Link](figsize=(10,7))
[Link](wordcloud, interpolation='bilinear')
[Link]("off")
[Link]()

# Word cloud for real news

from wordcloud import WordCloud

real_data = data[data["target"] == "true"]

all_words = ' '.join([text for text in fake_data.text])

wordcloud = WordCloud(width= 800, height= 500,

max_font_size = 110,
collocations = False).generate(all_words)

[Link](figsize=(10,7))
[Link](wordcloud, interpolation='bilinear')
[Link]("off")
[Link]()

# Most frequent words counter (Code adapted from

[Link]
from nltk import tokenize

token_space = [Link]()
38
def counter(text, column_text, quantity):
all_words = ' '.join([text for text in text[column_text]])
token_phrase = token_space.tokenize(all_words)
frequency = [Link](token_phrase)
df_frequency = [Link]({"Word": list([Link]()),
"Frequency": list([Link]())})
df_frequency = df_frequency.nlargest(columns = "Frequency", n = quantity)
[Link](figsize=(12,8))
ax = [Link](data = df_frequency, x = "Word", y = "Frequency", color = 'blue')
[Link](ylabel = "Count")
[Link](rotation='vertical')
[Link]()

# Most frequent words in fake news

counter(data[data["target"] == "fake"], "text", 20)

# Most frequent words in real news

counter(data[data["target"] == "true"], "text", 20)

Modeling

# Function to plot the confusion matrix (code from [Link]

[Link]/stable/auto_examples/model_selection/plot_confusion_matrix.html)
from sklearn import metrics
import itertools

def plot_confusion_matrix(cm, classes,

normalize=False,
title='Confusion matrix',
cmap=[Link]):

[Link](cm, interpolation='nearest', cmap=cmap)

[Link](title)
[Link]()
tick_marks = [Link](len(classes))
[Link](tick_marks, classes, rotation=45)
[Link](tick_marks, classes)

if normalize:
cm = [Link]('float') / [Link](axis=1)[:, [Link]]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')

thresh = [Link]() / 2.
for i, j in [Link](range([Link][0]), range([Link][1])):
[Link](j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
39
plt.tight_layout()
[Link]('True label')
[Link]('Predicted label')

Preparing the data

# Split the data

X_train,X_test,y_train,y_test = train_test_split(data['text'], [Link], test_size=0.2,
random_state=42)

Logistic regression

# Vectorizing and applying TF-IDF

from sklearn.linear_model import LogisticRegression

pipe = Pipeline([('vect', CountVectorizer()),

('tfidf', TfidfTransformer()),
('model', LogisticRegression())])

# Fitting the model

model = [Link](X_train, y_train)

# Accuracy
prediction = [Link](X_test)
print("accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))

cm = metrics.confusion_matrix(y_test, prediction)
plot_confusion_matrix(cm, classes=['Fake', 'Real'])

Decision Tree Classifier

from [Link] import DecisionTreeClassifier

# Vectorizing and applying TF-IDF

pipe = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('model', DecisionTreeClassifier(criterion= 'entropy',
max_depth = 20,
splitter='best',
random_state=42))])
# Fitting the model
model = [Link](X_train, y_train)

# Accuracy
prediction = [Link](X_test)
print("accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))

cm = metrics.confusion_matrix(y_test, prediction)
40
plot_confusion_matrix(cm, classes=['Fake', 'Real'])

Random Forest Classifier

from [Link] import RandomForestClassifier

pipe = Pipeline([('vect', CountVectorizer()),

('tfidf', TfidfTransformer()),
('model', RandomForestClassifier(n_estimators=50, criterion="entropy"))])

model = [Link](X_train, y_train)

prediction = [Link](X_test)
print("accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))

cm = metrics.confusion_matrix(y_test, prediction)
plot_confusion_matrix(cm, classes=['Fake', 'Real'])

41
[Link]

42
[Link]
9.1 Introduction to testing

Testing is the process of evaluating a system or its component(s) with the intent to
find whether it satisfies the specified requirements or not. Testing is executing a
system in order to identify any gaps, errors, or missing requirements in contrary to the
actual requirements.

According to ANSI/IEEE 1059 standard, Testing can be defined as - A process of

analyzing a software item to detect the differences between existing and required
conditions (that is defects/errors/bugs) and to evaluate the features of the software
item.

Who does Testing?

It depends on the process and the associated stakeholders of the project(s). In the IT
industry, large companies have a team with responsibilities to evaluate the developed
software in context of the given requirements. Moreover, developers also conduct
testing which is called Unit Testing. In most cases, the following professionals are
involved in testing a systemwithin their respective capacities:

• Software Tester
• Software Developer
• Project Lead/Manager
• End User
Levels of testing include different methodologies that can be used while conducting
software testing. The main levels of software testing are:
• Functional Testing
• Non-functional Testing

Functional Testing

This is a type of black-box testing that is based on the specifications of the software
that is to be tested. The application is tested by providing input and then the results
are examined that need to conform to the functionality it was intended for. Functional
testing of a software is conducted on a complete, integrated system to evaluate the
system's compliance with its specified requirements

.9.2 Software Testing Life Cycle

The process of testing a software in a well planned and systematic way is known as
software testing lifecycle (STLC).

Different organizations have different phases in STLC however generic Software Test
Life Cycle (STLC) for waterfall development model consists of the following phases.

1. Requirements Analysis
2. Test Planning
43
3. Test Analysis
4. Test Design

• Requirements Analysis

In this phase testers analyze the customer requirements and work with developers
during the design phase to see which requirements are testable and how they aregoing
to test those requirements

. It is very important to start testing activities from the requirements phase itself
because the cost of fixing defect is very less if it is found in requirements phase rather
than in future phases.

• Test Planning

In this phase all the planning about testing is done like what needs to be tested, how
the testing will be done, test strategy to be followed, what will be the test
environment, what test methodologies will be followed, hardware and software
availability, resources, risks etc. A high level test plan document is created which
includes all the planning inputs mentioned above and circulated to the stakeholders.

• Test Analysis
After test planning phase is over test analysis phase starts, in this phase we need to
dig deeper into project and figure out what testing needs to be carried out in each
SDLC phase. Automation activities are also decided in this phase, ifautomation needs
to be done for software product, how will the automation bedone, how much time will
it take to automate and which features need to be automated. Non
functionaltestingareas(Stress and performance testing) are also analyzed and defined
in this phase.

• Test Design
In this phase various black-box and white-box test design techniques are used to
design the test cases for testing, testers start writing test cases by following those
design techniques, if automation testing needs to be done then automation scripts also
needs to written in this phase

. 9.3 Test Cases

• The model is tested with test data.

• Accuracy is calculated for each algorithm

44
[Link]

45
[Link]

46
47
48
49
50
51
52
[Link]

53
11. CONCLUSION
In this research, we studied the how fake messages are used on Twitter during the
year 2016. A classifier was developed to detect potential fake messages. A dataset of
Twitter messages of 2016 has been classified. A qualitative content analysis has been
performed on these classified tweets. In this chapter, the results of our research will be
[Link] and compared three different supervised Machine Learning
algorithms. These algorithms have been trained and tested on a data sample oftweets.
The sample contained an equal number of ‘true’ and ‘false’ tweets. The sample has
been used to train multiple supervised Machine Learning algorithms. The
performance of these algorithms has been compared using the accuracy The best
performing algorithm, the Decision Tree algorithm, was used to label 613.033 tweets
2016. These falselabelled tweets were analysed in detail. Distinctive features of the
different false tweets were found, and the tweets were categorized into six different
categories. As a result, we can conclude that the Decision Tree algorithm is the best
algorithm for the classification of true and false messages. The algorithm performed
best with a weighted accuracy of 99%. Since our research needed a classifier that can
handle a considerable amount of data, this algorithm is more suitable for the
research . The Decision Tree algorithm was used to classify different tweets in the
database into either true messages or false messages. In total, 613.033 tweets were
classified, of which 328.897 were classified as true, and 284.136 tweets were
classified as false..

54
[Link] SCOPE

55
12. FUTURE SCOPE
Our research shows limitations that can be addressed and improved in future research.
We used a dataset that was used by a research. This dataset contained twitter
messages of 2016. These messages are five years old. This meant that many tweets
that were identified as false were not online anymore. The user accounts were deleted
or suspended. This meant that information about the user, such as Twitter followers
and number of sent messages, could not be retrieved. Therefore, the Machine
Learning algorithm was trained on only the content of the twitter message. It could be
interesting to investigate the detection of potential fake messages with a combination
of both the content of the tweet and the account data of the user that tweeted the
message. As researched by Camisani-Calzolari potential fake messages can also be
identified using multiple data of the account. An algorithm that is trained on a
combination of the content of the tweet and the account data could have a higher
validity. Due to the limited information of the user account, it was hard to make a
substantial training set for the training of the machine learning. A more extensive
training set could improve the validity of the Machine Learning algorithm and
therefore decrease the amount of false-positive classifications.

Through the work done in this project, we have shown that machine learning certainly
does have the capacity to pick up on sometimes subtle language patterns that may be
difficult for humans to pick up on. The next steps involved in this project come in two
different aspects. The first of aspect that could be improved in this project is
augmenting and increasing the size of the dataset. We feel that more data would be
beneficial in ridding the model of any bias based on specific patterns in the source.
There is also question as to weather or not the size of our dataset is sufficient. The
second aspect in which this project could be expanded is by comparing it to humans
performing the same task. Comparing the accuracies would be beneficial in deciding
whether or not the dataset is representative of how difficult the task of separating fake
from real news is. If humans are more accurate than the model, it may mean that we
need to choose more deceptive fake news examples. Because we acknowledge that
this is only one tool in a toolbox that would really be required for an end-to-end
system for classifying fake news, we expect that its accuracy will never reach perfect.
However, it may be beneficial as a stand-alone application if its accuracy is already
higher than human accuracy at the same task. In addition to comparing the accuracy
to human accuracy, it would also be interesting to compare the phrases/trigrams that a
human would point out if asked what they based their classification decision on.
Then, we could quantify how similar these patterns are to those that humans find
indicative of fake and real news. Finally, as we have mentioned throughout, this
project is only one that would be necessary in a larger toolbox that could function as
a highly accurate fake news classifier. Other tools that would need to be built may
include a fact detector and a stance detector. In order to combine all of these
“routines,” there would need to be some type of model that combines all of the tools
and learns how to weight each of them in its final decision.

56
[Link]

57
13. REFERENCES
[1] Allcott, H., & Gentzkow, M. (2017). Social Media and Fake News in the 2016
Election. The Journal of Economic Perspectives: A Journal of the American
Economic Association, 31(2), 211–236. [Link]

[2] Boididou, C., Papadopoulos, S., Zampoglou, M., Apostolidis, L., Papadopoulou,
O., &Kompatsiaris, Y. (2018). Detection and visualisation of misleading content on
Twitter. International Journal of Multimedia Information Retrieval, 7(1), 71–86.
[Link]

[3] Brownlee, J. (2017, September 29). How to Prepare Text Data for Machine
Learning with scikit-learn. Retrieved June 24, 2018, from
ttps://[Link]/prepare-text-datamachine-learning-scikit-learn

[4] Camisani-Calzolari, M. (2012). Analysis of Twitter followers of the US

Presidential Election candidates: Barack Obama and Mitt Romney.
[Link]

[5] Ceron, A., Curini, L., Iacus, S. M., &Porro, G. (2014). Every tweet counts? How
sentiment analysis of social media can improve our knowledge of citizens’ political
preferences with an application to Italy and France. New Media and Society, 16(2),
340–358. [Link]

[6] Craig Silverman, J. S.-V. (2016, December 7). Most Americans Who See Fake
News Believe It, New Survey Says. Retrieved May 29, 2018, from
[Link] survey

[7] Cresci, S., Di Pietro, R., Petrocchi, M., Spognardi, A., & Tesconi, M. (2015).
Fame for sale: Efficient detection of fake Twitter followers. Decision Support
Systems, 80, 56– 71. [Link]

[8] De Twitter-regels. (n.d.). Retrieved July 1, 2018, from

[Link]

[9] Dudovskiy, J. (n.d.). Snowball sampling. Retrieved June 25, 2018, from
[Link]
sampling/

[10] Epstein, R., & Robertson, R. E. (2015). The search engine manipulation effect
(SEME) and its possible impact on the outcomes of elections. Proceedings of the
National Academy of Sciences of the United States of America, 112(33), E4512–
E4521. [Link]

58
[11] Exposed: Undercover secrets of Trump’s data firm. (2018, March 20). Retrieved
May 4, 2018, from [Link]
donald-trump-data-firm-cambridge-analytica

[12] Goodman, L. A. (1961). Snowball Sampling. Annals of Mathematical Statistics,

32(1), 148–170. Retrieved from [Link]

[13] Graham-Harrison, E., & Cadwalladr, C. (2018, March 21). Cambridge Analytica
execs boast of role in getting Donald Trump elected. The Guardian. Retrieved from
[Link]
of role-in-getting-trump-elected
[14] Hosch-Dayican, B., Amrit, C., Aarts, K., &Dassen, A. (2014). How Do Online
Citizens Persuade Fellow Voters? Using Twitter During the 2012 Dutch
Parliamentary Election Campaign. Social Science Computer Review, 34(2), 135–152.
[Link]

[15] How to use hashtags. (n.d.). Retrieved May 4, 2018, from

[Link]

A Project Report On Fake News Detection
100% (1)
A Project Report On Fake News Detection
29 pages
Project Report On An Efficient and Privacy Preserving Biometric Identification Scheme in Cloud Computing
100% (1)
Project Report On An Efficient and Privacy Preserving Biometric Identification Scheme in Cloud Computing
76 pages
Anush J Internship Report
No ratings yet
Anush J Internship Report
15 pages
Atulkumar Bca 5thsem A35404819038 NTCC Amity University Jharkhand
No ratings yet
Atulkumar Bca 5thsem A35404819038 NTCC Amity University Jharkhand
76 pages
Documentation-Fake News Detection
100% (1)
Documentation-Fake News Detection
57 pages
18A25F0012
100% (1)
18A25F0012
99 pages
Online Quiz System Project Report
No ratings yet
Online Quiz System Project Report
79 pages
Mani Project
No ratings yet
Mani Project
68 pages
Summer Training Report: Data Science
No ratings yet
Summer Training Report: Data Science
47 pages
Loan Approval System Based On Machine Learning Approach
100% (1)
Loan Approval System Based On Machine Learning Approach
55 pages
Online Bus Reservation System Project Report Good One
No ratings yet
Online Bus Reservation System Project Report Good One
59 pages
Student Performance Analysis Report
No ratings yet
Student Performance Analysis Report
40 pages
Credit Card Fraud Detection ML
No ratings yet
Credit Card Fraud Detection ML
100 pages
Industrial Training Report (Amar Rai)
No ratings yet
Industrial Training Report (Amar Rai)
48 pages
Python Internship Report
No ratings yet
Python Internship Report
49 pages
Universal AI: Healthcare Chatbot Project
No ratings yet
Universal AI: Healthcare Chatbot Project
55 pages
Book Record Management Project Using C
No ratings yet
Book Record Management Project Using C
62 pages
OnlineExaminationProject Report
No ratings yet
OnlineExaminationProject Report
97 pages
DBMS Mini-Project Report Format
No ratings yet
DBMS Mini-Project Report Format
31 pages
Alumni Project Poornima
100% (1)
Alumni Project Poornima
138 pages
Project Report of Online Test Team H
100% (2)
Project Report of Online Test Team H
144 pages
MCA Project Report Guidelines
0% (1)
MCA Project Report Guidelines
9 pages
E Learning Website Project Report1
No ratings yet
E Learning Website Project Report1
45 pages
Network Intrusion Detection Report
No ratings yet
Network Intrusion Detection Report
60 pages
Chronic Kidney Disease Prediction Using Machine Learning Techniques (Documentation)
No ratings yet
Chronic Kidney Disease Prediction Using Machine Learning Techniques (Documentation)
48 pages
E-Commerce Website
No ratings yet
E-Commerce Website
57 pages
MCA 4 (Project)
No ratings yet
MCA 4 (Project)
48 pages
Report Minor Project PDF
No ratings yet
Report Minor Project PDF
37 pages
Spam Detection via ML & NLP
No ratings yet
Spam Detection via ML & NLP
44 pages
Vreportinterm Nsihp
No ratings yet
Vreportinterm Nsihp
28 pages
Ai ML DS - Summerinternship
No ratings yet
Ai ML DS - Summerinternship
59 pages
EXAM CELL AUTOMATION SYSTEM-DOCUMENTATION-converted-pages-deleted
67% (3)
EXAM CELL AUTOMATION SYSTEM-DOCUMENTATION-converted-pages-deleted
34 pages
Projects 1920 A12
No ratings yet
Projects 1920 A12
78 pages
Email Client Application Implementing SMTP and POP - DOC
No ratings yet
Email Client Application Implementing SMTP and POP - DOC
103 pages
Major Project Documentation Final 2
No ratings yet
Major Project Documentation Final 2
62 pages
MSC Cs 2nd Mini Project
No ratings yet
MSC Cs 2nd Mini Project
40 pages
INTERNSHIP REPORT Baseer
No ratings yet
INTERNSHIP REPORT Baseer
23 pages
Final Year Project Report CSE
0% (1)
Final Year Project Report CSE
5 pages
Project Report
No ratings yet
Project Report
67 pages
GST Calculator For Supermarket: A Project Report On
No ratings yet
GST Calculator For Supermarket: A Project Report On
62 pages
Heart Disease Prediction Using Machine Learning Report
50% (2)
Heart Disease Prediction Using Machine Learning Report
45 pages
MCA Final Major Project Report
No ratings yet
MCA Final Major Project Report
60 pages
Diabetes Prediction via ML
No ratings yet
Diabetes Prediction via ML
82 pages
1NH17CS407
No ratings yet
1NH17CS407
110 pages
Java Project Report
100% (2)
Java Project Report
46 pages
Decentralized E-Voting System Based On Blockchain
100% (4)
Decentralized E-Voting System Based On Blockchain
89 pages
Online Rto Management System
84% (55)
Online Rto Management System
87 pages
A PROJECT REPORT ON Hotel Managment Usin
No ratings yet
A PROJECT REPORT ON Hotel Managment Usin
137 pages
Timetable Automation for Colleges
No ratings yet
Timetable Automation for Colleges
17 pages
Home Rental Portal Project Report
50% (2)
Home Rental Portal Project Report
97 pages
Fake News Detection Using Machine Learning Report Final
No ratings yet
Fake News Detection Using Machine Learning Report Final
26 pages
Final Internshala Report
No ratings yet
Final Internshala Report
38 pages
IT Project Report: Tech Hub System
No ratings yet
IT Project Report: Tech Hub System
43 pages
CSE35 Project Report
No ratings yet
CSE35 Project Report
111 pages
Stress Detection in It Professional by Image Processing and Machine Learning
No ratings yet
Stress Detection in It Professional by Image Processing and Machine Learning
91 pages
Online Agriculture Products Marketing
100% (1)
Online Agriculture Products Marketing
30 pages
Fake News Detection
100% (1)
Fake News Detection
44 pages
Front Papers-Technical Seminors
No ratings yet
Front Papers-Technical Seminors
46 pages
NEWS2
No ratings yet
NEWS2
55 pages
Fake News Detection Using NLP Techniques
No ratings yet
Fake News Detection Using NLP Techniques
4 pages
KWDT-I Report98
No ratings yet
KWDT-I Report98
414 pages
KRMB Ar 2018 - 19
No ratings yet
KRMB Ar 2018 - 19
84 pages
Status of Srama Sakthi Registration
No ratings yet
Status of Srama Sakthi Registration
2 pages
Approved Minutes of The Meeting Tiger Conservation Foundation 2019-2020
No ratings yet
Approved Minutes of The Meeting Tiger Conservation Foundation 2019-2020
14 pages
Settlement Risk Assessment of Dams Founded On Soft Clay: February 2019
No ratings yet
Settlement Risk Assessment of Dams Founded On Soft Clay: February 2019
9 pages
Receipt - 7 - 16 - 2021 12 - 00 - 00 AM
No ratings yet
Receipt - 7 - 16 - 2021 12 - 00 - 00 AM
1 page
L1 DataFrames I
No ratings yet
L1 DataFrames I
24 pages
15.youtube Analysis Using Machine Learning
No ratings yet
15.youtube Analysis Using Machine Learning
72 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
51 pages
Combine Interviewer and Candidate Videos
No ratings yet
Combine Interviewer and Candidate Videos
6 pages
Lasso Regression in Machine Learning
No ratings yet
Lasso Regression in Machine Learning
14 pages
3.2 Numpy Array Sample Programs
No ratings yet
3.2 Numpy Array Sample Programs
54 pages
Advanced Python
No ratings yet
Advanced Python
32 pages
Pythone 3 D 3
No ratings yet
Pythone 3 D 3
13 pages
Project Pre - Submission Final Report
No ratings yet
Project Pre - Submission Final Report
17 pages
Python 2D & 3D Plotting Guide
No ratings yet
Python 2D & 3D Plotting Guide
43 pages
Python Matplotlib Plotting Guide
No ratings yet
Python Matplotlib Plotting Guide
17 pages
MTE204 Data Python
No ratings yet
MTE204 Data Python
45 pages
Advanced Data Analytics and Visualization Course Material
No ratings yet
Advanced Data Analytics and Visualization Course Material
45 pages
Python Line Plotting with Matplotlib
No ratings yet
Python Line Plotting with Matplotlib
7 pages
Python Enthusiasts' Advanced Reads
0% (1)
Python Enthusiasts' Advanced Reads
104 pages
Essential Data Science Libraries Guide
No ratings yet
Essential Data Science Libraries Guide
10 pages
The Joy of Computing Using Python
No ratings yet
The Joy of Computing Using Python
1 page
Viva Questions For Python Lab
75% (8)
Viva Questions For Python Lab
9 pages
Stock Market Analysis Project Overview: Part 1: Getting The Data
No ratings yet
Stock Market Analysis Project Overview: Part 1: Getting The Data
1 page
ML for Identical Twin Prediction
No ratings yet
ML for Identical Twin Prediction
50 pages
Advanced Data Science & AI Courses
No ratings yet
Advanced Data Science & AI Courses
55 pages
Machine Learning and End-To-End Deep Learning For The Detection of Chronic Heart Failure From Heart Sounds
No ratings yet
Machine Learning and End-To-End Deep Learning For The Detection of Chronic Heart Failure From Heart Sounds
59 pages
Python Control Package Tutorial
No ratings yet
Python Control Package Tutorial
44 pages
Scikit-Learn Python Cheat Sheet
No ratings yet
Scikit-Learn Python Cheat Sheet
1 page
ML
No ratings yet
ML
131 pages
MATPLOTLIB For Python
No ratings yet
MATPLOTLIB For Python
46 pages
Face Mask Detection System Using Python
No ratings yet
Face Mask Detection System Using Python
42 pages
Python List Concept
No ratings yet
Python List Concept
32 pages
Deep Learning Lab Manual
No ratings yet
Deep Learning Lab Manual
44 pages
12 Ai Practical File
100% (3)
12 Ai Practical File
5 pages