MAJOR PROJECT Documentation
MAJOR PROJECT Documentation
BACHELOR OF TECHNOLOGY
IN
Computer Science & Engineering
Submitted By
M SHILPA (17E41A0585)
P SWATHI (17E41A0599)
CH SAIRAM (17E41A0587)
Under the Esteemed guidance of
[Link]
Assistant Professor, Department of CSE
2(f) recognition by
UGC Accredited by
CERTIFICATE
This is to certify that the major project report entitled “FAKE NEWS DETECTION ON
TWITTER” is being submitted by
M SHILPA (17E41A0585)
P SWATHI (17E41A0599)
J KARUNAKARA (17E41A05A0)
CH SAIRAM (17E1A0587)
In partialfulfillment of the requirement for the award of the degree of [Link]. In Computer
Science and Engineeringto the SreeDattha Institute of Engineering & Science, Hyderabad is a
record of bonafide work carried out by them under my/our guidance.
The results presented in this report have been verified and are found to be satisfactory. The results
embodied in this report have not been submitted to any other University for the award of any other
degree or diploma.
InternalGuideHOD – IN Charge
[Link]
External Examiner
ii
SREE DATTHA INSTITUTE OF ENGINEERING & SCIENCE
(Approved by AICTE, Affiliated to JNTU, Hyderabad)
Sheriguda, Ibrahimpatnam, R.R. Dist. Hyderabad.
2(f) recognition by UGC
Accredited by NAAC for 5 years
We arehereby declaring that the project report titled “FAKE NEWS DETECTION ON
TWITTER” is submitted in partial fulfillment of the requirement for the award of the degree of
Bachelor of Technologyin Computer Science & Engineering atSreeDattha institute of
Engineering and science affiliated to Jawaharlal Nehru Technological University, Hyderabadis a
record of bona fide work carried out by us and the results embodied in this project have not been
reproduced or copied from anysource.
The results embodied in this project report have not been submitted to any other University
orInstitute for the award of any Degree or Diploma.
M SHILPA (17E41A0585)
P SWATHI (17E41A0599)
J KARUNAKARA REDDY(17E41A05A0)
CH SAIRAM (17E41A0587)
iii
ACKNOWLEDGMENT
We are extremely grateful to Shri G. PanduRanga Reddy, Chairman and Dr. Md.
Sameeruddin Khan, Principal and Dr Amol Purohit, Head of the Department of
CSE , SreeDattha Institute of Engineering & Science.
We express our thanks to all staff members and friends for all the help and co-
ordination extended in bringing out this Project successfully in time.
Finally, we are very much thankful to our parents who guided us for every step.
[Link] (17E41A0585)
P SWATHI (17E41A0599)
J KARUNAKARA REDDY (17E41A0587)
CH SAIRAM (17E41A0587)
Date:
Place:
iv
TABLE OF CONTENTS
[Link] CONTENTS PAG
NO
TITLE PAGE ⅰ
CERTIFICATION ⅱ
DECLARATION ⅲ
ACKNOWLEDGEMENT ⅳ
ABSTRACT Ⅶ
1 INTRODUCTION 1
2 LITERATURE SURVEY 2
3 SYSTEM ANALYSIS 14
3.1EXISTING SYSTEM
3.2 PROPOSED SYSTEM
4 SYSTEM OVERVIEW AND REQUIREMENTS 18
6 OBJECTIVES 26
v
7 SYSTEM DESIGN 28
8 CODING 35
8.1 FRONT END
8.2 BACK END
9 SYSTEM TESTING 42
9.1TESTING
9.2 PRINCIPLES OF TESTING
9.3 TYPES OF TESTING
10 RESULT 45
11 CONCLUSION 52
12 FUTURE SCOPE 55
13 REFERENCES 57
vi
ABSTRACT
Problem statement:
The project is concerned with identifying a solution that could be used to detect and filter out sites
containing fake news for purposes of helping users to avoid being lured by clickbaits. It is imperative
that such solutions are identified as they will prove to be useful to both readers and tech companies
involved in the issue.
Solution:
Social media provide a platform for quick and seamless access to information. However, the
propagation of false information, raises major concerns, especially given the fact that social
media are the primary source of information for a large percentage of the population. False
information may manipulate people’s beliefs and have real-life consequences. Therefore, one
major challenge is to automatically identify false information by categorizing it into different
types and notify users about the credibility of different articles shared online.
Recent political events have lead to an increase in the popularity and spread of fake news.
As demonstrated by the widespread effects of the large onset of fake news, humans are
inconsistent if not outright poor detectors of fake news. With this, efforts have been made to
automate the process of fake news detection. The most popular of such attempts include
“blacklists” of sources and authors that are unreliable. While these tools are useful, in order
to create a more complete end to end solution, we need to account for more difficult cases
where reliable sources and authors release fake news.
The growth of social media has revolutionized the way people access information.
Although platforms like Facebook and Twitter allow for a quicker, wider and less restricted
access to information, they also consist of a breeding ground for the dissemination of fake
news. Most of the existing literature on fake news detection on social media proposes user-
based or content-based approaches. However, recent research revealed that real and fake
news also propagate significantly differently on Twitter.
As such, the goal of this project was to create a tool for detecting the language patterns that
characterize fake and real news through the use of machine learning and natural language
processing techniques. The results of this project demonstrate the ability for machine learning
to be useful in this task. We have built a model that catches many intuitive indications of real
and fake news
vii
[Link]
viii
[Link]
1.1 Purpose of Project
Fake news creates chaos in our societies, as people put fake news in order to conspire against
[Link] people are loosing their trust on media as sometimes news channels cover these
news to increase their [Link] is misleading people and is getting hard day by day to
separate fact from fiction .It is impacting the decisions of youth, letting them believe
something which is not [Link] partiesare taking advantage of fake news to manipulate
voters.
A large body of recent works has focused on understanding and detecting fake news stories
that are disseminated on social media. To accomplish this goal, these works explore several
types of features extracted from news stories, including source and posts from social media.
In addition to exploring the main features proposed in the literature for fake news detection,
we present a new set of features and measure the prediction performance of current
approaches and features for automatic detection of fake news. Our results reveal interesting
findings on the usefulness and importance of features for detecting false news. Finally, we
discuss how fake news detection approaches can be used in the practice. Number of Machine
Learning (ML) algorithms, such as Decision tree, Random Forest ,Logistic regression were
applied for the purpose of classification and prediction of fake news dataset, and many
promising results were presented in the literature.
1
[Link] SURVEY
2
2. LITERATURE SURVEY
The name machine learning was coined in 1959 by Arthur Samuel. Machine learning
explores the study and construction of algorithms that can learn from and make predictions
on data Machine learning is closely related to (and often overlaps with) computational
statistics, which also focuses on prediction-making through the use of computers. It has
strong ties to mathematical optimization, which delivers methods, theory and application
domains to the field. Machine learning is sometimes conflated with data mining, where the
latter subfield focuses more on exploratory data analysis and is known as unsupervised
learning.
Within the field of data analytics, machine learning is a method used to devise complex
models and algorithms that lend themselves to prediction; in commercial use, this is known
as predictive analytics. These analytical models allow researchers, data scientists, engineers,
and analysts to "produce reliable, repeatable decisions and results" and uncover "hidden
insights" through learning from historical relationships and trends in the data
.Machine learning tasks:Machine learning tasks are typically classified into several
broad categories:
Supervised learning: The computer is presented with example inputs and their desired
outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to
outputs. As special cases, the input signal can be only partially available, or restricted to
special feedback.
Active learning: The computer can only obtain training labels for a limited set of
instances (based on a budget), and also has to optimize its choice of objects to acquire labels
for. When used interactively, these can be presented to the user for labelling.
3
Reinforcement learning: Data (in form of rewards and punishments) are given only as
feedback to the program's actions in a dynamic environment, such as driving a vehicle or
playing a game against an opponent.
Unsupervised learning: No labels are given to the learning algorithm, leaving it on its
own to find structure in its input. Unsupervised learning can be a goal in itself (discovering
hidden patterns in data) or a means towards an end (feature learning).
Fig2.1-MachineLearningCategories
• Machine leaning models involves machines learning from data without the help of humans
or any kind of human intervention.
4
• Machine Learning is the science of making of making the computers learn andact like humans
by feeding data and information without being explicitly programmed.
• Machine Learning is a combination of Algorithms, Datasets, and Programs.
• Machine Learning is totally different from traditionally programming, here data and output
is given to the computer and in return it give us the programwhich provides solution
to the various problems. Below is the figure
Decision trees are constructed via an algorithmic approach that identifies ways to split a
data set based on different conditions. It is one of the most widely used and practical methods
for supervised learning. Decision Trees are a non-parametric supervised learning method
used for both classification and regression tasks.
Tree models where the target variable can take a discrete set of values are called
classification trees. Decision trees where the target variable can take continuous values
(typically real numbers) are called regression trees. Classification And Regression Tree
(CART) is general term for this.
Data comes in records of forms I .e, (x, Y)=(x1,x2,x3,... ,xk ,Y). The dependent variable,
Y, is
the target variable that we are trying to understand, classify or generalize. The vector x is
composed of the features, x1, x2, x3 etc., that are used for that task .While making decision
tree, at each node of tree we ask different type of questions. Based on the asked question we
will calculate the information gain corresponding to it.
5
Information Gain
Information gain is used to decide which feature to split on at each step in building the
tree. Simplicity is best, so we want to keep our tree small. To do so, at each step we should
choose the split that results in the purest daughter nodes. A commonly used measure of purity
is called information. For each node of the tree, the information value measures how much
information a feature gives us about the class. The split with the highest information gain will
be taken as the first split and the process will continue until all children nodes are pure, or
until the information gain is 0.
Algorithm for constructing decision tree usually works top-down, by choosing a variable at
each step that best splits the set of items. Different algorithms use different metrices for
measuring best.
GiniImpurity
Pure:
Pure means, in a selected sample of dataset all data belongs to same class (PURE).
Impure:
Impure means, data is mixture of different classes.
6
Disadvantage of Decision Tree
Prone to overfitting.
Require some kind of measurement as to how well they are doing.
Need to be careful with parameter tuning.
Can create biased learned trees if some classes dominate
The decision tree is a decision support tool. It uses a tree-like graph to show the possible
consequences. If you input a training dataset with targets and features into the decision tree, it
will formulate some set of rules. These rules can be used to perform predictions. Through the
decision tree algorithm, you can generate the rules. You can then input the features of this
movie and see whether it will be liked by your daughter. The process of calculating these
nodes and forming the rules is using information gain and Gini index calculations.
The difference between Random Forest algorithm and the decision tree algorithm is that in
Random Forest, the process es of finding the root node and splitting the feature nodes will
run randomly.
Overfitting is one critical problem that may make the results worse, but for Random Forest
algorithm, if there are enough trees in the forest, the classifier won’t overfit the model. The
third advantage is the classifier of Random Forest can handle missing values, and the last
advantage is that the Random Forest classifier can be modeled for categorical values.
There are two stages in Random Forest algorithm, one is random forest creation, the other
is to make a prediction from the random forest classifier created in the first stage.
1. Randomly select “K” features from total “m” features where k << m
2. Among the “K” features, calculate the node “d” using the best split point
3. Split the node into daughter nodes using the best split
4. Repeat the a to c steps until “l” number of nodes has been reached
5. Build forest by repeating steps a to d for “n” number times to create “n” number of trees.
In the next stage, with the random forest classifier created, we will make the prediction.
The random forest prediction pseudocode is shown below:
7
Takes the test features and use the rules of each randomly created decision tree to predict
the outcome and stores the predicted outcome (target). Calculate the votes for each predicted
target.
Consider the high voted predicted target as the final prediction from the random forest
algorithm.
Advantages:
• Compared with other classification techniques, there are three advantages as the author
mentioned.
• For applications in classification problems, Random Forest algorithm will avoid the
overfitting problem
. • For both classification and regression task, the same random forest algorithm can be used.
• The Random Forest algorithm can be used for identifying the most important features from
the training dataset, in other words, feature engineering .
8
o Logistic Regression is a significant machine learning algorithm because it has
the ability to provide probabilities and classify new data using continuous and
discrete datasets.
o Logistic Regression can be used to classify the observations using different
types of data and can easily determine the most effective variables used for
the classification. The below image is showing the logistic function:
It maps any real value into another value within a range of 0 and 1.
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".
To implement the Logistic Regression using Python, we will use the same steps as we have
done in previous topics of Regression. Below are the steps:
9
o Fitting Logistic Regression to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.
Advantages
Disadvantages
Non-linear problems can’t be solved with logistic regression because it has a linear
decision surface. Linearly separable data is rarely found in real-world scenarios.
10
exception) comes from the basic assumptions they work with: in machine learning,
performance is usually evaluated with respect to the ability to reproduce known
knowledge, while in knowledge discovery and data mining (KDD) the key task is the
discovery of previously unknown knowledge.
2.6 JupyterLab
Jupyter Lab is a web-based interactive development environment for Jupyter
notebooks, code, and data. JupyterLab is flexible: configure and arrange the user
interface to support a wide range of workflows in data science, scientific computing,
and machine learning. Jupyter lab is extensible and modular: write plugins that add
new components and integrate with existing [Link] Lab is a next-generation
web-based user interface or Project .JupyterLab enables you to work with documents
and activities such as Jupyter notebooks, text editors, terminals, and custom
components in a flexible, integrated, and extensible manner.
You can arrange multiple documents and activities side by side in the work area using
tabs and splitters. Documents and activities integrate with each other, enabling new
workflows for interactive computing, for example: Code Consoles provide transient
scratchpads for running code interactively, with full support for rich output. A code
console can be linked to a notebook kernel as a computation log from the notebook,
for example. Kernel-backed documents enable code in any text file (Markdown,
Python, R, LaTeX, etc.) to be run interactively in any Jupyter kernel. Notebook cell
outputs can be mirrored into their own tab, side by side with the notebook, enabling
simple dashboards with interactive controls backed by a kernel. Multiple views of
documents with different editors or viewers enable live editing of documents reflected
in other viewers. For example, it is easy to have live preview of Markdown,
Delimiter-separated Values, or Vega/Vega-Lite documents. Jupyter Lab also offers a
unified model for viewing and handling data formats .Jupyter Lab understands many
file formats (images, CSV, JSON, Markdown, PDF, Vega, Vega-Lite, etc.) and can
also display rich kernel output in these formats. See File and Output Formats for more
information.
Packages
Pandas is a software library written for the Python programming language for data
manipulation and analysis. In particular, it offers data structures and operations for
manipulating numerical tables and time series. It is free software released under the
three clause BSD license. The name is derived from the term "panel data", an
econometrics term for data sets that include observations over multiple time periods
for the same individuals. Its name is a play on the phrase "Python data analysis" itself.
Library features
Tools for reading and writing data between in-memory data structures and different
file formats.
11
Data alignment and integrated handling of missing data.
Seaborn
MatPlotLib
Matplotlib is a plotting library for the Python programming language and its
numerical mathematics extension NumPy. It provides an object-oriented API for
embedding plots into applications using general-purpose GUI toolkits like Tkinter,
wxPython, Qt, or GTK+. There is also a procedural "pylab" interface based on a state
machine (like OpenGL), designed to closely resemble that of MATLAB, though its
use is [Link] makes use of Matplotlib.
Matplotlib was originally written by John D. Hunter, since then it has an active
development community,[4] and is distributed under a BSD-style license. Michael
Droettboom was nominated as matplotlib's lead developer shortly before John
Hunter's death in August 2012, and further joined by Thomas Caswell.
Matplotlib 2.0.x supports Python versions 2.7 through 3.6. Python 3 support started
with Matplotlib 1.2. Matplotlib 1.4 is the last version to support Python 2.6.
Matplotlib has pledged not to support Python 2 past 2020 by signing the Python 3
Statement.
Several toolkits are available which extend Matplotlib functionality. Some are
separate downloads, others ship with the Matplotlib source code but have external
dependencies.
Basemap: map plotting with various map projections, coastlines, and political
boundaries
plotsNatgrid: interface to the natgrid library for gridding irregularly spaced data.
Seaborn: provides an API on top of Matplotlib that offers sane choices for plot style
and color defaults, defines simple high-level functions for common statistical plot
types, and integrates with the functionality provided by Pandas.
Numpy
13
NumPy is a library for the Python programming language, adding support for large,
multi-dimensional arrays and matrices, along with a large collection of highlevel
mathematical functions to operate on these arrays. The ancestor of NumPy, Numeric,
was originally created by Jim Hugunin with contributions from several other
developers. In 2005, Travis Oliphant created NumPy by incorporating features of the
competing Numarray into Numeric, with extensive modifications.
NumPy is open-source software and has many contributors. NumPy targets the
CPython reference implementation of Python, which is a non- optimizing bytecode
interpreter. Mathematical algorithms written for this version of Python often run much
slower than compiled equivalents. NumPy addresses the slowness problem partly by
providing multidimensional arrays and functions and operators that operate efficiently
on arrays, requiring rewriting some code, mostly inner loops, using NumPy. Using
NumPy in Python gives functionality comparable to MATLAB since they are both
interpreted, and they both allow the user to write fast programs as long as most
operations work on arrays or matrices instead of scalars. In comparison, MATLAB
boasts a large number of additional toolboxes, notably Simulink, whereas NumPy is
intrinsically integrated with Python, a more modern and complete programming
language.
Moreover, complementary Python packages are available; SciPy is a library that adds
more MATLAB- like functionality and Matplotlib is a plotting package that provides
MATLAB-like plotting functionality. Internally, both MATLAB and NumPy rely on
BLAS and LAPACK for efficient linear algebra computations.
Python bindings of the widely used computer vision library OpenCV utilize NumPy
arrays to store and operate on data. Since images with multiple channels are simply
represented as three-dimensional arrays, indexing, slicing or masking with other
arrays are very efficient ways to access specific pixels of an image. The NumPy array
as universal data structure in OpenCV for images, extracted feature points, filter
kernels and many more vastly simplifies the programming workflow and debugging.
Kaggle
Our Dataset is extracted from Kaggle from UCI [Link], a subsidiary of
Google LLC, is an online community of data scientists and machine learning
practitioners. Kaggle allows users to find and publish data sets, explore and build
models in a web-based data-science environment, work with other data scientists and
machine learning engineers, and enter competitions to solve data science challenges.
Kaggle's services:
Machine learning competitions: this was Kaggle's first product. Companies post
problems and machine learners compete to build the best algorithm, typically with
cash prizes.
Kaggle Kernels: a cloud-based workbench for data science and machine learning.
Allows data scientists to share code and analysis in Python, R and R Markdown. Over
14
150K "kernels" (code snippets) have been shared on Kaggle covering everything from
sentiment analysis to object detection.
Public datasets platform: community members share datasets with each other. Has
datasets on everything from bone x-rays to results from boxing bouts.
Kaggle has run hundreds of machine learning competitions since the company was
founded. Competitions have ranged from improving gesture recognition for
MicrosoftKinect to making an football AI for Manchester City to improving the
search for the Higgs boson at CERN.
DATASET:
[Link]
The dataset consists of 28000 real news and 31000 fake news. There are 4 columns in the dataset.
ATTRIBUTES:
15
[Link] ANALYSIS
16
3. SYSTEM ANALYSIS
17
4. SYSTEM
REQUIREMENTS AND
OVERVIEW
18
4. SYSTEM REQUIREMENT SPECIFICATION
4.1What is SRS?
Software Requirement Specification (SRS) is the starting point of the software developing
activity. As system grew more complex it became evident that the goal of the entire system
cannot be easily comprehended. Hence the need for the requirement phase arose. The
software project is initiated by the client needs. The SRS is the means of translating the ideas
of the minds of clients (the input) into a formal document (the output of the requirement
phase.) The SRS phase consists of two basic activities: Problem/Requirement Analysis: The
process is order and more nebulous of the two, deals with understand the problem, the goal
and constraints. Requirement Specification: Here, the focus is on specifying what has been
found giving analysis such as representation, specification languages and tools, and checking
the specifications are addressed during this activity. The Requirement phase terminates with
the production of the validate SRS document. Producing the SRS document isthe basic goal
of this phase.
The purpose of the Software Requirement Specification is to reduce the communication gap
between the clients and the developers. Software Requirement Specification is the medium
though which the client and user needs are accurately specified. It forms the basis of software
development. A good SRS should satisfy all the parties involved in the system.
19
4.4 Software Requirements
Platform :Jupyter-Lab
RAM : 1 GB or above.
20
5. SOFTWARES, TOOLS
AND DESCRIPTION
21
5. SOFTWARE TOOLS AND DESCRIPTION
5.1 PYTHON
Python is one of the most popular programming languages in both the coding and
Data Science communities. Guido Van Rossum created it in 1991 and ever since its
inception has been one of the most widely used languages along with C++, Java, etc.
Python is an opensource, high-level, general-purpose programming language that
incorporates the features of object-oriented, structural, and functional programming.
While Python’s simple syntax allows for writing readable code, which can be further
applied to complex software development processes to facilitate test-driven software
application development, machine learning, and data analytics. Python can run on all
the major operating systems, including Windows, Linux, and iOS.
• Python has Prebuilt Libraries like Numpy for scientific computation, Scipy for
advanced computing and Pybrain for machine learning (Python Machine Learning)
making it one of the best languages for AI.
• Python developers around the world provide comprehensive support and assistance
via forums and tutorials making the job of the coder easier than any other popular
languages.
• Python is platform Independent and is hence one of the most flexible and popular
choices for use across different platforms and technologies with the least tweaks in
basic coding.
• Python is the most flexible of all others with options to choose between OOPs
approach and scripting. You can also use IDE itself to check for most codes and is a
22
boon for developers struggling with different algorithms.
Python also supports data analyzation and visualization, thereby further simplifying
the process of creating custom solutions minus the extra effort and time investment.
5.2 LIBRARIES
NUMPY:
1. Installing Numpy:
Or
• Pandas:
import pandas as pd
By using the above command, you can easily import pandas library. Using
Pandas, we can accomplish five typical steps in the processing and
analysis of data, regardless of the origin of data — load, prepare,
manipulate, model, and analyze. Python with Pandas is used in a wide
range of fields including academic and commercial domains including
finance, economics, Statistics, analytics, etc.
• SCIKIT LEARN
If you like conda, you can also use the conda for package installation
Once you are done with the installation, you can use scikit-learn easily
in your Python code by importing it as:
import sklearn
24
• MATPLOTLIB :
1. Installing Matplotlib:
#Windows, Linus, MacOS users can install this library using the
following command:
python -mpip install -U matplotlib
2. Importing Matplotlib:
or
import [Link] as pl
25
[Link]
26
6. OBJECTIVES
Many Believe Fake News Articles
Studies have shown that many Americans cannot tell what news is fake and what news is
real. This can create confusion and misunderstanding about important social and political
issues.
Fake News Can Affect Your Grades
ACC Professors require that you use quality sources of information for your research
assignments and papers. If you use sources that have false or misleading information, you
may get a lower grade.
Fake News Can Be Harmful to Your Health
There are many fake and misleading news stories related to medical treatments and major
diseases like cancer or diabetes. Trusting these false stories could lead you to make decisions
that may be harmful to your health.
Fake News Makes It Harder For People To See the Truth
A Research Center study found that those on the right and the left of the political spectrum
have different ideas about the definition of 'fake news', "The study suggests that fake-news
panic, rather than driving people to abandon ideological outlets and the fringe, may actually
be accelerating the process of polarization: It’s driving consumers to drop some outlets, to
simply consume less information overall, and even to cut out social relationships."
This is why it is important for people to seek out news with as little bias as humanly possible.
News services strive to provide accurate, neutral coverage of major events.
27
[Link] DESIGN
28
7. SYSTEM DESIGN
1. SYSTEM DESIGN
These internal and external agents are known as actors. So use case diagrams are
consisting of actors, use cases and their relationships. The diagram is used to
model the system/subsystem of an application. A single use case diagram
captures a particular functionality of a system. So to model the entire system
numbers of use case diagrams are used.
Use case diagrams are used to gather the requirements of a system including
internal and external influences. These requirements are mostly design
requirements. So when a system is analysed to gather its functionalities use cases
are prepared and actors are identified. In brief, the purposes of use case diagrams
can be as follows:
a. Used to gather requirements of a system.
b. Used to get an outside view of a system.
c. Identify external and internal factors influencing the system.
d. Show the interacting among the requirements are actors
30
Sequence diagrams describe interactions among classes in terms of an exchange of messages
over time. They're also called event diagrams. A sequence diagram is a good way to visualize
and validate various runtime scenarios. These can help to predict how a system will behave
and to discover responsibilities a class may need to have in the process of modelling a new
system.
The aim of a sequence diagram is to define event sequences, which would have a desired
outcome. The focus is more on the order in which messages occur than on the message per se.
However, the majority of sequence diagrams will communicate what messages are sent and
the order in which they tend to occur.
Activation boxes represent the time an object needs to complete a task. When an object is
busy executing a process or waiting for a reply message, use a thin grey rectangle placed
vertically on its lifeline.
Messages
Messages are arrows that represent communication between objects. Use half arrowed lines
to represent asynchronous messages
. Asynchronous messages are sent from an object that will not wait for a response from the
receiver before continuing its tasks.
Lifelines
Lifelines are vertical dashed lines that indicate the object's presence over time.
Destroying Objects
Objects can be terminated early using an arrow labelled "<< destroy >>" that points to an X.
This object is removed from memory. When that object's lifeline ends, you can place an X at
the end of its lifeline to denote a destructionoccurrence.
Loops
A repetition or loop within a sequence diagram is depicted as a rectangle. Place the condition
for exiting the loop at the bottom left corner in square brackets [].
Guards
When modelling object interactions, there will be times when a condition must be met for a
message to be sent to an object. Guards are conditions that need to be used throughout UML
diagrams to control flow.
31
Fig 7.2.2 – Sequence Diagram
[Link] is the only UML which can appropriately depict various aspects of OOPs concept.
[Link] design and analysis of application can be faster and efficient.
[Link] is base for deployment and component diagram.
Each class is represented by a rectangle having a subdivision of three compartments name
,attributes and operation.
32
Fig 7.2.3-Class Diagram
The use of object diagrams is fairly limited, mainly to show examples of data structures.
During the analysis phase of a project, you might create a class diagram to describe the
structure of a system and then create a set of object diagrams as test cases to verify the
accuracy and completeness of the class diagram. Before you create a class diagram, you
might create an object diagram to discover facts about specific model elements and their
links, or to illustrate specific examples of the classifiers that are required.
An object diagram shows this relation between the instantiated classes and thedefined class,
and the relation between these objects in the system. They are be useful to explain smaller
portions of your system, when your system class diagram is very complex, and also
sometimes modeling recursive relationship in diagram.
Object Names:
33
Every object is actually symbolized like a rectangle, that offers the name from the object and
its class underlined as well as divided with a colon.
Object Attributes:
Similar to classes, you are able to list object attributes inside a separate compartment.
However, unlike classes, object attributes should have values assigned for them
.Links:
Links tend to be instances associated with associations. You can draw a link while using the
lines utilized in class diagrams.
34
[Link]
35
[Link]
8.1Pseudo Code
import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from [Link] import accuracy_score
from sklearn.model_selection import train_test_split
from [Link] import Pipeline
Read datasets
fake = pd.read_csv("data/[Link]")
true = pd.read_csv("data/[Link]")
[Link]
[Link]
36
fake['target'] = 'fake'
true['target'] = 'true'
# Concatenate dataframes
data = [Link]([fake, true]).reset_index(drop = True)
[Link]
# Convert to lowercase
# Remove punctuation
import string
def punctuation_removal(text):
all_list = [char for char in text if char not in [Link]]
clean_str = ''.join(all_list)
return clean_str
data['text'] = data['text'].apply(punctuation_removal)
# Check
[Link]()
# Removing stopwords
import nltk
[Link]('stopwords')
from [Link] import stopwords
stop = [Link]('english')
[Link](figsize=(10,7))
[Link](wordcloud, interpolation='bilinear')
[Link]("off")
[Link]()
[Link](figsize=(10,7))
[Link](wordcloud, interpolation='bilinear')
[Link]("off")
[Link]()
token_space = [Link]()
38
def counter(text, column_text, quantity):
all_words = ' '.join([text for text in text[column_text]])
token_phrase = token_space.tokenize(all_words)
frequency = [Link](token_phrase)
df_frequency = [Link]({"Word": list([Link]()),
"Frequency": list([Link]())})
df_frequency = df_frequency.nlargest(columns = "Frequency", n = quantity)
[Link](figsize=(12,8))
ax = [Link](data = df_frequency, x = "Word", y = "Frequency", color = 'blue')
[Link](ylabel = "Count")
[Link](rotation='vertical')
[Link]()
Modeling
if normalize:
cm = [Link]('float') / [Link](axis=1)[:, [Link]]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
thresh = [Link]() / 2.
for i, j in [Link](range([Link][0]), range([Link][1])):
[Link](j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
39
plt.tight_layout()
[Link]('True label')
[Link]('Predicted label')
Logistic regression
# Accuracy
prediction = [Link](X_test)
print("accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))
cm = metrics.confusion_matrix(y_test, prediction)
plot_confusion_matrix(cm, classes=['Fake', 'Real'])
# Accuracy
prediction = [Link](X_test)
print("accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))
cm = metrics.confusion_matrix(y_test, prediction)
40
plot_confusion_matrix(cm, classes=['Fake', 'Real'])
cm = metrics.confusion_matrix(y_test, prediction)
plot_confusion_matrix(cm, classes=['Fake', 'Real'])
41
[Link]
42
[Link]
9.1 Introduction to testing
Testing is the process of evaluating a system or its component(s) with the intent to
find whether it satisfies the specified requirements or not. Testing is executing a
system in order to identify any gaps, errors, or missing requirements in contrary to the
actual requirements.
It depends on the process and the associated stakeholders of the project(s). In the IT
industry, large companies have a team with responsibilities to evaluate the developed
software in context of the given requirements. Moreover, developers also conduct
testing which is called Unit Testing. In most cases, the following professionals are
involved in testing a systemwithin their respective capacities:
• Software Tester
• Software Developer
• Project Lead/Manager
• End User
Levels of testing include different methodologies that can be used while conducting
software testing. The main levels of software testing are:
• Functional Testing
• Non-functional Testing
Functional Testing
This is a type of black-box testing that is based on the specifications of the software
that is to be tested. The application is tested by providing input and then the results
are examined that need to conform to the functionality it was intended for. Functional
testing of a software is conducted on a complete, integrated system to evaluate the
system's compliance with its specified requirements
The process of testing a software in a well planned and systematic way is known as
software testing lifecycle (STLC).
Different organizations have different phases in STLC however generic Software Test
Life Cycle (STLC) for waterfall development model consists of the following phases.
1. Requirements Analysis
2. Test Planning
43
3. Test Analysis
4. Test Design
• Requirements Analysis
In this phase testers analyze the customer requirements and work with developers
during the design phase to see which requirements are testable and how they aregoing
to test those requirements
. It is very important to start testing activities from the requirements phase itself
because the cost of fixing defect is very less if it is found in requirements phase rather
than in future phases.
• Test Planning
In this phase all the planning about testing is done like what needs to be tested, how
the testing will be done, test strategy to be followed, what will be the test
environment, what test methodologies will be followed, hardware and software
availability, resources, risks etc. A high level test plan document is created which
includes all the planning inputs mentioned above and circulated to the stakeholders.
• Test Analysis
After test planning phase is over test analysis phase starts, in this phase we need to
dig deeper into project and figure out what testing needs to be carried out in each
SDLC phase. Automation activities are also decided in this phase, ifautomation needs
to be done for software product, how will the automation bedone, how much time will
it take to automate and which features need to be automated. Non
functionaltestingareas(Stress and performance testing) are also analyzed and defined
in this phase.
• Test Design
In this phase various black-box and white-box test design techniques are used to
design the test cases for testing, testers start writing test cases by following those
design techniques, if automation testing needs to be done then automation scripts also
needs to written in this phase
44
[Link]
45
[Link]
46
47
48
49
50
51
52
[Link]
53
11. CONCLUSION
In this research, we studied the how fake messages are used on Twitter during the
year 2016. A classifier was developed to detect potential fake messages. A dataset of
Twitter messages of 2016 has been classified. A qualitative content analysis has been
performed on these classified tweets. In this chapter, the results of our research will be
[Link] and compared three different supervised Machine Learning
algorithms. These algorithms have been trained and tested on a data sample oftweets.
The sample contained an equal number of ‘true’ and ‘false’ tweets. The sample has
been used to train multiple supervised Machine Learning algorithms. The
performance of these algorithms has been compared using the accuracy The best
performing algorithm, the Decision Tree algorithm, was used to label 613.033 tweets
2016. These falselabelled tweets were analysed in detail. Distinctive features of the
different false tweets were found, and the tweets were categorized into six different
categories. As a result, we can conclude that the Decision Tree algorithm is the best
algorithm for the classification of true and false messages. The algorithm performed
best with a weighted accuracy of 99%. Since our research needed a classifier that can
handle a considerable amount of data, this algorithm is more suitable for the
research . The Decision Tree algorithm was used to classify different tweets in the
database into either true messages or false messages. In total, 613.033 tweets were
classified, of which 328.897 were classified as true, and 284.136 tweets were
classified as false..
54
[Link] SCOPE
55
12. FUTURE SCOPE
Our research shows limitations that can be addressed and improved in future research.
We used a dataset that was used by a research. This dataset contained twitter
messages of 2016. These messages are five years old. This meant that many tweets
that were identified as false were not online anymore. The user accounts were deleted
or suspended. This meant that information about the user, such as Twitter followers
and number of sent messages, could not be retrieved. Therefore, the Machine
Learning algorithm was trained on only the content of the twitter message. It could be
interesting to investigate the detection of potential fake messages with a combination
of both the content of the tweet and the account data of the user that tweeted the
message. As researched by Camisani-Calzolari potential fake messages can also be
identified using multiple data of the account. An algorithm that is trained on a
combination of the content of the tweet and the account data could have a higher
validity. Due to the limited information of the user account, it was hard to make a
substantial training set for the training of the machine learning. A more extensive
training set could improve the validity of the Machine Learning algorithm and
therefore decrease the amount of false-positive classifications.
Through the work done in this project, we have shown that machine learning certainly
does have the capacity to pick up on sometimes subtle language patterns that may be
difficult for humans to pick up on. The next steps involved in this project come in two
different aspects. The first of aspect that could be improved in this project is
augmenting and increasing the size of the dataset. We feel that more data would be
beneficial in ridding the model of any bias based on specific patterns in the source.
There is also question as to weather or not the size of our dataset is sufficient. The
second aspect in which this project could be expanded is by comparing it to humans
performing the same task. Comparing the accuracies would be beneficial in deciding
whether or not the dataset is representative of how difficult the task of separating fake
from real news is. If humans are more accurate than the model, it may mean that we
need to choose more deceptive fake news examples. Because we acknowledge that
this is only one tool in a toolbox that would really be required for an end-to-end
system for classifying fake news, we expect that its accuracy will never reach perfect.
However, it may be beneficial as a stand-alone application if its accuracy is already
higher than human accuracy at the same task. In addition to comparing the accuracy
to human accuracy, it would also be interesting to compare the phrases/trigrams that a
human would point out if asked what they based their classification decision on.
Then, we could quantify how similar these patterns are to those that humans find
indicative of fake and real news. Finally, as we have mentioned throughout, this
project is only one that would be necessary in a larger toolbox that could function as
a highly accurate fake news classifier. Other tools that would need to be built may
include a fact detector and a stance detector. In order to combine all of these
“routines,” there would need to be some type of model that combines all of the tools
and learns how to weight each of them in its final decision.
56
[Link]
57
13. REFERENCES
[1] Allcott, H., & Gentzkow, M. (2017). Social Media and Fake News in the 2016
Election. The Journal of Economic Perspectives: A Journal of the American
Economic Association, 31(2), 211–236. [Link]
[2] Boididou, C., Papadopoulos, S., Zampoglou, M., Apostolidis, L., Papadopoulou,
O., &Kompatsiaris, Y. (2018). Detection and visualisation of misleading content on
Twitter. International Journal of Multimedia Information Retrieval, 7(1), 71–86.
[Link]
[3] Brownlee, J. (2017, September 29). How to Prepare Text Data for Machine
Learning with scikit-learn. Retrieved June 24, 2018, from
ttps://[Link]/prepare-text-datamachine-learning-scikit-learn
[5] Ceron, A., Curini, L., Iacus, S. M., &Porro, G. (2014). Every tweet counts? How
sentiment analysis of social media can improve our knowledge of citizens’ political
preferences with an application to Italy and France. New Media and Society, 16(2),
340–358. [Link]
[6] Craig Silverman, J. S.-V. (2016, December 7). Most Americans Who See Fake
News Believe It, New Survey Says. Retrieved May 29, 2018, from
[Link] survey
[7] Cresci, S., Di Pietro, R., Petrocchi, M., Spognardi, A., & Tesconi, M. (2015).
Fame for sale: Efficient detection of fake Twitter followers. Decision Support
Systems, 80, 56– 71. [Link]
[9] Dudovskiy, J. (n.d.). Snowball sampling. Retrieved June 25, 2018, from
[Link]
sampling/
[10] Epstein, R., & Robertson, R. E. (2015). The search engine manipulation effect
(SEME) and its possible impact on the outcomes of elections. Proceedings of the
National Academy of Sciences of the United States of America, 112(33), E4512–
E4521. [Link]
58
[11] Exposed: Undercover secrets of Trump’s data firm. (2018, March 20). Retrieved
May 4, 2018, from [Link]
donald-trump-data-firm-cambridge-analytica
[13] Graham-Harrison, E., & Cadwalladr, C. (2018, March 21). Cambridge Analytica
execs boast of role in getting Donald Trump elected. The Guardian. Retrieved from
[Link]
of role-in-getting-trump-elected
[14] Hosch-Dayican, B., Amrit, C., Aarts, K., &Dassen, A. (2014). How Do Online
Citizens Persuade Fellow Voters? Using Twitter During the 2012 Dutch
Parliamentary Election Campaign. Social Science Computer Review, 34(2), 135–152.
[Link]
59