0% found this document useful (0 votes)
22 views1 page

Fake Job Detection

The document discusses the importing of various Python libraries and modules for data analysis and natural language processing. It loads a CSV dataset on fake job postings and displays the header of the data. Finally, it imports common libraries for text preprocessing, modeling, visualization, and NLP including NumPy, Pandas, Matplotlib, Seaborn, scikit-learn, wordcloud, and spacy.

Uploaded by

Shruti Saxena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views1 page

Fake Job Detection

The document discusses the importing of various Python libraries and modules for data analysis and natural language processing. It loads a CSV dataset on fake job postings and displays the header of the data. Finally, it imports common libraries for text preprocessing, modeling, visualization, and NLP including NumPy, Pandas, Matplotlib, Seaborn, scikit-learn, wordcloud, and spacy.

Uploaded by

Shruti Saxena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

In [1]:

!pip install wordcloud

Requirement already satisfied: wordcloud in c:\users\hp\anaconda3\lib\site-packages (1.8.1)


Requirement already satisfied: pillow in c:\users\hp\anaconda3\lib\site-packages (from wordcloud) (8.2.0)
Requirement already satisfied: matplotlib in c:\users\hp\anaconda3\lib\site-packages (from wordcloud) (3.3.4)
Requirement already satisfied: numpy>=1.6.1 in c:\users\hp\anaconda3\lib\site-packages (from wordcloud) (1.20.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\hp\anaconda3\lib\site-packages (from matplotlib->wordcloud) (2.4.7)
Requirement already satisfied: cycler>=0.10 in c:\users\hp\anaconda3\lib\site-packages (from matplotlib->wordcloud) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in c:\users\hp\anaconda3\lib\site-packages (from matplotlib->wordcloud) (2.8.1)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\hp\anaconda3\lib\site-packages (from matplotlib->wordcloud) (1.3.1)
Requirement already satisfied: six in c:\users\hp\anaconda3\lib\site-packages (from cycler>=0.10->matplotlib->wordcloud) (1.15.0)

In [2]:
!pip install -U spacy

Requirement already satisfied: spacy in c:\users\hp\anaconda3\lib\site-packages (3.2.4)


Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.8 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (3.0.9)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (0.7.7)
Requirement already satisfied: packaging>=20.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (20.9)
Requirement already satisfied: numpy>=1.15.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (1.20.1)
Requirement already satisfied: pathy>=0.3.5 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (0.6.1)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (2.25.1)
Requirement already satisfied: wasabi<1.1.0,>=0.8.1 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (0.9.1)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (2.0.6)
Requirement already satisfied: jinja2 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (2.11.3)
Requirement already satisfied: srsly<3.0.0,>=2.4.1 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (2.4.2)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (3.3.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (1.0.6)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (1.0.2)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (3.0.6)
Requirement already satisfied: thinc<8.1.0,>=8.0.12 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (8.0.15)
Requirement already satisfied: typer<0.5.0,>=0.3.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (0.4.1)
Requirement already satisfied: click<8.1.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (7.1.2)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (2.0.7)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (1.8.2)
Requirement already satisfied: setuptools in c:\users\hp\anaconda3\lib\site-packages (from spacy) (52.0.0.post20210125)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (4.59.0)
Requirement already satisfied: pyparsing>=2.0.2 in c:\users\hp\anaconda3\lib\site-packages (from packaging>=20.0->spacy) (2.4.7)
Requirement already satisfied: smart-open<6.0.0,>=5.0.0 in c:\users\hp\anaconda3\lib\site-packages (from pathy>=0.3.5->spacy) (5.2.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\users\hp\anaconda3\lib\site-packages (from pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4->spacy) (3.7.4.3)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\hp\anaconda3\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (1.26.4)
Requirement already satisfied: idna<3,>=2.5 in c:\users\hp\anaconda3\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (2.10)
Requirement already satisfied: chardet<5,>=3.0.2 in c:\users\hp\anaconda3\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (4.0.0)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\hp\anaconda3\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (2020.12.5)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\hp\anaconda3\lib\site-packages (from jinja2->spacy) (1.1.1)

In [3]:
import re
import string
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin
from sklearn.metrics import accuracy_score, plot_confusion_matrix, classification_report, confusion_matrix
from wordcloud import WordCloud
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

In [4]:
df=pd.read_csv('fake_job_postings.csv')

In [5]:
df.head()

Out[5]: job_id title location department salary_range company_profile description requirements benefits telecommuting has_company_logo has_questions employment_type required_experien

We're Food52,
Food52, a fast- Experience with
Marketing US, NY, and we've
0 1 Marketing NaN growing, James content management NaN 0 1 0 Other Internsh
Intern New York created a
Beard Award-winn... systems a m...
groundbreaki...

What you
Customer 90 Seconds, the will get
Organised - Focused What we expect from
Service - Cloud NZ, , worlds Cloud from
1 2 Success NaN - Vibrant - you:Your key 0 1 0 Full-time Not Applicab
Video Auckland Video Production usThrough
Awesome!Do you... responsibilit...
Production ... being part
of...

Commissioning Valor Services


Our client, located in Implement pre-
Machinery US, IA, provides
2 3 NaN NaN Houston, is actively commissioning and NaN 0 1 0 NaN Na
Assistant Wever Workforce
se... commissioning ...
(CMA) Solutions th...

Our
culture is
Account
Our passion for THE COMPANY: ESRI EDUCATION: Bachelor’s anything
Executive - US, DC,
3 4 Sales NaN improving quality – Environmental or Master’s in GIS, but 0 1 0 Full-time Mid-Senior lev
Washington Washington
of life thro... Systems Rese... busi... corporate
DC
—we have
...

SpotSource
JOB TITLE: QUALIFICATIONS:RN Full
Bill Review US, FL, Fort Solutions LLC is a
4 5 NaN NaN Itemization Review license in the State of Benefits 0 1 1 Full-time Mid-Senior lev
Manager Worth Global Human
ManagerLOCATION:... Texa... Offered
Cap...

In [6]:
df.shape

Out[6]: (17880, 18)

In [7]:
df.isnull().sum()

Out[7]: job_id 0
title 0
location 346
department 11547
salary_range 15012
company_profile 3308
description 1
requirements 2695
benefits 7210
telecommuting 0
has_company_logo 0
has_questions 0
employment_type 3471
required_experience 7050
required_education 8105
industry 4903
function 6455
fraudulent 0
dtype: int64

In [8]:
columns = ['job_id', 'telecommuting', 'has_company_logo', 'has_questions', 'salary_range', 'employment_type']
for colu in columns:
del df[colu]

In [9]:
df.head()

Out[9]: title location department company_profile description requirements benefits required_experience required_education industry function fraudulent

We're Food52, and


US, NY, New Food52, a fast-growing, Experience with content
0 Marketing Intern Marketing we've created a NaN Internship NaN NaN Marketing 0
York James Beard Award-winn... management systems a m...
groundbreaki...

Customer Service - 90 Seconds, the worlds Organised - Focused - What you will get Marketing
NZ, , What we expect from Customer
1 Cloud Video Success Cloud Video Vibrant - Awesome!Do from usThrough Not Applicable NaN and 0
Auckland you:Your key responsibilit... Service
Production Production ... you... being part of... Advertising

Commissioning Valor Services provides Implement pre-


Our client, located in
2 Machinery Assistant US, IA, Wever NaN Workforce Solutions commissioning and NaN NaN NaN NaN NaN 0
Houston, is actively se...
(CMA) th... commissioning ...

Our culture is
Our passion for THE COMPANY: ESRI –
Account Executive - US, DC, EDUCATION: Bachelor’s or anything but Computer
3 Sales improving quality of Environmental Systems Mid-Senior level Bachelor's Degree Sales 0
Washington DC Washington Master’s in GIS, busi... corporate—we Software
life thro... Rese...
have ...

SpotSource Solutions JOB TITLE: Itemization


US, FL, Fort QUALIFICATIONS:RN license Full Benefits Hospital & Health Care
4 Bill Review Manager NaN LLC is a Global Human Review Mid-Senior level Bachelor's Degree 0
Worth in the State of Texa... Offered Health Care Provider
Cap... ManagerLOCATION:...

In [10]:
df.fillna('', inplace=True)

In [11]:
plt.figure(figsize=(15,5))
sns.countplot(y='fraudulent', data=df)
plt.show()

In [12]:
df.groupby('fraudulent')['fraudulent'].count()

Out[12]: fraudulent
0 17014
1 866
Name: fraudulent, dtype: int64

In [13]:
exp = dict(df.required_experience.value_counts())
del exp['']

In [14]:
exp

Out[14]: {'Mid-Senior level': 3809,


'Entry level': 2697,
'Associate': 2297,
'Not Applicable': 1116,
'Director': 389,
'Internship': 381,
'Executive': 141}

In [15]:
plt.figure(figsize=(10,5))
sns.set_theme(style='whitegrid')
plt.bar(exp.keys(), exp.values())
plt.title('No. of jobs with Experience', size=20)
plt.xlabel('Experience', size=10)
plt.ylabel('No. of jobs', size=10)
plt.xticks(rotation=30)
plt.show()

In [16]:
def split(location):
l = location.split(',')
return l[0]
df['country'] = df.location.apply(split)

In [17]:
df.head()

Out[17]: title location department company_profile description requirements benefits required_experience required_education industry function fraudulent country

We're Food52, and Food52, a fast-growing, Experience with content


US, NY, New
0 Marketing Intern Marketing we've created a James Beard Award- management systems a Internship Marketing 0 US
York
groundbreaki... winn... m...

What you will


Customer Service 90 Seconds, the Organised - Focused - Marketing
NZ, , What we expect from get from Customer
1 - Cloud Video Success worlds Cloud Video Vibrant - Awesome!Do Not Applicable and 0 NZ
Auckland you:Your key responsibilit... usThrough Service
Production Production ... you... Advertising
being part of...

Commissioning Valor Services Implement pre-


US, IA, Our client, located in
2 Machinery provides Workforce commissioning and 0 US
Wever Houston, is actively se...
Assistant (CMA) Solutions th... commissioning ...

Our culture is
Our passion for THE COMPANY: ESRI –
Account Executive US, DC, EDUCATION: Bachelor’s or anything but Computer
3 Sales improving quality of Environmental Systems Mid-Senior level Bachelor's Degree Sales 0 US
- Washington DC Washington Master’s in GIS, busi... corporate—we Software
life thro... Rese...
have ...

SpotSource Solutions JOB TITLE: Itemization QUALIFICATIONS:RN Health


Bill Review US, FL, Fort Full Benefits Hospital &
4 LLC is a Global Review license in the State of Mid-Senior level Bachelor's Degree Care 0 US
Manager Worth Offered Health Care
Human Cap... ManagerLOCATION:... Texa... Provider

In [18]:
countr = dict(df.country.value_counts()[:14])
del countr['']
countr

Out[18]: {'US': 10656,


'GB': 2384,
'GR': 940,
'CA': 457,
'DE': 383,
'NZ': 333,
'IN': 276,
'AU': 214,
'PH': 132,
'NL': 127,
'BE': 117,
'IE': 114,
'SG': 80}

In [19]:
plt.figure(figsize=(8,6))
plt.title('Country-wise Job Posting',size=20)
plt.bar(countr.keys(), countr.values())
plt.ylabel('No. of jobs', size=10)
plt.xlabel('Countries', size=10)

Out[19]: Text(0.5, 0, 'Countries')

In [20]:
edu = dict(df.required_education.value_counts()[:7])
del edu['']
edu

Out[20]: {"Bachelor's Degree": 5145,


'High School or equivalent': 2080,
'Unspecified': 1397,
"Master's Degree": 416,
'Associate Degree': 274,
'Certification': 170}

In [21]:
plt.figure(figsize=(15,6))
plt.title('Job postings based on Education', size=20)
plt.bar(edu.keys(), edu.values())
plt.ylabel('No. of Jobs', size=10)
plt.xlabel('Education', size=10)

Out[21]: Text(0.5, 0, 'Education')

In [22]:
print(df[df.fraudulent==0].title.value_counts()[:10])

English Teacher Abroad 311


Customer Service Associate 146
Graduates: English Teacher Abroad (Conversational) 144
English Teacher Abroad 95
Software Engineer 86
English Teacher Abroad (Conversational) 83
Customer Service Associate - Part Time 76
Account Manager 73
Web Developer 66
Project Manager 62
Name: title, dtype: int64

In [23]:
print(df[df.fraudulent==1].title.value_counts()[:10])

Cruise Staff Wanted *URGENT* 21


Data Entry Admin/Clerical Positions - Work From Home 21
Home Based Payroll Typist/Data Entry Clerks Positions Available 21
Customer Service Representative 17
Administrative Assistant 16
Home Based Payroll Data Entry Clerk Position - Earn $100-$200 Daily 12
Account Sales Managers $80-$130,000/yr 10
Network Marketing 10
Payroll Data Coordinator Positions - Earn $100-$200 Daily 10
Payroll Clerk 10
Name: title, dtype: int64

In [24]:
df['text']=df['title']+' '+df['company_profile']+' '+df['description']+' '+df['requirements']+' '+df['benefits']
del df['title']
del df['location']
del df['department']
del df['company_profile']
del df['description']
del df['requirements']
del df['benefits']
del df['required_experience']
del df['required_education']
del df['industry']
del df['function']
del df['country']

In [25]:
df.head()

Out[25]: fraudulent text

0 0 Marketing Intern We're Food52, and we've creat...

1 0 Customer Service - Cloud Video Production 90 S...

2 0 Commissioning Machinery Assistant (CMA) Valor ...

3 0 Account Executive - Washington DC Our passion ...

4 0 Bill Review Manager SpotSource Solutions LLC i...

In [26]:
fraudjobs_text = df[df.fraudulent==1].text
realjobs_text = df[df.fraudulent==0].text

In [27]:
STOPWORDS = spacy.lang.en.stop_words.STOP_WORDS
plt.figure(figsize=(16,14))
wc = WordCloud(min_font_size = 3, max_words = 3000, width = 1500, height = 800, stopwords= STOPWORDS).generate(str(" ".join(fraudjobs_text)))
plt.imshow(wc, interpolation = 'bilinear')

Out[27]: <matplotlib.image.AxesImage at 0x157155b68b0>

In [28]:
STOPWORDS = spacy.lang.en.stop_words.STOP_WORDS
plt.figure(figsize=(16,14))
wc = WordCloud(min_font_size = 3, max_words = 3000, width = 1500, height = 800, stopwords= STOPWORDS).generate(str(" ".join(realjobs_text)))
plt.imshow(wc, interpolation = 'bilinear')

Out[28]: <matplotlib.image.AxesImage at 0x157148d7040>

In [29]:
!pip install spacy && python -m spacy download en

Requirement already satisfied: spacy in c:\users\hp\anaconda3\lib\site-packages (3.2.4)


Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.8 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (3.0.9)
Requirement already satisfied: setuptools in c:\users\hp\anaconda3\lib\site-packages (from spacy) (52.0.0.post20210125)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (3.3.0)
Requirement already satisfied: thinc<8.1.0,>=8.0.12 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (8.0.15)
Requirement already satisfied: jinja2 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (2.11.3)
Requirement already satisfied: click<8.1.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (7.1.2)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (3.0.6)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (1.8.2)
Requirement already satisfied: packaging>=20.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (20.9)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (2.25.1)
Requirement already satisfied: numpy>=1.15.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (1.20.1)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (0.7.7)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (2.0.6)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (1.0.2)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (4.59.0)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (2.0.7)
Requirement already satisfied: typer<0.5.0,>=0.3.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (0.4.1)
Requirement already satisfied: srsly<3.0.0,>=2.4.1 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (2.4.2)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (1.0.6)
Requirement already satisfied: wasabi<1.1.0,>=0.8.1 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (0.9.1)
Requirement already satisfied: pathy>=0.3.5 in c:\users\hp\anaconda3\lib\site-packages (from spacy) (0.6.1)
Requirement already satisfied: pyparsing>=2.0.2 in c:\users\hp\anaconda3\lib\site-packages (from packaging>=20.0->spacy) (2.4.7)
Requirement already satisfied: smart-open<6.0.0,>=5.0.0 in c:\users\hp\anaconda3\lib\site-packages (from pathy>=0.3.5->spacy) (5.2.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\users\hp\anaconda3\lib\site-packages (from pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4->spacy) (3.7.4.3)
Requirement already satisfied: chardet<5,>=3.0.2 in c:\users\hp\anaconda3\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (4.0.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\hp\anaconda3\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (1.26.4)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\hp\anaconda3\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (2020.12.5)
Requirement already satisfied: idna<3,>=2.5 in c:\users\hp\anaconda3\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (2.10)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\hp\anaconda3\lib\site-packages (from jinja2->spacy) (1.1.1)
Collecting en-core-web-sm==3.2.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
Requirement already satisfied: spacy<3.3.0,>=3.2.0 in c:\users\hp\anaconda3\lib\site-packages (from en-core-web-sm==3.2.0) (3.2.4)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.3.0)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.0.2)
Requirement already satisfied: packaging>=20.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (20.9)
Requirement already satisfied: wasabi<1.1.0,>=0.8.1 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.9.1)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.0.7)
Requirement already satisfied: pathy>=0.3.5 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.6.1)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.7.7)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.0.6)
Requirement already satisfied: click<8.1.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (7.1.2)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.0.6)
Requirement already satisfied: thinc<8.1.0,>=8.0.12 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (8.0.15)
Requirement already satisfied: typer<0.5.0,>=0.3.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.4.1)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.8 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.0.9)
Requirement already satisfied: numpy>=1.15.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.20.1)
Requirement already satisfied: setuptools in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (52.0.0.post20210125)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.8.2)
Requirement already satisfied: jinja2 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.11.3)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.25.1)
Requirement already satisfied: srsly<3.0.0,>=2.4.1 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.4.2)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (4.59.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in c:\users\hp\anaconda3\lib\site-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.0.6)
Requirement already satisfied: pyparsing>=2.0.2 in c:\users\hp\anaconda3\lib\site-packages (from packaging>=20.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.4.7)
Requirement already satisfied: smart-open<6.0.0,>=5.0.0 in c:\users\hp\anaconda3\lib\site-packages (from pathy>=0.3.5->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (5.2.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\users\hp\anaconda3\lib\site-packages (from pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0)
(3.7.4.3)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\hp\anaconda3\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2020.12.5)
Requirement already satisfied: chardet<5,>=3.0.2 in c:\users\hp\anaconda3\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (4.0.0)
Requirement already satisfied: idna<3,>=2.5 in c:\users\hp\anaconda3\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\hp\anaconda3\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.26.4)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\hp\anaconda3\lib\site-packages (from jinja2->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.1.1)
[!] As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the full
pipeline package name 'en_core_web_sm' instead.
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')

In [30]:
punctuations = string.punctuation

nlp = spacy.load("en_core_web_sm")
stop_words = spacy.lang.en.stop_words.STOP_WORDS

parser = English()

def spacy_tokenizer(sentence):
mytokens = parser(sentence)

mytokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON" else word.lower_ for word in mytokens]

mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

return mytokens

# Custom transformer using spacy


class predictors(TransformerMixin):
def transform(self, X, **transform_params):
#Cleaning text
return [clean_text(text) for text in X]

def fit(self, X, y=None, **fit_params):


return self

def get_params(self, deep=True):


return {}

# Basic function to clean the text


def clean_text(text):
# Removing spaces and converting text to lowercase
return text.strip().lower()

In [31]:
df['text'] = df['text'].apply(clean_text)

In [32]:
cv = TfidfVectorizer(max_features = 100)
x = cv.fit_transform(df['text'])
df1 = pd.DataFrame(x.toarray(), columns = cv.get_feature_names())
df.drop(['text'], axis=1, inplace=True)
main_df = pd.concat([df1,df], axis=1)

In [33]:
main_df.head()

Out[33]: ability about all also amp an and are as at ... who will with work working world years you your fraudulent

0 0.000000 0.041120 0.000000 0.042424 0.036488 0.000000 0.755238 0.000000 0.078653 0.000000 ... 0.000000 0.000000 0.186067 0.051026 0.068029 0.000000 0.000000 0.000000 0.000000 0

1 0.021895 0.094183 0.035394 0.024292 0.041787 0.029771 0.490896 0.056626 0.060050 0.052431 ... 0.000000 0.078004 0.165735 0.043827 0.116862 0.099327 0.000000 0.204854 0.130452 0

2 0.000000 0.000000 0.176807 0.000000 0.041749 0.089231 0.397029 0.113149 0.000000 0.000000 ... 0.000000 0.062346 0.307512 0.058383 0.000000 0.000000 0.000000 0.094462 0.074476 0

3 0.023267 0.000000 0.018806 0.000000 0.000000 0.094909 0.695542 0.000000 0.031906 0.037144 ... 0.023132 0.049735 0.075480 0.046573 0.000000 0.105551 0.019806 0.050236 0.059411 0

4 0.000000 0.000000 0.068009 0.000000 0.040147 0.028602 0.606379 0.081605 0.115386 0.000000 ... 0.000000 0.000000 0.159230 0.028071 0.037425 0.000000 0.035814 0.030279 0.107427 0

5 rows × 101 columns

In [34]:
Y = main_df.iloc[:, -1]
X = main_df.iloc[:, :-1]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3)

print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

(12516, 100)
(12516,)
(5364, 100)
(5364,)

In [35]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_jobs=3,oob_score=True,n_estimators=100,criterion="entropy")
model = rfc.fit(X_train,Y_train)

In [36]:
print(X_test)

ability about all also amp an and \


4319 0.000000 0.000000 0.000000 0.082903 0.000000 0.050800 0.358994
8997 0.000000 0.000000 0.000000 0.000000 0.052054 0.000000 0.436794
4062 0.141852 0.101699 0.038218 0.000000 0.000000 0.128586 0.429105
9844 0.000000 0.000000 0.102455 0.000000 0.000000 0.086178 0.304500
1854 0.042142 0.090640 0.068125 0.000000 0.120646 0.057302 0.584916
... ... ... ... ... ... ... ...
14190 0.000000 0.000000 0.132536 0.000000 0.000000 0.037160 0.583560
2631 0.000000 0.000000 0.000000 0.000000 0.000000 0.123859 0.615942
11486 0.000000 0.000000 0.032025 0.000000 0.340289 0.080812 0.465325
13669 0.000000 0.000000 0.056661 0.000000 0.200687 0.047659 0.523907
2763 0.000000 0.070304 0.000000 0.072533 0.000000 0.044446 0.558382

are as at ... well who will \


4319 0.144939 0.000000 0.059645 ... 0.000000 0.074289 0.212966
8997 0.000000 0.074804 0.000000 ... 0.000000 0.000000 0.038868
4062 0.122290 0.000000 0.000000 ... 0.049710 0.000000 0.000000
9844 0.081958 0.086913 0.050591 ... 0.000000 0.000000 0.135479
1854 0.081745 0.028896 0.000000 ... 0.000000 0.000000 0.030028
... ... ... ... ... ... ... ...
14190 0.000000 0.074954 0.000000 ... 0.000000 0.000000 0.077892
2631 0.078530 0.041639 0.000000 ... 0.063844 0.000000 0.173082
11486 0.000000 0.054335 0.000000 ... 0.000000 0.000000 0.028232
13669 0.090651 0.000000 0.000000 ... 0.000000 0.000000 0.049950
2763 0.000000 0.000000 0.000000 ... 0.000000 0.064997 0.000000

with work working world years you your


4319 0.121203 0.099714 0.199411 0.000000 0.000000 0.161335 0.063600
8997 0.117975 0.036397 0.000000 0.000000 0.092872 0.039260 0.000000
4062 0.153395 0.063099 0.000000 0.053627 0.040252 0.000000 0.000000
9844 0.034268 0.000000 0.000000 0.000000 0.000000 0.182460 0.000000
1854 0.159501 0.168714 0.149956 0.000000 0.000000 0.030331 0.107610
... ... ... ... ... ... ... ...
14190 0.059106 0.109410 0.000000 0.061990 0.046529 0.000000 0.000000
2631 0.197008 0.121559 0.108044 0.000000 0.000000 0.087413 0.051689
11486 0.085692 0.026437 0.000000 0.000000 0.000000 0.199615 0.067449
13669 0.000000 0.000000 0.000000 0.000000 0.059675 0.100906 0.000000
2763 0.106042 0.043621 0.000000 0.000000 0.000000 0.282309 0.111289

[5364 rows x 100 columns]

In [37]:
pred = rfc.predict(X_test)
score = accuracy_score(Y_test, pred)
score

Out[37]: 0.9737136465324385

In [38]:
print('Classification_Report\n')
print(classification_report(Y_test, pred))
print('Confusion Matrix\n')
print(confusion_matrix(Y_test, pred))

Classification_Report

precision recall f1-score support

0 0.97 1.00 0.99 5117


1 1.00 0.43 0.60 247

accuracy 0.97 5364


macro avg 0.99 0.71 0.79 5364
weighted avg 0.97 0.97 0.97 5364

Confusion Matrix

[[5117 0]
[ 141 106]]

In [ ]:

You might also like