0% found this document useful (0 votes)

108 views2 pages

Text Mining and Preprocessing Guide

The document shows code for preprocessing text documents in R for text mining. It includes steps to clean the text such as removing punctuation, numbers, stopwords, and stemming words. It then creates a document-term matrix to analyze term frequencies. It finds the most and least frequent terms, creates correlations between terms, and generates a histogram and word cloud of frequent terms.

Uploaded by

ratan203

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

108 views2 pages

Text Mining and Preprocessing Guide

Uploaded by

ratan203

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

library(tm)

#Create Corpus - CHANGE PATH AS NEEDED

docs <- Corpus(DirSource("C:/Users/<YourPath>/Documents/TextMining"))
#Check details
inspect(docs)
#inspect a particular document
writeLines(as.character(docs[[30]]))
#Start preprocessing
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ",
x))})
docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, ":")
docs <- tm_map(docs, toSpace, "'")
docs <- tm_map(docs, toSpace, "'")
docs <- tm_map(docs, toSpace, " -")
#Good practice to check after each step.
writeLines(as.character(docs[[30]]))
#Remove punctuation - replace punctuation marks with " "
docs <- tm_map(docs, removePunctuation)
#Transform to lower case
docs <- tm_map(docs,content_transformer(tolower))
#Strip digits
docs <- tm_map(docs, removeNumbers)
#Remove stopwords from standard stopword list (How to check this? How to add you
r own?)
docs <- tm_map(docs, removeWords, stopwords("english"))
#Strip whitespace (cosmetic?)
docs <- tm_map(docs, stripWhitespace)
#inspect output
writeLines(as.character(docs[[30]]))
#Need SnowballC library for stemming
library(SnowballC)
#Stem document
docs <- tm_map(docs,stemDocument)
#some clean up
docs <- tm_map(docs, content_transformer(gsub),
pattern = "organiz", replacement = "organ")
docs <- tm_map(docs, content_transformer(gsub),
pattern = "organis", replacement = "organ")
docs <- tm_map(docs, content_transformer(gsub),
pattern = "andgovern", replacement = "govern")
docs <- tm_map(docs, content_transformer(gsub),
pattern = "inenterpris", replacement = "enterpris")
docs <- tm_map(docs, content_transformer(gsub),
pattern = "team-", replacement = "team")
#inspect
writeLines(as.character(docs[[30]]))
#Create document-term matrix
dtm <- DocumentTermMatrix(docs)
#inspect segment of document term matrix
inspect(dtm[1:2,1000:1005])
#collapse matrix by summing over columns - this gets total counts (over all docs
) for each term

freq <- colSums(as.matrix(dtm))

#length should be total number of terms
length(freq)
#create sort order (asc)
ord <- order(freq,decreasing=TRUE)
#inspect most frequently occurring terms
freq[head(ord)]
#inspect least frequently occurring terms
freq[tail(ord)]
#remove very frequent and very rare words
dtmr <-DocumentTermMatrix(docs, control=list(wordLengths=c(4, 20),
bounds = list(global = c(3,27))))
freqr <- colSums(as.matrix(dtmr))
#length should be total number of terms
length(freqr)
#create sort order (asc)
ordr <- order(freqr,decreasing=TRUE)
#inspect most frequently occurring terms
freqr[head(ordr)]
#inspect least frequently occurring terms
freqr[tail(ordr)]
#list most frequent terms. Lower bound specified as second argument
findFreqTerms(dtmr,lowfreq=80)
#correlations
findAssocs(dtmr,"project",0.6)
findAssocs(dtmr,"enterprise",0.6)
findAssocs(dtmr,"system",0.6)
#histogram
wf=data.frame(term=names(freqr),occurrences=freqr)
library(ggplot2)
p <- ggplot(subset(wf, freqr>100), aes(term, occurrences))
p <- p + geom_bar(stat="identity")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p
#wordcloud
library(wordcloud)
#setting the same seed each time ensures consistent look across clouds
set.seed(42)
#limit words by specifying min frequency
wordcloud(names(freqr),freqr, min.freq=70)
#...add color
wordcloud(names(freqr),freqr,min.freq=70,colors=brewer.pal(6,"Dark2"))

Text Mining Code
No ratings yet
Text Mining Code
2 pages
Word Cloud
No ratings yet
Word Cloud
3 pages
Text Mining Twitter Data with R
No ratings yet
Text Mining Twitter Data with R
35 pages
Text Mining & Analysis Guide
No ratings yet
Text Mining & Analysis Guide
6 pages
Text Analysis
No ratings yet
Text Analysis
15 pages
R语言基础入门指令 (tips)
No ratings yet
R语言基础入门指令 (tips)
14 pages
Twitter Data Mining with R Techniques
No ratings yet
Twitter Data Mining with R Techniques
34 pages
Reddit Comment Scraper & Word Cloud
No ratings yet
Reddit Comment Scraper & Word Cloud
4 pages
Text Mining in R with TM Package
No ratings yet
Text Mining in R with TM Package
6 pages
Text Mining and Word Cloud in R
No ratings yet
Text Mining and Word Cloud in R
3 pages
5 Paso S Text Mining
No ratings yet
5 Paso S Text Mining
4 pages
KNN Classification of Cloth Reviews
No ratings yet
KNN Classification of Cloth Reviews
2 pages
Peer Graded Assignment: Task Milestones
No ratings yet
Peer Graded Assignment: Task Milestones
6 pages
Naive Bayes Text Classification Guide
No ratings yet
Naive Bayes Text Classification Guide
3 pages
R Master Sheet - All Codes, Inbuilt Functions and Packages Needed For The Course
No ratings yet
R Master Sheet - All Codes, Inbuilt Functions and Packages Needed For The Course
2 pages
EBUS622 - Week 5 - Lecture - Text Preparation
No ratings yet
EBUS622 - Week 5 - Lecture - Text Preparation
40 pages
Ba Ca 2
No ratings yet
Ba Ca 2
18 pages
R Text Mining & Sentiment Guide
No ratings yet
R Text Mining & Sentiment Guide
9 pages
Stewart LabHandout
No ratings yet
Stewart LabHandout
11 pages
Group Project - Text Mining
No ratings yet
Group Project - Text Mining
4 pages
Basic Textual Analysis in R
No ratings yet
Basic Textual Analysis in R
2 pages
Text Analysis
No ratings yet
Text Analysis
13 pages
Amazon Review Sentiment Analysis in R
No ratings yet
Amazon Review Sentiment Analysis in R
8 pages
BA Notes
No ratings yet
BA Notes
5 pages
Text Analysis
No ratings yet
Text Analysis
15 pages
R Programming: Vector and Matrix Basics
No ratings yet
R Programming: Vector and Matrix Basics
3 pages
R Matrix and Vector Operations Guide
No ratings yet
R Matrix and Vector Operations Guide
22 pages
R Commands
No ratings yet
R Commands
18 pages
SMTA - Lab Record - Aim, Procedures and Results
No ratings yet
SMTA - Lab Record - Aim, Procedures and Results
31 pages
NLP
No ratings yet
NLP
4 pages
Big Data
No ratings yet
Big Data
5 pages
Assignment
No ratings yet
Assignment
4 pages
Data Science
No ratings yet
Data Science
20 pages
R Syntax Examples 1
No ratings yet
R Syntax Examples 1
6 pages
Matrix, Dataframes, List
No ratings yet
Matrix, Dataframes, List
8 pages
R Programming Lab Manual-24-25
No ratings yet
R Programming Lab Manual-24-25
17 pages
R-Script 2
No ratings yet
R-Script 2
10 pages
Module-2 String, Date and Time, Data Preparation Example Code
No ratings yet
Module-2 String, Date and Time, Data Preparation Example Code
18 pages
SEO Keyword Clustering in R
No ratings yet
SEO Keyword Clustering in R
15 pages
R File Code
No ratings yet
R File Code
16 pages
Purrr
No ratings yet
Purrr
2 pages
Data Science With R Text Mining by Graham Williams
No ratings yet
Data Science With R Text Mining by Graham Williams
21 pages
NLP Text Preprocessing in R
No ratings yet
NLP Text Preprocessing in R
2 pages
Purrr Functions Cheatsheet
No ratings yet
Purrr Functions Cheatsheet
2 pages
Hands-On Data Science With R Text Mining
No ratings yet
Hands-On Data Science With R Text Mining
41 pages
Session Set Working Directory Choose Directlry
No ratings yet
Session Set Working Directory Choose Directlry
17 pages
DAV Assign4
No ratings yet
DAV Assign4
11 pages
Text Mining Notes
No ratings yet
Text Mining Notes
24 pages
Base R
No ratings yet
Base R
9 pages
Data Cleaning Course Notes
No ratings yet
Data Cleaning Course Notes
27 pages
Data Cleaning Using Dataset
No ratings yet
Data Cleaning Using Dataset
12 pages
Day 2
No ratings yet
Day 2
5 pages
Lecture 8
No ratings yet
Lecture 8
45 pages
Apply Functions With Purrr::: Cheat Sheet
No ratings yet
Apply Functions With Purrr::: Cheat Sheet
2 pages
Jahanvi Gupta 22BC233 - Siya Gupta 22BC563
No ratings yet
Jahanvi Gupta 22BC233 - Siya Gupta 22BC563
23 pages
Text Mining Notes
No ratings yet
Text Mining Notes
28 pages
Install - Packages (" Install - Packages (" Install - Packages (" Install - Packages ("
No ratings yet
Install - Packages (" Install - Packages (" Install - Packages (" Install - Packages ("
2 pages
Live Classroom 3
No ratings yet
Live Classroom 3
36 pages
Titan Company
No ratings yet
Titan Company
10 pages
Wonderla Q1FY26 Highlights
No ratings yet
Wonderla Q1FY26 Highlights
1 page
Wonderla AR
No ratings yet
Wonderla AR
150 pages
Tidy Data PDF
No ratings yet
Tidy Data PDF
24 pages
BataoAI Key Risks and Mitigation
No ratings yet
BataoAI Key Risks and Mitigation
1 page
Wonderla Transcript
No ratings yet
Wonderla Transcript
16 pages
NPS Contribution Form PDF
No ratings yet
NPS Contribution Form PDF
1 page
The Startup Checklist Ebook PDF
No ratings yet
The Startup Checklist Ebook PDF
22 pages
When To Sell: GM Breweries
No ratings yet
When To Sell: GM Breweries
12 pages
Advanced Options Trading Guide
100% (5)
Advanced Options Trading Guide
25 pages
Case Diagnostics in Regression Analysis
No ratings yet
Case Diagnostics in Regression Analysis
14 pages
30 December TO 1 January, 2020: 4Th Ica Open Below 1600 Fide Rating Chess Tournament
No ratings yet
30 December TO 1 January, 2020: 4Th Ica Open Below 1600 Fide Rating Chess Tournament
4 pages
SCMA 632: Statistical Analysis Syllabus
No ratings yet
SCMA 632: Statistical Analysis Syllabus
1 page
Reliance Capital's Q3FY18 Performance
No ratings yet
Reliance Capital's Q3FY18 Performance
15 pages
Aksharchem Annual Report 2011-12
No ratings yet
Aksharchem Annual Report 2011-12
53 pages
NPS Subscriber Registration Form Guide
No ratings yet
NPS Subscriber Registration Form Guide
5 pages
Exponential Smoothing
No ratings yet
Exponential Smoothing
5 pages
Q4FY17 Earnings Report Reliance Capital LTD: Operating Income Ppop PAT Margin Net Profit
No ratings yet
Q4FY17 Earnings Report Reliance Capital LTD: Operating Income Ppop PAT Margin Net Profit
5 pages
Mathematics Behind Gambling Strategies
100% (1)
Mathematics Behind Gambling Strategies
26 pages
TCS Buyback Opportunity To Earn Around 13% in 3 Months (62% Annualized)
No ratings yet
TCS Buyback Opportunity To Earn Around 13% in 3 Months (62% Annualized)
4 pages
Fortis Healthcare: Performance In-Line With Expectations
No ratings yet
Fortis Healthcare: Performance In-Line With Expectations
13 pages
Fortis Healthcare: Performance In-Line With Expectations
No ratings yet
Fortis Healthcare: Performance In-Line With Expectations
13 pages
Tentative Bar Chart (GT Works) Nagpur Metro Reach 4, Phase II
No ratings yet
Tentative Bar Chart (GT Works) Nagpur Metro Reach 4, Phase II
1 page
Off Struc Probs
No ratings yet
Off Struc Probs
17 pages
Civil Engineer Resume: Andi Bau Emil Salim
No ratings yet
Civil Engineer Resume: Andi Bau Emil Salim
2 pages
HAND HELD METAL DETECTORs
No ratings yet
HAND HELD METAL DETECTORs
5 pages
Ethics For The Information Age 3rd Edition PDF Download
No ratings yet
Ethics For The Information Age 3rd Edition PDF Download
2 pages
Common CAE Engineering Mistakes
No ratings yet
Common CAE Engineering Mistakes
8 pages
Globalization's Business Impact
No ratings yet
Globalization's Business Impact
6 pages
June 24 2024 Ailing Democracy, China's Concerns
No ratings yet
June 24 2024 Ailing Democracy, China's Concerns
11 pages
Qualified Crane Inspector Checklist
No ratings yet
Qualified Crane Inspector Checklist
4 pages
SRAC Publication No. 453 Recirculating Aquaculture Tank Production Systems A Review of Current Design Practice PDF
No ratings yet
SRAC Publication No. 453 Recirculating Aquaculture Tank Production Systems A Review of Current Design Practice PDF
12 pages
Aci Committee 336 Footing, Mats and Drilled Piers
No ratings yet
Aci Committee 336 Footing, Mats and Drilled Piers
6 pages
Demotivating Factors in Employee Satisfaction
No ratings yet
Demotivating Factors in Employee Satisfaction
5 pages
ISD1820
No ratings yet
ISD1820
5 pages
Gimenez MarxismFeminism 1975
No ratings yet
Gimenez MarxismFeminism 1975
21 pages
2 Structure
No ratings yet
2 Structure
6 pages
Pressure Safety Valves
100% (1)
Pressure Safety Valves
21 pages
Drivng School
No ratings yet
Drivng School
18 pages
Thesis Manuscript 1.5
No ratings yet
Thesis Manuscript 1.5
72 pages
CATCTCESSEXITALYed 3 Rev 1
No ratings yet
CATCTCESSEXITALYed 3 Rev 1
38 pages
Account Statement
No ratings yet
Account Statement
1 page
TSD Aptitude
No ratings yet
TSD Aptitude
3 pages
Labor Law Case Digests-01
No ratings yet
Labor Law Case Digests-01
137 pages
Test Bank for Strategic Brand Management Building Measuring and Managing Brand Equity 4th by Keller
No ratings yet
Test Bank for Strategic Brand Management Building Measuring and Managing Brand Equity 4th by Keller
323 pages
AA520 Frame Hydraulic System Overview
No ratings yet
AA520 Frame Hydraulic System Overview
31 pages
Python Programming Language Overview
No ratings yet
Python Programming Language Overview
6 pages
Good Practices in Usability Testing On People With Disabilities
No ratings yet
Good Practices in Usability Testing On People With Disabilities
4 pages
Wholesale in Mysore
No ratings yet
Wholesale in Mysore
14 pages
13th Documentary Notes
No ratings yet
13th Documentary Notes
5 pages
PTP Mal Gui Eng
No ratings yet
PTP Mal Gui Eng
29 pages
Lista de Product Key 2015
100% (1)
Lista de Product Key 2015
3 pages

Text Mining and Preprocessing Guide

Uploaded by

Text Mining and Preprocessing Guide

Uploaded by

library(tm)

#Create Corpus - CHANGE PATH AS NEEDED

freq <- colSums(as.matrix(dtm))

You might also like