Statistical Computing With Python

The document outlines a seminar on Statistical Computing with Python, focusing on topics such as web scraping, JSON data structures, and natural language processing. It discusses methods for accessing data through APIs, the importance of tokenization and stop word removal in text processing, and introduces sentiment analysis as a supervised machine learning technique. The seminar is scheduled for October 22-24, 2020, and emphasizes practical applications of statistical computing in Python.

Uploaded by

fe90131

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views21 pages

Statistical Computing With Python

Uploaded by

fe90131

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Statistical Computing

with Python
Jason Anastasopoulos, Ph.D.

Upcoming Seminar:
October 22-24, 2020, Remote Seminar
Statistical Computing in
Python
Semi-Structured Data and Databases
HTML and Markup Languages
- Tree-structured (hierarchical) format
- Elements surrounded by opening & closing tags.
- Values embedded in open tags <tag-name attr-name=“attribute”> data
</tag-name>
Web Scraping Basics
- Use HTML and JSON data structures to build databases
- JSON is used to extract:
- Most data from APIs
- Data exchange systems like databases (SQL and MongoDB)
Getting data
Easiest: JSON from APIs

HTML - Very diﬃcult, last resort if not available in APIs (rare now)

Other options:
- Write a bot.
- Pretend to be a browser (Selenium)
HTML
Hyper Text Markup Language

Formatting Web pages

Uses tags
Example
<html>
<head>
<title>Page title here</title>
</head>
<body>
This is sample text...

This is text within a paragraph.
I really mean that
<img src="smileyface.jpg" alt="Smiley face" >
</body>
</html>
Webpage example
view-source:https://anastasopoulos.io/research
urllib package
Request package to retrieve ﬁle from ftp server

Connect with web servers using http protocol

Use of request and response data types

JSON
JavaScript Object Notation

Data interchange format

“Lightweight” format
- Data representations
- Easy for users to read
- Easy for parsers to translate
Main Structures
Object
- Uses {}, identical to a dictionary structure with key names and values separated
by comma.
Array
- List structure
- Uses []
- Contains values.
Value
- Lowest level.
- Values such as strings, numbers etc.
Simple JSON Sample
Accessing JSON Data
Data accessed via APIs formatted in JSON

Easy to access using Python ‘json’ package

Data accessed as in a dictionary.

Databases
- Means of exchanging information.
- SQL: Structured Query Language.
- MongoDB: NoSQL database, uses JSON-like ways of storing data.
- Brief code demonstration but each of these databases require more time to
cover.
Statistical Computing in
Python
Unstructured Data and Natural Language Processing
Text processing
1.Tokenization - splits the document into tokens which can be words or n-grams
(phrases).
2.Formatting - punctuation, numbers, case, spacing.
3.Stop word removal - removal of “stop words”
Tokenization
“Bag of words” model - most text analysis methods treat documents as a big
bunch of words or terms.

Order is generally not taken into account, just word and term frequencies.

There are ways to parse documents into ngrams or words but we’ll stick with
words for now.
Tokenization
Tokenized tweet (1 gram): [“I”, “don’t”, “think”,
“you’re”, “the”....]

Tokenized tweet (2-gram): [“I don’t”, “don’t

think”, “think you’re”, “you’re the”, …]
“
Stop words
Stop words are simply words that removed during text processing.

They tend to be words that are very common “the”, “and”, “is” etc.

These common words can cause problems for machine learning algorithms
and search engines because they add noise.

BEWARE Each package defines different lists of stop words and sometimes
removal can decrease performance of supervised mechine learning classifiers.
Sentiment Analysis
- Sentiment analysis is a type of supervised machine learning that is used to
predict the sentiment of texts.

- Without going into too much detail, we will use what is known as a pretrained
sentiment analysis algorithm.

- This is basically how it works...

Sentiment Analysis

Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
Social Media
No ratings yet
Social Media
7 pages
ML Week 6
No ratings yet
ML Week 6
11 pages
Text Pre Processing (NLTK SpaCy) (1) .HTML
No ratings yet
Text Pre Processing (NLTK SpaCy) (1) .HTML
25 pages
Analyzing Big Data with Computational Linguistics
No ratings yet
Analyzing Big Data with Computational Linguistics
29 pages
Data Mining vs Extraction Explained
No ratings yet
Data Mining vs Extraction Explained
8 pages
BDA Unit 5 Notes
No ratings yet
BDA Unit 5 Notes
9 pages
Text Mining Problems-4
No ratings yet
Text Mining Problems-4
59 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Facets of Data
50% (2)
Facets of Data
22 pages
Getting Data
No ratings yet
Getting Data
54 pages
Understanding Language Model
No ratings yet
Understanding Language Model
5 pages
Python NLP
No ratings yet
Python NLP
15 pages
Week 2 PSOSM - NPTEL
No ratings yet
Week 2 PSOSM - NPTEL
8 pages
Data Wrangling
No ratings yet
Data Wrangling
4 pages
Text Data Cleaning with Python
No ratings yet
Text Data Cleaning with Python
5 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Python for Developers & Analysts
No ratings yet
Python for Developers & Analysts
23 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
6 pages
Analytics and Tech Mining For Engineering Managers 9781606505113 1606505114 9781606505106
No ratings yet
Analytics and Tech Mining For Engineering Managers 9781606505113 1606505114 9781606505106
146 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Lecture10 - Mining Text and Images
No ratings yet
Lecture10 - Mining Text and Images
25 pages
IE Python
No ratings yet
IE Python
26 pages
Live Classroom 3
No ratings yet
Live Classroom 3
36 pages
Ece 2318 GENERAL DATA AND ITS TYPES
No ratings yet
Ece 2318 GENERAL DATA AND ITS TYPES
34 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
SQL and NoSQL
No ratings yet
SQL and NoSQL
5 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Data Mining for News Article Analysis
No ratings yet
Data Mining for News Article Analysis
30 pages
Retrieving Data From The Web
No ratings yet
Retrieving Data From The Web
9 pages
XML and JSON Processing in Python
No ratings yet
XML and JSON Processing in Python
18 pages
Different Facets of Data
No ratings yet
Different Facets of Data
4 pages
Data Analysis for Beginners
No ratings yet
Data Analysis for Beginners
46 pages
ThuyếtTrinh asm3 TextAnalysis
No ratings yet
ThuyếtTrinh asm3 TextAnalysis
3 pages
Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
10366-Article Text-12682-1-10-20240404
No ratings yet
10366-Article Text-12682-1-10-20240404
7 pages
With Structured Output
No ratings yet
With Structured Output
3 pages
AI Zone: Log in Sign Up
No ratings yet
AI Zone: Log in Sign Up
24 pages
Natural Language Processing in Python
No ratings yet
Natural Language Processing in Python
214 pages
UasF8LbRcGSLtQZG83HN - Natural Language Processing in Python
No ratings yet
UasF8LbRcGSLtQZG83HN - Natural Language Processing in Python
219 pages
EXP5
No ratings yet
EXP5
15 pages
Unit - IV IoT
No ratings yet
Unit - IV IoT
41 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
Introduction To Python
No ratings yet
Introduction To Python
18 pages
Python for Business Analytics & Text Mining
No ratings yet
Python for Business Analytics & Text Mining
17 pages
Text Analysis in Business Using Python
No ratings yet
Text Analysis in Business Using Python
5 pages
Lecture 2 - Collecting, Analyzing, and Visualizing Data With Python Part I
No ratings yet
Lecture 2 - Collecting, Analyzing, and Visualizing Data With Python Part I
15 pages
Python Ecosystem
No ratings yet
Python Ecosystem
11 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Top 18 Python Libraries for Data Science
100% (1)
Top 18 Python Libraries for Data Science
11 pages
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
No ratings yet
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
72 pages
Compose Release Notes
No ratings yet
Compose Release Notes
55 pages
Neo4j WP Retail Innovation en A4
No ratings yet
Neo4j WP Retail Innovation en A4
13 pages
Database Design & Normalization Guide
No ratings yet
Database Design & Normalization Guide
30 pages
Data Warehousing Question Bank
No ratings yet
Data Warehousing Question Bank
10 pages
SQL Database Setup Guide
No ratings yet
SQL Database Setup Guide
9 pages
LAB REPORT Database
No ratings yet
LAB REPORT Database
5 pages
Class8 Ch-3 Quesans
No ratings yet
Class8 Ch-3 Quesans
2 pages
Azure SQL Databases Beginner Course
No ratings yet
Azure SQL Databases Beginner Course
6 pages
Understanding SQL: Basics and History
No ratings yet
Understanding SQL: Basics and History
8 pages
Spark DataFrame and RDD Operations Guide
No ratings yet
Spark DataFrame and RDD Operations Guide
5 pages
SQL Server 2012 Exam Questions
No ratings yet
SQL Server 2012 Exam Questions
25 pages
Argentina Withholding Tax Report (RPFIWTAR SIRE SICORE) User Guide
0% (1)
Argentina Withholding Tax Report (RPFIWTAR SIRE SICORE) User Guide
16 pages
Database Management System Guide
No ratings yet
Database Management System Guide
9 pages
Ranchhod Rangila
No ratings yet
Ranchhod Rangila
13 pages
Latest Amazon MLS-C01 Dumps PDF (2024)
No ratings yet
Latest Amazon MLS-C01 Dumps PDF (2024)
3 pages
Lab Manual Big Data
No ratings yet
Lab Manual Big Data
22 pages
Complete Download MongoDB in Action Kyle Banker PDF All Chapters
100% (29)
Complete Download MongoDB in Action Kyle Banker PDF All Chapters
55 pages
Aviation Data Update Schedule
No ratings yet
Aviation Data Update Schedule
1 page
Relational Database System Guide
No ratings yet
Relational Database System Guide
3 pages
G7 - Unit 9 - Data & Data Representation
No ratings yet
G7 - Unit 9 - Data & Data Representation
17 pages
12 Ip Study Material Bangalore 2425
No ratings yet
12 Ip Study Material Bangalore 2425
167 pages
JDBC Exam Q & A
100% (1)
JDBC Exam Q & A
19 pages
CCA175 Exam: Spark & Hadoop Tasks
No ratings yet
CCA175 Exam: Spark & Hadoop Tasks
17 pages
Pyq Sampling Distribution
No ratings yet
Pyq Sampling Distribution
2 pages
Unit 4 JDBC
No ratings yet
Unit 4 JDBC
40 pages
DB Question
No ratings yet
DB Question
209 pages
ASP.NET MVC Data Handling Guide
No ratings yet
ASP.NET MVC Data Handling Guide
10 pages
Information Modelling
No ratings yet
Information Modelling
84 pages
Mongodb
No ratings yet
Mongodb
2 pages
SQL Injection Vulnerability Analysis
No ratings yet
SQL Injection Vulnerability Analysis
11 pages