0% found this document useful (0 votes)

83 views10 pages

Advance Data Mining Assignment

Uploaded by

Medical Education Gomal Medical College

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views10 pages

Advance Data Mining Assignment

Uploaded by

Medical Education Gomal Medical College

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

ASSIGNMENT

Course Name: Advance Data Mining

Name: Umar Yameen

Father Name: Muhammad Yameen

Student ID: 20903

Date: 25th June, 2024

Question No:1
I will use BeautifulSoup, a Python library for parsing HTML and XML, often used in conjunction with
other libraries to build web crawlers, to collect data of interest.

Topic: Latest Technology News

We'll scrape a popular technology news website to gather titles, publication dates, and brief descriptions
of the latest articles.

Steps:

1. Install required libraries.

2. Identify the website to scrape.

3. Write a script to collect the data.

4. Display the collected data.

Step 1: Install Required Libraries

Ensure you have BeautifulSoup and requests installed. If not, you can install them using pip:

 bash
 Copy code
 pip install beautifulsoup4 requests

Step 2: Identify the Website

I like TechCrunch.

Step 3: Write a Script

Below is a script to scrape the latest technology news from TechCrunch:

# Define the URL of the TechCrunch technology news page

url = "https://techcrunch.com/startups/"

# Send a GET request to the webpage

response = requests.get(url)

# Check if the request was successful

if response.status_code == 200:

# Parse the HTML content of the page

soup = BeautifulSoup(response.content, 'html.parser')

# Find all articles

articles = soup.find_all('article')
# List to store the scraped data

news_data = []

# Extract details from each article

for article in articles:

title = article.find('h2').text.strip()

description = article.find('p').text.strip()

date = article.find('time')['datetime']

# Append the data to the list

news_data.append({

'title': title,

'description': description,

'date': date

})

# Display the scraped data

for news in news_data:

print(f"Title: {news['title']}")

print(f"Date: {news['date']}")

print(f"Description: {news['description']}\n")

else:

print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

Step 4: Display the Collected Data

Running the script will output the latest technology news articles from TechCrunch, including titles,
publication dates, and brief descriptions.

Sample Output

Title: TechCrunch Startup Battlefield

Date: 2024-06-18T12:00:00Z

Description: The best startups are competing for the coveted Disrupt Cup and a $50,000 prize.

Title: AI Startup Secures $10M Funding

Date: 2024-06-17T14:30:00Z

Description: An AI startup specializing in natural language processing has raised $10 million in a Series A
funding round.

Title: New Tech Hub in Silicon Valley

Date: 2024-06-16T09:00:00Z

Description: A new tech hub has been established in Silicon Valley, offering resources and support to
innovative startups.

Question No:2

Let's use Twint, an advanced Twitter scraping tool written in Python, to extract some Twitter data. Twint
is powerful because it doesn't require access to the Twitter API, which can be restrictive due to rate
limits and other constraints.

Installation

First, let's install Twint. Twint requires Python 3.6 or higher. You can install Twint using pip:

bash

Copy code

pip install twint

Extracting Data with Twint

We'll extract tweets containing a specific hashtag, for example, #AI, and display the tweet text,
username, and date of each tweet.

Step-by-Step Script

1. Import the Required Library

2. Configure Twint

3. Run the Twint Search

4. Display the Results

Here's a Python script using Twint:

python

Copy code

import twint

# Configure Twint to search for tweets containing the hashtag #AI

c = twint.Config()

c.Search = "#AI"

c.Limit = 10 # Limit to 10 tweets for demonstration purposes

c.Lang = "en" # Search for English tweets

c.Store_object = True # Store tweets in a Python object

c.Hide_output = True # Hide output in terminal

# Run the search

twint.run.Search(c)

# Retrieve tweets from Twint's internal storage

tweets = twint.output.tweets_list

# Display the collected tweets

for tweet in tweets:

print(f"Username: {tweet.username}")

print(f"Date: {tweet.datestamp} {tweet.timestamp}")

print(f"Tweet: {tweet.tweet}\n")

Explanation

1. Import the Required Library: We import Twint to use its functionality.

2. Configure Twint: We set up the search parameters:

o c.Search specifies the search query, which is #AI in this case.

o c.Limit limits the number of tweets to retrieve.

o c.Lang filters tweets to a specific language.

o c.Store_object tells Twint to store the results in a Python object.

o c.Hide_output hides the output in the terminal.

3. Run the Twint Search: We execute the search using twint.run.Search(c).

4. Display the Results: We access the stored tweets and print the username, date, and tweet text.

Sample Output

The output will be similar to this:

plaintext

Copy code

Username: ai_expert

Date: 2024-06-19 12:34:56

Tweet: AI is transforming the world in unprecedented ways. #AI

Username: tech_guru

Date: 2024-06-19 12:30:22

Tweet: Exciting developments in AI technology! #AI #MachineLearning

Username: datascientist

Date: 2024-06-19 12:28:15

Tweet: How AI is revolutionizing healthcare. #AI #HealthTech

Question No:3
To demonstrate the general steps involved in collecting data for analysis using crawlers, let's create a
practical example using Twint to scrape Twitter data related to the hashtag #AI.

Step 1: Identify the Data Source

We will use Twitter as our data source to collect tweets containing the hashtag #AI.

Step 2: Understand the Data Structure

We aim to collect the following elements from tweets:

 Umarchauhdry
 Date: 2024-06-19 12:28:15

 Likes: 89

 Retweets: 112

Step 3: Choose a Crawler

We will use Twint, a Python library that allows scraping Twitter without API limitations.

Step 4: Build or Configure the Crawler

We will configure Twint to search for tweets containing #AI and set the necessary parameters.

Step 5: Implement Data Extraction Logic

We'll use Twint's configuration options to specify our data extraction needs.

Step 6: Handle Pagination

Twint automatically handles pagination, so we don't need to write extra code for this.

Step 7: Ensure Compliance and Respect

We'll include a delay between requests to avoid overloading Twitter's servers and ensure compliance
with Twitter's terms of service.

Step 8: Execute the Crawler

We will run the Twint script to start collecting data.

Step 9: Clean and Preprocess the Data

After collecting the data, we'll clean and preprocess it to remove any irrelevant information and format it
for analysis.

Step 10: Analyze the Data

We'll analyze the data using Python libraries like pandas and matplotlib to gain insights.

Python Script for Steps 4-8

Here's a complete Python script to collect and preprocess Twitter data using Twint:

import pandas as pd

# Configure Twint to search for tweets containing the hashtag #AI

c = twint.Config()

c.Search = "#AI"

c.Limit = 100 # Limit to 100 tweets for demonstration purposes

c.Lang = "en" # Search for English tweets

c.Store_object = True # Store tweets in a Python object

c.Hide_output = True # Hide output in terminal

c.Pandas = True # Enable saving to pandas DataFrame

# Run the search

twint.run.Search(c)

# Retrieve tweets from Twint's internal storage

tweets_df = twint.storage.panda.Tweets_df

# Display the first few rows of the collected data

print(tweets_df.head())

# Clean and preprocess the data

# Select relevant columns

tweets_cleaned = tweets_df[['date', 'username', 'tweet', 'likes_count', 'retweets_count']]

# Remove duplicates

tweets_cleaned = tweets_cleaned.drop_duplicates()

# Save to a CSV file for further analysis

tweets_cleaned.to_csv('ai_tweets.csv', index=False)

# Display the cleaned data

print(tweets_cleaned.head())

Explanation of the Script

1. Configure Twint: We set the search parameters, including the hashtag #AI, limit to 100 tweets,
language as English, and storing results in a pandas DataFrame.
2. Run the Search: We execute the Twint search with the configured parameters.

3. Retrieve Data: We retrieve the data from Twint's internal storage and convert it to a pandas
DataFrame.

4. Clean and Preprocess Data: We select relevant columns, remove duplicates, and save the
cleaned data to a CSV file.

5. Display Data: We print the first few rows of the collected and cleaned data.

Step 9: Clean and Preprocess the Data

The script already includes data cleaning steps such as selecting relevant columns and removing
duplicates.

Step 10: Analyze the Data

Here's a simple analysis using pandas and matplotlib to visualize the number of likes and retweets:

import matplotlib.pyplot as plt

# Read the cleaned data from the CSV file

tweets_cleaned = pd.read_csv('ai_tweets.csv')

# Plot the number of likes and retweets

plt.figure(figsize=(10, 5))

# Number of likes

plt.subplot(1, 2, 1)

plt.hist(tweets_cleaned['likes_count'], bins=20, color='blue', edgecolor='black')

plt.title('Distribution of Likes')

plt.xlabel('Number of Likes')

plt.ylabel('Frequency')

# Number of retweets

plt.subplot(1, 2, 2)

plt.hist(tweets_cleaned['retweets_count'], bins=20, color='green', edgecolor='black')

plt.title('Distribution of Retweets')
plt.xlabel('Number of Retweets')

plt.ylabel('Frequency')

plt.tight_layout()

plt.show()

Social Media Data Collection Techniques
No ratings yet
Social Media Data Collection Techniques
9 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
Data Science Papers
No ratings yet
Data Science Papers
109 pages
Retrieving Data From The Web
No ratings yet
Retrieving Data From The Web
9 pages
Getting Data
No ratings yet
Getting Data
54 pages
Api and Data Structure
No ratings yet
Api and Data Structure
3 pages
Data Science Web Scraping Guide
No ratings yet
Data Science Web Scraping Guide
4 pages
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
No ratings yet
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
193 pages
ML Week 6
No ratings yet
ML Week 6
11 pages
Introduction To Popular-1
No ratings yet
Introduction To Popular-1
15 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Data Mining for News Article Analysis
No ratings yet
Data Mining for News Article Analysis
30 pages
Dap Mod 4-5
No ratings yet
Dap Mod 4-5
19 pages
Python Weather Forecasting Guide
No ratings yet
Python Weather Forecasting Guide
36 pages
Python Tools for Data Scientists
100% (1)
Python Tools for Data Scientists
23 pages
Analytics and Tech Mining For Engineering Managers 9781606505113 1606505114 9781606505106
No ratings yet
Analytics and Tech Mining For Engineering Managers 9781606505113 1606505114 9781606505106
146 pages
Sentiment Analysis with CSV Data
No ratings yet
Sentiment Analysis with CSV Data
24 pages
21CSS203TCT-1 - SET A - Answer Key
No ratings yet
21CSS203TCT-1 - SET A - Answer Key
4 pages
CSV
No ratings yet
CSV
3 pages
Python Toolbox 100 Scripts For Developers Enhance Your Development Skills With Ready-to-Use Python Scripts (Sari, Serhan) (Z-Library)
No ratings yet
Python Toolbox 100 Scripts For Developers Enhance Your Development Skills With Ready-to-Use Python Scripts (Sari, Serhan) (Z-Library)
193 pages
Twitter Web Scraping with Python Tools
100% (1)
Twitter Web Scraping with Python Tools
5 pages
Twitter API: Accessing Tweets with JSON
No ratings yet
Twitter API: Accessing Tweets with JSON
6 pages
Anis D. Ultimate Step by Step Guide To Data Science..Python.2021
No ratings yet
Anis D. Ultimate Step by Step Guide To Data Science..Python.2021
161 pages
Brown Green Modern Event Organizer Presentation
No ratings yet
Brown Green Modern Event Organizer Presentation
9 pages
Python API Tutorial - Getting Started With APIs - Dataquest
100% (1)
Python API Tutorial - Getting Started With APIs - Dataquest
26 pages
ML Lab File
No ratings yet
ML Lab File
33 pages
Python For Data Analysis The Python Crash Course Comprehensive The Programming From The Ground Up To Python by Cannon, Jason
No ratings yet
Python For Data Analysis The Python Crash Course Comprehensive The Programming From The Ground Up To Python by Cannon, Jason
167 pages
How I Built My Very First Twitter Bot-That'S Surprisingly Enjoyable
No ratings yet
How I Built My Very First Twitter Bot-That'S Surprisingly Enjoyable
9 pages
Python Library Functions Overview
No ratings yet
Python Library Functions Overview
12 pages
Tweepy Functions
No ratings yet
Tweepy Functions
49 pages
Top 18 Python Libraries for Data Science
100% (1)
Top 18 Python Libraries for Data Science
11 pages
AI Learning Path: Python & Libraries
No ratings yet
AI Learning Path: Python & Libraries
133 pages
Cheat Sheet For API's and Data Collection
No ratings yet
Cheat Sheet For API's and Data Collection
4 pages
Python Interview Project
No ratings yet
Python Interview Project
3 pages
30 Ai Projects - 2025 05 26
No ratings yet
30 Ai Projects - 2025 05 26
28 pages
Web Scraping CheatSheet Guide
No ratings yet
Web Scraping CheatSheet Guide
10 pages
Blog Post HTML
No ratings yet
Blog Post HTML
6 pages
ANN Final Exam
100% (1)
ANN Final Exam
13 pages
Import Tweepy
No ratings yet
Import Tweepy
4 pages
Cric Score App
No ratings yet
Cric Score App
16 pages
Wa0003.
No ratings yet
Wa0003.
12 pages
Python Basics for Aspiring Data Scientists
No ratings yet
Python Basics for Aspiring Data Scientists
16 pages
L2 - Data Acquisition
No ratings yet
L2 - Data Acquisition
48 pages
Scraping 1000's of News Articles Using 10 Simple Steps - by Kajal Yadav - Jun, 2020 - Towards Data Science
No ratings yet
Scraping 1000's of News Articles Using 10 Simple Steps - by Kajal Yadav - Jun, 2020 - Towards Data Science
24 pages
Programming in Ds With Python
No ratings yet
Programming in Ds With Python
11 pages
Data Extraction: Parse A 3-Nested JSON Object and Convert It To A Pandas Dataframe
No ratings yet
Data Extraction: Parse A 3-Nested JSON Object and Convert It To A Pandas Dataframe
1 page
Data Wrangling
No ratings yet
Data Wrangling
4 pages
WeRateDogs Twitter Data Wrangling
No ratings yet
WeRateDogs Twitter Data Wrangling
4 pages
Categorical to Quantitative Variables in Python
No ratings yet
Categorical to Quantitative Variables in Python
23 pages
Web Scraping Techniques by Joseph Siryani
No ratings yet
Web Scraping Techniques by Joseph Siryani
35 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Mining Twitter Data for Insights
No ratings yet
Mining Twitter Data for Insights
13 pages
Python For AI Developers
No ratings yet
Python For AI Developers
45 pages
IE Python
No ratings yet
IE Python
26 pages
Sriraam Ip Project
No ratings yet
Sriraam Ip Project
30 pages
Python
100% (1)
Python
61 pages
(Ebook) Python Tools For Scientists: An Introduction To Using Anaconda, JupyterLab, and Python's Scientific Libraries by Lee Vaughan ISBN 9781718502666, 1718502664 Online Version
No ratings yet
(Ebook) Python Tools For Scientists: An Introduction To Using Anaconda, JupyterLab, and Python's Scientific Libraries by Lee Vaughan ISBN 9781718502666, 1718502664 Online Version
323 pages
XII IP Practical List 2025-26 - KV1UDR
No ratings yet
XII IP Practical List 2025-26 - KV1UDR
4 pages
Solutions For Reference Practical Questions
No ratings yet
Solutions For Reference Practical Questions
6 pages
Machine Learning Course Content For Classroomdocx - 240504 - 163403
No ratings yet
Machine Learning Course Content For Classroomdocx - 240504 - 163403
6 pages
Learning Python With Anki
No ratings yet
Learning Python With Anki
5 pages
GCI Global Guide - 2025
No ratings yet
GCI Global Guide - 2025
19 pages
Project File
No ratings yet
Project File
36 pages
Gender-Based Body Metrics Analysis
No ratings yet
Gender-Based Body Metrics Analysis
42 pages
Data Story Telling MDM Program
No ratings yet
Data Story Telling MDM Program
2 pages
Data and Variable
No ratings yet
Data and Variable
44 pages
Python Training for Oil & Gas Engineers
No ratings yet
Python Training for Oil & Gas Engineers
9 pages
Grade 11 Pandas Notes and Worksheet
No ratings yet
Grade 11 Pandas Notes and Worksheet
5 pages
Tanya 1 Resume 2
No ratings yet
Tanya 1 Resume 2
1 page
Assignment 1
No ratings yet
Assignment 1
3 pages
Data Science Internship Report
No ratings yet
Data Science Internship Report
55 pages
BSC in CS - Course Document - 5th Sem MK
No ratings yet
BSC in CS - Course Document - 5th Sem MK
54 pages
Restaurant Review Production Analysis Using Python
No ratings yet
Restaurant Review Production Analysis Using Python
33 pages
Advanced Data Analyst Roadmap
No ratings yet
Advanced Data Analyst Roadmap
3 pages
Xii First Periodic Test Blueprint 2025
No ratings yet
Xii First Periodic Test Blueprint 2025
17 pages
Using Python To Analyze Isometric Force Time.7
No ratings yet
Using Python To Analyze Isometric Force Time.7
15 pages
UNIT2
No ratings yet
UNIT2
20 pages
Datascience Lab Manual
No ratings yet
Datascience Lab Manual
46 pages
21BCE1406 PardheevKrishnaTammineni
No ratings yet
21BCE1406 PardheevKrishnaTammineni
71 pages
ML - Lab - Programs - J
No ratings yet
ML - Lab - Programs - J
18 pages
Practice Paper 1
No ratings yet
Practice Paper 1
6 pages
Data Science Book1
No ratings yet
Data Science Book1
9 pages
Plane Crash Data Analysis Project
No ratings yet
Plane Crash Data Analysis Project
1 page
Pandas Notes
No ratings yet
Pandas Notes
19 pages