ASSIGNMENT
Course Name: Advance Data Mining
Name: Umar Yameen
Father Name: Muhammad Yameen
Student ID: 20903
Date: 25th June, 2024
Question No:1
I will use BeautifulSoup, a Python library for parsing HTML and XML, often used in conjunction with
other libraries to build web crawlers, to collect data of interest.
Topic: Latest Technology News
We'll scrape a popular technology news website to gather titles, publication dates, and brief descriptions
of the latest articles.
Steps:
1. Install required libraries.
2. Identify the website to scrape.
3. Write a script to collect the data.
4. Display the collected data.
Step 1: Install Required Libraries
Ensure you have BeautifulSoup and requests installed. If not, you can install them using pip:
bash
Copy code
pip install beautifulsoup4 requests
Step 2: Identify the Website
I like TechCrunch.
Step 3: Write a Script
Below is a script to scrape the latest technology news from TechCrunch:
# Define the URL of the TechCrunch technology news page
url = "https://techcrunch.com/startups/"
# Send a GET request to the webpage
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content of the page
soup = BeautifulSoup(response.content, 'html.parser')
# Find all articles
articles = soup.find_all('article')
# List to store the scraped data
news_data = []
# Extract details from each article
for article in articles:
title = article.find('h2').text.strip()
description = article.find('p').text.strip()
date = article.find('time')['datetime']
# Append the data to the list
news_data.append({
'title': title,
'description': description,
'date': date
})
# Display the scraped data
for news in news_data:
print(f"Title: {news['title']}")
print(f"Date: {news['date']}")
print(f"Description: {news['description']}\n")
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
Step 4: Display the Collected Data
Running the script will output the latest technology news articles from TechCrunch, including titles,
publication dates, and brief descriptions.
Sample Output
Title: TechCrunch Startup Battlefield
Date: 2024-06-18T12:00:00Z
Description: The best startups are competing for the coveted Disrupt Cup and a $50,000 prize.
Title: AI Startup Secures $10M Funding
Date: 2024-06-17T14:30:00Z
Description: An AI startup specializing in natural language processing has raised $10 million in a Series A
funding round.
Title: New Tech Hub in Silicon Valley
Date: 2024-06-16T09:00:00Z
Description: A new tech hub has been established in Silicon Valley, offering resources and support to
innovative startups.
Question No:2
Let's use Twint, an advanced Twitter scraping tool written in Python, to extract some Twitter data. Twint
is powerful because it doesn't require access to the Twitter API, which can be restrictive due to rate
limits and other constraints.
Installation
First, let's install Twint. Twint requires Python 3.6 or higher. You can install Twint using pip:
bash
Copy code
pip install twint
Extracting Data with Twint
We'll extract tweets containing a specific hashtag, for example, #AI, and display the tweet text,
username, and date of each tweet.
Step-by-Step Script
1. Import the Required Library
2. Configure Twint
3. Run the Twint Search
4. Display the Results
Here's a Python script using Twint:
python
Copy code
import twint
# Configure Twint to search for tweets containing the hashtag #AI
c = twint.Config()
c.Search = "#AI"
c.Limit = 10 # Limit to 10 tweets for demonstration purposes
c.Lang = "en" # Search for English tweets
c.Store_object = True # Store tweets in a Python object
c.Hide_output = True # Hide output in terminal
# Run the search
twint.run.Search(c)
# Retrieve tweets from Twint's internal storage
tweets = twint.output.tweets_list
# Display the collected tweets
for tweet in tweets:
print(f"Username: {tweet.username}")
print(f"Date: {tweet.datestamp} {tweet.timestamp}")
print(f"Tweet: {tweet.tweet}\n")
Explanation
1. Import the Required Library: We import Twint to use its functionality.
2. Configure Twint: We set up the search parameters:
o c.Search specifies the search query, which is #AI in this case.
o c.Limit limits the number of tweets to retrieve.
o c.Lang filters tweets to a specific language.
o c.Store_object tells Twint to store the results in a Python object.
o c.Hide_output hides the output in the terminal.
3. Run the Twint Search: We execute the search using twint.run.Search(c).
4. Display the Results: We access the stored tweets and print the username, date, and tweet text.
Sample Output
The output will be similar to this:
plaintext
Copy code
Username: ai_expert
Date: 2024-06-19 12:34:56
Tweet: AI is transforming the world in unprecedented ways. #AI
Username: tech_guru
Date: 2024-06-19 12:30:22
Tweet: Exciting developments in AI technology! #AI #MachineLearning
Username: datascientist
Date: 2024-06-19 12:28:15
Tweet: How AI is revolutionizing healthcare. #AI #HealthTech
Question No:3
To demonstrate the general steps involved in collecting data for analysis using crawlers, let's create a
practical example using Twint to scrape Twitter data related to the hashtag #AI.
Step 1: Identify the Data Source
We will use Twitter as our data source to collect tweets containing the hashtag #AI.
Step 2: Understand the Data Structure
We aim to collect the following elements from tweets:
Umarchauhdry
Date: 2024-06-19 12:28:15
Likes: 89
Retweets: 112
Step 3: Choose a Crawler
We will use Twint, a Python library that allows scraping Twitter without API limitations.
Step 4: Build or Configure the Crawler
We will configure Twint to search for tweets containing #AI and set the necessary parameters.
Step 5: Implement Data Extraction Logic
We'll use Twint's configuration options to specify our data extraction needs.
Step 6: Handle Pagination
Twint automatically handles pagination, so we don't need to write extra code for this.
Step 7: Ensure Compliance and Respect
We'll include a delay between requests to avoid overloading Twitter's servers and ensure compliance
with Twitter's terms of service.
Step 8: Execute the Crawler
We will run the Twint script to start collecting data.
Step 9: Clean and Preprocess the Data
After collecting the data, we'll clean and preprocess it to remove any irrelevant information and format it
for analysis.
Step 10: Analyze the Data
We'll analyze the data using Python libraries like pandas and matplotlib to gain insights.
Python Script for Steps 4-8
Here's a complete Python script to collect and preprocess Twitter data using Twint:
import pandas as pd
# Configure Twint to search for tweets containing the hashtag #AI
c = twint.Config()
c.Search = "#AI"
c.Limit = 100 # Limit to 100 tweets for demonstration purposes
c.Lang = "en" # Search for English tweets
c.Store_object = True # Store tweets in a Python object
c.Hide_output = True # Hide output in terminal
c.Pandas = True # Enable saving to pandas DataFrame
# Run the search
twint.run.Search(c)
# Retrieve tweets from Twint's internal storage
tweets_df = twint.storage.panda.Tweets_df
# Display the first few rows of the collected data
print(tweets_df.head())
# Clean and preprocess the data
# Select relevant columns
tweets_cleaned = tweets_df[['date', 'username', 'tweet', 'likes_count', 'retweets_count']]
# Remove duplicates
tweets_cleaned = tweets_cleaned.drop_duplicates()
# Save to a CSV file for further analysis
tweets_cleaned.to_csv('ai_tweets.csv', index=False)
# Display the cleaned data
print(tweets_cleaned.head())
Explanation of the Script
1. Configure Twint: We set the search parameters, including the hashtag #AI, limit to 100 tweets,
language as English, and storing results in a pandas DataFrame.
2. Run the Search: We execute the Twint search with the configured parameters.
3. Retrieve Data: We retrieve the data from Twint's internal storage and convert it to a pandas
DataFrame.
4. Clean and Preprocess Data: We select relevant columns, remove duplicates, and save the
cleaned data to a CSV file.
5. Display Data: We print the first few rows of the collected and cleaned data.
Step 9: Clean and Preprocess the Data
The script already includes data cleaning steps such as selecting relevant columns and removing
duplicates.
Step 10: Analyze the Data
Here's a simple analysis using pandas and matplotlib to visualize the number of likes and retweets:
import matplotlib.pyplot as plt
# Read the cleaned data from the CSV file
tweets_cleaned = pd.read_csv('ai_tweets.csv')
# Plot the number of likes and retweets
plt.figure(figsize=(10, 5))
# Number of likes
plt.subplot(1, 2, 1)
plt.hist(tweets_cleaned['likes_count'], bins=20, color='blue', edgecolor='black')
plt.title('Distribution of Likes')
plt.xlabel('Number of Likes')
plt.ylabel('Frequency')
# Number of retweets
plt.subplot(1, 2, 2)
plt.hist(tweets_cleaned['retweets_count'], bins=20, color='green', edgecolor='black')
plt.title('Distribution of Retweets')
plt.xlabel('Number of Retweets')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()