Skip to content

imethanlee/WebpageSpider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebpageSpider: A Concurrent Webpage Data Scraper

1. Introduction

WebpageSpider is a concurrent webpage data scraper based on Scrapy and Playwright, particularly used to fetch static webpage data for the following phishing detectors:

The data of a webpage $w$ comprises the following 4 files:

  • input_url.txt: The input URL of $w$
  • info.txt: The landing URL of $w$
  • html.txt: The HTML of $w$
  • shot.png: A screenshot of $w$ with 1280*720 resolution

2. Installation

  1. Set up the conda environment in your Linux machine
conda create -n webpage_spider python=3.10 
conda activate webpage_spider
  1. Install the required Python package
bash ./install.sh

3. Start Crawling

You can just simply specify your input URL list within the __init__() method at ./mySpider/spiders/webpage_spider.py, and then run the following command

scrapy crawl webpage_spider

The scaper can concurrently process 16 URL requests at the same time. You can modify the maximum concurrent requests at ./mySpider/settings.py.

(Optional) By default, WebpageSpider will look at the csv file in ./input/ to get a list of input URLs and output the crawled data at a folder in ./output. We also provide two scripts to fetch a few examples of benign and phishing URLs

python ./input/pull_tranco_urls.py
python ./input/pull_openphish_urls.py

Citation

If you find this project helpful, please consider citing our paper

@inproceedings {li2024knowphish,
  author = {Yuexin Li and Chengyu Huang and Shumin Deng and Mei Lin Lock and Tri Cao and Nay Oo and Hoon Wei Lim and Bryan Hooi},
  title = {{KnowPhish}: Large Language Models Meet Multimodal Knowledge Graphs for Enhancing {Reference-Based} Phishing Detection},
  booktitle = {33rd USENIX Security Symposium (USENIX Security 24)},
  year = {2024},
  isbn = {978-1-939133-44-1},
  address = {Philadelphia, PA},
  pages = {793--810},
  url = {https://www.usenix.org/conference/usenixsecurity24/presentation/li-yuexin},
  publisher = {USENIX Association},
  month = aug
}

Acknowledgement

This project was developed with the assistance of @meilinnn and @lindsey98.

About

A concurrent web data scraper that crawls the URL, HTML and Screenshot of webpages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published