uriscrape

Scrapes URIs from Telegram channel transcripts in PDF files. Typically URIs will take the form of something like these examples:

https://t.me/joinchat/AAAAAEOs3wFD4Mv6SN4hlQ

(tg://join?invite=AAAAAEOs3wFD4Mv6SN4hlQ)

https://drive.google.com/open?id=0B_3xyna6XV4GMHNPU0VVWHZKRXc

https://archive.org/details/Rumiyah13UR_201709
(https://archive.org/details/Rumiyah13UR_201709)

(tg://search_hashtag?hashtag=%D8%A6%DB%95%D9%84%DA%BE%D8%A7%D9%8A%D8%A7%D8%AA)

Running the program

usage: `python uriscrape.py transcript`

positional arguments:
  transcript         filepath to transcript pdf or directory

optional arguments:
  None yet...

Output file

urls.xlsx - All found URIs, including columns/variables as follows:

File: PDF file processed
Access_Date: Date/time the program was run. May be important for documenting when the program attempted to resolve URIs
Post_Date: Date of the post, as derived from the date labels in the Telegram transcript
URL: URL as found
Site_Reached: True/False - whether the URI was able to be resolved
Unshortened URL: Unshortened URL (e.g. https://youtu.be/lqXwyl89xU4 -> unshortens to https://www.youtube.com/watch?v=lqXwyl89xU4&feature=youtu.be )
Status: Error code, if an error was encountered in trying to access the URI
Type: Classification of the link
Hashtag: Hashtag, if the link is a Telegram hashtag link
Channel: Channel, if the link is a Telegram join link
Account: Account, if the link is a Telegram account link
Domain: Full server daomain (e.g. www.youtube.com)
Primary_Secondary: Just the primary and secondary portions of the domain (e.g. youtube.com)

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
LICENSE.md		LICENSE.md
README.md		README.md
uriscrape.py		uriscrape.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

uriscrape

Running the program

Output file

About

Uh oh!

Releases 3

Packages

Languages

License

gwu-libraries/uriscrape

Folders and files

Latest commit

History

Repository files navigation

uriscrape

Running the program

Output file

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages