extract | Python Adventures

extract text from a PDF file

October 26, 2020 Jabba Laci Leave a comment

Problem

You have a PDF file and you want to extract text from it.

Solution

You can use the PyPDF2 module for this purpose.

import PyPDF2

def main():
    book = open('book.pdf', 'rb')
    pdfReader = PyPDF2.PdfFileReader(book)
    pages = pdfReader.numPages
    page = pdfReader.getPage(0)    # 1st page
    text = page.extractText()
    print(text)

Note that indexing starts at 0. So if you open your PDF with Adobe Reader for instance and you locate page 20, in the source code you must use getPage(19).

Links

PyPDF2 on GitHub

Exercise

Write a program that extracts all pages of a PDF and saves the content of the pages to separate files, e.g. page0.txt, page1.txt, etc.

Categories: python Tags: extract, pdf, pdf2text, pypdf2

extract all images from a 4chan thread

July 16, 2020 Jabba Laci Leave a comment

If you are a 4chan user, then this little project of mine can be useful for you: https://github.com/jabbalaci/4chan-Thread-Images . It can extract all the images from a 4chan thread. It uses the official API of 4chan, it doesn’t do any webscraping.

Categories: python Tags: 4chan, extract

table2csv

July 25, 2018 Jabba Laci Leave a comment

Problem
I wanted to extract a table from an HTML. I wanted to import it to Excel, thus I wanted it in CSV format for instance.

Solution
table2csv can do exactly this. Visit the project’s page on GitHub for examples.

Note that I could only make it work under Python 2.7.

Categories: python Tags: csv, extract, html, table, table2csv

extract e-mails from a file

October 10, 2017 Jabba Laci Leave a comment

Problem
You have a text file and you want to extract all the e-mail addresses from it. For research purposes, of course.

Solution

#!/usr/bin/env python3

import re
import sys

def extract_emails_from(fname):
    with open(fname, errors='replace') as f:
        for line in f:
            match = re.findall(r'[\w\.-]+@[\w\.-]+', line)
            for e in match:
                if '?' not in e:
                    print(e)
                    
def main():
    fname = sys.argv[1]
    extract_emails_from(fname)

##############################################################################

if __name__ == "__main__":
    if len(sys.argv) == 1:
        print("Error: provide a text file!", file=sys.stderr)
        exit(1)
    # else
    main()

I had character encoding problems with some lines where the original program died with an exception. Using “open(fname, errors='replace')” will replace problematic characters with a “?“, hence the extra check before printing an e-mail to the screen.

The core of the script is the regex to find e-mails. That tip is from here.

Categories: python Tags: email, extract, regex

extract all links from a file

June 17, 2014 Jabba Laci Leave a comment

Problem
You want to extract all links (URLs) from a text file.

Solution

def extract_urls(fname):
    with open(fname) as f:
        return re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', f.read())

Categories: python Tags: extract, link, url

Extract all links from a web page

March 10, 2011 Jabba Laci 21 comments

Problem

You want to extract all the links from a web page. You need the links in absolute path format since you want to further process the extracted links.

Solution

Unix commands have a very nice philosophy: “do one thing and do it well”. Keeping that in mind, here is my link extractor:

#!/usr/bin/env python

# get_links.py

import re
import sys
import urllib
import urlparse
from BeautifulSoup import BeautifulSoup

class MyOpener(urllib.FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'

def process(url):
    myopener = MyOpener()
    #page = urllib.urlopen(url)
    page = myopener.open(url)

    text = page.read()
    page.close()

    soup = BeautifulSoup(text)

    for tag in soup.findAll('a', href=True):
        tag['href'] = urlparse.urljoin(url, tag['href'])
        print tag['href']
# process(url)

def main():
    if len(sys.argv) == 1:
        print "Jabba's Link Extractor v0.1"
        print "Usage: %s URL [URL]..." % sys.argv[0]
        sys.exit(-1)
    # else, if at least one parameter was passed
    for url in sys.argv[1:]:
        process(url)
# main()

if __name__ == "__main__":
    main()

You can find the up-to-date version of the script here.

The script will print the links to the standard output. The output can be refined with grep for instance.

Troubleshooting

The HTML parsing is done with the BeautifulSoup (BS) library. If you get an error, i.e. BeautifulSoup cannot parse a tricky page, download the latest version of BS and put BeautifulSoup.py in the same directory where get_links.py is located. I had a problem with the version that came with Ubuntu 10.10 but I could solve the problem by upgrading to the latest version of BeautifulSoup.
Update (20110414): To update BS, first remove the package python-beautifulsoup with Synaptic, then install the latest version from PyPI: sudo pip install beautifulsoup.

Examples

Basic usage: get all links on a given page.

./get_links.py http://www.reddit.com/r/Python

Basic usage: get all links from an HTML file. Yes, it also works on local files.

./get_links.py index.html

Number of links.

./get_links.py http://www.reddit.com/r/Python | wc -l

Filter result and keep only those links that you are interested in.

./get_links.py http://www.beach-hotties.com/ | grep -i jpg

Eliminate duplicates.

./get_links.py http://www.beach-hotties.com/ | sort | uniq

Note: if the URL contains the special character “&“, then put the URL between quotes.

./get_links.py "http://www.google.ca/search?hl=en&source=hp&q=python&aq=f&aqi=g10&aql=&oq="

Open (some) extracted links in your web browser. Here I use the script “open_in_tabs.py” that I introduced in this post. You can also download “open_in_tabs.py” here.

./get_links.py http://www.beach-hotties.com/ | grep -i jpg | sort | uniq | ./open_in_tabs.py

Update (20110507)

You might be interested in another script called “get_images.py” that extracts all image links from a webpage. Available here.

Categories: python Tags: beautifulsoup, extract, firefox, link, tab

Python Adventures

Archive

extract text from a PDF file

extract all images from a 4chan thread

table2csv

extract e-mails from a file

extract all links from a file

Extract all links from a web page

Blog Stats

Random Post

Recent Posts

Archives

Meta