Archive
Static HTML file browser for Dropbox
Two of my students worked on a project that creates static HTML files for a public Dropbox folder (find it at github). I use it in production, check it out here.
If you created your Dropbox account before October 4, 2012, then you are lucky and you have a Public folder. Accounts opened after this date have no Public folder :(
So, if you have a Public folder and you want to share the content of a folder recursively, then you can try this script. It produces a file and directory list that is similar to an Apache output.
Screenshot
Authors and Contributors
- Kiss Sándor Ádám (main developer)
- Iváncza Csaba (junior developer)
- Jabba Laci (project idea)
Splinter: open Firefox in fullscreen mode
Problem
With Splinter you can automate a browser window (click on a button, type in some text, etc). You can also use a Firefox instance beside Chrome and some other browsers. But how to open the Firefox instance in fullscreen (as if you had clicked on the “maximize” button)? Strangely, there is no command-line option for this :(
Solution
Well, under Linux there are some tools that allows you to interact with windows:
- xwininfo
- xdotool
- wmctrl
When the Firefox instance is opened, it becomes the active window and I ask its window ID with “xdotool getactivewindow”. Then, with “wmctrl” I can toggle this window to fullscreen.
Demonstration:
jabba@jabba-uplink:~$ xdotool getactivewindow 109051940 jabba@jabba-uplink:~$ python Python 2.7.4 (default, Apr 19 2013, 18:28:01) [GCC 4.7.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> hex(109051940) '0x6800024' jabba@jabba-uplink:~$ wmctrl -i -r 0x6800024 -b toggle,maximized_vert,maximized_horz
The same in Python is available in my jabbapylib library here.
Extract all links from a web page
Problem
You want to extract all the links from a web page. You need the links in absolute path format since you want to further process the extracted links.
Solution
Unix commands have a very nice philosophy: “do one thing and do it well”. Keeping that in mind, here is my link extractor:
#!/usr/bin/env python
# get_links.py
import re
import sys
import urllib
import urlparse
from BeautifulSoup import BeautifulSoup
class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'
def process(url):
myopener = MyOpener()
#page = urllib.urlopen(url)
page = myopener.open(url)
text = page.read()
page.close()
soup = BeautifulSoup(text)
for tag in soup.findAll('a', href=True):
tag['href'] = urlparse.urljoin(url, tag['href'])
print tag['href']
# process(url)
def main():
if len(sys.argv) == 1:
print "Jabba's Link Extractor v0.1"
print "Usage: %s URL [URL]..." % sys.argv[0]
sys.exit(-1)
# else, if at least one parameter was passed
for url in sys.argv[1:]:
process(url)
# main()
if __name__ == "__main__":
main()
You can find the up-to-date version of the script here.
The script will print the links to the standard output. The output can be refined with grep for instance.
Troubleshooting
The HTML parsing is done with the BeautifulSoup (BS) library. If you get an error, i.e. BeautifulSoup cannot parse a tricky page, download the latest version of BS and put BeautifulSoup.py in the same directory where get_links.py is located. I had a problem with the version that came with Ubuntu 10.10 but I could solve the problem by upgrading to the latest version of BeautifulSoup.
Update (20110414): To update BS, first remove the package python-beautifulsoup with Synaptic, then install the latest version from PyPI: sudo pip install beautifulsoup.
Examples
Basic usage: get all links on a given page.
./get_links.py http://www.reddit.com/r/Python
Basic usage: get all links from an HTML file. Yes, it also works on local files.
./get_links.py index.html
Number of links.
./get_links.py http://www.reddit.com/r/Python | wc -l
Filter result and keep only those links that you are interested in.
./get_links.py http://www.beach-hotties.com/ | grep -i jpg
Eliminate duplicates.
./get_links.py http://www.beach-hotties.com/ | sort | uniq
Note: if the URL contains the special character “&“, then put the URL between quotes.
./get_links.py "http://www.google.ca/search?hl=en&source=hp&q=python&aq=f&aqi=g10&aql=&oq="
Open (some) extracted links in your web browser. Here I use the script “open_in_tabs.py” that I introduced in this post. You can also download “open_in_tabs.py” here.
./get_links.py http://www.beach-hotties.com/ | grep -i jpg | sort | uniq | ./open_in_tabs.py
Update (20110507)
You might be interested in another script called “get_images.py” that extracts all image links from a webpage. Available here.
Check downloaded movies on imdb.com
Recently, I downloaded a nice pack of horror movies. The pack contained more than a hundred movies :) I wanted to see their IMDB ratings to decide which ones to watch, but typing their titles in the browser would be too much work. Could it be automated?
Solution
Each movie was located in a subdirectory. Here is an extract:
... Subspecies.1991.DVDRip.XviD-NoGrp Terror.Train.1980.DVDRIP.XVID-NoGrp The.Changeling.1980.DVDRip-KooKoo The.Creature.Walks.Among.Us.1956.DVDRip-KooKoo The.Hills.Have.Eyes.1977.DVDRip-KooKoo The.Howling.Special.Edition.1981.XviD.6ch-AC3-FTL The.Monster.Club.1980.DVDRip.DivX-UTOPiA ...
Fortunately, the directories were named in a consistent way: title of the movie (words separated with a dot), year, extra info. Thus, extracting titles was very easy. Idea: collect the titles in a list and open them in Firefox on imdb.com, each in a new tab.
First, I redirected the directory list in a file. It was easier to work with a text file than doing globbing:
ls >a.txt
And finally, here is the script:
#!/usr/bin/env python
import re
import urllib
import webbrowser
base = 'http://www.imdb.com/find?s=all'
firefox = webbrowser.get('firefox')
f1 = open('a.txt', 'r')
for line in f1:
line = line.rstrip('\n')
if line.startswith('#'):
continue
# else
result = re.search(r'(.*)\.\d{4}\..*', line)
if result:
address = result.group(1).replace('.', ' ')
url = "%s&q=%s" % ( base, urllib.quote(address) )
print url
firefox.open_new_tab(url)
#webbrowser.open_new_tab(url) # try this if the line above doesn't work
f1.close()
Achtung! Don’t try it with a huge list, otherwise your system will die :) Firefox won’t handle too many open tabs… Try to open around ten titles at a time. In the input file (a.txt) you can comment lines by adding a leading ‘#‘ sign, thus those lines will be discarded by the script.
