pdfmanip
Today I wrote a simple PDF manipulation CLI tool. You can find it here: https://github.com/jabbalaci/pdfmanip .
How old are you in days?
Problem
You want to calculate how old you are in days.
Solution
Let’s use a popular 3rd party date/time library for this purpose called pendulum.
As an example, let’s take Arnold Schwarzenegger, who was born on July 30, 1947. So let’s answer the following question: how old is Schwarzenegger today?
>>> import pendulum
>>>
>>> born = pendulum.parse("1947-07-30")
>>> born
DateTime(1947, 7, 30, 0, 0, 0, tzinfo=Timezone('UTC'))
>>> today = pendulum.now()
>>> today
DateTime(2018, 8, 14, 21, 32, 33, 248489, tzinfo=Timezone('Europe/Budapest'))
>>>
>>> diff = today - born
>>> diff
<Period [1947-07-30T00:00:00+00:00 -> 2018-08-14T21:32:33.248489+02:00]>
>>>
>>> diff.in_years()
71
>>> diff.in_words()
'71 years 2 weeks 1 day 19 hours 32 minutes 33 seconds'
>>> diff.in_days()
25948
table2csv
Problem
I wanted to extract a table from an HTML. I wanted to import it to Excel, thus I wanted it in CSV format for instance.
Solution
table2csv can do exactly this. Visit the project’s page on GitHub for examples.
Note that I could only make it work under Python 2.7.
MonkeyType: generate static type annotations automatically
In my previous post I wrote about mypy that can check type annotations.
MonkeyType (from Instagram) is a project that generates static type annotations by collecting runtime types. It can be a great help if you don’t want to do it manually.
I tried it and it works surprisingly well! However, the idea is to generate type annotations with MonkeyType and review the hints. “MonkeyType’s annotations are an informative first draft, to be checked and corrected by a developer.”
Read the README of the project for an example.
mypy — optional static typing
Problem
Python is awesome, it’s my favourite programming language (what a surprise :)). Dynamic typing allows one to code very quickly. However, if you have a large codebase (and I ran into this problem with my latest project JiVE), things start to get complicated. Let’s take an example:
def extract_images(html):
lst = []
...
return lst
If I look at this code, then what is “html”? Is it raw HTML (string), or is it a BeautifulSoup object? What is the return value? What is in the returned list? Does it return a list of URLs (a list of strings)? Or is it a list of custom objects that wrap the URLs?
If we used a statically-typed language (e.g. Java), these questions wouldn’t even arise since we could see the types of the parameters and the return value.
Solution
I’ve heard about mypy but I never cared to try it. Until now… “Mypy is an experimental optional static type checker for Python that aims to combine the benefits of dynamic (or “duck”) typing and static typing. Mypy combines the expressive power and convenience of Python with a powerful type system and compile-time type checking.” (source)
Thus, the example above could be written like this:
from typing import List
from bs4 import BeautifulSoup
from myimage import Image
def extract_images(html: BeautifulSoup) -> List[Image]:
lst = []
...
return lst
And now everything is clear and there is no need to carefully analyse the code to figure out what goes in and what comes out.
This new syntax was introduced in Python 3.6. Actually, if you run such an annotated program, the Python interpreter ignores all these hints. So Python won’t become a statically-typed language. If you want to make these hints work, you need to use “mypy”, which is a linter-like static analyzer. Example:
$ pip install mypy # you can also install it globally with sudo $ mypy program.py # verify a file $ mypy src/ # verify every file in a folder
With mypy you can analyse individual files and you can also analyse every file in a folder.
If you get warnings that mypy doesn’t find certain modules, then run mypy with these options:
$ mypy program.py --ignore-missing-imports --follow-imports=skip
IDE support
I highly recommend using PyCharm for large(r) projects. PyCharm has its own implementation of a static type checker. You can use type hints out of the box and PyCharm will tell you if there’s a problem. It’s a good idea to combine PyCharm with Mypy, i.e. when you edit the code in the IDE, run mypy from time to time in the terminal too.
Getting started
To get started, I suggest watching/reading these resources:
- Static types for Python, PyCon 2017
- Putting Type Hints to Work in PyCharm
- further YouTube videos
- Mypy HQ
- mypy cheat sheet
- Using mypy with an existing codebase
- Static types in Python, oh my(py)! (blog post about the Zulip project’s experiences with adopting mypy)
It’s really easy to get started with mypy. It took me only 2 days to fully annotate my project JiVE. Now the code is much easier to understand IMO.
Tips
If you have a large un-annotated codebase, proceed from bottom up. Start to annotate files that are leaf nodes in the dependency tree/graph. Start with modules that are most used by others (e.g. helper.py, utils.py, common.py). Then proceed upwards until you reach the main file (that I usually call main.py, which is the entry point of the whole project).
You don’t need to annotate everything. Type hints are optional. The more you add, the better, but if there’s a function that you find difficult to annotate, just skip it and come back to it later.
Annotate the function signatures (type of arguments, type of the return value). Inside a function I don’t annotate every variable. If mypy drops a warning and says a variable should be annotated, then I do it.
Sometimes mypy drops an error on a line but you don’t want to annotate it. In this case you can add a special comment to tell mypy to ignore this line:
... # type: ignore
If a function’s signature has no type hints at all, mypy will skip it. If you want mypy to check that function, then add at least one type hint to it. You can add for instance the return type. If the function is a procedure, i.e. it has no return value, then indicate None as the returned type:
def hello() -> None:
print("hello")
You can add type hints later. That is, you can write your project first, test it, and when it works fine, you can add type hints at the end.
When to use mypy?
For a small script it may not be necessary but it could add a lot to a large(r) project.
Notes
If you read older blog posts, you may find that they mention the package “mypy-lang”. It’s old. Install the package “mypy” and forget “mypy-lang”. More info here.
fold / unfold URLs
Problem
When you visit a gallery, very often the URLs follow a pattern. For instance:
http://www.website.com/001.jpg, http://www.website.com/002.jpg, …, http://www.website.com/030.jpg. There is a sequence: [001-030]. Thus, these URLs can be represented in a compact way: http://www.website.com/ [001-030].jpg (without space). I call it a sequence URL.
There are two challenges here:
- Having a sequence URL, restore all the URLs. We can call it unpacking / unfolding.
- The opposite of the previous: having a list of URLs (that follow a pattern), compress them to a sequence URL. We can call it folding.
I met this challenge when I was working with URLs but it can be generalized to strings.
Unfolding
I wrote an algorithm for this (see later) but later I found a module that could do it better. I posed my question on Reddit and got a very good answer (see here). It was suggested that I should use the ClusterShell project. This project was made for administrating Linux clusters. We have nothing to do with Linux clusters, but it contains an implementation of string folding / unfolding that we can re-use here.
Installation is trivial: “pip install clustershell“.
Then, I made a wrapper function for unfolding:
from ClusterShell.NodeSet import NodeSet
def unfold_sequence_url(text):
"""
From a sequence URL restore all the URLs (unpack, unfold).
Input: "node[1-3]"
Output: ["node1", "node2", "node3"]
"""
# Create a new nodeset from string
nodeset = NodeSet(text)
res = [str(node) for node in nodeset]
return res
Folding
Here is another wrapper function for folding:
from ClusterShell.NodeSet import NodeSet
def fold_urls(lst):
"""
Now the input is a list of URLs
that we want to compress (fold) to a sequence URL.
Example:
Input: ["node1", "node2", "node3"]
Output: "node[1-3]"
"""
res = NodeSet.fromlist(lst) # it's a ClusterShell.NodeSet.NodeSet object
return str(res)
My own implementation (old)
Naively, I implemented the unfolding since I didn’t know about ClusterShell. I put it here, but I suggest you should use ClusterShell (see above).
#!/usr/bin/env python3
"""
Unpack a sequence URL.
How it works:
First Gallery Image: http://www.website.com/001.jpg
Last Gallery Image: http://www.website.com/030.jpg
Sequence: [001-030]
Sequence URL: http://www.website.com/[001-030].jpg
From the sequence URL we restore the complete list of URLs.
"""
import re
from jive import mylogging as log
def is_valid_sequence_url(url, verbose=True):
lst = re.findall("\[(.+?)-(.+?)\]", url)
# print(lst)
if len(lst) == 0:
if verbose: log.warning(f"no sequence was found in {url}")
return False
if len(lst) > 1:
if verbose: log.warning(f"several sequences were found in {url} , which is not supported")
return False
# else, if len(lst) == 1
return True
def get_urls_from_sequence_url(url, statusbar=None):
res = []
if not is_valid_sequence_url(url):
return []
m = re.search("\[(.+?)-(.+?)\]", url)
if m:
start = m.group(1)
end = m.group(2)
prefix = url[:url.find('[')]
postfix = url[url.find(']')+1:]
zfill = start.startswith('0') or end.startswith('0')
# print(url)
# print(prefix)
# print(postfix)
if zfill and (len(start) != len(end)):
log.warning(f"start and end sequences in {url} must have the same lengths if they are zero-filled")
return []
# else
length = len(start)
if start.isdigit() and end.isdigit():
start = int(start)
end = int(end)
for i in range(start, end+1):
middle = i
if zfill:
middle = str(i).zfill(length)
curr = f"{prefix}{middle}{postfix}"
res.append(curr)
# endfor
# endif
# endif
return res
##############################################################################
if __name__ == "__main__":
url = "http://www.website.com/[001-030].jpg" # for testing
urls = get_urls_from_sequence_url(url)
for url in urls:
print(url)
Links
Update
It turned out that ClusterShell doesn’t install on Windows. However, I could extract that part of it which does the (un)folding. Read this ticket for more info. The extracted part works on Windows too.
pip install –user
Problem
When we install something with pip, usually we do a “sudo pip install pkg_name“. However, there are some problems with this approach. First, you need root privileges. Second, it installs the package globally, which can cause conflicts in the system. Is there a way to install something with pip locally?
Solution
The good news is that you can install a package with pip locally too. Under Linux the destination folder by default is ~/.local . Add the following line to the end of your ~/.bashrc :
export PATH=~/.local/bin:$PATH
Then install the package locally. For instance, let’s install pipenv:
$ pip install pipenv --user
Open a new terminal (thus ~/.bashrc is read), and launch pipenv. It should be available. Let’s check where it is:
$ which pipenv /home/jabba/.local/bin/pipenv
pynt: a lightweight build tool, written in Python
Problem
I mainly work under Linux and when I write a Python program, I don’t care if it runs on other platforms or not. Does it work for me? Good :) So far I haven’t really used any build tools. If I needed something, I solved it with a Bash script.
However, a few weeks ago I started to work on a larger side project (JiVE Image Viewer) and I wanted to make it portable from the beginning. Beside Linux, it must also work on Windows (on Mac I couldn’t try it).
Now, if I want to automate some build task (e.g. creating a standalone executable from the project), a Bash script is not enough as it doesn’t run under Windows. Should I write the same thing in a .bat file? Hell, no! Should I install Cygwin on all my Windows machines? No way! It’s time to start using a build tool. The time has finally come.
Solution
There are tons of build tools. I wanted something very simple with which I can do some basic tasks: run an external command, create a directory, delete a directory, move a file, move a directory, etc. As I am most productive in Python, I wanted a build tool that I can program in pure Python. And I wanted something simple that I can start using right away without reading tons of docs.
And this is how I found pynt. Some of its features:
- “easy to learn
- build tasks are just python funtions
- manages dependencies between tasks
- automatically generates a command line interface
- supports python 2.7 and python 3.x” (source)
Just create a file called build.py in your project’s root folder and invoke the build tool with the command “pynt“.
My project is in a virtual environment. First I installed pynt in the virt. env.:
$ pip install pynt
Here you can find an example that I wrote for JiVE.
Update (20180628)
I had a little contribution to the project: https://github.com/rags/pynt/pull/17. If the name of a task starts with an underscore, then it’s a hidden task, thus it won’t appear in the auto-generated docs. This way you can easily hide sub-tasks.
Convert a nested OrderedDict to normal dict
Problem
You have a nested OrderedDict object and you want to convert it to a normal dict.
Today I was playing with the configparser module. It reads an .ini file and builds a dict-like object. However, I prefer normal dict objects. With a configparser object’s “._sections” you can access the underlying dictionary object, but it’s a nested OrderedDict object.
Example:
; preferences.ini [GENERAL] onekey = "value in some words" [SETTINGS] resolution = '1024 x 768'
import configparser
from pprint import pprint
config = configparser.ConfigParser()
config.read("preferences.ini")
pprint(config._sections)
Sample output:
OrderedDict([('GENERAL', OrderedDict([('onekey', '"value in some words"')])),
('SETTINGS', OrderedDict([('resolution', "'1024 x 768'")]))])
Solution
JSON to the rescue! Convert the nested OrderedDict to json, thus you lose the order. Then, convert the json back to a dictionary. Voilá, you have a plain dict object.
def to_dict(self, config):
"""
Nested OrderedDict to normal dict.
"""
return json.loads(json.dumps(config))
Output:
{'GENERAL': {'onekey': '"value in some words"'},
'SETTINGS': {'resolution': "'1024 x 768'"}}
As you can see, quotes around string values are kept by configparser. If you want to remove them, see my previous post.
I found this solution here @ SO.

You must be logged in to post a comment.