Archive
fold / unfold URLs
Problem
When you visit a gallery, very often the URLs follow a pattern. For instance:
http://www.website.com/001.jpg, http://www.website.com/002.jpg, …, http://www.website.com/030.jpg. There is a sequence: [001-030]. Thus, these URLs can be represented in a compact way: http://www.website.com/ [001-030].jpg (without space). I call it a sequence URL.
There are two challenges here:
- Having a sequence URL, restore all the URLs. We can call it unpacking / unfolding.
- The opposite of the previous: having a list of URLs (that follow a pattern), compress them to a sequence URL. We can call it folding.
I met this challenge when I was working with URLs but it can be generalized to strings.
Unfolding
I wrote an algorithm for this (see later) but later I found a module that could do it better. I posed my question on Reddit and got a very good answer (see here). It was suggested that I should use the ClusterShell project. This project was made for administrating Linux clusters. We have nothing to do with Linux clusters, but it contains an implementation of string folding / unfolding that we can re-use here.
Installation is trivial: “pip install clustershell“.
Then, I made a wrapper function for unfolding:
from ClusterShell.NodeSet import NodeSet
def unfold_sequence_url(text):
"""
From a sequence URL restore all the URLs (unpack, unfold).
Input: "node[1-3]"
Output: ["node1", "node2", "node3"]
"""
# Create a new nodeset from string
nodeset = NodeSet(text)
res = [str(node) for node in nodeset]
return res
Folding
Here is another wrapper function for folding:
from ClusterShell.NodeSet import NodeSet
def fold_urls(lst):
"""
Now the input is a list of URLs
that we want to compress (fold) to a sequence URL.
Example:
Input: ["node1", "node2", "node3"]
Output: "node[1-3]"
"""
res = NodeSet.fromlist(lst) # it's a ClusterShell.NodeSet.NodeSet object
return str(res)
My own implementation (old)
Naively, I implemented the unfolding since I didn’t know about ClusterShell. I put it here, but I suggest you should use ClusterShell (see above).
#!/usr/bin/env python3
"""
Unpack a sequence URL.
How it works:
First Gallery Image: http://www.website.com/001.jpg
Last Gallery Image: http://www.website.com/030.jpg
Sequence: [001-030]
Sequence URL: http://www.website.com/[001-030].jpg
From the sequence URL we restore the complete list of URLs.
"""
import re
from jive import mylogging as log
def is_valid_sequence_url(url, verbose=True):
lst = re.findall("\[(.+?)-(.+?)\]", url)
# print(lst)
if len(lst) == 0:
if verbose: log.warning(f"no sequence was found in {url}")
return False
if len(lst) > 1:
if verbose: log.warning(f"several sequences were found in {url} , which is not supported")
return False
# else, if len(lst) == 1
return True
def get_urls_from_sequence_url(url, statusbar=None):
res = []
if not is_valid_sequence_url(url):
return []
m = re.search("\[(.+?)-(.+?)\]", url)
if m:
start = m.group(1)
end = m.group(2)
prefix = url[:url.find('[')]
postfix = url[url.find(']')+1:]
zfill = start.startswith('0') or end.startswith('0')
# print(url)
# print(prefix)
# print(postfix)
if zfill and (len(start) != len(end)):
log.warning(f"start and end sequences in {url} must have the same lengths if they are zero-filled")
return []
# else
length = len(start)
if start.isdigit() and end.isdigit():
start = int(start)
end = int(end)
for i in range(start, end+1):
middle = i
if zfill:
middle = str(i).zfill(length)
curr = f"{prefix}{middle}{postfix}"
res.append(curr)
# endfor
# endif
# endif
return res
##############################################################################
if __name__ == "__main__":
url = "http://www.website.com/[001-030].jpg" # for testing
urls = get_urls_from_sequence_url(url)
for url in urls:
print(url)
Links
Update
It turned out that ClusterShell doesn’t install on Windows. However, I could extract that part of it which does the (un)folding. Read this ticket for more info. The extracted part works on Windows too.
