This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urllib.parse.urljoin is broken in python 3.5
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.6, Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Pavel Ivanov, berker.peksag, ezio.melotti, iritkatriel, kilowu, martin.panter, orsenthil, xtreak
Priority: normal Keywords: 3.5regression

Created on 2015-10-14 11:49 by Pavel Ivanov, last changed 2022-04-11 14:58 by admin.

Messages (5)
msg252986 - (view) Author: Pavel Ivanov (Pavel Ivanov) Date: 2015-10-14 11:49
urllib.parse.urljoin does not conform the RFC 1808 in case of joining relative URL’s containing ‘..’ path components anymore.

Examples:

Python 3.4: 
>>> urllib.parse.urljoin('http://a.com', '..')
'http://a.com/..'
Python 3.5:
>>> urllib.parse.urljoin('http://a.com', '..')
'http://a.com/'

Python 3.4: 
>>> urllib.parse.urljoin('a/’, '..')
''
Python 3.5:
>>> urllib.parse.urljoin('a/', '..')
'/'

Python 3.4: 
>>> urllib.parse.urljoin('a/’, '../..')
'..'
Python 3.5:
>>> urllib.parse.urljoin('a/', '../..')
'/'

Python 3.4 conforms RFC 1808 in these scenarios, but Python 3.5 does not.
msg253031 - (view) Author: Wei Wu (kilowu) * Date: 2015-10-15 07:00
It's a change made in 3.5 that resolution of relative URLs confirms to the RFC 3986. See https://bugs.python.org/issue22118 for details.
msg253032 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-10-15 07:05
See also this change:

changeset:   95683:fc0e79387a3a
user:        Berker Peksag <[email protected]>
date:        Thu Apr 16 02:31:14 2015 +0300
files:       Lib/test/test_urlparse.py Lib/urllib/parse.py Misc/NEWS
description:
Issue #23703: Fix a regression in urljoin() introduced in 901e4e52b20a.

Patch by Demian Brecht.
msg253067 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-16 02:19
It is true that 3.5 is meant to follow RFC 3986, which obsoletes RFC 1808 and specifies slightly different behaviour for abnormal cases. This change is documented under urljoin(), and also in “What’s New in 3.5”. Pavel’s first case is one of these differences in the RFCs, and I don’t think it is a bug. According to <https://tools.ietf.org/html/rfc3986.html#section-5.2.4>,

“The remove_dot_segments algorithm respects [the base’s] hierarchy by removing extra dot-segments rather than treating them as an error or leaving them to be misinterpreted by dereference implementations.”

For Pavel’s second and third cases, RFC 3986 doesn’t cover them directly because the base URL is relative. The RFC only covers absolute base URLs, which start with a scheme like “http:”. The documentation doesn’t really bless these cases either: ‘Construct a full (“absolute”) URL’. However there is explicit support in the source code ("" in urllib.parse.uses_relative).

It looks like 3.5 is strict in following the RFC’s Remove Dot Segments algorithm. Step 2C says that for “/../” or “/..”, the parent segment is removed, but the input is always replaced with “/”:

“a/..” → “/”
“a/../..” → “/..” → “/”

I would prefer a less strict interpretation of the spirit of the algorithm. Do not introduce a slash in the input if you did not remove one from the output buffer:

“a/..” → empty URL
“a/../..” → “..” → empty URL

Python 3.4 and earlier did not behave sensibly if you extend the relative URL:

>>> urljoin("a/", "..")
''
>>> urljoin("a/", "../..")
'..'
>>> urljoin("a/", "../../..")
''
>>> urljoin("a/", "../../../..")
'../'

Pavel, what behaviour would you expect in these cases? My empty URL interpretation, or perhaps a more sensible version of the Python 3.4 behaviour? What is your use case?

One related more serious (IMO) regression I noticed compared to 3.4, where the path becomes a host name:

>>> urljoin("file:///base", "/dummy/..//host/oops")
'file://host/oops'
msg407502 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-12-01 23:03
See also 37235, 40594.
History
Date User Action Args
2022-04-11 14:58:22adminsetgithub: 69589
2021-12-02 11:20:59vstinnersetnosy: - vstinner
2021-12-01 23:03:53iritkatrielsetnosy: + iritkatriel
messages: + msg407502
2018-09-23 15:30:06xtreaksetnosy: + xtreak
2016-01-04 03:37:01ezio.melottisetnosy: + ezio.melotti
2015-10-16 02:19:30martin.pantersetmessages: + msg253067
components: - Interpreter Core
2015-10-15 07:37:17berker.peksagsetkeywords: + 3.5regression
2015-10-15 07:06:42vstinnersetnosy: + orsenthil, berker.peksag, martin.panter
2015-10-15 07:05:48vstinnersetnosy: + vstinner
messages: + msg253032
2015-10-15 07:00:23kilowusetnosy: + kilowu
messages: + msg253031
2015-10-14 11:49:00Pavel Ivanovcreate