What happened?
Description
I created an issue originally about this on the SEOmatic plugin repo who suggested this maybe should be resolved upstream.
In short, I believe I've found evidence that Craft should be encoding every url for publicly available elements.
Especially where non-ascii characters are used this can cause problems.
This isn't obviously a problem for browser users as browsers do a good job of showing a decoded version in the address bar but actually making a request to the encoded version of a url.
But for crawlers, bots, services and tools it will be a bigger deal.
Slightly longer version:
The main issue presented in my SEOmatic issue I was highlighting is that the plugin likely calls a craft url function, which is used to generate the canonical link element and a canonical header.
However, nginx does not support non-ascii characters in response headers, so we actually get a different URL, which results in a 404 if you were to visit it.
As an example where an element on an entry with SEOmatic enable which has a url: /genêve
- the link element - uses
/genêve - which does not match t
- the link response header - uses
/genêve which 404s
- the page itself - available on
/gen%C3%AAve
I think this probably been fine for the most part for the longest as most crawlers likely detect non-ascii chars and parse them themselves. Though this was highlighted to me originally where some third-party SEO crawler tool noticed the SEOmatic response header did not match the canonical link element value, as it thought it was going to a different destination... it was! Because the nginx header was invalid.
Though the referenced specs in my other issue which most systems use as a guide, does seem to suggest that all URLs should be encoded so other systems can pick them up.
I think this goes for the link element, response headers, but potentially any a element's href value too given that a non-browser system which (e.g. a crawler) could pick that up.
There is lots more data to digest in the linked issue above for your reading leisure!
relevant links:
Steps to reproduce
- install/enable seomatic
- create an entry with a non-ascii character in it's slug
- checkout the canonical element/response header
Expected behavior
Use a non-ascii character in an element's slug, where craft will return an encoded url where the url function is used. e.g. in the examples above, SEOmatic canonical link elements, response headers, a element href values etc
Actual behavior
A mixture of results which can result in invalid response headers, and non-matching references to a single element's public URL.
Craft CMS version
5.3.6
PHP version
8.3
Operating system and version
No response
Database type and version
No response
Image driver and version
No response
Installed plugins and versions
What happened?
Description
I created an issue originally about this on the SEOmatic plugin repo who suggested this maybe should be resolved upstream.
In short, I believe I've found evidence that Craft should be encoding every url for publicly available elements.
Especially where non-ascii characters are used this can cause problems.
This isn't obviously a problem for browser users as browsers do a good job of showing a decoded version in the address bar but actually making a request to the encoded version of a url.
But for crawlers, bots, services and tools it will be a bigger deal.
Slightly longer version:
The main issue presented in my SEOmatic issue I was highlighting is that the plugin likely calls a craft
urlfunction, which is used to generate the canonical link element and a canonical header.However, nginx does not support non-ascii characters in response headers, so we actually get a different URL, which results in a 404 if you were to visit it.
As an example where an element on an entry with SEOmatic enable which has a url:
/genêve/genêve- which does not match t/genêvewhich 404s/gen%C3%AAveI think this probably been fine for the most part for the longest as most crawlers likely detect non-ascii chars and parse them themselves. Though this was highlighted to me originally where some third-party SEO crawler tool noticed the SEOmatic response header did not match the canonical link element value, as it thought it was going to a different destination... it was! Because the nginx header was invalid.
Though the referenced specs in my other issue which most systems use as a guide, does seem to suggest that all URLs should be encoded so other systems can pick them up.
I think this goes for the link element, response headers, but potentially any
aelement's href value too given that a non-browser system which (e.g. a crawler) could pick that up.There is lots more data to digest in the linked issue above for your reading leisure!
relevant links:
Steps to reproduce
Expected behavior
Use a non-ascii character in an element's slug, where craft will return an encoded url where the url function is used. e.g. in the examples above, SEOmatic canonical link elements, response headers, a element href values etc
Actual behavior
A mixture of results which can result in invalid response headers, and non-matching references to a single element's public URL.
Craft CMS version
5.3.6
PHP version
8.3
Operating system and version
No response
Database type and version
No response
Image driver and version
No response
Installed plugins and versions