Skip to content

[5.x]: It's possible all urls which craft generate should be url encoded. #15838

@joepagan

Description

@joepagan

What happened?

Description

I created an issue originally about this on the SEOmatic plugin repo who suggested this maybe should be resolved upstream.

In short, I believe I've found evidence that Craft should be encoding every url for publicly available elements.
Especially where non-ascii characters are used this can cause problems.

This isn't obviously a problem for browser users as browsers do a good job of showing a decoded version in the address bar but actually making a request to the encoded version of a url.
But for crawlers, bots, services and tools it will be a bigger deal.

Slightly longer version:

The main issue presented in my SEOmatic issue I was highlighting is that the plugin likely calls a craft url function, which is used to generate the canonical link element and a canonical header.
However, nginx does not support non-ascii characters in response headers, so we actually get a different URL, which results in a 404 if you were to visit it.

As an example where an element on an entry with SEOmatic enable which has a url: /genêve

  • the link element - uses /genêve - which does not match t
  • the link response header - uses /genêve which 404s
  • the page itself - available on /gen%C3%AAve

I think this probably been fine for the most part for the longest as most crawlers likely detect non-ascii chars and parse them themselves. Though this was highlighted to me originally where some third-party SEO crawler tool noticed the SEOmatic response header did not match the canonical link element value, as it thought it was going to a different destination... it was! Because the nginx header was invalid.

Though the referenced specs in my other issue which most systems use as a guide, does seem to suggest that all URLs should be encoded so other systems can pick them up.
I think this goes for the link element, response headers, but potentially any a element's href value too given that a non-browser system which (e.g. a crawler) could pick that up.

There is lots more data to digest in the linked issue above for your reading leisure!

relevant links:

Steps to reproduce

  1. install/enable seomatic
  2. create an entry with a non-ascii character in it's slug
  3. checkout the canonical element/response header

Expected behavior

Use a non-ascii character in an element's slug, where craft will return an encoded url where the url function is used. e.g. in the examples above, SEOmatic canonical link elements, response headers, a element href values etc

Actual behavior

A mixture of results which can result in invalid response headers, and non-matching references to a single element's public URL.

Craft CMS version

5.3.6

PHP version

8.3

Operating system and version

No response

Database type and version

No response

Image driver and version

No response

Installed plugins and versions

  • seomatic 5.1.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions