Skip to content

Conversation

@AhmarZaidi
Copy link
Contributor

@AhmarZaidi AhmarZaidi commented Jan 14, 2025

Summary

This PR addresses an issue where non-ASCII characters in URL filenames caused HTTP headers to break reverse proxies and violate standards. The solution encodes only the filename part of URLs, ensuring compliance with ISO-8859-1 character requirements. This change maintains URL integrity while preventing potential issues with reverse proxies.

Example URL : https://testsite.com/wp-content/uploads/2025/01/חנות-scaled.avif
Corrected URL: https://testsite.com/wp-content/uploads/2025/01/%D7%97%D7%A0%D7%95%D7%AA-scaled.avif

Fixes #1775

Relevant technical choices

  • Implemented a solution to address the issue where non-ASCII characters in URLs, such as Hebrew characters in filenames, were causing HTTP headers to break reverse proxies and violate HTTP standards.
  • Updated the URL encoding logic to use a regular expression that matches any character not allowed by RFC 3986, ensuring that all non-ASCII and disallowed ASCII characters are percent-encoded.
  • Implemented preg_replace_callback() with a static closure to dynamically encode characters using rawurlencode(). This ensures that only characters outside the allowed set are encoded.
  • Add test cases to include international domain names, non-ASCII paths, and URLs with special characters.

@AhmarZaidi AhmarZaidi changed the title Fix: Optimize URL encoding logic in get_response_header Fix: Optimization detective can return non-ascii characters in the Link header, breaking some reverse proxies and HTTP standards Jan 14, 2025
@codecov
Copy link

codecov bot commented Jan 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 65.97%. Comparing base (e793289) to head (c3c2512).
Report is 12 commits behind head on trunk.

Additional details and impacted files
@@            Coverage Diff             @@
##            trunk    #1802      +/-   ##
==========================================
+ Coverage   65.92%   65.97%   +0.04%     
==========================================
  Files          88       88              
  Lines        6885     6895      +10     
==========================================
+ Hits         4539     4549      +10     
  Misses       2346     2346              
Flag Coverage Δ
multisite 65.97% <100.00%> (+0.04%) ⬆️
single 38.17% <0.00%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@westonruter westonruter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I left a couple alternative suggestions.

Please add some test coverage for the lines not covered be tests.

@westonruter
Copy link
Member

The solution encodes only the filename part of URLs

What about when an internationalized domain name is used? Couldn't this also cause problems with encoding?

@westonruter westonruter added [Type] Bug An existing feature is broken [Plugin] Optimization Detective Issues for the Optimization Detective plugin labels Jan 14, 2025
@westonruter
Copy link
Member

Additionally, on multisite subdirectory installs, in theory the path before wp-content could also include non-ASCII chars:

https://testsite.com/חנות/wp-content/uploads/2025/01/example-scaled.avif

@westonruter
Copy link
Member

@AhmarZaidi Hey, are you intending to pick this up again?

@AhmarZaidi
Copy link
Contributor Author

@westonruter Yes, I'll be working on the changes right away.

@AhmarZaidi
Copy link
Contributor Author

AhmarZaidi commented Jan 30, 2025

Instead of parsing the URL and then re-constructing it, what if you just check if the href has any non-ASCII chars and then encode the entire URL

@westonruter I've implemented the changes: If the path contains any non-ascii characters then we encode the whole URL.

However we can use preg_replace_callback to encode only the non-ascii characters in the URL. Let me know if that approach will be better.

@westonruter
Copy link
Member

What you did looks good. Could you add some test cases for international domain names and non-ASCII paths to ensure the expected output?

@AhmarZaidi
Copy link
Contributor Author

@westonruter The old solution was encoding the :// part also, so I've update the code slightly.
Current solution: Encode complete url after :// part.

Potential Issue: This solution converts the slashes (/) to %2F so the file path would be incorrect.
For example https://xn--fsq.com/תמונה.jpg will be encoded to https://xn--fsq.com%2F%D7%AA%D7%9E%D7%95%D7%A0%D7%94.jpg
Note: I've written the tests according to the above output.


If we use something like:

$decoded_url = urldecode( $link['href'] );
$encoded_url = preg_replace_callback(
        '/[^\x00-\x7F]/',
        fn( $matches ) => rawurlencode( $matches[0] ),
        $decoded_url
);

Then https://xn--fsq.com/תמונה.jpg will be encoded to https://xn--fsq.com/%D7%AA%D7%9E%D7%95%D7%A0%D7%94.jpg

Please feel free to let me know if we should do any further changes to the approach.

@westonruter
Copy link
Member

If we use something like:

That looks good!

@AhmarZaidi
Copy link
Contributor Author

Implemented the changes.

@westonruter
Copy link
Member

So this is no longer a draft and is ready for review, correct?

@AhmarZaidi AhmarZaidi marked this pull request as ready for review January 31, 2025 20:03
@github-actions
Copy link

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Unlinked Accounts

The following contributors have not linked their GitHub and WordPress.org accounts: @amitay-elementor.

Contributors, please read how to link your accounts to ensure your work is properly credited in WordPress releases.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Unlinked contributors: amitay-elementor.

Co-authored-by: AhmarZaidi <[email protected]>
Co-authored-by: westonruter <[email protected]>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@westonruter
Copy link
Member

Relevant technical choices

Could you update the description based on the latest revisions?

@westonruter westonruter changed the title Fix: Optimization detective can return non-ascii characters in the Link header, breaking some reverse proxies and HTTP standards Prevent URL in Link header from including invalid characters Feb 4, 2025
Copy link
Member

@westonruter westonruter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@felixarntz Anything we missed here?

Copy link
Member

@felixarntz felixarntz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @AhmarZaidi, LGTM!

@westonruter westonruter merged commit 40e8c31 into WordPress:trunk Feb 5, 2025
16 checks passed
@westonruter
Copy link
Member

We forgot about the URLs in the imagesrcset. I've opened #1866 to fix that.

@westonruter
Copy link
Member

Follow-up issue: #1906

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

[Plugin] Optimization Detective Issues for the Optimization Detective plugin [Type] Bug An existing feature is broken

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimization detective can return non-ascii characters in the Link header, breaking some reverse proxies and HTTP standards.

3 participants