Ignoring Attributes / Incorrect rewriting of certain URLS

Resolved

Yan

(@donamyk)

Hello,

I have been troubleshooting a few issues with the URL rewriting. I have been stepping through the staatic source code & now have a clearer picture, but I’m not sure how to best workaround this issue without modifying the plugin source.

Here are some of the things that I noticed.

1. https:/ is being appended sometimes


// ### BEFORE

    @font-face {
        font-display: swap;
        font-family: AIBCase-Regular;
        font-style: normal;
        font-weight: 400;
        src: url(/wp-content/themes/aibms/src/dist/assets/AIBCase-Regular.woff2) format("woff2")
    }

    @font-face {
        font-display: swap;
        font-family: AIBCase-Medium;
        font-style: normal;
        font-weight: 500;
        src: url(/wp-content/themes/aibms/src/dist/assets/AIBCase-Medium.woff2) format("woff2")
    }

    @font-face {
        font-display: swap;
        font-family: IvyPrestoDisplay-Regular;
        font-style: normal;
        font-weight: 400;
        src: url(/wp-content/themes/aibms/src/dist/assets/IvyPrestoDisplay-Regular.otf) format("opentype")
    }

// #### AFTER

// ### AFTER 
@font-face {
        font-display: swap;
        font-family: AIBCase-Regular;
        font-style: normal;
        font-weight: 400;
        src: url(/wp-content/themes/aibms/src/dist/assets/AIBCase-Regular.woff2) format("woff2")
    }

    @font-face {
        font-display: swap;
        font-family: AIBCase-Medium;
        font-style: normal;
        font-weight: 500;
        src: url(https:/wp-content/themes/aibms/src/dist/assets/AIBCase-Medium.woff2) format("woff2")
    }

    @font-face {
        font-display: swap;
        font-family: IvyPrestoDisplay-Regular;
        font-style: normal;
        font-weight: 400;
        src: url(https:/wp-content/themes/aibms/src/dist/assets/IvyPrestoDisplay-Regular.otf) format("opentype")
    }

I think this might be related to when a resource/link is excluded / not found. (I am troubleshooting by generating a single page of my site so that I can more easily follow the logs)

2. SVG path is rewritten if inside of CSS Style for fill

// ### BEFORE
url("data:image/svg+xml;charset=utf-8,%3Csvg xmlns='http://www.w3.org/2000/svg' width='20' height='20' fill='none' viewBox='0 0 20 20'%3E%3Cpath fill='url(%23a)' stroke='url(%23b)'

// ### AFTER
fill='url(/%23a)' stroke='url(https://mysite.local/%23b)'

^^ As a consequence, /%23a gets added to the crawl queue which adds unnecessary looping.

3. yoast-schema-graph get attributes get rewritten

// ### BEFORE
 <script type="application/ld+json" class="yoast-schema-graph">{
        "@context": "https://schema.org",
        "@graph": [
            {
                "@type": "WebPage",
                "@id": "https://mysite.local/",
                "url": "https://mysite.local/",
                "name": "My Site",
                "isPartOf": {
                    "@id": "https://mysite.local/#website"
                },
                "about": {
                    "@id": "https://mysite.local/#organization"
                },

// ### AFTER
 <script type="application/ld+json" class="yoast-schema-graph">{
        "@context": "https://schema.org",
        "@graph": [
            {
                "@type": "WebPage",
                "@id": "/",
                "url": "/",
                "name": "AIB Merchant Services | Card Payment Solutions for Businesses",
                "isPartOf": {
                    "@id": "https:/#website"
                },
                "about": {
                    "@id": "https:/#organization"
                },

I am getting inconsistent results between generations, which I’m also not sure about.

Is it possible to add a data attribute or a comment block to indicate that a section should be ignored? I did not find a filter that would help with this.

For reference, I am using the PHP-DOM-Parser

Any help would be appreciated!

Great work on this plugin!

This topic was modified 3 months, 1 week ago by Yan.

Viewing 5 replies - 1 through 5 (of 5 total)

Plugin Author Team Staatic
(@staatic)

3 months, 1 week ago
Hi @donamyk ,

Thanks for the detailed bug report and test cases.

Good news on issue #2 (SVG paths): We’ve identified and fixed the SVG data URL issue where url(%23a) references were being incorrectly extracted and transformed. The fix is in the latest development version:
1. Go to the Staatic WordPress plugin page
2. Click “Advanced View” on the right
3. Under “Advanced Options” → “Previous versions”, download the latest development version
For issues #1 and #3: We haven’t been able to reproduce these yet. The https:/ pattern and https:/#website URLs suggest URL transformation issues specific to your configuration.

To help us reproduce and fix these, please provide:
1. Site Health Report: WP-Admin → Tools → Site Health → Info → “Copy site info to clipboard” (or email to [email protected])
2. Staatic Settings:
- Destination URL (Settings → Build tab)
- Any Advanced tab customizations
- Custom filters/transformations you’re using
- Which specific PHP-DOM-Parser option you selected
The inconsistent results between generations is particularly interesting – might be related to processing order or caching.

Regarding exclusion by data attribute/comment: This feature doesn’t exist currently, though it would be useful for future versions.

We’re committed to resolving all three issues. Once we can reproduce them with your configuration details, we’ll provide fixes promptly.

Thanks for using Staatic!
Thread Starter Yan
(@donamyk)

3 months, 1 week ago
Thanks for the quick reply.

Items 1 & 3 are seem to be related to each other, but it’s a bit of an edge case to reproduce:

For background, I am working on a pre-existing client WP installation built with a theme developed by somebody else, so I’m not super familiar with all of the nuances.

This website has hundreds of resources, so to troubleshoot the generation, I added the following to limit generation to just the index:
```
// URL crawling control
add_filter('staatic_should_crawl_url', function ($value, $url, $context): mixed {
    if (true) {
        // TODO: enable to just debug 1 page
        $sitePrefix = "https://mysite.local";
        $allowedUrls = array(
            "/"
//                    "/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js"
        );
        $value = false;
        $path = $url->getPath();
        foreach ($allowedUrls as $allowedPath) {
            if ($path === $allowedPath || $sitePrefix . $path === $url) {
                $value = true;
                break;
            }
        }

    }
    return $value;
}, 5, 3);
```
I configured staatic_override_site_url to be "/"
IF staatic_extended_url_context is enabled, then all links pointing to excluded resources inside of the static HTML output get prefixed with https:/

ex: script src="https:/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js

However, if I uncomment "/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js"from the filter above then the output is correct:
```
src="/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js" 
```
NOTE: the same thing will happen to anchor href & image src attributes for excluded files
```
<img src="https:/wp-content/uploads/2025/...."
```
IF staatic_override_site_url is set to something like "https://xyz.com/" , then the output for excluded resources becomes like this:
```
<img src="https:https://xyz.com/wp-content/uploads/2025/06/...
```
Setting staatic_extended_url_context then fixes the output.

At this point, I am quite confused…

Can you help me understand whether I should enable/disable extendedContext ? I had originally set it to be true after reading your guide here:

https://staatic.com/blog/tutorials/advanced-publication-process-customization/
Thread Starter Yan
(@donamyk)

3 months, 1 week ago
After searching through the plugin source, I guess the issue is happening in the FallbackUrlTransformer // FallbackUrlExtractor ?

I don’t understand the intended logic here
```
 protected function getPatterns(): array
    {
        $formats = ['plain' => ['encode' => function (string $value) {
            return $value;
        }, 'decode' => function (string $value) {
            return $value;
        }], 'jsonEncoded' => ['encode' => function (string $value) {
            return str_replace('/', '\/', $value);
        }, 'decode' => function (string $value) {
            return str_replace('\/', '/', $value);
        }], 'urlEncoded' => ['encode' => function (string $value) {
            return rawurlencode($value);
        }, 'decode' => function (string $value) {
            return rawurldecode($value);
        }]];
        $patterns = [];
        foreach ($formats as $format => $options) {
            $slash = preg_quote($options['encode']('/'), '~');
            $doubleColon = preg_quote($options['encode'](':'), '~');
            $authority = preg_quote($options['encode']($this->baseUrl->getAuthority()), '~');
            $filterBasePath = $this->filterBasePath === null ? '' : preg_quote($options['encode'](trim($this->filterBasePath, '/')), '~');
            $patterns[] = ['pattern' => '~' . ($this->extendedUrlContext ? '(?P<before>.{0,100})' : '') . '(?P<url>
                    (?P<scheme>https?' . $doubleColon . ')?' . $slash . $slash . $authority . '
                    (?P<port>' . $doubleColon . '(?:80|443))?
                    (?P<path>' . (empty($filterBasePath) ? '' : $slash . $filterBasePath) . '

                        # Either the URL has an extra path or in the future it has a non-path char.
                        (' . $slash . '|(?![a-z0-9-._]))

                        # Rest of the path/query chars.
                        (?:' . $slash . '|[a-z0-9-._\~%])*
                    )

                )' . ($this->extendedUrlContext ? '(?P<after>.{0,100})' : '') . '~ix', 'encode' => $options['encode'], 'decode' => $options['decode']];
        }
```
I added a log statement to FallbackUrlTransformer & saw the following
```
[05-Sep-2025 20:22:37 UTC] [FallbackUrlTransformer] - {
    "effectiveUrl": "https://xyz.com/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js",
    "$transformResult": "https://xyz.com/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js",
    "$url": "https://mysite.local/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js",
    "$foundOnUrl": "https://mysite.local/",
    "$context": {
        "before": "</style><link rel=\"modulepreload\" as=\"script\" crossorigin=\"\" href=\"https:",
        "scheme": "",
        "port": "",
        "path": "/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js",
        "after": "\"><link rel=\"stylesheet\" href=\"https://mysite.local/wp-content/themes/aibms/src/dist/assets/maincss.c",
        "extractor": "Staatic\\Crawler\\UrlExtractor\\FallbackUrlExtractor"
    }
}
```
Thread Starter Yan
(@donamyk)

3 months, 1 week ago
On the subject, can you please elaborate on this note from the Advanced Publication blog post?

=================

Beyond controlling which URLs are crawled, this filter can also be used to manage how URLs are presented in the static version of your site. For example, when you configure Staatic to use relative URLs for portability, there might be cases where you need to maintain absolute URLs. This is common for canonical URL references or links in XML sitemaps, where the absolute URL is crucial for SEO purposes.
```
<?php

add_filter( 'staatic_should_crawl_url' , function ( $value, $url, $context ) {
    if ( ( $context[ 'htmlTagName' ] ?? '' ) === 'link' &&
        ( $context[ 'htmlAttributeName' ] ?? '' ) === 'href' &&
        ( str_contains( $context[ 'htmlElement' ] ?? '', 'canonical' ) ) ) {
        return false;
    }

    return $value;
}, 10, 3 );
```
=================

How would excluding a URL via staatic_should_crawl_url preserve the absolute URL if this filter would exclude the page entirely?

Maybe there should be a separate exclude_from_transformation hook or something?

Perhaps that documentation is outdated, but then how could I achieve the following:
- Inside of pages, rewrite all paths to begin with / except for
  - <meta property="og:url" content="https:https://xyz.com/">
  - the JSON inside of yoast-schema-graph
- Preserve absolute URL inside of generated sitemap files.
So far the only solution that I can think of is to define a custom Transformer to find & replace a placeholder URL based on my own criteria ?
Plugin Author Team Staatic
(@staatic)

3 months ago
Hi @donamyk ,

Great news! All the issues you reported have been addressed in Staatic 1.12.0 which was just released.

Issue #1 & #3 (malformed URLs with https:/ prefixes): This was indeed a bug with FallbackUrlExtractor when staatic_extended_url_context was enabled. URLs appearing after incomplete protocol prefixes were incorrectly matched, causing the malformed output you observed. We’ve implemented a two-pass extraction algorithm that properly handles these cases. You can now safely use staatic_extended_url_context.

Regarding preserving absolute URLs for canonical/Open Graph tags: You identified an important limitation in the documentation. The staatic_should_crawl_url filter controls crawling, not transformation; returning false prevents crawling entirely, which isn’t what you want.

For your specific requirements (canonical, Open Graph, Yoast Schema with absolute URLs pointing to your static site), you have two approaches:

Option 1: use an absolute Destination URL
Set your Destination URL in Staatic’s Build settings to your static site’s full URL (e.g., https://example.com). This will automatically keep all URLs absolute while pointing to the correct static domain.

Option 2: custom Transformer for selective absolute URLs
If you need relative URLs for most content but absolute for specific tags, you’ll need a custom Transformer during the publication process. This would rewrite specific URLs to be absolute with your static domain. See our transformer hooks documentation – while the docs are limited at the moment, reviewing the source code along with the staatic_transformers filter should give you a starting point.

New in 1.12.0: We’ve added the staatic_should_transform_url filter which provides independent control over URL transformation. While this won’t solve the absolute URL requirement directly (as it would keep original WordPress URLs), it could be useful for preventing transformation in other scenarios:
```
// Example: Keep WordPress admin-ajax.php URLs unchanged
add_filter('staatic_should_transform_url', function ($transform, $url, $foundOnUrl, $context) {
    // Don't transform admin-ajax endpoints (for forms/dynamic features)
    if (str_contains($url->getPath(), '/wp-admin/admin-ajax.php')) {
        return false; // Keep original WordPress URL
    }

    return $transform;
}, 10, 4);
```
We’ll be updating the documentation and the Advanced Publication article shortly to clarify these distinctions.

Also, thanks for your excellent debugging work! Your logs showing the partial protocol in the “before” context were exactly what we needed to identify the root cause.

Viewing 5 replies - 1 through 5 (of 5 total)

You must be logged in to reply to this topic.

Tags