Skip to content

[4.x]: Issue with Search Indexing and Invisible Characters in Multi-Site Craft CMS Project #16430

@romainpoirier

Description

@romainpoirier

What happened?

Description

In a multi-site Craft CMS Pro project with the "craftcms/redactor": "3.1.0" plugin, I am encountering an issue with content indexing. The issue seems related to invisible characters that affect search results in the admin panel and frontend.

Example Content

If I click the Source button in the Redactor field, the content is displayed as follows:

<p>More infor­ma­tion on the projects sub­mis­sion can be found in the <strong>TOOL­BOX</strong>.</p>

However, when I search for the word toolbox in the admin, this page does not appear in the results. Conversely, if I search for TOOL%C2%ADBOX, the page is found. The same behavior occurs in the frontend when using .search().

Field Configuration

  • Clean up HTML: Remove inline styles, Remove empty tags, Replace non-breaking spaces with regular spaces
  • Purify HTML: Enabled
  • HTML Purifier Config: Default

It seems that some invisible characters are being introduced and retained after saving. These characters interfere with the indexing process.

Steps to Reproduce

  1. Create a field using Redactor plugin with the configurations mentioned above.
  2. Add the following content in the field:
    <p>More infor­ma­tion on the projects sub­mis­sion can be found in the <strong>TOOL­BOX</strong>.</p>
  3. Save the entry and perform a search for the word toolbox in the admin or frontend.

Expected Behavior

The page containing the word TOOLBOX should appear in the search results without requiring the exact invisible character sequence (TOOL%C2%ADBOX).

Actual Behavior

The page only appears in the search results if the invisible character sequence is included in the search query. Regular searches for toolbox do not return the expected result.

Additional Questions

  1. Why are these invisible characters added and retained after saving the content?
  2. How can I prevent such characters from being saved in the first place?
  3. What is the recommended approach to clean all content encodings before re-index using --update-search-index?

Craft CMS version

4.13.8

PHP version

No response

Operating system and version

No response

Database type and version

No response

Image driver and version

No response

Installed plugins and versions

  • "craftcms/redactor": "3.1.0"

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions