in_tail: encoding option does not work

### Describe the bug

When trying a regex parse, I couldn't match the plus-minus unicode character (±)
I ensured my config file, my log file and my config (encoding UTF-8) all pointed towards utf-8.

All those failed to match. ±, \uc2b1, \xc2\xb1
In despair, I'm matching "unknown character" for now (\uFFFD).

In fluentd docker image, I enabled -vv to get some more insight and discovered the following:

```
2025-02-28 10:42:12 +0000 [info]: fluent/log.rb:362:info: starting fluentd-1.18.0 pid=7 ruby="3.2.6"
2025-02-28 10:42:12 +0000 [info]: fluent/log.rb:362:info: spawn command to main:  cmdline=["/usr/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/bin/fluentd", "-c", "/fluentd/etc/fluentd.conf", "-vv", "--plugin", "/fluentd/plugins", "--under-supervisor"]
```

With fluent-package:
```
# pgrep -a ruby
772464 /opt/fluent/bin/ruby -Eascii-8bit:ascii-8bit /opt/fluent/bin/fluentd --log /var/log/fluent/fluentd.log --daemon /var/run/fluent/fluentd.pid --under-supervisor
```

Shall we have something like -Eutf-8:utf-8 defined instead or maybe an easy wait to override it?

Thank you.

### To Reproduce

Run a regex matching unicode.

### Expected behavior

Should be handling utf-8 by default but this could be subject to taste.
Otherwise, document it (Have I missed it?) and provide an easy way to override it.

### Your Environment

```markdown
- Fluentd version: fluent-package 5.0.6-1 arm64 (Docker image was latest end of Feb).
- Operating system: Ubuntu 22.04
- Kernel version: 6.5.0
```

### Your Configuration

```apache
# posfile is disabled on purpose here.
<source>
  @type tail
  path /var/log/somelogs/*.log

  tag sometag
  encoding UTF-8
  <parse>
    @type regexp
    expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ ]*)±(?<loglevel>[^ ]*)±(?<alphabetone>[^ ]*)±(?<alphabettwo>[^ ].*)±(?<tag>.*)±(?<message>.*)±(?<severity>[^ ]*)$/

    time_format %s.%N
    time_key timestamp_unix_and_us
    time_type string
    keep_time_key true

  </parse>
</source>
```

### Your Error Log

```shell
2025-03-06 10:25:32 +0000 [warn]: #0 fluent/log.rb:383:warn: pattern not matched: "Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780\uFFFD\uFFFDINFO\uFFFD\uFFFDABCD\uFFFD\uFFFDEFGH\uFFFD\uFFFDMYTAG\uFFFD\uFFFDHERE IS SOME MESSAGE\uFFFD\uFFFD"
```

### Additional context

Here's a log line:

```
Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780±INFO±ABCD±EFGH±MYTAG±HERE IS SOME MESSAGE±
```

Here's my source:

```
# posfile is disabled on purpose here.
<source>
  @type tail
  path /var/log/somelogs/*.log

  tag sometag
  encoding UTF-8
  <parse>
    @type regexp
    expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ ]*)±(?<loglevel>[^ ]*)±(?<alphabetone>[^ ]*)±(?<alphabettwo>[^ ].*)±(?<tag>.*)±(?<message>.*)±(?<severity>[^ ]*)$/

    time_format %s.%N
    time_key timestamp_unix_and_us
    time_type string
    keep_time_key true

  </parse>
</source>
```

Here's the error I get:

```
2025-03-06 10:25:32 +0000 [warn]: #0 fluent/log.rb:383:warn: pattern not matched: "Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780\uFFFD\uFFFDINFO\uFFFD\uFFFDABCD\uFFFD\uFFFDEFGH\uFFFD\uFFFDMYTAG\uFFFD\uFFFDHERE IS SOME MESSAGE\uFFFD\uFFFD"
```

If I replace the expression with this, it goes through and is properly split:

```
expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ \uFFFD]*)\uFFFD\uFFFD(?<loglevel>[^ \uFFFD]*)\uFFFD\uFFFD(?<alphabetone>[^ \uFFFD]*)\uFFFD\uFFFD(?<alphabettwo>[^ \uFFFD]*)\uFFFD\uFFFD(?<tag>[^\uFFFD]*)\uFFFD\uFFFD(?<message>[^\uFFFD]*)\uFFFD\uFFFD(?<severity>[^ ]*)$/
```

\uFFFD is "unrecognized character".

My fluentd.conf is using UTF-8
My log file is using UTF-8
My config, as you can see above is UTF-8
The only thing I see that is not UTF-8 is the ruby argument defining encoding.

So even when simplifying things a lot I don't see a proper handling of unicode.

I tried ±, \±, \uc2b1, \xc2\xb1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

in_tail: encoding option does not work #4863

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

in_tail: encoding option does not work #4863

Description

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions