Skip to content

in_tail: encoding option does not work #4863

@pierreact

Description

@pierreact

Describe the bug

When trying a regex parse, I couldn't match the plus-minus unicode character (±)
I ensured my config file, my log file and my config (encoding UTF-8) all pointed towards utf-8.

All those failed to match. ±, \uc2b1, \xc2\xb1
In despair, I'm matching "unknown character" for now (\uFFFD).

In fluentd docker image, I enabled -vv to get some more insight and discovered the following:

2025-02-28 10:42:12 +0000 [info]: fluent/log.rb:362:info: starting fluentd-1.18.0 pid=7 ruby="3.2.6"
2025-02-28 10:42:12 +0000 [info]: fluent/log.rb:362:info: spawn command to main:  cmdline=["/usr/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/bin/fluentd", "-c", "/fluentd/etc/fluentd.conf", "-vv", "--plugin", "/fluentd/plugins", "--under-supervisor"]

With fluent-package:

# pgrep -a ruby
772464 /opt/fluent/bin/ruby -Eascii-8bit:ascii-8bit /opt/fluent/bin/fluentd --log /var/log/fluent/fluentd.log --daemon /var/run/fluent/fluentd.pid --under-supervisor

Shall we have something like -Eutf-8:utf-8 defined instead or maybe an easy wait to override it?

Thank you.

To Reproduce

Run a regex matching unicode.

Expected behavior

Should be handling utf-8 by default but this could be subject to taste.
Otherwise, document it (Have I missed it?) and provide an easy way to override it.

Your Environment

- Fluentd version: fluent-package 5.0.6-1 arm64 (Docker image was latest end of Feb).
- Operating system: Ubuntu 22.04
- Kernel version: 6.5.0

Your Configuration

# posfile is disabled on purpose here.
<source>
  @type tail
  path /var/log/somelogs/*.log

  tag sometag
  encoding UTF-8
  <parse>
    @type regexp
    expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ ]*)±(?<loglevel>[^ ]*)±(?<alphabetone>[^ ]*)±(?<alphabettwo>[^ ].*)±(?<tag>.*)±(?<message>.*)±(?<severity>[^ ]*)$/

    time_format %s.%N
    time_key timestamp_unix_and_us
    time_type string
    keep_time_key true

  </parse>
</source>

Your Error Log

2025-03-06 10:25:32 +0000 [warn]: #0 fluent/log.rb:383:warn: pattern not matched: "Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780\uFFFD\uFFFDINFO\uFFFD\uFFFDABCD\uFFFD\uFFFDEFGH\uFFFD\uFFFDMYTAG\uFFFD\uFFFDHERE IS SOME MESSAGE\uFFFD\uFFFD"

Additional context

Here's a log line:

Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780±INFO±ABCD±EFGH±MYTAG±HERE IS SOME MESSAGE±

Here's my source:

# posfile is disabled on purpose here.
<source>
  @type tail
  path /var/log/somelogs/*.log

  tag sometag
  encoding UTF-8
  <parse>
    @type regexp
    expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ ]*)±(?<loglevel>[^ ]*)±(?<alphabetone>[^ ]*)±(?<alphabettwo>[^ ].*)±(?<tag>.*)±(?<message>.*)±(?<severity>[^ ]*)$/

    time_format %s.%N
    time_key timestamp_unix_and_us
    time_type string
    keep_time_key true

  </parse>
</source>

Here's the error I get:

2025-03-06 10:25:32 +0000 [warn]: #0 fluent/log.rb:383:warn: pattern not matched: "Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780\uFFFD\uFFFDINFO\uFFFD\uFFFDABCD\uFFFD\uFFFDEFGH\uFFFD\uFFFDMYTAG\uFFFD\uFFFDHERE IS SOME MESSAGE\uFFFD\uFFFD"

If I replace the expression with this, it goes through and is properly split:

expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ \uFFFD]*)\uFFFD\uFFFD(?<loglevel>[^ \uFFFD]*)\uFFFD\uFFFD(?<alphabetone>[^ \uFFFD]*)\uFFFD\uFFFD(?<alphabettwo>[^ \uFFFD]*)\uFFFD\uFFFD(?<tag>[^\uFFFD]*)\uFFFD\uFFFD(?<message>[^\uFFFD]*)\uFFFD\uFFFD(?<severity>[^ ]*)$/

\uFFFD is "unrecognized character".

My fluentd.conf is using UTF-8
My log file is using UTF-8
My config, as you can see above is UTF-8
The only thing I see that is not UTF-8 is the ruby argument defining encoding.

So even when simplifying things a lot I don't see a proper handling of unicode.

I tried ±, \±, \uc2b1, \xc2\xb1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions