Skip to content

Conversation

@repeatedly
Copy link
Member

@repeatedly repeatedly commented Feb 28, 2018

On k8s environment, we received the report "file chunks are unexpectedly broken and it causes error loop during resume."
This patch mitigates this problem by ignoring broken chunks in resume.


def handle_broken_files(path, mode, e)
log.error "found broken chunk file during resume. Deleted corresponding files:", :path => path, :mode => mode, :err_msg => e.message
# After support 'backup_dir' feature, these files are moved to backup_dir instead of unlink.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backup_dir's issue is here: #1856

@mururu
Copy link
Member

mururu commented Mar 1, 2018

Looks good. If you push tests in this PR. I'll also see it.

@mururu
Copy link
Member

mururu commented Mar 1, 2018

minor comment: It seems better to prepare a method for path + '.meta'.

@repeatedly
Copy link
Member Author

Add tests

Copy link
Member

@mururu mururu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@repeatedly repeatedly merged commit 5c7c32a into master Mar 2, 2018
@repeatedly repeatedly deleted the ignore-broken-file-chunks branch March 2, 2018 11:04
daipom pushed a commit that referenced this pull request Jun 27, 2025
**Which issue(s) this PR fixes**: 
* Related to #3970

**What this PR does / why we need it**: 
This PR improves meta file corruption checking.

The meta file contains at least the following field values.


https://github.com/fluent/fluentd/blob/fa2eb58922e1c36f83bf1d5243b325a860f72864/lib/fluent/plugin/buffer/file_chunk.rb#L249-L254

This PR reinforces #1874.


Without this changes, it might causes following error when launch
fluentd every time with broken meta file:

```
2025-06-06 12:11:26 +0900 [error]: unexpected error while checking flushed chunks. ignored. error_class=NoMethodError error="undefined method '<' for nil"
  2025-06-06 12:11:26 +0900 [error]: /Users/watson/src/fluentd/lib/fluent/plugin/output.rb:1479:in 'block in Fluent::Plugin::Output#enqueue_thread_run'
  2025-06-06 12:11:26 +0900 [error]: /Users/watson/src/fluentd/lib/fluent/plugin/buffer.rb:548:in 'block in Fluent::Plugin::Buffer#enqueue_all'
  2025-06-06 12:11:26 +0900 [error]: /Users/watson/src/fluentd/lib/fluent/plugin/buffer.rb:542:in 'Array#each'
  2025-06-06 12:11:26 +0900 [error]: /Users/watson/src/fluentd/lib/fluent/plugin/buffer.rb:542:in 'Fluent::Plugin::Buffer#enqueue_all'
  2025-06-06 12:11:26 +0900 [error]: /Users/watson/src/fluentd/lib/fluent/plugin/output.rb:1479:in 'Fluent::Plugin::Output#enqueue_thread_run'
  2025-06-06 12:11:26 +0900 [error]: /Users/watson/src/fluentd/lib/fluent/plugin_helper/thread.rb:78:in 'block in Fluent::PluginHelper::Thread#thread_create'
```

If the timekey value is corrupted, the above error occurs.
Since there is no appropriate way to check timekey directly, check `id`,
`c`, and `m` fields instead. This is because when timekey is broken,
other fields may also be broken.

It might be possible that the `@size` is 0.
`@unique_id`, `@created_at`, and `@modified_at` are set when FileChunk
is initialized, so they definitely have some values.
I think these fields should be written in meta file.

So, this PR adds the `id`, `c`, and `m` fields check.


Previously, it operates using default value if metadata was broken. 
However, it can miss the corruption and result in unexpected errors.
So, this PR enhances the detection of broken metadata files instead of
using defalut value.

This change has backward compatible with v0.14 behavior.


**Docs Changes**:
Not necessarily required.

**Release Note**: 
buf_file: reinforce buffer file corruption check

---------

Signed-off-by: Shizuo Fujita <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants