-
Notifications
You must be signed in to change notification settings - Fork 1.4k
buf_file: Skip and delete broken file chunks during resume. fix #1760 #1874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| def handle_broken_files(path, mode, e) | ||
| log.error "found broken chunk file during resume. Deleted corresponding files:", :path => path, :mode => mode, :err_msg => e.message | ||
| # After support 'backup_dir' feature, these files are moved to backup_dir instead of unlink. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backup_dir's issue is here: #1856
|
Looks good. If you push tests in this PR. I'll also see it. |
|
minor comment: It seems better to prepare a method for |
|
Add tests |
mururu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
**Which issue(s) this PR fixes**: * Related to #3970 **What this PR does / why we need it**: This PR improves meta file corruption checking. The meta file contains at least the following field values. https://github.com/fluent/fluentd/blob/fa2eb58922e1c36f83bf1d5243b325a860f72864/lib/fluent/plugin/buffer/file_chunk.rb#L249-L254 This PR reinforces #1874. Without this changes, it might causes following error when launch fluentd every time with broken meta file: ``` 2025-06-06 12:11:26 +0900 [error]: unexpected error while checking flushed chunks. ignored. error_class=NoMethodError error="undefined method '<' for nil" 2025-06-06 12:11:26 +0900 [error]: /Users/watson/src/fluentd/lib/fluent/plugin/output.rb:1479:in 'block in Fluent::Plugin::Output#enqueue_thread_run' 2025-06-06 12:11:26 +0900 [error]: /Users/watson/src/fluentd/lib/fluent/plugin/buffer.rb:548:in 'block in Fluent::Plugin::Buffer#enqueue_all' 2025-06-06 12:11:26 +0900 [error]: /Users/watson/src/fluentd/lib/fluent/plugin/buffer.rb:542:in 'Array#each' 2025-06-06 12:11:26 +0900 [error]: /Users/watson/src/fluentd/lib/fluent/plugin/buffer.rb:542:in 'Fluent::Plugin::Buffer#enqueue_all' 2025-06-06 12:11:26 +0900 [error]: /Users/watson/src/fluentd/lib/fluent/plugin/output.rb:1479:in 'Fluent::Plugin::Output#enqueue_thread_run' 2025-06-06 12:11:26 +0900 [error]: /Users/watson/src/fluentd/lib/fluent/plugin_helper/thread.rb:78:in 'block in Fluent::PluginHelper::Thread#thread_create' ``` If the timekey value is corrupted, the above error occurs. Since there is no appropriate way to check timekey directly, check `id`, `c`, and `m` fields instead. This is because when timekey is broken, other fields may also be broken. It might be possible that the `@size` is 0. `@unique_id`, `@created_at`, and `@modified_at` are set when FileChunk is initialized, so they definitely have some values. I think these fields should be written in meta file. So, this PR adds the `id`, `c`, and `m` fields check. Previously, it operates using default value if metadata was broken. However, it can miss the corruption and result in unexpected errors. So, this PR enhances the detection of broken metadata files instead of using defalut value. This change has backward compatible with v0.14 behavior. **Docs Changes**: Not necessarily required. **Release Note**: buf_file: reinforce buffer file corruption check --------- Signed-off-by: Shizuo Fujita <[email protected]>
On k8s environment, we received the report "file chunks are unexpectedly broken and it causes error loop during resume."
This patch mitigates this problem by ignoring broken chunks in resume.