Skip to content

Read repair empty file in chunks_head#8061

Merged
codesome merged 5 commits intoprometheus:masterfrom
codesome:segment-fsync
Oct 21, 2020
Merged

Read repair empty file in chunks_head#8061
codesome merged 5 commits intoprometheus:masterfrom
codesome:segment-fsync

Conversation

@codesome
Copy link
Copy Markdown
Member

@codesome codesome commented Oct 15, 2020

Even though we first write the segment file to a temporary file and then rename, the Close() method does not assure writes to be reflected to disk (TIL). We faced the mmap: invalid argument error again with the temp file protection.

We should be syncing the temporary file before closing, similar to what we do for compaction. While this is on the hotpath, the sync call won't be that often and should not be a problem in most cases.

Since we don't want stalls on hot path, this PR does a read repair - deletes the last file if it is empty. Any empty file in between the sequence will still lead to error.

@brian-brazil
Copy link
Copy Markdown
Contributor

While this is on the hotpath, the sync call won't be that often and should not be a problem in most cases.

Syncs can take seconds, so this could cause quite a stutter. What's the failure mode that this is meant to be covering? This should only matter if there's a machine crash within a minute of the write.

@codesome
Copy link
Copy Markdown
Member Author

codesome commented Oct 15, 2020

What's the failure mode that this is meant to be covering?

Prometheus/TSDB not able to start after a crash if it was unlucky to hit this case. Another option is to ignore an empty last file while reading the files.

@brian-brazil
Copy link
Copy Markdown
Contributor

A Prometheus crash or a machine crash? Sync should never make a difference for a Prometheus crash.

@codesome
Copy link
Copy Markdown
Member Author

Machine crash, hence a force shutdown on Prometheus

@brian-brazil
Copy link
Copy Markdown
Contributor

Could we do it async as we do elsewhere?

Being resilient in the read path is also a good idea.

@codesome
Copy link
Copy Markdown
Member Author

Could we do it async as we do elsewhere?

We want data to go to disk before rename, so async might not be helpful.

Being resilient in the read path is also a good idea.

I will do this rather.

@codesome codesome changed the title fsync head chunks file before closing temporary file Read repair empty file in chunks_head Oct 19, 2020
@codesome
Copy link
Copy Markdown
Member Author

I have updated the PR to do a read repair - deletes the last file if it is empty. Any empty file in between the sequence will still lead to error.

@brian-brazil would you like to take a look?

Copy link
Copy Markdown
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @codesome. LGTM. I just left a question and a couple of nits.

Signed-off-by: Ganesh Vernekar <[email protected]>
Signed-off-by: Ganesh Vernekar <[email protected]>
Copy link
Copy Markdown
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just minor nit! thanks.

Signed-off-by: Ganesh Vernekar <[email protected]>
@codesome codesome merged commit 2624d82 into prometheus:master Oct 21, 2020
@codesome codesome mentioned this pull request Oct 27, 2020
@brancz brancz mentioned this pull request Nov 3, 2020
brancz pushed a commit to brancz/prometheus that referenced this pull request Nov 4, 2020
* Read repair empty file in chunks_head

Signed-off-by: Ganesh Vernekar <[email protected]>

* Refactor and introduce repairLastChunkFile

Signed-off-by: Ganesh Vernekar <[email protected]>

* Attempt windows test fix

Signed-off-by: Ganesh Vernekar <[email protected]>

* Fix review comments

Signed-off-by: Ganesh Vernekar <[email protected]>

* Fix review comments

Signed-off-by: Ganesh Vernekar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants