Read repair empty file in chunks_head#8061
Conversation
Syncs can take seconds, so this could cause quite a stutter. What's the failure mode that this is meant to be covering? This should only matter if there's a machine crash within a minute of the write. |
Prometheus/TSDB not able to start after a crash if it was unlucky to hit this case. Another option is to ignore an empty last file while reading the files. |
|
A Prometheus crash or a machine crash? Sync should never make a difference for a Prometheus crash. |
|
Machine crash, hence a force shutdown on Prometheus |
|
Could we do it async as we do elsewhere? Being resilient in the read path is also a good idea. |
We want data to go to disk before rename, so async might not be helpful.
I will do this rather. |
Signed-off-by: Ganesh Vernekar <[email protected]>
daecca3 to
4a4a603
Compare
|
I have updated the PR to do a read repair - deletes the last file if it is empty. Any empty file in between the sequence will still lead to error. @brian-brazil would you like to take a look? |
Signed-off-by: Ganesh Vernekar <[email protected]>
Signed-off-by: Ganesh Vernekar <[email protected]>
fe2627c to
c2cfddf
Compare
Signed-off-by: Ganesh Vernekar <[email protected]>
bwplotka
left a comment
There was a problem hiding this comment.
LGTM, just minor nit! thanks.
Signed-off-by: Ganesh Vernekar <[email protected]>
* Read repair empty file in chunks_head Signed-off-by: Ganesh Vernekar <[email protected]> * Refactor and introduce repairLastChunkFile Signed-off-by: Ganesh Vernekar <[email protected]> * Attempt windows test fix Signed-off-by: Ganesh Vernekar <[email protected]> * Fix review comments Signed-off-by: Ganesh Vernekar <[email protected]> * Fix review comments Signed-off-by: Ganesh Vernekar <[email protected]>
Even though we first write the segment file to a temporary file and then rename, the
Close()method does not assure writes to be reflected to disk (TIL). We faced themmap: invalid argumenterror again with the temp file protection.We should be syncing the temporary file before closing, similar to what we do for compaction. While this is on the hotpath, the sync call won't be that often and should not be a problem in most cases.Since we don't want stalls on hot path, this PR does a read repair - deletes the last file if it is empty. Any empty file in between the sequence will still lead to error.