Skip to content

Conversation

@dnnr
Copy link

@dnnr dnnr commented Jan 23, 2015

Instead of giving all files a fixed block count of 1, this assigns each
deduplicated chunk to a certain file. In effect, the cumulative file
size that is shown in the mountpoint accurately reflects the amount of
actual disk space needed for the repository (barring metadata overhead).

Although the block assignment is done arbitrarily, depending on the
user's access pattern, the sizes will be consistent within the entire
mount point. This facilitates the use of tools like du and ncdu for
inspecting the actual disk usage in a repository as opposed to just
looking at the original, uncompressed, non-deduplicated file sizes.

Instead of giving all files a fixed block count of 1, this assigns each
deduplicated chunk to a certain file. In effect, the cumulative file
size that is shown in the mountpoint accurately reflects the amount of
actual disk space needed for the repository (barring metadata overhead).

Although the block assignment is done arbitrarily, depending on the
user's access pattern, the sizes will be consistent within the entire
mount point. This facilitates the use of tools like du and ncdu for
inspecting the actual disk usage in a repository as opposed to just
looking at the original, uncompressed, non-deduplicated file sizes.
@ThomasWaldmann
Copy link
Contributor

can we have some opinions here about this PR?

is there a chance that this might confuse users, if the blocks are more or less random compared to the original filesize?

@dnnr
Copy link
Author

dnnr commented Mar 6, 2015

On the one hand, yes. But on the other hand, those values are currently simply set to 1, i.e., they're mostly wrong and meaningless anyway. And more importantly: I'd say that the semantics of that field are actually correct this way. It's supposed represent the "size used on disk" and therefore supposed to be potentially arbitrarily different from the nominal file size exactly because of the effects caused by compression, deduplication, sparse files, or whatever else is going on in the underlying file system.

So of course someone might claim to be confused by those values, but I actually can't think of any better way of populating st_blocks that wouldn't be at least equally confusing. At least this way it's consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants