Skip to content

Comments

refactored/cleaned up lib/private/files.php and switched get() to ZipStreamer#6893

Closed
McNetic wants to merge 3 commits intoowncloud:masterfrom
McNetic:zipstreamer
Closed

refactored/cleaned up lib/private/files.php and switched get() to ZipStreamer#6893
McNetic wants to merge 3 commits intoowncloud:masterfrom
McNetic:zipstreamer

Conversation

@McNetic
Copy link
Contributor

@McNetic McNetic commented Jan 22, 2014

This commit mainly adds the ability for owncloud to use ZipStreamer instead of the builtin ZipArchive. This reduces disk i/o greatly, as the current implementation creates one temp file for each file to be added to the zip, and then again a temp file of the complete zip, before starting to send the temp zip to the client.
The ZipStreamer creates the zip file on the fly in memory and directly sends the data to the client, no temp file needed at all. The ZipStreamer class was added in a second pull request to 3rdparty.
(this is a replacement for pull request #3439)

cleaned up get() logic
fixed get() to only send headers if requested (xsendfile could get in the way)
do no longer readfile() when already using mod_xsendfile or similar
@ghost
Copy link

ghost commented Jan 22, 2014

Thanks a lot for your contribution! Contributions to the core repo require a signed contributors agreement http://owncloud.org/about/contributor-agreement/ Alternatively you can add a comment here where you state that this contribution is MIT licensed. Some more details about out pull request workflow can be found here: http://owncloud.org/code-reviews-on-github/

@McNetic
Copy link
Contributor Author

McNetic commented Jan 22, 2014

Contributor agreement was sent by mail.

I'm not exactly sure what to do about the second pull request (the one for 3rdparty). Is that all necessary? If yes, it has to be updated, to, to incorporate witchis changes and some more from me, too.

I also want to add that currently (with the Zip64-support as far as I understand), this only works on a 64-bit machine. It effectively imposes this restriction on the whole owncloud, as long as zip download is used. I just started a correspondence with witchi to clear this up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about autoloading?

@DeepDiver1975 You know how to do this with external libraries, right? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to add the ZipStreamer to the 3rdparty repo - preferably we use composer to load it there.
Composer will take case of proper autoloading.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping me in case you need support to bring ZipStreamer to the 3rdparty repo

@bantu
Copy link

bantu commented Feb 16, 2014

I also want to add that currently (with the Zip64-support as far as I understand), this only works on a 64-bit machine. It effectively imposes this restriction on the whole owncloud, as long as zip download is used. I just started a correspondence with witchi to clear this up.

@McNetic Can you please explain the background in technical terms here? See report in #7138 for how floats can be used to represent filesize.

@McNetic
Copy link
Contributor Author

McNetic commented Feb 18, 2014

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

@McNetic https://github.com/McNetic Can you please explain the
background in technical terms here? See report in #7138
#7138 for how floats can
be used to represent filesize.

It was indeed about the file sizes. As the Zip64 extension is now
always active in ZipStreamer, it has to represent file sizes as 64 bit
(especially in the zip file), which works natively only on 64 bit
platforms.
Meanwhile, however, I changed the internal representation of file
sizes to a simple class that is able to represent a 64bit integer with
two 32bit integers on 32bit platforms. It does not do complete 64bit
arithmetic; just the single necessary + operation. I think it is much
easier to get the correct binary representation from this class than
it would be with floats.

As ZipStreamer does not read file sizes, and does process files
chunk-wise with the native fread() function (fopen() is done by
owncloud), there should not be any more problems here.

The only limitation in 32bit machines now should be the LFS support.
It should, however, be perfectly possible to stream zip files bigger
than 2 GB on 32bit machines even without lfs support. I said should,
because I tested the 64bit arithmetic I implemented on a 32bit
machine, but not owncloud itself.

NB: The content-length, of course, can not be advertised with streamed
zip files, as the final size of the file is unknown prior to being sent.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)

iEYEARECAAYFAlMDge0ACgkQYm+MkvsfJ5+UhgCg3PM99I/i8usAdwru9WylCk8w
wWYAnjtT8TpWDPo0xpqxwsqmXzuHg+vO
=x2zx
-----END PGP SIGNATURE-----

@McNetic
Copy link
Contributor Author

McNetic commented Feb 18, 2014

@DeepDiver1975: I just now understood you may have meant me with your offer to help bring ZipStreamer to 3rdparty. I would be glad if you could do that. I'm not really into owncloud very much, and it's just a side project for me. I'd like to concentrate on the technical side of the zip implementation, if possible.

@bantu
Copy link

bantu commented Feb 18, 2014

NB: The content-length, of course, can not be advertised with streamed zip files, as the final size of the file is unknown prior to being sent.

Are you sure? Even without compression?

@DeepDiver1975
Copy link
Member

@DeepDiver1975: I just now understood you may have meant me with your offer to help bring ZipStreamer to 3rdparty. I would be glad if you could do that. I'm not really into owncloud very much, and it's just a side project for me. I'd like to concentrate on the technical side of the zip implementation, if possible.

owncloud-archive/3rdparty#72

@bantu
Copy link

bantu commented Feb 19, 2014

@McNetic Since you're probably most familiar with the ZIP format:

Do you think it is impossible to pre-calculate the size of the generated zip or do you think it is just not worth it? I mean, because there is no compression involved, my naive approach would be to just use two passes. The first pass would only get all filesizes and calculate the size of the zip.
The size of the zip file is kind of important, especially for large files, because this is how the browser calculates progress.

From an architectural point, I would like to note that it is usually not a good idea to do too much work in the constructor of a class. Especially performing output generation, e.g. calling header(), will limit future extensibility of your class. The constructor should setup dependencies only.

Another probably complex question: Is the process of zip generation deterministic such that partial downloads could (in theory) be resumed?

Thanks a lot.

@McNetic
Copy link
Contributor Author

McNetic commented Feb 19, 2014

Do you think it is /impossible/ to pre-calculate the size of the
generated zip or do you think it is just not worth it? I mean, because
there is no compression involved, my naive approach would be to just use
two passes. The first pass would only get all filesizes and calculate
the size of the zip.
The size of the zip file is kind of important, especially for large
files, because this is how the browser calculates progress.
Yeah, sure, I did not rethink the size calculation thing, because
initially, my intention was to enable compression. With compression
enabled, the final size is obviously only known after processing.

So you're right, it should be possible to pre-calculate the file size.
Unfortunately, the lengths of the various headers differ based for
example on file names, currently not implemented comments, and usage of
different zip features (currently, we use always the same feature set,
but that may change in the future). A two-pass approach is certainly
feasible and will prevent some code duplication. However, it makes the
code for actual processing more complex (conditionals everywhere).
Writing a separate method of calculating the file size has the drawback
that every change to the actual logic has to be mirrored here, too. I'm
not sure which option to choose here, but I'd tend to the latter.
But there are two preconditions to this - first, obviously, we need a
method to determine file size correctly on 32bit lfs machines. With the
curl extension, we'd also need a way of converting the base-10 string to
the binary representation. Second, with the current implementation, the
user of ZipStreamer (owncloud in this case), would have to iterate
over the file structure twice. It would also be possible to build a
internal representation of the file structure on the first pass, and use
that for the second one - but that would actually mean to change the
current architecture that instantly sends a file when added.

From an architectural point, I would like to note that it is usually not
a good idea to do too much work in the constructor of a class.
Especially performing output generation, e.g. calling header(), will
limit future extensibility of your class. The constructor should setup
dependencies only.
That's why sending the headers is optional in the constructor. Actually,
I intended to make the usage of the class as simple as possible, with a
similar goal in mind. But I agree that putting that all in a separate
method is a better solution.

Another probably complex question: Is the process of zip generation
deterministic such that partial downloads could (in theory) be resumed?
That's strongly connected to the question of pre-calculation (and
compression, of course). As long as no compression is involved, the
answer is yes, provided that:

  • none of the files where changed between the downloads (obviously)
  • the files and directories are added to the ZipStreamer in the same
    order (I do not know if foreach() is deterministic, and if the order of
    files in the argument passed to get() in files.php is deterministic, and
    if OC_Files::getDirectoryContent() is deterministic)
    Using an internal representation of the file structure as proposed above
    could help here (by sorting before finally processing).

As you can see, both features (size calculation and resuming) are not
completely trivial. And both would rely on compression being disabled.

@bantu
Copy link

bantu commented Feb 19, 2014

I was thinking more about this in the meantime as well.

When the filesize is advertised problems may occur when files change between filesize determination and file reading. I think this is probably my biggest concern.

I don't think that getting the correct filesize and getting it into the correct format is a big problem.

@bantu
Copy link

bantu commented Feb 19, 2014

From a usage perspective, I had something like this in mind:

$streamer = new ZipStreamer('filename.zip'); // Possibly other options.
$streamer->addDirectory('foo/'); // registers directory
$streamer->addFile('bar.txt'); // registers file
$streamer->stream(); // Begins reading and delivering the files, sends headers if not already sent.

This way, a second pass could be added to stream() easily.

@McNetic
Copy link
Contributor Author

McNetic commented Feb 19, 2014

When the filesize is advertised problems may occur when files change
between filesize determination and file reading. I think this is
probably my biggest concern.
I'm not sure if it could in theory also happen that files are changed
during streaming. But that has nothing to do with zip streaming.

I don't think that getting the correct filesize and getting it into the
correct format is a bit problem.
Well, in the link you mentioned is described that filesize() does not
work correctly on 32bit lfs systems (as it can only return 32bit
integers). The mentioned workarounds return base-10 strings, which have
to be converted. The latter is not a bit problem, but it arises from it.

@bantu
Copy link

bantu commented Feb 19, 2014

I'm not sure if it could in theory also happen that files are changed during streaming. But that has nothing to do with zip streaming.

It does have to do with zip streaming. Say in the first pass you see a file as 1000 bytes, then you calculate the filesize of the zip. And then you actually read the file until eof, but now it has 2000 bytes. Thus your filesize of the zip is wrong, if you read the file 2000 bytes file completely.
The more interesting case is actually the other way around, when you see the file as 2000 bytes, but when you try to read it, it only has 1000 bytes.

The latter is not a bit problem, but it arises from it.

I meant "big problem", sorry. Corrected above.

@McNetic
Copy link
Contributor Author

McNetic commented Feb 19, 2014

It does. Say in the first pass you see a file as 1000 bytes, then you
calculate the filesize of the zip. And then you actually read the file
until eof, but now it has 2000 bytes. Thus your filesize of the zip is
wrong, if you read the file 2000 bytes file completely.
The more interesting case is actually the other way around, when you see
the file as 2000 bytes, but when you try to read it, it only has 1000 bytes.
Yes, I meant it's not only a zip problem. If you stream a single file
without zip creation and the size changes during streaming (or after
reading the file size but before streaming), it will also go wrong.

I meant "big problem", sorry. Corrected above.
Oh, ok. Yes, it's not a big problem, but an awful mess a sane
programming language should do for us :-).

@McNetic
Copy link
Contributor Author

McNetic commented Feb 19, 2014

Regarding your usage perspective:

Yes, that's what I meant by building an internal file structure
representation. Currently, ZipStreamer() does not track what files or
directories are added to the Zip. Content is send out instantly, and
meta-information is added to an opaque directory as part of the zip
format (which is not suited for latter processing).

As you can see, what you propose is a quite different approach. It has
it's drawbacks on the versatility of the ZipStreamer class (for
example, if you want to use it for proxying multiple files from
different server(s) into a zip file on the fly).

@DeepDiver1975
Copy link
Member

I'm hijacking this pull request

@MorrisJobke
Copy link
Contributor

@DeepDiver1975 Can you then mention the PR here as it is ready. Thanks :)

@McNetic McNetic deleted the zipstreamer branch April 3, 2014 13:38
@lock lock bot locked as resolved and limited conversation to collaborators Aug 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants