Detect incomplete or corrupted downloaded files
-
It is quite often when a file is being downloaded that it downloads incomplete or corrupted; which could lead to unexpected potential problems.
-
I propose to create one (or two) files beside the target file target.ext, named target.ext.size (and target.ext.sha).
-
When we need to parse a file, we currently check its presence (e.g. in
ScopInstallation.ensureClaInstalled()for SCPR classification DB file). In my proposal, we do not only check for its presence, but also for its size +/- its hash code.
- HashCode: Some web folders publish the HashCode along with the download file itself. If not, we ignore the hash code :(.
-
Size: We can issue a
HEADERrequest before theGETorPOSTcall that downloads the file itself, extract the size, and save it.
-
The four functions to store/check size/hash can be implemented centrally in a separate location (most probably in
BioJava-core) as public static methods and called only at the respective use places. -
This way, we hopefully will not modify in the current code a lot: just adding one (or two) lines to call
storeSize(), andstoreHash()at the file download location andcheckSize()+/-checkHash()method from within theensureXXXInstalled()functions.
Sample code in case someone wants to work on it.
URL url;
HttpURLConnection httpConnection = null;
try {
url = new URL("https://scop.berkeley.edu/downloads/parse/dir.des.scope.2.07-stable.txt");
URLConnection connection = url.openConnection();
if (connection instanceof HttpURLConnection) {
httpConnection = (HttpURLConnection) connection;
}
System.out.println("Content-Length: " + connection.getContentLengthLong());
// if (httpConnection != null) {
// System.out.println(httpConnection.getResponseCode() + " " + httpConnection.getResponseMessage());
// httpConnection.disconnect();
// }
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
Sample code in case someone wants to work on it.
URL url; HttpURLConnection httpConnection = null; try { url = new URL("https://scop.berkeley.edu/downloads/parse/dir.des.scope.2.07-stable.txt"); URLConnection connection = url.openConnection(); if (connection instanceof HttpURLConnection) { httpConnection = (HttpURLConnection) connection; } System.out.println("Content-Length: " + connection.getContentLengthLong()); // if (httpConnection != null) { // System.out.println(httpConnection.getResponseCode() + " " + httpConnection.getResponseMessage()); // httpConnection.disconnect(); // } } catch (MalformedURLException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); }
Hi, Could you please provide me a brief overview on where the change is required? I would like to contribute!
Hi @Sounak123, sorry for my late reply. We always welcome new contributors.
You will need to create 2 ~~abstract~~ (sorry I mean static) methods somewhere in biojava-core, preferably in
package org.biojava.nbio.core.util;
maybe in FileDownloadUtils
-
createValidationFiles(URLConnection resource, File localDestination, URLConnection sourceHassh) -
boolean validateFile(File localFile)
The sourceHash may be null if the server does not provide it.
When we want to download a file, we call createValidationFiles first. It should create target.ext.size +/- target.ext.sha
this method may be overloaded (by adding different hash codes types for example: sha/hash25)
The contents of these files should be in plain ASCII format.
Before we open a file, we call validateFile
It will check if any of the validation files is present beside the local file with the full file + extension. If a validation file is not present (say the sha file), it is assumed valid and the method checks for the other one. Better to start with the size.
Any other questions, please feel free to ask.
This is a good idea. I think there are a couple places where we check if the file is empty, but without the hash it's tough to know whether the download completed. Using HEADER requests is a good idea. In addition to implementing this we should look for other places where files are downloaded and try to make sure they all use the new code.
One other strategy could be to download files to a temp location and then move them. I think that would fix the most common case of an interrupted connection.
This issue was inspired by #979.
Well, as long as Sounak did not show up anymore, I'll do it myself. Hopefully, I can do it this current weekend.
Downloading to a temp location and then move the file should help minimize the causes of problems. However, it may cause more problems related to write privileges and available size. Let's stick to validating the size +/- hash first, then listing all places where files are downloaded, then we can add downloading to a temporary location before moving the file to its final destination.
Hi @aalhossary sorry for the delay, I was in the middle of few things so couldn't work on this. Could you please let me continue with this issue, if it is ok? I need a few days at most 3 days
Sorry @Sounak123, for unknown reason, I didn't receive a notification about your comment. When I noticed it, it was already too late.
Anyway, you still can participate in the hash code part if you like.
Sure @aalhossary you can assign it to me