biojava Detect incomplete or corrupted downloaded files

It is quite often when a file is being downloaded that it downloads incomplete or corrupted; which could lead to unexpected potential problems.
I propose to create one (or two) files beside the target file target.ext, named target.ext.size (and target.ext.sha).
When we need to parse a file, we currently check its presence (e.g. in ScopInstallation.ensureClaInstalled() for SCPR classification DB file). In my proposal, we do not only check for its presence, but also for its size +/- its hash code.

HashCode: Some web folders publish the HashCode along with the download file itself. If not, we ignore the hash code :(.
Size: We can issue a HEADER request before the GET or POST call that downloads the file itself, extract the size, and save it.

The four functions to store/check size/hash can be implemented centrally in a separate location (most probably in BioJava-core) as public static methods and called only at the respective use places.
This way, we hopefully will not modify in the current code a lot: just adding one (or two) lines to call storeSize(), and storeHash() at the file download location and checkSize() +/- checkHash() method from within the ensureXXXInstalled() functions.

Oct 11 '21 20:10 aalhossary

Sample code in case someone wants to work on it.

URL url;
HttpURLConnection httpConnection = null;
try {
	url = new URL("https://scop.berkeley.edu/downloads/parse/dir.des.scope.2.07-stable.txt");
	URLConnection connection = url.openConnection();
	if (connection instanceof HttpURLConnection) {
		httpConnection = (HttpURLConnection) connection;
	}
	System.out.println("Content-Length: " + connection.getContentLengthLong());
//			if (httpConnection != null) {
//				System.out.println(httpConnection.getResponseCode() + " " + httpConnection.getResponseMessage());
//				httpConnection.disconnect();
//			}
} catch (MalformedURLException e) {
	e.printStackTrace();
} catch (IOException e) {
	e.printStackTrace();
}

Oct 13 '21 10:10 aalhossary

Sample code in case someone wants to work on it.

URL url;
HttpURLConnection httpConnection = null;
try {
	url = new URL("https://scop.berkeley.edu/downloads/parse/dir.des.scope.2.07-stable.txt");
	URLConnection connection = url.openConnection();
	if (connection instanceof HttpURLConnection) {
		httpConnection = (HttpURLConnection) connection;
	}
	System.out.println("Content-Length: " + connection.getContentLengthLong());
//			if (httpConnection != null) {
//				System.out.println(httpConnection.getResponseCode() + " " + httpConnection.getResponseMessage());
//				httpConnection.disconnect();
//			}
} catch (MalformedURLException e) {
	e.printStackTrace();
} catch (IOException e) {
	e.printStackTrace();
}

Hi, Could you please provide me a brief overview on where the change is required? I would like to contribute!

Oct 30 '21 07:10 Sounak123

Hi @Sounak123, sorry for my late reply. We always welcome new contributors.

You will need to create 2 ~~abstract~~ (sorry I mean static) methods somewhere in biojava-core, preferably in

package org.biojava.nbio.core.util;

maybe in FileDownloadUtils

createValidationFiles(URLConnection resource, File localDestination, URLConnection sourceHassh)
boolean validateFile(File localFile)

The sourceHash may be null if the server does not provide it.

When we want to download a file, we call createValidationFiles first. It should create target.ext.size +/- target.ext.sha this method may be overloaded (by adding different hash codes types for example: sha/hash25) The contents of these files should be in plain ASCII format.

Before we open a file, we call validateFile It will check if any of the validation files is present beside the local file with the full file + extension. If a validation file is not present (say the sha file), it is assumed valid and the method checks for the other one. Better to start with the size.

Any other questions, please feel free to ask.

Nov 04 '21 07:11 aalhossary

This is a good idea. I think there are a couple places where we check if the file is empty, but without the hash it's tough to know whether the download completed. Using HEADER requests is a good idea. In addition to implementing this we should look for other places where files are downloaded and try to make sure they all use the new code.

One other strategy could be to download files to a temp location and then move them. I think that would fix the most common case of an interrupted connection.

Jan 27 '22 10:01 sbliven

This issue was inspired by #979.

Well, as long as Sounak did not show up anymore, I'll do it myself. Hopefully, I can do it this current weekend.

Downloading to a temp location and then move the file should help minimize the causes of problems. However, it may cause more problems related to write privileges and available size. Let's stick to validating the size +/- hash first, then listing all places where files are downloaded, then we can add downloading to a temporary location before moving the file to its final destination.

Jan 29 '22 07:01 aalhossary

Hi @aalhossary sorry for the delay, I was in the middle of few things so couldn't work on this. Could you please let me continue with this issue, if it is ok? I need a few days at most 3 days

Jan 29 '22 15:01 Sounak123

Sorry @Sounak123, for unknown reason, I didn't receive a notification about your comment. When I noticed it, it was already too late.

Anyway, you still can participate in the hash code part if you like.

Feb 22 '22 03:02 aalhossary

Sure @aalhossary you can assign it to me

Feb 22 '22 06:02 Sounak123