Incremental updates for OsvDownloadTask #5537
Conversation
Implemented incremental updates for the OsvDownloadTask Adapted and added tests for the OsvDownloadTask Signed-off-by: jonbally <[email protected]>
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
Coverage summary from CodacySee diff coverage on Codacy
Coverage variation details
Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: Diff coverage details
Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: See your quality gate settings Change summary preferences |
9485073 to
8593bbe
Compare
Fixed the Codacy issue 'Avoid instantiating FileInputStream, FileOutputStream, FileReader, or FileWriter' in writeTimestampFile(). Removed 404 stub for any unspecified endpoint from test, as it is the default behaviour of wiremock. Removed ecosystems.txt and replaced the local file in the stub with a proxied response from the OSV cloud storage (previous test behavior). Signed-off-by: jonbally <[email protected]>
nscuro
left a comment
There was a problem hiding this comment.
Overall great work on this, thanks!
Especially appreciate you injecting a Clock for being able to test the time-sensitive logic.
Added an escape hatch, if incremental update would contain more than 1000 new or modified advisories, a full mirror is initiated. Removed the outer retry (download retry). Added consumeResultBeforeRetryAttempt() to the request retry. Added small helper method for URL encoding. Implemented smaller requested changes. Signed-off-by: jonbally <[email protected]>
|
I have implemented the requested changes, the failing tests seem to be unrelated. These |
Description
In it's current implementation the
OsvDownloadTaskwill download and process all existing advisory files for each selected ecosystem on every run. Depending on the selected ecosystems, this can lead to a big daily resource demand (e.g. mirroring the entire collection of OSV Ubuntu advisories takes more than one hour on our instance).This PR adds changes to the
OsvDownloadTaskto support incremental updates for the selected OSV ecosystems.I also adapted existing tests and included new ones for full and incremental updates.
A full mirror for the selected ecosystems will still be performed every 5 days (similar to the
NistMirrorTask) to stay in sync with the OSV database. In certain cases some changes might be missed in the incremental update, e.g. when the modified_id.csv file contains a lot of new or modified ids, sometimes more than 40k, the incremental update will cut off at 10k.Addressed Issue
Enhancement for
OsvDownloadTaskAdditional Details
The incremental updates are utilizing the
storage.googleapis.com/osv-vulnerabilities/<ecosystem>/modified_id.csvfile.It contains the IDs of recently added and modified advisories as well as the timestamp of the addition/modification.
The implementation is conceptually similar to the
NistMirrorTask:osvdirectory for the respective ecosystemall.zip(renamed),.zip.tstimestamp file for theall.zipfile, and.csv.tstimestamp file for themodified_id.csv(renamed) file.csv.tsfile for selecting relevant entries of the.csvI did not modify the parts of the
OsvDownloadTaskthat perform the database update.At the moment each advisory is still individually going through these steps, causing many DB reads and writes:
Full mirror:
Incremental mirror:
I think this could be optimized a lot by processing the advisories in batches.
An alternative to the current implementation, which I saw discussed somewhere recently, would be to implement an OSV analyzer using the OSV API instead of mirroring all of the advisories and performing the matching in DT.
Some points about this PR which might be worth discussing:
all.zipfilesChecklist