Architecture for non-deterministic mass data collection: part 2: dynamic data lake schemas

Note, this is the final part of a two part series about this project; article #1 is here. Continuing on from where we last left off, now that we had a functioning collection engine producing full graphs of crawled data all the way down to interrogable dataset_items, it was now time to get down to … Continue reading Architecture for non-deterministic mass data collection: part 2: dynamic data lake schemas

Architecture for non-deterministic mass data collection: part 1: collection engine

Note, this is part one of a two part series about this project; article #2 is here. One of my more recent projects was spawned from a pretty interesting idea. The team wanted to build a system that would permit them to scour the Internet for information regarding a particular set of targets; a "target" … Continue reading Architecture for non-deterministic mass data collection: part 1: collection engine

Serverless AWS Lambda architecture for large scale data ingestion

Recently was faced with a requirement to build out an extensible data import framework that would be able to consume various file formats provided by 3rd parties.... but make it faster than the current implementation. The current mechanism that was in place was using a proprietary packaged legacy file ETL product who's output was an … Continue reading Serverless AWS Lambda architecture for large scale data ingestion