Skip to content

Comments

Add base Fingerprint class#35

Merged
JacobHayes merged 2 commits intogoldenfrom
fingerprint
Apr 25, 2021
Merged

Add base Fingerprint class#35
JacobHayes merged 2 commits intogoldenfrom
fingerprint

Conversation

@JacobHayes
Copy link
Member

Adds a base Fingerprint class that will be used as a fingerprint for Artifacts, Producer Versions, etc. Fingerprints are represented as an int64. When fingerprinting string content, we use the farmhash Fingerprint64 algorithm (via pyfarmhash), but convert the uint64 to int64 using two's compliment.

I also (went a little overboard and) added int64/uint64 types to enforce clear casting behavior + mypy static typing

Relates to #30 - I'll be adding the concrete use in Storage and Views in a follow up PR.

--

int64:

  • Compact Storage: ✅
    • int64 should be much more compact than string, though that can be a fallback (eg: Javascript/JSON).
  • Simple Combining Semantics (XOR): ✅
    • XOR is associative and commutative
  • Datatype Portability: 📈
    • string > int64 > uint64
    • Databases: All database support string, most support int64, but a few key ones don't support uint64 (eg: postgres, dgraph, etc).
    • Languages: Most, but javascript/JSON are notable exceptions (BigNumber/json-bignumber or treat as strings).

Farmhash Fingerprint64:

  • Speed to Compute: ✅
  • Entropy: ✅
    • farmhash fingerprint64 should have enough entropy - not aiming for cryptographic hash accuracy here
  • Implementation Portability: 📈
    • Packages in a handful of languages (orig C lib, python, Go, JS, etc) and databases (eg: BigQuery, Clickhouse).
    • Notably missing for:
      • Snowflake (they have a custom HASH/HASH_AGG, but perhaps we can request farmhash or add via a JS UDF)
      • Postgres

For systems that don't support farmhash, we can use an alternate mechanism (eg: Snowflake's HASH_AGG) with the caveat that content fingerprints won't be portable when/if changing storage/view.

@JacobHayes JacobHayes self-assigned this Apr 21, 2021
@codecov
Copy link

codecov bot commented Apr 21, 2021

Codecov Report

Merging #35 (313d224) into golden (1cdc6e3) will not change coverage.
The diff coverage is 100.00%.

❗ Current head 313d224 differs from pull request most recent head 9196625. Consider uploading reports for the commit 9196625 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##            golden       #35    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           13        14     +1     
  Lines          378       496   +118     
  Branches        56        65     +9     
==========================================
+ Hits           378       496   +118     
Impacted Files Coverage Δ
src/arti/fingerprints/core.py 100.00% <100.00%> (ø)
src/arti/internal/utils.py 100.00% <100.00%> (ø)
src/arti/producers/core.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1cdc6e3...9196625. Read the comment docs.

@JacobHayes JacobHayes force-pushed the fingerprint branch 3 times, most recently from 73be2e3 to 7f45735 Compare April 25, 2021 01:07
@JacobHayes JacobHayes merged commit 5a57411 into golden Apr 25, 2021
@JacobHayes JacobHayes deleted the fingerprint branch April 25, 2021 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant