Skip to content

Conversation

@theStack
Copy link
Contributor

@theStack theStack commented May 27, 2025

This PR is a late follow-up to #27432, introducing an option for the utxo-to-sqlite script to store the txid/scriptPubKey columns as bytes (= BLOB storage class in sqlite, see e.g. https://www.sqlite.org/datatype3.html in sqlite) rather than hex strings. This was proposed in earlier reviews (#27432 (comment), #27432 (comment)) and has the obvious advantage of a significantly smaller size of the resulting database (and with that, faster conversion) and the avoidance of hex-to-bytes conversion for further processing of the data [1]. The rationale on why hex strings were chosen back then (and still stays the default, if only for compatibility reasons) is laid out in #27432 (comment) [2].

The approach taken is introducing new parameters --spk and --txid which can either have the values "hex", "raw" (for scriptpubkey) and "hex", "raw", "rawle" (for txid). Thanks to ajtowns for providing this suggestion. Happy to take further inputs on naming and thoughts on future extensibility etc.

[1] For a concrete example, I found that having these columns as bytes would be nice while working on a SwiftSync hints generator tool (https://github.com/theStack/swiftsync-hints-gen), which takes the result of the utxo-to-sqlite tool as input.
[2] note that in contrast what I wrote back then, I think there is no ambiguity on byte-string-serialization of txids; they are ultimately just hash results and hence, they should be stored as such, and adding a big/little endian knob wouldn't make much sense. The drawback of not being able to immediately show txid-strings (as one would need to do the bytes-reversal step first, which is not possible in sqlite, see e.g. #24952 (comment)) still remains though.

@DrahtBot
Copy link
Contributor

DrahtBot commented May 27, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32621.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK w0xlt
Approach ACK ajtowns

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #32116 (contrib: refactor: dedup deserialization routines in utxo-to-sqlite script by theStack)
  • #31560 (rpc: allow writing UTXO set to a named pipe, introduce dump_to_sqlite.sh script by theStack)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

LLM Linter (✨ experimental)

Possible typos and grammar issues:

  • if --spk=raw, then scriptpubkey will be BLOB instead. -> If --spk=raw, then scriptpubkey will be BLOB instead. [sentence start should be capitalized]

No other typos were found.

drahtbot_id_4_m

@ajtowns
Copy link
Contributor

ajtowns commented May 27, 2025

The approach taken is introducing a -m/--mode parameter which can either have the values "hex" (default) or "bytes". Happy to take suggestions on naming and thoughts on future extensibility etc.

I wonder if separate options would be better, eg --spk=raw --txid=raw?

I think there is no ambiguity on byte-string-serialization of txids; they are ultimately just hash results and hence, they should be stored as such, and adding a big/little endian knob wouldn't make much sense.

I think it will be surprising if hex(txid) gives "reversed" values so that just copy and pasting into block explorers doesn't work. If you did separate flags, you could consider --txid=hex, --txid=raw (same byte order as hex), and --txid=rawhash (same byte order as sha256 output) as separate options, maybe?

Here's a patch: ajtowns@d1fc5df

Could consider allowing --spk=none --txid=none to drop both fields for further space saving if you're just analysing the utxo set, not doing things with it (or --txid=counter to replace the actual txid with a counter that saves space but still lets you notice when utxos are from the same tx).

The drawback of not being able to immediately show txid-strings (as one would need to do the bytes-reversal step first, which is not possible in sqlite, see e.g. #24952 (comment)) still remains though.

Seems like you can get addons for this, for whatever that's worth.

@theStack
Copy link
Contributor Author

@ajtowns: Thanks for your suggestions, I like the idea of having more flexibility and mostly agree with the naming. Took your patch (TIL about argparse choices) and adapted the functional test accordingly, to exercise all combinations of new options. As for txid format option naming, I think I'd intuitively prefer raw for storing them in the hash order, since that's AFAICT the most common txid byte-string serialization (I think I haven't encountered any counter-examples so far in the Bitcoin ecosystem) -- but then one would need to pick a good name for the "reverse-hash" byte-string serialization again. So maybe it's just fine 🤷‍♂️ Personally, with the current naming, I'd only either wanted to use the combinations --txid=hex --spk=hex (i.e. the default, the only possible on master), or --txid=rawhash --spk=raw.

Could consider allowing --spk=none --txid=none to drop both fields for further space saving if you're just analysing the utxo set, not doing things with it (or --txid=counter to replace the actual txid with a counter that saves space but still lets you notice when utxos are from the same tx).

Pretty neat ideas, and potentially useful for e.g. dust (and other "spam") or output script type analysis. Will consider for further follow-ups.

@theStack theStack force-pushed the 202505-utxo-to-sqlite_blobs branch 2 times, most recently from 7ad3829 to aaf1173 Compare May 27, 2025 14:06
@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Task lint: https://github.com/bitcoin/bitcoin/runs/42967430443
LLM reason (✨ experimental): The CI failure is caused by a linting error due to a Python f-string missing placeholders.

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

Copy link
Contributor

@w0xlt w0xlt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK aaf1173

@ajtowns
Copy link
Contributor

ajtowns commented May 27, 2025

I think I'd intuitively prefer raw for storing them in the hash order, since that's AFAICT the most common txid byte-string serialization

Maybe raw and rawle then? For future consideration --spk=type so that it only records whether the spk was p2pk, p2pkh, p2sh, p2wpkh, p2wsh, p2tr, p2a, or other would probably be interesting.

@theStack theStack force-pushed the 202505-utxo-to-sqlite_blobs branch from aaf1173 to 1b10e88 Compare May 28, 2025 17:14
@theStack
Copy link
Contributor Author

I think I'd intuitively prefer raw for storing them in the hash order, since that's AFAICT the most common txid byte-string serialization

Maybe raw and rawle then?

raw/rawle are a decent alternative that I'd slightly prefer over rawhash/raw, yes! Force-pushed accordingly. I'm fine with both though, so if other reviewers / potential users of that script feel strongly on rawhash/raw, I'm happy to revert.

For future consideration --spk=type so that it only records whether the spk was p2pk, p2pkh, p2sh, p2wpkh, p2wsh, p2tr, p2a, or other would probably be interesting.

Concept ACK. I think there is room for some bike-shedding on details, like e.g. do we split up p2pk into uncompressed and compressed (or, god forbid, even hybrid?) or whether bare multisig should also be a category (if at all, then probably only with a very weak detection like "first byte is in range OP_1...OP_16, last byte is OP_CHECKMULTISIG", as implementing the full detection as used in Solver seems overblown for a conversion script). But, something to discuss in another PR anyways.

Copy link
Contributor

@w0xlt w0xlt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reACK 1b10e88

@theStack theStack force-pushed the 202505-utxo-to-sqlite_blobs branch from 1b10e88 to 7378f27 Compare May 29, 2025 00:56
Copy link
Contributor

@w0xlt w0xlt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reACK 7378f27

@ajtowns
Copy link
Contributor

ajtowns commented Oct 30, 2025

Conflicts with #32116 ; let's get that merged first?

Approach ACK

@ajtowns ajtowns removed their request for review October 30, 2025 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants