contrib: utxo_to_sqlite.py: add option to store txid/spk as BLOBs #32621

theStack · 2025-05-27T03:03:35Z

This PR is a late follow-up to #27432, introducing an option for the utxo-to-sqlite script to store the txid/scriptPubKey columns as bytes (= BLOB storage class in sqlite, see e.g. https://www.sqlite.org/datatype3.html in sqlite) rather than hex strings. This was proposed in earlier reviews (#27432 (comment), #27432 (comment)) and has the obvious advantage of a significantly smaller size of the resulting database (and with that, faster conversion) and the avoidance of hex-to-bytes conversion for further processing of the data [1]. The rationale on why hex strings were chosen back then (and still stays the default, if only for compatibility reasons) is laid out in #27432 (comment) [2].

The approach taken is introducing new parameters --spk and --txid which can either have the values "hex", "raw" (for scriptpubkey) and "hex", "raw", "rawle" (for txid). Thanks to ajtowns for providing this suggestion. Happy to take further inputs on naming and thoughts on future extensibility etc.

[1] For a concrete example, I found that having these columns as bytes would be nice while working on a SwiftSync hints generator tool (https://github.com/theStack/swiftsync-hints-gen), which takes the result of the utxo-to-sqlite tool as input.
[2] note that in contrast what I wrote back then, I think there is no ambiguity on byte-string-serialization of txids; they are ultimately just hash results and hence, they should be stored as such, and adding a big/little endian knob wouldn't make much sense. The drawback of not being able to immediately show txid-strings (as one would need to do the bytes-reversal step first, which is not possible in sqlite, see e.g. #24952 (comment)) still remains though.

DrahtBot · 2025-05-27T03:03:37Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32621.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	w0xlt
Approach ACK	ajtowns

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#32116 (contrib: refactor: dedup deserialization routines in utxo-to-sqlite script by theStack)
#31560 (rpc: allow writing UTXO set to a named pipe, introduce dump_to_sqlite.sh script by theStack)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

LLM Linter (✨ experimental)

Possible typos and grammar issues:

if --spk=raw, then scriptpubkey will be BLOB instead. -> If --spk=raw, then scriptpubkey will be BLOB instead. [sentence start should be capitalized]

No other typos were found.

^{drahtbot_id_4_m}

ajtowns · 2025-05-27T04:40:40Z

The approach taken is introducing a -m/--mode parameter which can either have the values "hex" (default) or "bytes". Happy to take suggestions on naming and thoughts on future extensibility etc.

I wonder if separate options would be better, eg --spk=raw --txid=raw?

I think there is no ambiguity on byte-string-serialization of txids; they are ultimately just hash results and hence, they should be stored as such, and adding a big/little endian knob wouldn't make much sense.

I think it will be surprising if hex(txid) gives "reversed" values so that just copy and pasting into block explorers doesn't work. If you did separate flags, you could consider --txid=hex, --txid=raw (same byte order as hex), and --txid=rawhash (same byte order as sha256 output) as separate options, maybe?

Here's a patch: ajtowns@d1fc5df

Could consider allowing --spk=none --txid=none to drop both fields for further space saving if you're just analysing the utxo set, not doing things with it (or --txid=counter to replace the actual txid with a counter that saves space but still lets you notice when utxos are from the same tx).

The drawback of not being able to immediately show txid-strings (as one would need to do the bytes-reversal step first, which is not possible in sqlite, see e.g. #24952 (comment)) still remains though.

Seems like you can get addons for this, for whatever that's worth.

theStack · 2025-05-27T14:00:31Z

@ajtowns: Thanks for your suggestions, I like the idea of having more flexibility and mostly agree with the naming. Took your patch (TIL about argparse choices) and adapted the functional test accordingly, to exercise all combinations of new options. As for txid format option naming, I think I'd intuitively prefer raw for storing them in the hash order, since that's AFAICT the most common txid byte-string serialization (I think I haven't encountered any counter-examples so far in the Bitcoin ecosystem) -- but then one would need to pick a good name for the "reverse-hash" byte-string serialization again. So maybe it's just fine 🤷‍♂️ Personally, with the current naming, I'd only either wanted to use the combinations --txid=hex --spk=hex (i.e. the default, the only possible on master), or --txid=rawhash --spk=raw.

Could consider allowing --spk=none --txid=none to drop both fields for further space saving if you're just analysing the utxo set, not doing things with it (or --txid=counter to replace the actual txid with a counter that saves space but still lets you notice when utxos are from the same tx).

Pretty neat ideas, and potentially useful for e.g. dust (and other "spam") or output script type analysis. Will consider for further follow-ups.

DrahtBot · 2025-05-27T14:06:55Z

🚧 At least one of the CI tasks failed.
_{Task lint: https://github.com/bitcoin/bitcoin/runs/42967430443}
_{LLM reason (✨ experimental): The CI failure is caused by a linting error due to a Python f-string missing placeholders.}

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

w0xlt

ACK aaf1173

ajtowns · 2025-05-27T23:15:48Z

I think I'd intuitively prefer raw for storing them in the hash order, since that's AFAICT the most common txid byte-string serialization

Maybe raw and rawle then? For future consideration --spk=type so that it only records whether the spk was p2pk, p2pkh, p2sh, p2wpkh, p2wsh, p2tr, p2a, or other would probably be interesting.

theStack · 2025-05-28T17:14:35Z

I think I'd intuitively prefer raw for storing them in the hash order, since that's AFAICT the most common txid byte-string serialization

Maybe raw and rawle then?

raw/rawle are a decent alternative that I'd slightly prefer over rawhash/raw, yes! Force-pushed accordingly. I'm fine with both though, so if other reviewers / potential users of that script feel strongly on rawhash/raw, I'm happy to revert.

For future consideration --spk=type so that it only records whether the spk was p2pk, p2pkh, p2sh, p2wpkh, p2wsh, p2tr, p2a, or other would probably be interesting.

Concept ACK. I think there is room for some bike-shedding on details, like e.g. do we split up p2pk into uncompressed and compressed (or, god forbid, even hybrid?) or whether bare multisig should also be a category (if at all, then probably only with a very weak detection like "first byte is in range OP_1...OP_16, last byte is OP_CHECKMULTISIG", as implementing the full detection as used in Solver seems overblown for a conversion script). But, something to discuss in another PR anyways.

w0xlt

reACK 1b10e88

contrib/utxo-tools/utxo_to_sqlite.py

Co-authored-by: Anthony Towns <[email protected]>

…inations

w0xlt

reACK 7378f27

ajtowns · 2025-10-30T15:50:35Z

Conflicts with #32116 ; let's get that merged first?

Approach ACK

DrahtBot added the Scripts and tools label May 27, 2025

DrahtBot mentioned this pull request May 27, 2025

contrib: refactor: dedup deserialization routines in utxo-to-sqlite script #32116

Open

theStack force-pushed the 202505-utxo-to-sqlite_blobs branch from 77a79cf to d6fe040 Compare May 27, 2025 14:00

theStack force-pushed the 202505-utxo-to-sqlite_blobs branch 2 times, most recently from 7ad3829 to aaf1173 Compare May 27, 2025 14:06

DrahtBot added the CI failed label May 27, 2025

DrahtBot removed the CI failed label May 27, 2025

DrahtBot mentioned this pull request May 27, 2025

rpc: allow writing UTXO set to a named pipe, introduce dump_to_sqlite.sh script #31560

Open

w0xlt reviewed May 27, 2025

View reviewed changes

theStack force-pushed the 202505-utxo-to-sqlite_blobs branch from aaf1173 to 1b10e88 Compare May 28, 2025 17:14

w0xlt reviewed May 28, 2025

View reviewed changes

contrib/utxo-tools/utxo_to_sqlite.py Outdated Show resolved Hide resolved

theStack and others added 2 commits May 29, 2025 02:55

contrib: utxo_to_sqlite.py: add options to store txid/spk as BLOBs

b30fca7

Co-authored-by: Anthony Towns <[email protected]>

test: run utxo-to-sqlite script test with spk/txid format option comb…

7378f27

…inations

theStack force-pushed the 202505-utxo-to-sqlite_blobs branch from 1b10e88 to 7378f27 Compare May 29, 2025 00:56

w0xlt reviewed May 29, 2025

View reviewed changes

achow101 requested review from ajtowns and willcl-ark October 22, 2025 14:58

ajtowns removed their request for review October 30, 2025 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

contrib: utxo_to_sqlite.py: add option to store txid/spk as BLOBs #32621

contrib: utxo_to_sqlite.py: add option to store txid/spk as BLOBs #32621

theStack commented May 27, 2025 •

edited

Loading

Uh oh!

DrahtBot commented May 27, 2025 •

edited

Loading

Uh oh!

ajtowns commented May 27, 2025

Uh oh!

theStack commented May 27, 2025

Uh oh!

DrahtBot commented May 27, 2025

Uh oh!

w0xlt left a comment

Uh oh!

ajtowns commented May 27, 2025

Uh oh!

theStack commented May 28, 2025

Uh oh!

w0xlt left a comment

Uh oh!

Uh oh!

w0xlt left a comment

Uh oh!

ajtowns commented Oct 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

contrib: utxo_to_sqlite.py: add option to store txid/spk as BLOBs #32621

Are you sure you want to change the base?

contrib: utxo_to_sqlite.py: add option to store txid/spk as BLOBs #32621

Conversation

theStack commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DrahtBot commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage & Benchmarks

Reviews

Conflicts

LLM Linter (✨ experimental)

Uh oh!

ajtowns commented May 27, 2025

Uh oh!

theStack commented May 27, 2025

Uh oh!

DrahtBot commented May 27, 2025

Uh oh!

w0xlt left a comment

Choose a reason for hiding this comment

Uh oh!

ajtowns commented May 27, 2025

Uh oh!

theStack commented May 28, 2025

Uh oh!

w0xlt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

w0xlt left a comment

Choose a reason for hiding this comment

Uh oh!

ajtowns commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

theStack commented May 27, 2025 •

edited

Loading

DrahtBot commented May 27, 2025 •

edited

Loading

ajtowns commented Oct 30, 2025 •

edited

Loading