Skip to content

Conversation

@nevans
Copy link
Collaborator

@nevans nevans commented Aug 7, 2025

When using SequenceSet to store mailbox state for mailboxes with many millions of messages, the last thing you want to do is accidentally generate a string and output that!

So this establishes a maximum number of entries to output as an inspect string, currently 512. Above that number, we output only a much smaller number of entries from the head and tail, currently 16 from each end.

Net::IMAP::SequenceSet(((1..5000) % 2).to_a).inspect
#=> #<Net::IMAP::SequenceSet 2500 entries "1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,...(2468 entries omitted)...,4969,4971,4973,4975,4977,4979,4981,4983,4985,4987,4989,4991,4993,4995,4997,4999">

A note on the implementation

When the sequence set has an internal denormalized string, we use regexps to grab the first and last groups of entries from that.

At first, I used a simple Regexp match anchored to \z to grab the tail entries from the string. But that was shockingly slow even when the string isn't extremely large.

Next I benchmarked replacing Regexp#match with String#rpartition and that performed fine, using the same naive regexps. String#rindex was also fine. They are both slower than Regexp#match anchored to \A, but not too bad and not dramatically slower based on string size. I found myself wishing for a String#rscan or String#rsplit. 😆

Then I benchmarked replacing \d+ with \d{0,10} or [1-9]\d{1,9}, and those both made the regexp fast... but Regexp.linear_time? doesn't like nested quantifiers. Fully expanding both the \d+ and the (entry){100} keeps it fast and satisfies Regexp.linear_time?.

But that regexp is UGLY! 🙁

@nevans nevans changed the title ⚡🔎 Abridged SequenceSet#inspect output ⚡🔎 Abridged SequenceSet#inspect output Aug 7, 2025
@nevans nevans force-pushed the sequence_set/abridged-inspect branch from a9f8422 to ef3a6e7 Compare August 7, 2025 22:01
When using SequenceSet to store mailbox state for mailboxes with many
millions of messages, the last thing you want to do is accidentally
generate a string and output that!

So this establishes a maximum number of entries to output as an inspect
string, currently 512.  Above that number, we output only a much smaller
number of entries from the head and tail, currently 16 from each end.
When the sequence set has an internal denormalized string, we use
regexps to grab the first and last groups of entries from that.

A note on the implementation:  At first, I used a simple Regexp match
anchored to `\z` to grab the tail entries from the string.  But that was
shockingly slow even when the string isn't extremely large.

Next I benchmarked replacing `Regexp#match` with `String#rpartition` and
that performed fine, using the same naive regexps.  `String#rindex` was
also fine.  They are both slower than `Regexp#match` anchored to `\A`,
but not too bad and not dramatically slower based on string size.

Then I benchmarked replacing `\d+` with `\d{0,10}` or `[1-9]\d{1,9}`,
and those both made the regexp fast... but `Regexp.linear_time?` doesn't
like nested quantifiers.  Fully expanding both the `\d+` and the
`(entry){100}` keeps it fast _and_ satisfies `Regexp.linear_time?`.

But that regexp is UGLY! 🙁
@nevans nevans force-pushed the sequence_set/abridged-inspect branch from ef3a6e7 to 3585984 Compare August 7, 2025 22:13
@nevans nevans merged commit bac67ca into master Aug 7, 2025
37 checks passed
@nevans nevans deleted the sequence_set/abridged-inspect branch August 7, 2025 22:39
#
# This unrolls all nested quantifiers. It's much harder to read, but it's
# also the fastest out of all the versions I tested.
nz_uint32 = /[1-9](?:\d(?:\d(?:\d(?:\d(?:\d(?:\d(?:\d(?:\d(?:\d)?)?)?)?)?)?)?)?)?/
Copy link
Collaborator Author

@nevans nevans Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n.b:

# max digits in nz-number
(2**32 - 1).digits.count => 10
# to unroll /[1-9]\d{0,9}/
/[1-9]#{9.times.inject(//) { /(?:\d#{_1.source})?/ }.source}/ ==
  /[1-9](?:\d(?:\d(?:\d(?:\d(?:\d(?:\d(?:\d(?:\d(?:\d)?)?)?)?)?)?)?)?)?/ =>
  true

@nevans nevans changed the title ⚡🔎 Abridged SequenceSet#inspect output ⚡🔎 Abridge SequenceSet#inspect output for more than 512 entries Aug 28, 2025
@nevans nevans added the enhancement New feature or request label Sep 30, 2025
@nevans nevans added the sequence-set Any code the IMAP `sequence-set` data type or grammar rule, especially the SequenceSet class. label Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request sequence-set Any code the IMAP `sequence-set` data type or grammar rule, especially the SequenceSet class.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants