Skip to content

ARROW-15244: [Format] Clarify that offsets are monotonic for binary like arrays#12019

Closed
alamb wants to merge 4 commits intoapache:masterfrom
alamb:alamb/clarify_offsets
Closed

ARROW-15244: [Format] Clarify that offsets are monotonic for binary like arrays#12019
alamb wants to merge 4 commits intoapache:masterfrom
alamb:alamb/clarify_offsets

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Dec 22, 2021

Rationale

The question of "what are the values of the offsets for non-valid entries in arrays" came up in arrow-rs: apache/arrow-rs#1071 and the existing docs seem to be somewhat vague on this issue.

I looked at three implementations of arrow, and they all seem to assume / validate the offsets are monotonic:

Changes

Thus I propose updating the format docs to make the monotonic offsets explicit.

Background

I think @jorgecarleitao's description on apache/arrow-rs#1071 (comment), explains the reason why having monotonic offsets is a good idea

I think that in general the property we seek is: discarding the validity cannot result in UB when accessing the values. This justifies the values buffer of a primitive array is always initialized, and the offsets being valid and in-bounds even in null cases.

The rational for this is that sometimes it is faster to skip validity accesses and only iterate over the values (and clone the validity). I do not recall the benchmark result, but this may explain why string comparison ignores validity and & the bitmaps instead.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@alamb
Copy link
Contributor Author

alamb commented Dec 22, 2021

@pitrou pitrou changed the title (docs) Clarify that offsets are monotonic for binary like arrays ARROW-15244: [Format] Clarify that offsets are monotonic for binary like arrays Jan 4, 2022
@github-actions
Copy link

github-actions bot commented Jan 4, 2022

@github-actions
Copy link

github-actions bot commented Jan 4, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@pitrou pitrou closed this in e7dc8f5 Jan 4, 2022
@alamb alamb deleted the alamb/clarify_offsets branch January 4, 2022 21:44
@ursabot
Copy link

ursabot commented Jan 5, 2022

Benchmark runs are scheduled for baseline = 31a07be and contender = e7dc8f5. e7dc8f5 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.45% ⬆️0.0%] ursa-i9-9960x
[Failed ⬇️0.79% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants