Skip to content

Core, Spark: Use 'delete' if RowDelta only has delete files#10123

Merged
nastra merged 2 commits intoapache:mainfrom
nastra:use-delete-op-with-delete-files
Apr 16, 2024
Merged

Core, Spark: Use 'delete' if RowDelta only has delete files#10123
nastra merged 2 commits intoapache:mainfrom
nastra:use-delete-op-with-delete-files

Conversation

@nastra
Copy link
Contributor

@nastra nastra commented Apr 11, 2024

Resolves #10122

@nastra nastra marked this pull request as draft April 11, 2024 14:27
@nastra nastra force-pushed the use-delete-op-with-delete-files branch from 2209f87 to c5dcc3a Compare April 11, 2024 14:40

@Override
protected String operation() {
if (!addsDataFiles() && addsDeleteFiles()) {
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand right, wouldn't this condition need to be

if (!addsDataFiles() && (addsDeleteFiles() || deletesDataFiles())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically I think for a metadata only delete we'd want to make sure the operation is delete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you're absolutely right. Fixed it. This also needs some more tests, hence why it's WIP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amogh-jahagirdar actually I think if (!addsDataFiles() && addsDeleteFiles()) is correct as the RowDelta API only provides addRows(DataFile) and addDeletes(DeleteFile). deletesDataFiles() would apply for the RewriteFiles API.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah true, but doesn't the same principle apply for RewriteFiles then? I think technically someone can use RewriteFiles to perform just a delete.

table.newRewrite().deleteFile(file)).commit()

It's awkward since they could just do a deleteFile but I think it's still possible, and we'd still want to produce the DELETE operation instead of a REPLACE. It seems like we should have a check in RewriteFiles as well.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi I agree that we should use replace when it's the same logical set of data but I think the issue is on what RewriteFiles actually enforces currently. It doesn't do any validation on not only having deletes, and I wouldn't expect a logical comparison since that requires a data comparison which isn't feasible.

I wrote the following test in TestRewriteFiles:

  @TestTemplate
  public void testDelete() {
    commit(table, table.newAppend().appendFile(FILE_A).appendFile(FILE_B), branch);

    table.newRewrite().deleteFile(FILE_A).toBranch(branch).commit();

    table.currentSnapshot();
  }

and the operation commits successfully with a REPLACE operation. I know it's strange to use this API (and goes against what's in the JavaDoc) for just deleting a file but as of today it's technically possible and would lead to producing a snapshot operation with the incorrect REPLACE operation which can lead to issues such as the incremental processing case.

There's 2 ways to address this incorrectness that I can think of:

1.) Fail at the time of committing if there's only deletes or only additions. If this is the case then we know that the data cannot possibly be logically the same since there'd essentially be a net gain or net loss in data. I'm not sure if there's any other cases but this seems like a reasonable check which doesn't require actually comparing the data.
2.) Don't fail (to preserve the API behavior) and rather at the time of committing, if it's only a delete or only an append, use the correct API to perform the operation with the right summary.

I'm leaning towards 1, since the API docs already makes it clear that from an API perspective RewriteFiles should only be used when it's logically the same. If we know for a fact that it's not locally the same table with a simple check, we can enforce that.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we have some validation in BaseRewriteFiles#validateReplacedAndAddedFiles, I think we're just missing this case. I'll raise a PR to show what I mean

Copy link
Contributor

@aokolnychyi aokolnychyi Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case you highlight is a valid one if FILE_A has all records removed with deletes. Then removing FILE_A has no impact on the table state.

That being said, we should definitely check if the validation is missing any edge case. Thanks for looking into that!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for explaining! This issue is ultimately orthogonal to the one that is being solved in this PR so we can take that separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@nastra nastra force-pushed the use-delete-op-with-delete-files branch 3 times, most recently from 3a32e54 to 5df00aa Compare April 12, 2024 10:06
@nastra nastra marked this pull request as ready for review April 12, 2024 10:11
@nastra nastra force-pushed the use-delete-op-with-delete-files branch from 5df00aa to bbc2212 Compare April 12, 2024 10:12
deleteRowSchema);

table.newRowDelta().addDeletes(eqDeletes).commit();
DataFile dataFile =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test is specifically testing for OVERWRITE, hence I'm adding a data file here

@nastra nastra requested a review from amogh-jahagirdar April 12, 2024 10:33
@nastra nastra force-pushed the use-delete-op-with-delete-files branch 4 times, most recently from 6ef3220 to b0c8b28 Compare April 12, 2024 14:07
}

@TestTemplate
public void deleteSingleRecordProducesDeleteOperation() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something seems to be off with subclasses of TestDelete. The test produces different results on each run when executing ./gradlew :iceberg-spark:iceberg-spark-extensions-3.5_2.12:test --tests "org.apache.iceberg.spark.extensions.TestCopyOnWriteDelete" --tests "org.apache.iceberg.spark.extensions.TestMergeOnReadDelete" vs when running TestCopyOnWriteDelete or TestMergeOnReadDelete within the IDE

Copy link
Contributor Author

@nastra nastra Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out this was because of the randomness of configuring adaptive query execution:

SQLConf.ADAPTIVE_EXECUTION_ENABLED().key(), String.valueOf(RANDOM.nextBoolean()))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll port this test to Spark 3.3/3.4 once CI passes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have an explanation here.

assertThat(actual)
.as("Snapshot property " + property + " has unexpected value.")
.isEqualTo(expectedValue);
if (null == expectedValue) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes will show the content of the snapshot summary in case the assertion fails

@nastra nastra force-pushed the use-delete-op-with-delete-files branch 4 times, most recently from a191e0e to 29f6dd8 Compare April 13, 2024 06:43
String actual = snapshot.summary().get(property);
Assert.assertEquals(
"Snapshot property " + property + " has unexpected value.", expectedValue, actual);
if (null == expectedValue) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes will show the content of the snapshot summary in case the assertion fails


Snapshot currentSnapshot = table.currentSnapshot();

if (Boolean.parseBoolean(spark.conf().get(SQLConf.ADAPTIVE_EXECUTION_ENABLED().key()))) {
Copy link
Contributor

@aokolnychyi aokolnychyi Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend using append methods in SparkRowLevelOperationsTestBase to make sure the 3 records are added as a single file (I mean versions of append that accept json). Otherwise, you have to check for AQE and this logic is very fragile overall.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I've updated that to use append methods

if (!addsDataFiles() && addsDeleteFiles()) {
return DataOperations.DELETE;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: My personal preference would be to have the negation as the second argument (or even offer a new method in the parent with the negated meaning) and combine the two return statements into one. It is completely up to you, though.

if (addsDeleteFiles() && !addsDataFiles()) {
  return DataOperations.DELETE;
} else {
  return DataOperations.OVERWRITE;
}

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I have been meaning to do this for quite some time. Thanks, @nastra!
We should make sure RewriteFiles stays as replace.


@Override
protected String operation() {
if (!addsDataFiles() && addsDeleteFiles()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for explaining! This issue is ultimately orthogonal to the one that is being solved in this PR so we can take that separately.

@nastra nastra force-pushed the use-delete-op-with-delete-files branch from 29f6dd8 to 1725387 Compare April 16, 2024 06:53
@nastra nastra force-pushed the use-delete-op-with-delete-files branch from 1725387 to 7e0237d Compare April 16, 2024 06:54
@nastra
Copy link
Contributor Author

nastra commented Apr 16, 2024

thanks for the reviews @amogh-jahagirdar and @aokolnychyi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Delete using Merge-on-Read sets OVERWRITE while DELETE is expected

3 participants