Skip to content

[GLUTEN-6067][VL] [Part 3-1] Refactor: Rename VeloxColumnarWriteFilesExec to ColumnarWriteFilesExec#6403

Merged
baibaichen merged 3 commits intoapache:mainfrom
baibaichen:feature/native_write
Jul 19, 2024
Merged

[GLUTEN-6067][VL] [Part 3-1] Refactor: Rename VeloxColumnarWriteFilesExec to ColumnarWriteFilesExec#6403
baibaichen merged 3 commits intoapache:mainfrom
baibaichen:feature/native_write

Conversation

@baibaichen
Copy link
Copy Markdown
Contributor

@baibaichen baibaichen commented Jul 11, 2024

What changes were proposed in this pull request?

(Fixes: #6067)

This PR Refactors Velox side code, rename VeloxColumnarWriteFilesExec to GlutenColumnarWriteFilesExec, move it to gluten-core, so that Clickhouse backend can use the same SparkPlan in the followup PR.

By supporting spark 3.4, Velox supports whole stage native write pipeline which is better than old implementation, clickhouse backend also adopt such implementation.

Major change 1

The only major difference between velox and clichouse is how to parse native metrics. which I introduce a new trait called BackendWrite, it only has one member now. Once native write pipeline is compeleted, we get it by BackendsApiManager.getSparkPlanExecApiInstance.createBackendWrite, Please see VeloxBackendWrite for details

trait BackendWrite {
  def collectNativeWriteFilesMetrics(batch: ColumnarBatch): Option[WriteTaskResult]
}

Minor change 2

The other minor diffierence is clickhose backend doesn't generate filename. To compute filename per task, it uses HadoopMapReduceCommitProtocol::getFilename, and then injects them to backend. This is ok because Velox doesn't support maxRecordsPerFile, see #4329 and clickhouse backend also follow this, which means one task only produce one file, no need more injections.

Improve

I also pass File Format to backed.

How was this patch tested?

Uisng Existed UTs.

@github-actions
Copy link
Copy Markdown

#6067

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@baibaichen baibaichen force-pushed the feature/native_write branch from 429bdb9 to 4b24d7f Compare July 11, 2024 03:15
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@baibaichen baibaichen force-pushed the feature/native_write branch from 4b24d7f to ea41482 Compare July 11, 2024 07:57
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

2 similar comments
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@baibaichen baibaichen force-pushed the feature/native_write branch from 37ff1bf to 5ba29ac Compare July 11, 2024 12:02
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@baibaichen baibaichen force-pushed the feature/native_write branch from 5ba29ac to f03e47e Compare July 15, 2024 10:04
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@baibaichen baibaichen force-pushed the feature/native_write branch from f03e47e to 4837d46 Compare July 17, 2024 07:36
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@baibaichen baibaichen force-pushed the feature/native_write branch from 4837d46 to 04ca20e Compare July 17, 2024 16:10
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@baibaichen baibaichen force-pushed the feature/native_write branch from 04ca20e to 181d613 Compare July 18, 2024 02:29
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@baibaichen baibaichen force-pushed the feature/native_write branch from 181d613 to 14cf511 Compare July 18, 2024 07:19
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@baibaichen baibaichen changed the title [GLUTEN-6067][CH] [Part 3] [WIP] [GLUTEN-6067][CH] [Part 3-1] Refactor: Rename VeloxColumnarWriteFilesExec to GlutenColumnarWriteFilesExec Jul 18, 2024
@baibaichen baibaichen changed the title [GLUTEN-6067][CH] [Part 3-1] Refactor: Rename VeloxColumnarWriteFilesExec to GlutenColumnarWriteFilesExec [GLUTEN-6067][VL] [Part 3-1] Refactor: Rename VeloxColumnarWriteFilesExec to GlutenColumnarWriteFilesExec Jul 18, 2024
@baibaichen baibaichen force-pushed the feature/native_write branch from 14cf511 to c475114 Compare July 18, 2024 13:56
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

* plan, and support Spark file commit protocol.
*/
class VeloxColumnarWriteFilesRDD(
class GlutenColumnarWriteFilesRDD(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After moving VeloxColumnarWriteFilesExec from backend-velox to gluten-core, can we update the class names by renaming GlutenColumnarWriteFilesExec to ColumnarWriteFilesExec and GlutenColumnarWriteFilesRDD to ColumnarWriteFilesRDD?

…nd move it to gluten-core

1. Return GlutenColumnarWriteFilesExec at SparkPlanExecApi
2. Move SparkWriteFilesCommitProtocol to gluten-core
3. SparkWriteFilesCommitProtocol support getFilename from internal commiter
4. Remove supportTransformWriteFiles from BackendSettingsApi
5. injectWriteFilesTempPath with fileName
…tenColumnarWriteFilesRDD to ColumnarWriteFilesRDD
@baibaichen baibaichen force-pushed the feature/native_write branch from c475114 to cd7cdd0 Compare July 19, 2024 04:12
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@baibaichen baibaichen changed the title [GLUTEN-6067][VL] [Part 3-1] Refactor: Rename VeloxColumnarWriteFilesExec to GlutenColumnarWriteFilesExec [GLUTEN-6067][VL] [Part 3-1] Refactor: Rename VeloxColumnarWriteFilesExec to ColumnarWriteFilesExec Jul 19, 2024
@baibaichen baibaichen merged commit 206e4be into apache:main Jul 19, 2024
@baibaichen baibaichen deleted the feature/native_write branch July 19, 2024 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CH] Support CH backend with Spark 3.5.x

3 participants