Skip to content

[BUG]: Job linting for Spark Python tasks only works with notebooks #2213

@asnare

Description

@asnare

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

The WorkflowLinter currently lints tasks (and dependencies) for Spark Python tasks (Task.spark_python_task='/path/to/python_script.py'). However it incorrectly assumes that these files must be notebooks and fails if they are a simple python script instead of a notebook, raising ValueError: Not a Databricks notebook source!

The backtrace leading to this is:

../src/databricks/labs/ucx/source_code/jobs.py:356: in lint_job
    problems = self._lint_job(job)
../src/databricks/labs/ucx/source_code/jobs.py:371: in _lint_job
    for path, advice in self._lint_task(task, job):
../src/databricks/labs/ucx/source_code/jobs.py:400: in _lint_task
    problems = container.build_dependency_graph(graph)
../src/databricks/labs/ucx/source_code/jobs.py:103: in build_dependency_graph
    return list(self._register_task_dependencies(parent))
../src/databricks/labs/ucx/source_code/jobs.py:110: in _register_task_dependencies
    yield from self._register_spark_python_task(graph)
../src/databricks/labs/ucx/source_code/jobs.py:200: in _register_spark_python_task
    return graph.register_notebook(path)
../src/databricks/labs/ucx/source_code/graph.py:54: in register_notebook
    maybe_graph = self.register_dependency(maybe.dependency)
../src/databricks/labs/ucx/source_code/graph.py:78: in register_dependency
    container = dependency.load(self.path_lookup)
../src/databricks/labs/ucx/source_code/graph.py:194: in load
    return self._loader.load_dependency(path_lookup, self)
../src/databricks/labs/ucx/source_code/notebooks/loaders.py:68: in load_dependency
    return Notebook.parse(absolute_path, content, language)
../src/databricks/labs/ucx/source_code/notebooks/sources.py:52: in parse
    cells = default_cell_language.extract_cells(source)

The start of the problem is here:

return graph.register_notebook(path)

Expected Behavior

Jobs with Spark Python tasks that refer to ordinary Python files (and not notebooks) should also work, both on Workspace paths and DBFS.

Steps To Reproduce

The following integration tests demonstrate the failure and will pass when the problem is fixed:

def test_job_spark_python_task_workspace_linter_happy_path(
    simple_ctx,
    make_job,
    make_random,
    make_cluster,
    make_workspace_directory,
):
    pyspark_job_path = make_workspace_directory() / "spark_job.py"
    pyspark_job_path.write_text("import greenlet\n")

    new_cluster = make_cluster(single_node=True)
    task = jobs.Task(
        task_key=make_random(4),
        spark_python_task=jobs.SparkPythonTask(python_file=pyspark_job_path.as_posix()),
        existing_cluster_id=new_cluster.cluster_id,
        libraries=[compute.Library(pypi=compute.PythonPyPiLibrary(package="greenlet"))],
    )
    j = make_job(tasks=[task])

    problems = simple_ctx.workflow_linter.lint_job(j.job_id)
    assert not [problem for problem in problems if problem.message == "Could not locate import: greenlet"]


def test_job_spark_python_task_dbfs_linter_happy_path(
    simple_ctx,
    make_job,
    make_random,
    make_cluster,
    make_dbfs_directory,
):
    pyspark_job_path = make_dbfs_directory() / "spark_job.py"
    pyspark_job_path.write_text("import greenlet\n")

    new_cluster = make_cluster(single_node=True)
    task = jobs.Task(
        task_key=make_random(4),
        spark_python_task=jobs.SparkPythonTask(python_file=f"dbfs:{pyspark_job_path.as_posix()}"),
        existing_cluster_id=new_cluster.cluster_id,
        libraries=[compute.Library(pypi=compute.PythonPyPiLibrary(package="greenlet"))],
    )
    j = make_job(tasks=[task])

    problems = simple_ctx.workflow_linter.lint_job(j.job_id)
    assert not [problem for problem in problems if problem.message == "Could not locate import: greenlet"]

Note that the existing integration tests for Spark Python tasks (eg. test_job_spark_python_task_linter_happy_path) create notebooks and not basic python files.

Cloud

Azure

Operating System

macOS

Version

latest via Databricks CLI

Relevant log output

No response

Metadata

Metadata

Assignees

Labels

migrate/codeAbstract Syntax Trees and other dark magicmigrate/jobsStep 5 - Upgrading Jobs for External Tables

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions