-
Notifications
You must be signed in to change notification settings - Fork 101
[BUG]: Job linting for Spark Python tasks only works with notebooks #2213
Description
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
The WorkflowLinter currently lints tasks (and dependencies) for Spark Python tasks (Task.spark_python_task='/path/to/python_script.py'). However it incorrectly assumes that these files must be notebooks and fails if they are a simple python script instead of a notebook, raising ValueError: Not a Databricks notebook source!
The backtrace leading to this is:
../src/databricks/labs/ucx/source_code/jobs.py:356: in lint_job
problems = self._lint_job(job)
../src/databricks/labs/ucx/source_code/jobs.py:371: in _lint_job
for path, advice in self._lint_task(task, job):
../src/databricks/labs/ucx/source_code/jobs.py:400: in _lint_task
problems = container.build_dependency_graph(graph)
../src/databricks/labs/ucx/source_code/jobs.py:103: in build_dependency_graph
return list(self._register_task_dependencies(parent))
../src/databricks/labs/ucx/source_code/jobs.py:110: in _register_task_dependencies
yield from self._register_spark_python_task(graph)
../src/databricks/labs/ucx/source_code/jobs.py:200: in _register_spark_python_task
return graph.register_notebook(path)
../src/databricks/labs/ucx/source_code/graph.py:54: in register_notebook
maybe_graph = self.register_dependency(maybe.dependency)
../src/databricks/labs/ucx/source_code/graph.py:78: in register_dependency
container = dependency.load(self.path_lookup)
../src/databricks/labs/ucx/source_code/graph.py:194: in load
return self._loader.load_dependency(path_lookup, self)
../src/databricks/labs/ucx/source_code/notebooks/loaders.py:68: in load_dependency
return Notebook.parse(absolute_path, content, language)
../src/databricks/labs/ucx/source_code/notebooks/sources.py:52: in parse
cells = default_cell_language.extract_cells(source)
The start of the problem is here:
| return graph.register_notebook(path) |
Expected Behavior
Jobs with Spark Python tasks that refer to ordinary Python files (and not notebooks) should also work, both on Workspace paths and DBFS.
Steps To Reproduce
The following integration tests demonstrate the failure and will pass when the problem is fixed:
def test_job_spark_python_task_workspace_linter_happy_path(
simple_ctx,
make_job,
make_random,
make_cluster,
make_workspace_directory,
):
pyspark_job_path = make_workspace_directory() / "spark_job.py"
pyspark_job_path.write_text("import greenlet\n")
new_cluster = make_cluster(single_node=True)
task = jobs.Task(
task_key=make_random(4),
spark_python_task=jobs.SparkPythonTask(python_file=pyspark_job_path.as_posix()),
existing_cluster_id=new_cluster.cluster_id,
libraries=[compute.Library(pypi=compute.PythonPyPiLibrary(package="greenlet"))],
)
j = make_job(tasks=[task])
problems = simple_ctx.workflow_linter.lint_job(j.job_id)
assert not [problem for problem in problems if problem.message == "Could not locate import: greenlet"]
def test_job_spark_python_task_dbfs_linter_happy_path(
simple_ctx,
make_job,
make_random,
make_cluster,
make_dbfs_directory,
):
pyspark_job_path = make_dbfs_directory() / "spark_job.py"
pyspark_job_path.write_text("import greenlet\n")
new_cluster = make_cluster(single_node=True)
task = jobs.Task(
task_key=make_random(4),
spark_python_task=jobs.SparkPythonTask(python_file=f"dbfs:{pyspark_job_path.as_posix()}"),
existing_cluster_id=new_cluster.cluster_id,
libraries=[compute.Library(pypi=compute.PythonPyPiLibrary(package="greenlet"))],
)
j = make_job(tasks=[task])
problems = simple_ctx.workflow_linter.lint_job(j.job_id)
assert not [problem for problem in problems if problem.message == "Could not locate import: greenlet"]Note that the existing integration tests for Spark Python tasks (eg. test_job_spark_python_task_linter_happy_path) create notebooks and not basic python files.
Cloud
Azure
Operating System
macOS
Version
latest via Databricks CLI
Relevant log output
No response