-
Notifications
You must be signed in to change notification settings - Fork 101
[FEATURE]: Assess consumption of interactive shared clusters and look for usage patterns incompatible with shared clusters. #810
Description
Is there an existing issue for this?
- I have searched the existing issues
Problem statement
UCX Assessment scans compute and tables but does not scan interactive usage of shared clusters.
Understanding usage patterns of individual clusters (and users) will lend insights as to the related upgrade tasks and plan for interactive clusters and interactive users.
There are scenarios where data scientists are using techniques not directly compatible with UC or scenarios where developers are testing code for frameworks that are not directly compatible with UC. This assessment may catch and summarize these cases.
The audit logs can provide a rich input of what is run on interactive clusters and by whom. Scan the audit log (in system table form) to find code patterns to assess potential issues with migrating interactive clusters to Shared UC enabled clusters.
Adding this capability will require that the Unity Catalog system.access.audit table is available so this assessor will have to run as an optional task.
Related issues and enhancements:
#498
Proposed Solution
Add an Interactive Cluster audit log (system table) Assessment Task (or dashboard).
The query runs fast enough to be an interactive dashboard. Creating an inventory table would allow for capture of point in time snapshot of the assessment.
I just happen to have a prototype of the sql:
-- Scan notebook command history for potential paper cut issues
-- https://docs.databricks.com/en/compute/access-mode-limitations.html#compute-access-mode-limitations
-- This query 'overcounts' the paper cuts that might be encountered. There are complex DBR interactions with DBR 11, 12, 13 and 14
with paper_cut_patterns(
select col1 as pattern, col2 as issue from values
('hive_metastore.', 'I00'),
('spark.catalog.', 'I01.1'),
('spark._jsparkSession.catalog','I01.2'),
('spark._jspark', 'I02.1'),
('spark._jvm','I02.2'),
('._jdf', 'I02.3'),
('._jcol','I02.4'),
('spark.read.format("jdbc")', 'I04'),
('dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson()','I05.1'),
('dbutils.notebook.entry_point.getDbutils().notebook().getContext()','I05.2'),
('spark.udf.registerJavaFunction','I06'),
('boto3', 'I07.1'),
('s3fs', 'I07.2'),
('from graphframes', 'I08'),
('pyspark.ml.', 'I09'),
('applyInPandas', 'I10.1'),
('mapInPandas', 'I10.2'),
('dbutils.fs.', 'I11'),
('dbutils.credentials.', 'I12') -- credential passthrough
),
sparkcontext (
select explode(split("_jvm, _jvm.org.apache.log4j, emptyRDD, range, init_batched_serializer, parallelize, pickleFile, textFile, wholeTextFiles, binaryFiles, binaryRecords, sequenceFile, newAPIHadoopFile, newAPIHadoopRDD, hadoopFile, hadoopRDD, union, runJob, setSystemProperty, uiWebUrl, stop, setJobGroup, setLocalProperty, getConf",', ')) as pattern,
'I03.1' as issue
UNION ALL
select explode(split("from pyspark.sql import SQLContext, import org.apache.spark.sql.SQLContext, spark.sparkContext ", ', ')) as pattern, 'I03.2' as issue
),
streaming (
select explode(split('.trigger(continuous, kafka.sasl.client.callback.handler.class, kafka.sasl.login.callback.handler.class, kafka.sasl.login.class, kafka.partition.assignment.strategy, kafka.ssl.truststore.location, kafka.ssl.keystore.location, cleanSource, sourceArchiveDir, applyInPandasWithState, .format("socket"), StreamingQueryListener',', ')) pattern,
'I30' as issue
),
paper_cuts(
select pattern, issue from paper_cut_patterns
UNION ALL
select concat('sc.',pattern) as pattern, issue from sparkcontext
UNION ALL
select pattern, issue from streaming
),
iteractive_cluster_commands (
select
a.request_params.notebookId as notebook_id,
a.request_params.clusterId as cluster_id,
a.user_identity.email,
a.request_params.commandLanguage,
a.request_params.commandText
from system.access.audit a
join system.compute.clusters as c
ON c.cluster_source != 'JOB'
AND (c.tags.ResourceClass is null OR c.tags.ResourceClass != "SingleNode")
AND a.action_name = 'runCommand'
AND a.request_params.clusterId = c.cluster_id
WHERE
a.event_date >= DATE_SUB(CURRENT_DATE(), 90)
),
python_matcher(
select
p.issue,
a.notebook_id,
a.cluster_id,
a.email,
a.commandLanguage,
a.commandText
from iteractive_cluster_commands a
join paper_cuts p
ON a.commandLanguage = 'python'
AND contains(a.commandText, p.pattern)
),
scala_matcher(
select
'I20 - scala/R' as issue,
a.notebook_id,
a.cluster_id,
a.email,
a.commandText
from iteractive_cluster_commands a
where a.commandLanguage in ('scala','r')
),
unions(
SELECT issue, notebook_id, cluster_id, email, commandText from python_matcher
UNION ALL
SELECT issue, notebook_id, cluster_id, email, commandText from scala_matcher
)
SELECT issue,
count(distinct notebook_id) distinct_notebooks,
count(distinct cluster_id) distinct_clusters,
count(distinct email) distinct_users
FROM unions group by 1
order by 1
;Additional Context
Summary view, the Issues (assessment findings) would be linked to the assessment.md document for more details. A short Assessment finding description could be provided.

Metadata
Metadata
Assignees
Labels
Type
Projects
Status