Skip to content

[FEATURE]: Assess consumption of interactive shared clusters and look for usage patterns incompatible with shared clusters. #810

@dmoore247

Description

@dmoore247

Is there an existing issue for this?

  • I have searched the existing issues

Problem statement

UCX Assessment scans compute and tables but does not scan interactive usage of shared clusters.
Understanding usage patterns of individual clusters (and users) will lend insights as to the related upgrade tasks and plan for interactive clusters and interactive users.

There are scenarios where data scientists are using techniques not directly compatible with UC or scenarios where developers are testing code for frameworks that are not directly compatible with UC. This assessment may catch and summarize these cases.

The audit logs can provide a rich input of what is run on interactive clusters and by whom. Scan the audit log (in system table form) to find code patterns to assess potential issues with migrating interactive clusters to Shared UC enabled clusters.

Adding this capability will require that the Unity Catalog system.access.audit table is available so this assessor will have to run as an optional task.

Related issues and enhancements:
#498

Proposed Solution

Add an Interactive Cluster audit log (system table) Assessment Task (or dashboard).
The query runs fast enough to be an interactive dashboard. Creating an inventory table would allow for capture of point in time snapshot of the assessment.

I just happen to have a prototype of the sql:

-- Scan notebook command history for potential paper cut issues
-- https://docs.databricks.com/en/compute/access-mode-limitations.html#compute-access-mode-limitations
-- This query 'overcounts' the paper cuts that might be encountered. There are complex DBR interactions with DBR 11, 12, 13 and 14
with paper_cut_patterns(
select col1 as pattern, col2 as issue from values
    ('hive_metastore.', 'I00'),
    ('spark.catalog.', 'I01.1'),
    ('spark._jsparkSession.catalog','I01.2'),
    ('spark._jspark', 'I02.1'),
    ('spark._jvm','I02.2'),
    ('._jdf', 'I02.3'),
    ('._jcol','I02.4'),
    ('spark.read.format("jdbc")', 'I04'),
    ('dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson()','I05.1'),
    ('dbutils.notebook.entry_point.getDbutils().notebook().getContext()','I05.2'),
    ('spark.udf.registerJavaFunction','I06'),
    ('boto3', 'I07.1'),
    ('s3fs', 'I07.2'),
    ('from graphframes', 'I08'),
    ('pyspark.ml.', 'I09'),
    ('applyInPandas', 'I10.1'),
    ('mapInPandas', 'I10.2'),
    ('dbutils.fs.', 'I11'),
    ('dbutils.credentials.', 'I12') -- credential passthrough
),
sparkcontext (
    select explode(split("_jvm, _jvm.org.apache.log4j, emptyRDD, range, init_batched_serializer, parallelize, pickleFile, textFile, wholeTextFiles, binaryFiles, binaryRecords, sequenceFile, newAPIHadoopFile, newAPIHadoopRDD, hadoopFile, hadoopRDD, union, runJob, setSystemProperty, uiWebUrl, stop, setJobGroup, setLocalProperty, getConf",', ')) as pattern,
    'I03.1' as issue
    UNION ALL
    select explode(split("from pyspark.sql import SQLContext, import org.apache.spark.sql.SQLContext, spark.sparkContext ", ', ')) as pattern, 'I03.2' as issue
),
streaming (
    select explode(split('.trigger(continuous, kafka.sasl.client.callback.handler.class, kafka.sasl.login.callback.handler.class, kafka.sasl.login.class, kafka.partition.assignment.strategy, kafka.ssl.truststore.location, kafka.ssl.keystore.location, cleanSource, sourceArchiveDir, applyInPandasWithState, .format("socket"), StreamingQueryListener',', ')) pattern,
    'I30' as issue
),
paper_cuts(
    select pattern, issue from paper_cut_patterns
    UNION ALL
    select concat('sc.',pattern) as pattern, issue from sparkcontext
    UNION ALL
    select pattern, issue from streaming
),
iteractive_cluster_commands (
    select 
        a.request_params.notebookId as notebook_id, 
        a.request_params.clusterId as cluster_id, 
        a.user_identity.email,
        a.request_params.commandLanguage,
        a.request_params.commandText 
    from system.access.audit a
    join system.compute.clusters as c
        ON c.cluster_source != 'JOB'
        AND (c.tags.ResourceClass is null OR c.tags.ResourceClass != "SingleNode")
        AND a.action_name = 'runCommand' 
        AND a.request_params.clusterId = c.cluster_id
    WHERE
        a.event_date >= DATE_SUB(CURRENT_DATE(), 90)
),
python_matcher(
    select 
    p.issue, 
    a.notebook_id, 
    a.cluster_id, 
    a.email,
    a.commandLanguage,
    a.commandText 
from iteractive_cluster_commands a
join paper_cuts p
    ON a.commandLanguage = 'python'
    AND contains(a.commandText, p.pattern)
),
scala_matcher(
    select
        'I20 - scala/R' as issue,
        a.notebook_id, 
        a.cluster_id, 
        a.email, 
        a.commandText  
    from iteractive_cluster_commands a
    where a.commandLanguage in ('scala','r')
),
unions(
    SELECT issue, notebook_id, cluster_id, email, commandText from python_matcher
    UNION ALL
    SELECT issue, notebook_id, cluster_id, email, commandText from scala_matcher
)
SELECT issue, 
    count(distinct notebook_id) distinct_notebooks,
    count(distinct cluster_id) distinct_clusters,
    count(distinct email) distinct_users
FROM unions group by 1
order by 1
;

Additional Context

Summary view, the Issues (assessment findings) would be linked to the assessment.md document for more details. A short Assessment finding description could be provided.

image

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestmigrate/clustersgo/uc/upgrade Upgrade Interactive Clustersstep/assessmentgo/uc/upgrade - Assessment Step

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions