[FEATURE]: Assess consumption of interactive shared clusters and look for usage patterns incompatible with shared clusters.

### Is there an existing issue for this?

- [X] I have searched the existing issues

### Problem statement

UCX Assessment scans compute and tables but does not scan interactive usage of shared clusters. 
Understanding usage patterns of individual clusters (and users) will lend insights as to the related upgrade tasks and plan for interactive clusters and interactive users.

There are scenarios where data scientists are using techniques not directly compatible with UC or scenarios where developers are testing code for frameworks that are not directly compatible with UC. This assessment may catch and summarize these cases.

The audit logs can provide a rich input of what is run on interactive clusters and by whom. Scan the audit log (in system table form) to find code patterns to assess potential issues with migrating interactive clusters to `Shared` UC enabled clusters.

Adding this capability will require that the Unity Catalog `system.access.audit` table is available so this assessor will have to run as an optional task.

Related issues and enhancements: 
https://github.com/databrickslabs/ucx/issues/498 

### Proposed Solution

Add an Interactive Cluster audit log (system table) Assessment Task (or dashboard).
The query runs fast enough to be an interactive dashboard. Creating an inventory table would allow for capture of point in time snapshot of the assessment.

I just happen to have a prototype of the sql:
```sql
-- Scan notebook command history for potential paper cut issues
-- https://docs.databricks.com/en/compute/access-mode-limitations.html#compute-access-mode-limitations
-- This query 'overcounts' the paper cuts that might be encountered. There are complex DBR interactions with DBR 11, 12, 13 and 14
with paper_cut_patterns(
select col1 as pattern, col2 as issue from values
    ('hive_metastore.', 'I00'),
    ('spark.catalog.', 'I01.1'),
    ('spark._jsparkSession.catalog','I01.2'),
    ('spark._jspark', 'I02.1'),
    ('spark._jvm','I02.2'),
    ('._jdf', 'I02.3'),
    ('._jcol','I02.4'),
    ('spark.read.format("jdbc")', 'I04'),
    ('dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson()','I05.1'),
    ('dbutils.notebook.entry_point.getDbutils().notebook().getContext()','I05.2'),
    ('spark.udf.registerJavaFunction','I06'),
    ('boto3', 'I07.1'),
    ('s3fs', 'I07.2'),
    ('from graphframes', 'I08'),
    ('pyspark.ml.', 'I09'),
    ('applyInPandas', 'I10.1'),
    ('mapInPandas', 'I10.2'),
    ('dbutils.fs.', 'I11'),
    ('dbutils.credentials.', 'I12') -- credential passthrough
),
sparkcontext (
    select explode(split("_jvm, _jvm.org.apache.log4j, emptyRDD, range, init_batched_serializer, parallelize, pickleFile, textFile, wholeTextFiles, binaryFiles, binaryRecords, sequenceFile, newAPIHadoopFile, newAPIHadoopRDD, hadoopFile, hadoopRDD, union, runJob, setSystemProperty, uiWebUrl, stop, setJobGroup, setLocalProperty, getConf",', ')) as pattern,
    'I03.1' as issue
    UNION ALL
    select explode(split("from pyspark.sql import SQLContext, import org.apache.spark.sql.SQLContext, spark.sparkContext ", ', ')) as pattern, 'I03.2' as issue
),
streaming (
    select explode(split('.trigger(continuous, kafka.sasl.client.callback.handler.class, kafka.sasl.login.callback.handler.class, kafka.sasl.login.class, kafka.partition.assignment.strategy, kafka.ssl.truststore.location, kafka.ssl.keystore.location, cleanSource, sourceArchiveDir, applyInPandasWithState, .format("socket"), StreamingQueryListener',', ')) pattern,
    'I30' as issue
),
paper_cuts(
    select pattern, issue from paper_cut_patterns
    UNION ALL
    select concat('sc.',pattern) as pattern, issue from sparkcontext
    UNION ALL
    select pattern, issue from streaming
),
iteractive_cluster_commands (
    select 
        a.request_params.notebookId as notebook_id, 
        a.request_params.clusterId as cluster_id, 
        a.user_identity.email,
        a.request_params.commandLanguage,
        a.request_params.commandText 
    from system.access.audit a
    join system.compute.clusters as c
        ON c.cluster_source != 'JOB'
        AND (c.tags.ResourceClass is null OR c.tags.ResourceClass != "SingleNode")
        AND a.action_name = 'runCommand' 
        AND a.request_params.clusterId = c.cluster_id
    WHERE
        a.event_date >= DATE_SUB(CURRENT_DATE(), 90)
),
python_matcher(
    select 
    p.issue, 
    a.notebook_id, 
    a.cluster_id, 
    a.email,
    a.commandLanguage,
    a.commandText 
from iteractive_cluster_commands a
join paper_cuts p
    ON a.commandLanguage = 'python'
    AND contains(a.commandText, p.pattern)
),
scala_matcher(
    select
        'I20 - scala/R' as issue,
        a.notebook_id, 
        a.cluster_id, 
        a.email, 
        a.commandText  
    from iteractive_cluster_commands a
    where a.commandLanguage in ('scala','r')
),
unions(
    SELECT issue, notebook_id, cluster_id, email, commandText from python_matcher
    UNION ALL
    SELECT issue, notebook_id, cluster_id, email, commandText from scala_matcher
)
SELECT issue, 
    count(distinct notebook_id) distinct_notebooks,
    count(distinct cluster_id) distinct_clusters,
    count(distinct email) distinct_users
FROM unions group by 1
order by 1
;
```

### Additional Context

Summary view, the Issues (assessment findings) would be linked to the `assessment.md` document for more details. A short Assessment finding description could be provided.

<img width="754" alt="image" src="https://github.com/databrickslabs/ucx/assets/1122251/f450ab85-9602-4d0d-8da9-90b3d9a511ca">


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: Assess consumption of interactive shared clusters and look for usage patterns incompatible with shared clusters. #810

Is there an existing issue for this?

Problem statement

Proposed Solution

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE]: Assess consumption of interactive shared clusters and look for usage patterns incompatible with shared clusters. #810

Description

Is there an existing issue for this?

Problem statement

Proposed Solution

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions