RFC: categorize codebase files to prioritize license-related risk analysis

A codebase can have thousands or more files of a vast range of types, from those that are clearly code (`.py`, `.cpp`, `.so`) to those that clearly are not code (`.txt`, `.xlsx`), and many that are somewhere in between (`.jsp`, `.html`, `.erb`).  Our goal: automate the identification of which files are more or less important in the risk analysis process so that the more important files can be analyzed first, and those that need not be analyzed can be identified and omitted from analysis.

To accomplish this goal, we intend to add a feature that will (1) apply a set of rules -- definitions, really -- to a subset of each file's attributes and (2) add three fields to the ScanCode Toolkit scan output metadata that identify the file by category and sub-category and rank each file for license-risk-related importance, i.e., analysis priority.

**Relevant file attributes**

Define each of the rules in terms of the contents of one or more of these attributes:

- `extension`
- `name`
- `mime_type`
- `file_type`
- `programming_language`

**Categorization fields**

- `analysis_priority` ==> a scale from 1-3, with 1 being most relevant for license-risk-analysis purposes
- `file_category` ==> e.g., `archive`, `binary`, `source`, `manifest`, `doc`, `media`, `script`
- `file_subcategory` ==> e.g., `c++`, `python`, `make`, `json`, `license`, `audio`, `data`

**Output format**

We anticipate that this file categorization data will be available as three new fields in all of ScanCode Toolkit's output formats, e.g., `.json`, `.xlsx` et al.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: categorize codebase files to prioritize license-related risk analysis #2945

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

RFC: categorize codebase files to prioritize license-related risk analysis #2945

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions