Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Refactor CPD #4397

Merged
merged 97 commits into from
Aug 29, 2023
Merged

[core] Refactor CPD #4397

merged 97 commits into from
Aug 29, 2023

Conversation

oowekyala
Copy link
Member

@oowekyala oowekyala commented Feb 16, 2023

Describe the PR

Upgrade the CPD codebase for PMD 7, removing redundant API and the transitional APIs introduced along the way like the class CpdCompat. We haven't planned a clean transition from PMD 6 to PMD 7 for the CPD programmatic API, so I think it's fine to break some things, given it's unlikely that many external repos implement custom CPD tokenizers. For this reason I don't think the release notes need to go into much details, I'll just link this PR.

Detailed changelog

  • Remove the CPD Language interface and replace it with the PMD interface. ([core] Merge CPD and PMD language #3919)
    • PMD-specific methods are pushed down from Language into a PmdCapableLanguage interface
    • CPD-specific methods are similarly on CpdCapableLanguage.
  • CPD system properties to configure languages are turned into language properties
    • Some of them have been renamed to have a more general name that can apply to several languages.
      • ignore_usings -> cpdIgnoreImports
      • ignore_annotations -> cpdIgnoreMetadata
      • ignore_identifiers -> cpdAnonymizeIdentifiers
      • ignore_literals -> cpdAnonymizeLiterals
      • ignore_literal_sequences -> cpdIgnoreLiteralSequences
    • Some of them have been removed from pmd-core and pushed down to the specific modules they concern.
      • skipBlocks (C++)
    • C++ only has the property cpdSkipBlocksPattern (and not a separate skipBlocks anymore). It defaults to skipping the previous default value. Setting it to a blank string disables the block skipping.
    • The scala language module now uses the languageVersion language property (common to all language modules) instead of the custom system property net.sourceforge.pmd.scala.version.
  • Remove static state from CPD (this static EOF token, and the static ThreadLocal maps in TokenEntry). This global state is now part of the Tokens class.
    • since there is no single EOF token for all files, we can have one EOF token per file, with accurate text coordinates. That means methods like getBeginColumn or getBeginLine on TokenEntry can return a useful value always, instead of -1 sometimes. That special case does not need to be handled by renderers anymore.
  • Create CpdAnalysis and remove CPD ([core] Provide a CpdAnalysis class as a programmatic entry point into CPD #4204)
    • CpdAnalysis uses a FileCollector, like PmdAnalysis. That means language detection and file collection is done exactly like with PmdAnalysis, based on file extensions. CPD used to tokenize files as soon as they are added, while CpdAnalysis tokenizes files in bulk in performAnalysis.
  • Remove SourceCode and replace it with TextDocument ([core] Text documents epic #3784)
    • Since SourceCode's main difference is that it only loaded the code behind a SoftReference, to allow for the GC to reclaim unused memory. I introduce SourceManager, that maintains a map of TextFile to SoftReference<TextDocument>. This allows the same kind of lazy loading and reuse. I think it could come in handy for PMD global rules too if we make it thread-safe [core] Support global rules that report at the end of analysis #3920
    • Mark doesn't have direct access to the source code slice anymore, rather, the SourceManager has. This is part of CPDReport, so renderers can output text as before. We only slice text for the marks that we are going to render and not more.
  • Remove other deprecated things like AbstractTokenizer
  • Reorganize files of CPD modules:

Related issues

Ready?

  • Add names to the versions of CPD language modules, because they would be broken by [core] Explicitly name all language versions #4387
  • Added unit tests for fixed bug/feature
  • Passing all unit tests
  • Complete build ./mvnw clean verify passes (checked automatically by github actions)
  • Added (in-code) documentation (if needed)

@oowekyala oowekyala changed the title [cpd] [cpd] Refactor CPD Feb 16, 2023
@oowekyala oowekyala changed the base branch from master to pmd/7.0.x February 16, 2023 19:45
@oowekyala oowekyala added the in:cpd Affects the copy-paste detector label Feb 16, 2023
Use a consistent implementation of getInstance().
Now all modules resolve against the LanguageRegistry.
- Introduce ExitAction
- Sort languages by name
when SoftReferences have been freed.
@adangel adangel mentioned this pull request Aug 27, 2023
4 tasks
@adangel adangel changed the title [cpd] Refactor CPD [core] Refactor CPD Aug 27, 2023
@adangel adangel added this to the 7.0.0 milestone Aug 29, 2023
@adangel adangel merged commit 056b339 into pmd:master Aug 29, 2023
adangel added a commit to adangel/pmd that referenced this pull request Aug 31, 2023
@oowekyala oowekyala deleted the clem.pmd7-refactor-cpd branch May 13, 2024 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in:cpd Affects the copy-paste detector
Projects
None yet
3 participants