Skip to content

refactor: Replace use of check-jsonschema with direct jsonschema calls#9726

Merged
edgarrmondragon merged 5 commits intomainfrom
refactor/check-jsonschema
Jan 28, 2026
Merged

refactor: Replace use of check-jsonschema with direct jsonschema calls#9726
edgarrmondragon merged 5 commits intomainfrom
refactor/check-jsonschema

Conversation

@edgarrmondragon
Copy link
Copy Markdown
Collaborator

@edgarrmondragon edgarrmondragon commented Jan 11, 2026

Description

Related Issues

Summary by Sourcery

Replace external check-jsonschema CLI usage with direct jsonschema-based manifest validation and extend tests to cover schema validation behavior and logging.

Bug Fixes:

  • Ensure manifest schema validation logs include nested JSON path information for deeply nested schema violations.

Enhancements:

  • Refactor manifest schema validation to use jsonschema.Draft7Validator directly instead of invoking the check-jsonschema subprocess.
  • Remove the unused check-jsonschema dependency from the project dependencies.

Build:

  • Drop check-jsonschema from runtime dependencies in pyproject.toml.

Tests:

  • Add tests verifying that valid manifests lint cleanly without schema warnings and that nested schema violations produce appropriate warning logs with nested paths.

Summary by Sourcery

Replace external manifest schema validation via the check-jsonschema CLI with direct jsonschema-based validation and improve how schema validation results are logged and tested.

Bug Fixes:

  • Ensure schema validation warnings include precise JSON paths for root-level and nested manifest schema violations.

Enhancements:

  • Introduce a cached jsonschema-based validator for the manifest schema instead of invoking the check-jsonschema subprocess.
  • Simplify manifest environment location discovery to use the loaded manifest schema data.
  • Improve schema validation warning formatting to aggregate and present all errors in a structured, readable message.

Build:

  • Remove the check-jsonschema dependency from the project runtime dependencies.

Tests:

  • Add tests to assert that valid manifests lint without schema warnings and that invalid manifests, including nested violations, produce appropriate warning logs with full JSON paths.

@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Jan 11, 2026

Reviewer's Guide

Refactors manifest schema validation to use an in-process jsonschema validator instead of the external check-jsonschema CLI, improves formatting and path reporting of validation errors, extends compile CLI tests for valid and nested schema violations, and removes the check-jsonschema dependency from project dependencies and lockfile.

Sequence diagram for in-process manifest schema validation

sequenceDiagram
    participant Caller
    participant Manifest
    participant JsonschemaValidator
    participant Logger

    Caller->>Manifest: _validate_against_manifest_schema(instance_name, instance_path, instance_data)
    activate Manifest
    Manifest->>Manifest: _schema_validator
    Note right of Manifest: Cached validator instance
    Manifest->>JsonschemaValidator: iter_errors(instance_data)
    JsonschemaValidator-->>Manifest: errors
    alt errors_found
        Manifest->>Manifest: format error messages with JSON paths
        Manifest->>Logger: warning(formatted_errors)
    else no_errors
        Note right of Manifest: No logging occurs
    end
    deactivate Manifest
Loading

Class diagram for updated manifest schema validation

classDiagram
    class Manifest {
      - dict _manifest_schema
      - dict _env_locations
      + bool check_schema
      + bool redact_secrets
      + _schema_validator() jsonschema_protocols_Validator
      + _validate_against_manifest_schema(instance_name str, instance_path Path, instance_data dict)
    }

    class jsonschema_protocols_Validator {
      + iter_errors(instance_data dict) list
    }

    class jsonschema_validators {
      + validator_for(schema dict) jsonschema_protocols_Validator_class
      + check_schema(schema dict) void
    }

    class Logger {
      + warning(message str) void
    }

    Manifest ..> jsonschema_protocols_Validator : uses
    Manifest ..> jsonschema_validators : uses
    Manifest ..> Logger : logs_with
Loading

File-Level Changes

Change Details Files
Replace external check-jsonschema subprocess-based schema validation with a cached in-process jsonschema validator and improved error reporting.
  • Load the manifest JSON schema into an instance attribute rather than a local variable during Manifest initialization.
  • Introduce a cached_property that builds and returns a jsonschema validator class via jsonschema.validators.validator_for and validates the schema itself with check_schema.
  • Refactor _validate_against_manifest_schema to iterate over validator errors, sort them by absolute path, format each error with a JSON-pointer-like suffix (::$ or ::$.<nested.path>), and emit a single structured warning message when any errors are present.
  • Remove all usage of the check-jsonschema CLI and color handling utilities from manifest schema validation.
src/meltano/core/manifest/manifest.py
Tighten and extend compile CLI tests to cover both absence and presence of schema validation warnings, including nested-path violations.
  • Add a helper function that filters StructuredLogCapture events for schema validation warnings based on level and message substring.
  • Augment the existing test_warn_schema_violation to focus on root-level schema errors and assert the ::$ path marker behavior.
  • Add a test ensuring that a valid project compiled with --lint exits successfully, produces the expected manifest file, and logs no schema validation warnings.
  • Add a new test that injects a nested schema violation into the plugins.extractors configuration, runs compile --lint, and asserts that at least one warning is logged whose message includes a JSON path matching ::$.plugins.extractors..name.
tests/meltano/cli/test_compile.py
Remove the check-jsonschema runtime dependency from project configuration.
  • Drop check-jsonschema from the main dependency list in pyproject.toml.
  • Update the uv.lock file to reflect the removal of the check-jsonschema dependency and any associated resolved artifacts.
pyproject.toml
uv.lock

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@netlify
Copy link
Copy Markdown

netlify bot commented Jan 11, 2026

Deploy Preview for meltano canceled.

Name Link
🔨 Latest commit d903da0
🔍 Latest deploy log https://app.netlify.com/projects/meltano/deploys/697a59e105db8f0008665468

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.42%. Comparing base (16cc9bd) to head (d903da0).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9726   +/-   ##
=======================================
  Coverage   96.41%   96.42%           
=======================================
  Files         280      280           
  Lines       24130    24164   +34     
  Branches     1386     1390    +4     
=======================================
+ Hits        23264    23299   +35     
  Misses        690      690           
+ Partials      176      175    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@edgarrmondragon edgarrmondragon added this to the 4.1 milestone Jan 12, 2026
@edgarrmondragon edgarrmondragon self-assigned this Jan 12, 2026
@edgarrmondragon edgarrmondragon added kind/Tech Debt dependencies Pull requests that update a dependency file labels Jan 12, 2026
@edgarrmondragon edgarrmondragon force-pushed the refactor/check-jsonschema branch from c707207 to fc0b8b9 Compare January 19, 2026 17:34
@edgarrmondragon edgarrmondragon force-pushed the refactor/check-jsonschema branch from fc0b8b9 to 022a1af Compare January 20, 2026 00:23
@edgarrmondragon edgarrmondragon marked this pull request as ready for review January 20, 2026 00:27
@edgarrmondragon edgarrmondragon requested a review from a team as a code owner January 20, 2026 00:27
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • In _validate_against_manifest_schema, the schema file is re-opened and a new Draft7Validator is constructed on every call; consider loading the schema and instantiating the validator once (e.g., in __init__ or a cached property) to avoid repeated I/O and setup cost when validating multiple manifests.
  • When formatting validation errors, you currently only use error.absolute_path; if you want richer information for nested/compound schema constraints (e.g., oneOf, anyOf), you might also inspect error.context to provide more precise messages about which sub-constraint failed.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `_validate_against_manifest_schema`, the schema file is re-opened and a new `Draft7Validator` is constructed on every call; consider loading the schema and instantiating the validator once (e.g., in `__init__` or a cached property) to avoid repeated I/O and setup cost when validating multiple manifests.
- When formatting validation errors, you currently only use `error.absolute_path`; if you want richer information for nested/compound schema constraints (e.g., `oneOf`, `anyOf`), you might also inspect `error.context` to provide more precise messages about which sub-constraint failed.

## Individual Comments

### Comment 1
<location> `tests/meltano/cli/test_compile.py:180-189` </location>
<code_context>
+    def test_warn_nested_schema_violation(
</code_context>

<issue_to_address>
**suggestion (testing):** Add a dedicated test for root-level schema errors to cover the `::$` (no path) branch

This test covers the nested-path case (`::$.<path>`), but there’s a separate code path when `error.absolute_path` is empty that yields a root-level path (`::$`). Please add a small test that triggers a schema violation at the top level (e.g., wrong type on the root key) and assert that a warning is emitted with the expected root-only path format (no nested segments). This will exercise the `else` branch and protect against regressions there.

Suggested implementation:

```python
    def test_warn_nested_schema_violation(
        self,
        cli_runner: CliRunner,
        monkeypatch: pytest.MonkeyPatch,
        tmp_path: Path,
        log: StructuredLogCapture,
    ) -> None:
        """Warn when a schema violation occurs on a nested path (::$.<path>)."""
        original_yaml_load = manifest.yaml.load

        def patch(*args, **kwargs):
            project_files = original_yaml_load(*args, **kwargs)
            # Introduce a nested schema violation – wrong type for a nested field.
            # This should yield a path including nested segments, e.g. `::$.plugins.extractors[0].name`.
            project_files["plugins"]["extractors"][0]["name"] = 123
            return project_files

        monkeypatch.setattr(manifest.yaml, "load", patch)

        # Run compile to trigger schema validation
        result = cli_runner.invoke(cli, ["compile"], cwd=tmp_path)
        assert result.exit_code == 0

        # Collect schema warnings from the structured log
        schema_warnings = [
            e
            for e in log.events
            if isinstance(e, dict)
            and isinstance(e.get("event"), str)
            and "schema" in e.get("event").lower()
        ]
        assert schema_warnings, "Expected at least one schema warning for nested violation"

        # Ensure the warning path includes nested segments (`::$.<path>`).
        message = " ".join(e.get("event", "") for e in schema_warnings)
        assert "::$." in message, "Expected nested path format `::$.<path>` in warning message"

    def test_warn_root_schema_violation(
        self,
        cli_runner: CliRunner,
        monkeypatch: pytest.MonkeyPatch,
        tmp_path: Path,
        log: StructuredLogCapture,
    ) -> None:
        """Warn when a schema violation occurs at the root level (::$)."""
        original_yaml_load = manifest.yaml.load

        def patch(*args, **kwargs):
            project_files = original_yaml_load(*args, **kwargs)
            # Introduce a root-level schema violation – wrong type for a top-level key.
            # This should exercise the branch where `error.absolute_path` is empty
            # and yield a root-only path (`::$`).
            project_files["version"] = 123
            return project_files

        monkeypatch.setattr(manifest.yaml, "load", patch)

        # Run compile to trigger schema validation
        result = cli_runner.invoke(cli, ["compile"], cwd=tmp_path)
        assert result.exit_code == 0

        # Collect schema warnings from the structured log
        schema_warnings = [
            e
            for e in log.events
            if isinstance(e, dict)
            and isinstance(e.get("event"), str)
            and "schema" in e.get("event").lower()
        ]
        assert schema_warnings, "Expected at least one schema warning for root violation"

        # Ensure the warning path is root-only (`::$`) with no nested segments appended.
        message = " ".join(e.get("event", "") for e in schema_warnings)
        assert "::$" in message, "Expected root path `::$` in warning message"
        assert "::$." not in message, "Root-level error should not include nested segments in the path"

```

To fully align this with your existing test suite you should:

1. Adjust the way `schema_warnings` are collected to match how other tests in this file filter for schema validation warnings (e.g., specific `event` name or logger).
2. Ensure `log.events` is the correct attribute for the `StructuredLogCapture` fixture; if the existing tests use a different attribute (e.g., `log.messages` or `log.records`), update both tests accordingly.
3. Confirm that modifying `project_files["plugins"]["extractors"][0]["name"]` and `project_files["version"]` actually violates the JSON schema in your project; if different keys are used in your manifest schema, change these assignments to target fields that are known to be type-validated.
4. Make sure the `cli_runner.invoke(...)` call matches how other tests invoke `meltano` in this file (correct `cli` object, arguments, and `cwd`/`obj` usage). If compile is invoked differently elsewhere, mirror that pattern in both tests.
5. If your warning message places the `::$` path in a structured field (e.g., `e["path"]`) instead of inside `event`, change the assertions to inspect that field rather than parsing the `event` string.
</issue_to_address>

### Comment 2
<location> `tests/meltano/cli/test_compile.py:203-212` </location>
<code_context>
+        with mock.patch.object(manifest.yaml, "load", side_effect=patch):
</code_context>

<issue_to_address>
**suggestion (testing):** Tighten the log assertions to avoid false positives from unrelated warnings

The test currently only checks that some warning contains both `"::$."` and `"plugins"`, which could match unrelated warnings.

To better target schema validation:
- First filter `warning_events` to those whose `event` includes `"Failed to validate"`, then assert on the nested path.
- Optionally also assert that the log includes the `instance_path` (e.g. `tmp_path` / manifest filename) to confirm the `{instance_path}{path}: {error.message}` format.

This will ensure you’re validating the schema error log structure rather than any warning content.
</issue_to_address>

### Comment 3
<location> `tests/meltano/cli/test_compile.py:201` </location>
<code_context>
+            project_files["plugins"]["extractors"].append({"name": 12345})
+            return project_files
+
+        monkeypatch.setenv("NO_COLOR", "1")
+
+        with mock.patch.object(manifest.yaml, "load", side_effect=patch):
</code_context>

<issue_to_address>
**question:** Consider whether `NO_COLOR` setup is still needed for this test

Since this path no longer shells out to `check-jsonschema` or uses `get_no_color_flag`, verify whether `NO_COLOR` still affects behavior here. If it doesn’t, consider dropping this env setup to avoid unnecessary test complexity. If it does, add a brief comment clarifying its role for future readers.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Signed-off-by: Edgar Ramírez Mondragón <[email protected]>
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • Consider using jsonschema.validators.validator_for(self._manifest_schema) (and calling .check_schema(...)) instead of hard-coding Draft7Validator so the implementation automatically matches the $schema declared in the manifest schema file.
  • It may be helpful to sort iter_errors by error.path before logging so that the warning output and tests are deterministic and easier to compare or reason about over time.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider using `jsonschema.validators.validator_for(self._manifest_schema)` (and calling `.check_schema(...)`) instead of hard-coding `Draft7Validator` so the implementation automatically matches the `$schema` declared in the manifest schema file.
- It may be helpful to sort `iter_errors` by `error.path` before logging so that the warning output and tests are deterministic and easier to compare or reason about over time.

## Individual Comments

### Comment 1
<location> `tests/meltano/cli/test_compile.py:183-192` </location>
<code_context>
+    def test_warn_nested_schema_violation(
</code_context>

<issue_to_address>
**issue (testing):** Strengthen the nested-path assertion to check for the full JSON path string rather than just `"::$."` and `"plugins"`

Right now the test only checks that a warning contains both `"::$."` and `"plugins"`, which could still succeed if the JSON path format regresses (missing indices, wrong separators, truncated path, etc.). Since the expected path here is something like `"<instance_path>::$.plugins.extractors.0.name"`, please assert on the full path, or at least a more specific substring such as `"::$.plugins.extractors.0.name"`, so the test more accurately validates the nested JSON path bugfix.
</issue_to_address>

### Comment 2
<location> `tests/meltano/cli/test_compile.py:215-224` </location>
<code_context>
+                ("--environment=dev", "compile", "--lint", f"--directory={tmp_path}"),
+            )
+        assert result.exit_code == 0
+        # Filter to schema validation warnings specifically
+        schema_warnings = [
+            e
+            for e in log.events
+            if e.get("level") == "warning"
+            and "Failed to validate" in e.get("event", "")
+        ]
+        assert len(schema_warnings) >= 1, (
+            "Expected at least one schema validation warning"
+        )
+        # Check that the warning contains a nested path (with ::$.)
+        nested_path_found = any(
+            "::$." in event.get("event", "") and "plugins" in event.get("event", "")
+            for event in schema_warnings
+        )
+        assert nested_path_found, (
+            "Expected to find nested path in validation errors. "
+            f"Events: {[e.get('event')[:100] for e in schema_warnings]}"
</code_context>

<issue_to_address>
**suggestion:** Extract the repeated schema-warning filtering logic into a helper to improve readability and reuse

Both tests filter `log.events` the same way to find schema validation warnings. Please extract this into a small helper or fixture (e.g. `get_schema_warnings(log)`) so the intent is clearer and any future changes to log wording/structure are made in one place.

Suggested implementation:

```python
        # Filter to schema validation warnings specifically
        schema_warnings = get_schema_warnings(log)

```

To fully implement the refactor, you should also:

1. Define a shared helper at module scope (near the top of `tests/meltano/cli/test_compile.py`, alongside other helpers/fixtures):

```python
def get_schema_warnings(log):
    """Return schema validation warning log events."""
    return [
        e
        for e in log.events
        if e.get("level") == "warning"
        and "Failed to validate" in e.get("event", "")
    ]
```

2. Locate the other test in this file that performs the same filtering of `log.events` for schema validation warnings and replace its inline list comprehension with:

```python
schema_warnings = get_schema_warnings(log)
```

This will centralize the schema-warning filtering logic and improve readability and maintainability.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Signed-off-by: Edgar Ramírez Mondragón <[email protected]>
@edgarrmondragon
Copy link
Copy Markdown
Collaborator Author

@sourcery-ai review

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • Since _validate_against_manifest_schema now depends on self._schema_validator and is no longer a @staticmethod, double-check all its call sites are using the instance (e.g., self._validate_against_manifest_schema(...)) rather than the class, otherwise this will fail at runtime.
  • The new logging filter in get_schema_warnings relies on the hard-coded substring "Failed to validate"; consider centralizing the log message prefix (e.g., a constant or helper) so future changes to the warning text don’t silently break these tests.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Since `_validate_against_manifest_schema` now depends on `self._schema_validator` and is no longer a `@staticmethod`, double-check all its call sites are using the instance (e.g., `self._validate_against_manifest_schema(...)`) rather than the class, otherwise this will fail at runtime.
- The new logging filter in `get_schema_warnings` relies on the hard-coded substring `"Failed to validate"`; consider centralizing the log message prefix (e.g., a constant or helper) so future changes to the warning text don’t silently break these tests.

## Individual Comments

### Comment 1
<location> `src/meltano/core/manifest/manifest.py:167-168` </location>
<code_context>
-                text=True,
-                stdout=subprocess.PIPE,
-                stderr=subprocess.STDOUT,
+        errors = sorted(
+            self._schema_validator.iter_errors(instance_data),
+            key=lambda e: list(e.absolute_path),
+        )
</code_context>

<issue_to_address>
**issue (bug_risk):** Sorting errors by `list(e.absolute_path)` can raise a `TypeError` when paths mix strings and integers.

Because `absolute_path` elements can be a mix of strings (object keys) and integers (array indices), sorting on `list(e.absolute_path)` risks `TypeError` in Python 3 when it tries to compare `str` and `int`. Consider normalizing the key, e.g.:

```python
errors = sorted(
    self._schema_validator.iter_errors(instance_data),
    key=lambda e: ".".join(str(p) for p in e.absolute_path),
)
```

or:

```python
errors = sorted(
    self._schema_validator.iter_errors(instance_data),
    key=lambda e: tuple(str(p) for p in e.absolute_path),
)
```

Both preserve deterministic ordering while avoiding cross-type comparisons.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Signed-off-by: Edgar Ramírez Mondragón <[email protected]>
@edgarrmondragon edgarrmondragon added this pull request to the merge queue Jan 28, 2026
Merged via the queue into main with commit f485ce9 Jan 28, 2026
61 checks passed
@edgarrmondragon edgarrmondragon deleted the refactor/check-jsonschema branch January 28, 2026 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli dependencies Pull requests that update a dependency file kind/Tech Debt

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant