refactor: Replace use of check-jsonschema with direct jsonschema calls#9726
refactor: Replace use of check-jsonschema with direct jsonschema calls#9726edgarrmondragon merged 5 commits intomainfrom
check-jsonschema with direct jsonschema calls#9726Conversation
Reviewer's GuideRefactors manifest schema validation to use an in-process jsonschema validator instead of the external check-jsonschema CLI, improves formatting and path reporting of validation errors, extends compile CLI tests for valid and nested schema violations, and removes the check-jsonschema dependency from project dependencies and lockfile. Sequence diagram for in-process manifest schema validationsequenceDiagram
participant Caller
participant Manifest
participant JsonschemaValidator
participant Logger
Caller->>Manifest: _validate_against_manifest_schema(instance_name, instance_path, instance_data)
activate Manifest
Manifest->>Manifest: _schema_validator
Note right of Manifest: Cached validator instance
Manifest->>JsonschemaValidator: iter_errors(instance_data)
JsonschemaValidator-->>Manifest: errors
alt errors_found
Manifest->>Manifest: format error messages with JSON paths
Manifest->>Logger: warning(formatted_errors)
else no_errors
Note right of Manifest: No logging occurs
end
deactivate Manifest
Class diagram for updated manifest schema validationclassDiagram
class Manifest {
- dict _manifest_schema
- dict _env_locations
+ bool check_schema
+ bool redact_secrets
+ _schema_validator() jsonschema_protocols_Validator
+ _validate_against_manifest_schema(instance_name str, instance_path Path, instance_data dict)
}
class jsonschema_protocols_Validator {
+ iter_errors(instance_data dict) list
}
class jsonschema_validators {
+ validator_for(schema dict) jsonschema_protocols_Validator_class
+ check_schema(schema dict) void
}
class Logger {
+ warning(message str) void
}
Manifest ..> jsonschema_protocols_Validator : uses
Manifest ..> jsonschema_validators : uses
Manifest ..> Logger : logs_with
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
✅ Deploy Preview for meltano canceled.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #9726 +/- ##
=======================================
Coverage 96.41% 96.42%
=======================================
Files 280 280
Lines 24130 24164 +34
Branches 1386 1390 +4
=======================================
+ Hits 23264 23299 +35
Misses 690 690
+ Partials 176 175 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
c707207 to
fc0b8b9
Compare
…calls Signed-off-by: Edgar Ramírez Mondragón <[email protected]>
fc0b8b9 to
022a1af
Compare
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- In
_validate_against_manifest_schema, the schema file is re-opened and a newDraft7Validatoris constructed on every call; consider loading the schema and instantiating the validator once (e.g., in__init__or a cached property) to avoid repeated I/O and setup cost when validating multiple manifests. - When formatting validation errors, you currently only use
error.absolute_path; if you want richer information for nested/compound schema constraints (e.g.,oneOf,anyOf), you might also inspecterror.contextto provide more precise messages about which sub-constraint failed.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `_validate_against_manifest_schema`, the schema file is re-opened and a new `Draft7Validator` is constructed on every call; consider loading the schema and instantiating the validator once (e.g., in `__init__` or a cached property) to avoid repeated I/O and setup cost when validating multiple manifests.
- When formatting validation errors, you currently only use `error.absolute_path`; if you want richer information for nested/compound schema constraints (e.g., `oneOf`, `anyOf`), you might also inspect `error.context` to provide more precise messages about which sub-constraint failed.
## Individual Comments
### Comment 1
<location> `tests/meltano/cli/test_compile.py:180-189` </location>
<code_context>
+ def test_warn_nested_schema_violation(
</code_context>
<issue_to_address>
**suggestion (testing):** Add a dedicated test for root-level schema errors to cover the `::$` (no path) branch
This test covers the nested-path case (`::$.<path>`), but there’s a separate code path when `error.absolute_path` is empty that yields a root-level path (`::$`). Please add a small test that triggers a schema violation at the top level (e.g., wrong type on the root key) and assert that a warning is emitted with the expected root-only path format (no nested segments). This will exercise the `else` branch and protect against regressions there.
Suggested implementation:
```python
def test_warn_nested_schema_violation(
self,
cli_runner: CliRunner,
monkeypatch: pytest.MonkeyPatch,
tmp_path: Path,
log: StructuredLogCapture,
) -> None:
"""Warn when a schema violation occurs on a nested path (::$.<path>)."""
original_yaml_load = manifest.yaml.load
def patch(*args, **kwargs):
project_files = original_yaml_load(*args, **kwargs)
# Introduce a nested schema violation – wrong type for a nested field.
# This should yield a path including nested segments, e.g. `::$.plugins.extractors[0].name`.
project_files["plugins"]["extractors"][0]["name"] = 123
return project_files
monkeypatch.setattr(manifest.yaml, "load", patch)
# Run compile to trigger schema validation
result = cli_runner.invoke(cli, ["compile"], cwd=tmp_path)
assert result.exit_code == 0
# Collect schema warnings from the structured log
schema_warnings = [
e
for e in log.events
if isinstance(e, dict)
and isinstance(e.get("event"), str)
and "schema" in e.get("event").lower()
]
assert schema_warnings, "Expected at least one schema warning for nested violation"
# Ensure the warning path includes nested segments (`::$.<path>`).
message = " ".join(e.get("event", "") for e in schema_warnings)
assert "::$." in message, "Expected nested path format `::$.<path>` in warning message"
def test_warn_root_schema_violation(
self,
cli_runner: CliRunner,
monkeypatch: pytest.MonkeyPatch,
tmp_path: Path,
log: StructuredLogCapture,
) -> None:
"""Warn when a schema violation occurs at the root level (::$)."""
original_yaml_load = manifest.yaml.load
def patch(*args, **kwargs):
project_files = original_yaml_load(*args, **kwargs)
# Introduce a root-level schema violation – wrong type for a top-level key.
# This should exercise the branch where `error.absolute_path` is empty
# and yield a root-only path (`::$`).
project_files["version"] = 123
return project_files
monkeypatch.setattr(manifest.yaml, "load", patch)
# Run compile to trigger schema validation
result = cli_runner.invoke(cli, ["compile"], cwd=tmp_path)
assert result.exit_code == 0
# Collect schema warnings from the structured log
schema_warnings = [
e
for e in log.events
if isinstance(e, dict)
and isinstance(e.get("event"), str)
and "schema" in e.get("event").lower()
]
assert schema_warnings, "Expected at least one schema warning for root violation"
# Ensure the warning path is root-only (`::$`) with no nested segments appended.
message = " ".join(e.get("event", "") for e in schema_warnings)
assert "::$" in message, "Expected root path `::$` in warning message"
assert "::$." not in message, "Root-level error should not include nested segments in the path"
```
To fully align this with your existing test suite you should:
1. Adjust the way `schema_warnings` are collected to match how other tests in this file filter for schema validation warnings (e.g., specific `event` name or logger).
2. Ensure `log.events` is the correct attribute for the `StructuredLogCapture` fixture; if the existing tests use a different attribute (e.g., `log.messages` or `log.records`), update both tests accordingly.
3. Confirm that modifying `project_files["plugins"]["extractors"][0]["name"]` and `project_files["version"]` actually violates the JSON schema in your project; if different keys are used in your manifest schema, change these assignments to target fields that are known to be type-validated.
4. Make sure the `cli_runner.invoke(...)` call matches how other tests invoke `meltano` in this file (correct `cli` object, arguments, and `cwd`/`obj` usage). If compile is invoked differently elsewhere, mirror that pattern in both tests.
5. If your warning message places the `::$` path in a structured field (e.g., `e["path"]`) instead of inside `event`, change the assertions to inspect that field rather than parsing the `event` string.
</issue_to_address>
### Comment 2
<location> `tests/meltano/cli/test_compile.py:203-212` </location>
<code_context>
+ with mock.patch.object(manifest.yaml, "load", side_effect=patch):
</code_context>
<issue_to_address>
**suggestion (testing):** Tighten the log assertions to avoid false positives from unrelated warnings
The test currently only checks that some warning contains both `"::$."` and `"plugins"`, which could match unrelated warnings.
To better target schema validation:
- First filter `warning_events` to those whose `event` includes `"Failed to validate"`, then assert on the nested path.
- Optionally also assert that the log includes the `instance_path` (e.g. `tmp_path` / manifest filename) to confirm the `{instance_path}{path}: {error.message}` format.
This will ensure you’re validating the schema error log structure rather than any warning content.
</issue_to_address>
### Comment 3
<location> `tests/meltano/cli/test_compile.py:201` </location>
<code_context>
+ project_files["plugins"]["extractors"].append({"name": 12345})
+ return project_files
+
+ monkeypatch.setenv("NO_COLOR", "1")
+
+ with mock.patch.object(manifest.yaml, "load", side_effect=patch):
</code_context>
<issue_to_address>
**question:** Consider whether `NO_COLOR` setup is still needed for this test
Since this path no longer shells out to `check-jsonschema` or uses `get_no_color_flag`, verify whether `NO_COLOR` still affects behavior here. If it doesn’t, consider dropping this env setup to avoid unnecessary test complexity. If it does, add a brief comment clarifying its role for future readers.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Signed-off-by: Edgar Ramírez Mondragón <[email protected]>
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- Consider using
jsonschema.validators.validator_for(self._manifest_schema)(and calling.check_schema(...)) instead of hard-codingDraft7Validatorso the implementation automatically matches the$schemadeclared in the manifest schema file. - It may be helpful to sort
iter_errorsbyerror.pathbefore logging so that the warning output and tests are deterministic and easier to compare or reason about over time.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider using `jsonschema.validators.validator_for(self._manifest_schema)` (and calling `.check_schema(...)`) instead of hard-coding `Draft7Validator` so the implementation automatically matches the `$schema` declared in the manifest schema file.
- It may be helpful to sort `iter_errors` by `error.path` before logging so that the warning output and tests are deterministic and easier to compare or reason about over time.
## Individual Comments
### Comment 1
<location> `tests/meltano/cli/test_compile.py:183-192` </location>
<code_context>
+ def test_warn_nested_schema_violation(
</code_context>
<issue_to_address>
**issue (testing):** Strengthen the nested-path assertion to check for the full JSON path string rather than just `"::$."` and `"plugins"`
Right now the test only checks that a warning contains both `"::$."` and `"plugins"`, which could still succeed if the JSON path format regresses (missing indices, wrong separators, truncated path, etc.). Since the expected path here is something like `"<instance_path>::$.plugins.extractors.0.name"`, please assert on the full path, or at least a more specific substring such as `"::$.plugins.extractors.0.name"`, so the test more accurately validates the nested JSON path bugfix.
</issue_to_address>
### Comment 2
<location> `tests/meltano/cli/test_compile.py:215-224` </location>
<code_context>
+ ("--environment=dev", "compile", "--lint", f"--directory={tmp_path}"),
+ )
+ assert result.exit_code == 0
+ # Filter to schema validation warnings specifically
+ schema_warnings = [
+ e
+ for e in log.events
+ if e.get("level") == "warning"
+ and "Failed to validate" in e.get("event", "")
+ ]
+ assert len(schema_warnings) >= 1, (
+ "Expected at least one schema validation warning"
+ )
+ # Check that the warning contains a nested path (with ::$.)
+ nested_path_found = any(
+ "::$." in event.get("event", "") and "plugins" in event.get("event", "")
+ for event in schema_warnings
+ )
+ assert nested_path_found, (
+ "Expected to find nested path in validation errors. "
+ f"Events: {[e.get('event')[:100] for e in schema_warnings]}"
</code_context>
<issue_to_address>
**suggestion:** Extract the repeated schema-warning filtering logic into a helper to improve readability and reuse
Both tests filter `log.events` the same way to find schema validation warnings. Please extract this into a small helper or fixture (e.g. `get_schema_warnings(log)`) so the intent is clearer and any future changes to log wording/structure are made in one place.
Suggested implementation:
```python
# Filter to schema validation warnings specifically
schema_warnings = get_schema_warnings(log)
```
To fully implement the refactor, you should also:
1. Define a shared helper at module scope (near the top of `tests/meltano/cli/test_compile.py`, alongside other helpers/fixtures):
```python
def get_schema_warnings(log):
"""Return schema validation warning log events."""
return [
e
for e in log.events
if e.get("level") == "warning"
and "Failed to validate" in e.get("event", "")
]
```
2. Locate the other test in this file that performs the same filtering of `log.events` for schema validation warnings and replace its inline list comprehension with:
```python
schema_warnings = get_schema_warnings(log)
```
This will centralize the schema-warning filtering logic and improve readability and maintainability.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Signed-off-by: Edgar Ramírez Mondragón <[email protected]>
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- Since
_validate_against_manifest_schemanow depends onself._schema_validatorand is no longer a@staticmethod, double-check all its call sites are using the instance (e.g.,self._validate_against_manifest_schema(...)) rather than the class, otherwise this will fail at runtime. - The new logging filter in
get_schema_warningsrelies on the hard-coded substring"Failed to validate"; consider centralizing the log message prefix (e.g., a constant or helper) so future changes to the warning text don’t silently break these tests.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Since `_validate_against_manifest_schema` now depends on `self._schema_validator` and is no longer a `@staticmethod`, double-check all its call sites are using the instance (e.g., `self._validate_against_manifest_schema(...)`) rather than the class, otherwise this will fail at runtime.
- The new logging filter in `get_schema_warnings` relies on the hard-coded substring `"Failed to validate"`; consider centralizing the log message prefix (e.g., a constant or helper) so future changes to the warning text don’t silently break these tests.
## Individual Comments
### Comment 1
<location> `src/meltano/core/manifest/manifest.py:167-168` </location>
<code_context>
- text=True,
- stdout=subprocess.PIPE,
- stderr=subprocess.STDOUT,
+ errors = sorted(
+ self._schema_validator.iter_errors(instance_data),
+ key=lambda e: list(e.absolute_path),
+ )
</code_context>
<issue_to_address>
**issue (bug_risk):** Sorting errors by `list(e.absolute_path)` can raise a `TypeError` when paths mix strings and integers.
Because `absolute_path` elements can be a mix of strings (object keys) and integers (array indices), sorting on `list(e.absolute_path)` risks `TypeError` in Python 3 when it tries to compare `str` and `int`. Consider normalizing the key, e.g.:
```python
errors = sorted(
self._schema_validator.iter_errors(instance_data),
key=lambda e: ".".join(str(p) for p in e.absolute_path),
)
```
or:
```python
errors = sorted(
self._schema_validator.iter_errors(instance_data),
key=lambda e: tuple(str(p) for p in e.absolute_path),
)
```
Both preserve deterministic ordering while avoiding cross-type comparisons.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Signed-off-by: Edgar Ramírez Mondragón <[email protected]>
Signed-off-by: Edgar Ramírez Mondragón <[email protected]>
Description
Related Issues
Summary by Sourcery
Replace external check-jsonschema CLI usage with direct jsonschema-based manifest validation and extend tests to cover schema validation behavior and logging.
Bug Fixes:
Enhancements:
Build:
Tests:
Summary by Sourcery
Replace external manifest schema validation via the check-jsonschema CLI with direct jsonschema-based validation and improve how schema validation results are logged and tested.
Bug Fixes:
Enhancements:
Build:
Tests: