netplan: Handle command exceptions#334
Conversation
|
I'm wary of this approach. While some errors should indeed be handled gracefully, especially those caused by user error, some exceptions could also be raised due to programmer error and are thus symptomatic of bugs. We have machinery in Ubuntu to detect uncaught exceptions and report them to Launchpad, this would bypass it completely :/ |
b0435ab to
a53fec7
Compare
|
Thank you, @schopin-pro! I introduced a few more specific exceptions to avoid the problem you mentioned. |
schopin-pro
left a comment
There was a problem hiding this comment.
LGTM.
I just have one suggestion that I don't consider blocking here, plus some idle musing.
Feel free to disregard the suggestion and merge :)
| dhcp4: nothanks''') | ||
|
|
||
| # The idea was to capture stderr here but for some reason | ||
| # my attempts to mock sys.stderr didn't work with pytest |
There was a problem hiding this comment.
note: for the record, many test harnesses mess with std{err,out} to be able to capture output for failing tests while suppressing it for the rest, in order to keep a clean run output. It's not surprising that it conflicts with standard mocking techniques.
There seem to be a way to access the captured output with pytest though: https://docs.pytest.org/en/7.1.x/how-to/capture-stdout-stderr.html#accessing-captured-output-from-a-test-function
I'm happy with the alternative you came up with, though :)
There was a problem hiding this comment.
There's also the capture_stderr() context manager, provided by tests/parser/base.py, which could/should be used here to keep a similar tooling between the different test cases.
slyon
left a comment
There was a problem hiding this comment.
Overall, I like the idea of this PR. But I feel it should be taken a bit further and merged with the suggestions made in #318 (especially parsing out the filename/row/col in addition to the exception type).
This would already be a good start. Longer term (maybe as a follow-up PR), we should actually differentiate the error types in libnetplan.so itself, to avoid parsing the error message strings in libnetplan.py.
| dhcp4: nothanks''') | ||
|
|
||
| # The idea was to capture stderr here but for some reason | ||
| # my attempts to mock sys.stderr didn't work with pytest |
There was a problem hiding this comment.
There's also the capture_stderr() context manager, provided by tests/parser/base.py, which could/should be used here to keep a similar tooling between the different test cases.
|
Thank you for all your comments. I agree we really need to incorporate the ideas from #318. I demoted the PR to draft and tried to introduce the ideas from #318. It still needs some work but I think this is good enough for an RFC. I introduced a bunch of enums with error codes so we can stop using the glib ones. It will be checked in the python code so we can raise specific exceptions for each error. One thing I still want to do is to auto-generate the python enums based on the C ones so we don't need to keep them in sync manually. Let me know what you think :) |
|
IMHO auto-generating the error bindings from the C code would be over-engineering at this point. It's an implementation detail that can be revisited later on anyway ;) |
6191895 to
d62db99
Compare
|
I cleaned up the history mess a little bit and rebased the branch. |
There was a problem hiding this comment.
Thank you, I like the direction this is going! And especially the merge with PR #318
Here's a summary of my inline comments:
- I agree with Simon that auto-sync between the C and Python enums is probably overkill. Those shouldn't change too frequently and as long as we have a solid testing story around it, we should be able to recognize and adopt for changes manually.
- I asked for some more generic (fallback) error classes inline
- I asked for some error cases to change their domain/code
- There might be a conflict in the NETPLAN_ERROR_FILE domain, when returning
errnoas code. - I asked for some cleanup of parent-/child error-classes, using inheritance and default/positinal parameters, for better clarity and ease of use.
d62db99 to
38c354c
Compare
|
I tried to address most of the comments. I also rebased this branch on top of main. |
blackboxsw
left a comment
There was a problem hiding this comment.
Nice improvement in specialized exceptions. Thanks for the ping on the review. Minor questions and or comments. Ignore anything that doesn't make sense for your goals. I think LibNetplanValidationException looks like it'll surface exactly what cloud-init needs if we are calling the python api directly.
| try: | ||
| self.run_command() | ||
| except LibNetplanValidationException as e: | ||
| message = f'{e.filename}:{e.line}:{e.column}: {e}' |
There was a problem hiding this comment.
What do you think about encoding a single character label in front of {e.line}:{e.column} to aid in discovery and processing for those that just see the exception string in logs.
| message = f'{e.filename}:{e.line}:{e.column}: {e}' | |
| message = f'{e.filename}:l{e.line}:c{e.column}: {e}' |
netplan/libnetplan.py
Outdated
|
|
||
| class LibNetplanException(Exception): | ||
| SCHEMA_ERROR_MSG_REGEX = ( | ||
| r'(?P<file_path>.*):(?P<error_line>\d+):(?P<error_col>\d+): (?P<message>.*)' |
There was a problem hiding this comment.
Again wondering about making a clearer encoded format for line and column positions to ensure it's easier to read on glance in logs, and less likely to mis-match with other errors which may have sprinklings of :\d+:\d+ in them in the future
| r'(?P<file_path>.*):(?P<error_line>\d+):(?P<error_col>\d+): (?P<message>.*)' | |
| r'(?P<file_path>.*):l(?P<error_line>\d+):c(?P<error_col>\d+): (?P<message>.*)' |
netplan/libnetplan.py
Outdated
| elif NETPLAN_ERROR_DOMAINS.is_parser_error(domain): | ||
| return LibNetplanParserException(message, domain=domain, error=error) | ||
|
|
||
| elif NETPLAN_ERROR_DOMAINS.is_file_error(domain) or NETPLAN_ERROR_DOMAINS.is_file_errno(domain): |
There was a problem hiding this comment.
Are there expected cases where we'd treat handling of these two "domains" as distinct?
If not, we coiuld just have
def is_file_error(cls, domain):
return domain in (cls.NETPLAN_FILE_ERROR.value, cls.NETPLAN_FILE_ERRNO.value)
There was a problem hiding this comment.
I completely refactored this part of the code. Hopefully it's better now.
netplan/libnetplan.py
Outdated
| def _get_exception_from_error(domain, error, message): | ||
|
|
||
| if NETPLAN_ERROR_DOMAINS.is_validation_error(domain): | ||
| if NETPLAN_VALIDATION_ERRORS.is_config_validation_error(error): |
There was a problem hiding this comment.
Do we need this check is_config_validation_error?
It seems we are NETPLAN_ERROR_CONFIG_GENERIC we'd get True on is_validation_error, fallthrough to the LibNetplanException at the bottom of the function.
We could just avoid fallingthrough and return the LibNetplanException for NETPLAN_ERROR_CONFIG_GENERIC up here. I guess it really doesn't matter as I think we raise the same LibNetplanException in either case.
if NETPLAN_ERROR_DOMAINS.is_validation_error(domain):
schema_error = re.match(SCHEMA_ERROR_MSG_REGEX, message)
if not schema_error:
return LibNetplanException(message, domain=domain, error=error) # pragma: nocover
file_path = schema_error["file_path"]
error_line = schema_error["error_line"]
error_col = schema_error["error_col"]
error_message = schema_error["message"]
# some error messages might have line breaks, let's ignore them
msg = error_message.split('\n')
msg = msg[0].strip()
return LibNetplanValidationException(msg, domain, error, file_path, error_line, error_col)
There was a problem hiding this comment.
I refactored all this code. Now ParserExceptions will have the file name, line and column numbers. As config validations don't know the YAML context, only the YAML file from where the config was loaded, ValidationExceptions will only have the file name.
I also replaced all this if-elif-else with a lookup table. It looks cleaner now.
aa11cc4 to
a95a3ea
Compare
a95a3ea to
6b78d38
Compare
slyon
left a comment
There was a problem hiding this comment.
Thank you for the great work!
I worked on the FO114 spec to get it in line with this PR and did another review pass with some more inline comments.
6b78d38 to
157ad58
Compare
slyon
left a comment
There was a problem hiding this comment.
LGTM!
I left two smaller comments in the inline discussion for lbnetplan.py and openvswitch.c
blackboxsw
left a comment
There was a problem hiding this comment.
Thanks @daniloegea for the updates. Just a couple of questions inline about what conditions we expect to lead to NETPLAN_PARSER_ERROR vs NETPLAN_VALIDATION_ERROR so we know how to handle those conditions better.
Also, not for this PR, but for discussion:
Have there been thoughts about whether it's useful to represent schema validation failures with a supplement xpath or jsonpath string to represent where in the merged network config a schema validation error originated from? A jsonpath or xpath may give us the ability with the full merged config (maybe NETPLAN_VALIDATION_ERROR even when no file path is present) to represent the offending object key or path within the network config that is causing problems. It also may aid in any UX or UI that attempts to represent full merged network config with annotated errors anchored at the specific keys or values within a given network config dictionary when filename, line and column values don't strictly make sense.
netplan/libnetplan.py
Outdated
| file_path = schema_error["file_path"] | ||
| error_line = schema_error["error_line"] | ||
| error_col = schema_error["error_col"] |
There was a problem hiding this comment.
Is there a reason we want these locals here instead of just assigning the instance attrs and drop the assignment we perform below the msg processing?
| file_path = schema_error["file_path"] | |
| error_line = schema_error["error_line"] | |
| error_col = schema_error["error_col"] | |
| self.filename = schema_error["file_path"] | |
| self.line = schema_error["error_line"] | |
| self.column = schema_error["error_col"] |
| # | ||
| # filename.yaml:4:14: Error in network definition: invalid boolean value 'falsea' | ||
| # | ||
| schema_error = re.match(self.SCHEMA_PARSER_ERROR_MSG_REGEX, message) |
There was a problem hiding this comment.
Do you want to raise a ValueError or RuntimeError with something more descriptive if message doesn't meet regex otherwise you'll get ugly tracebacks like:n
TypeError: 'NoneType' object is not subscriptable that may be hard to debug/understand at runtime
There was a problem hiding this comment.
Yes, that's a good point, I think I was doing that at some point but removed during the refactoring...
|
|
||
| def __init__(self, message=None, domain=None, error=None): | ||
| super().__init__(message, domain, error) | ||
| schema_error = re.match(self.SCHEMA_VALIDATION_ERROR_MSG_REGEX, message) |
There was a problem hiding this comment.
Do we want to raise ValueError or RuntimeErrors here with an informative message if message doesn't match SCHEMA_VALIDATION_ERROR_MSG_REGEX. This prevents the obscure traceback TypeError: 'NoneType' object is not subscriptable at runtime
| self.assertEqual(yaml.safe_load(output), None) | ||
|
|
||
| def test_validation_error_exception(self): | ||
| parser = libnetplan.Parser() |
There was a problem hiding this comment.
Could you please explain in this docstring why this test is supposed to fail? It's valid yaml.
There was a problem hiding this comment.
Yeah we probably should better document our tests... 😅
In this case it will raise ValidationException because "set-name" requires the use of "match".
| class _GError(ctypes.Structure): | ||
| class NetplanValidationException(NetplanException): | ||
| ''' | ||
| Netplan Validation errors are expected to contain the YAML file name |
There was a problem hiding this comment.
Can this docstr explain in what cases we expect this class to be used (and a docstr on NetplanParserException to better differentiate what symptom each exception indicates).
| * Errors for domain NETPLAN_VALIDATION_ERROR | ||
| * | ||
| * VALIDATION_ERRORS are expected to contain only the YAML file name | ||
| * where the error was found. |
There was a problem hiding this comment.
I don't understand this error path. In what scenarios do we expect to only have a filename available during error handling and not specific guidance about the lines, columns or config xpath object that is representing a problem?
There was a problem hiding this comment.
I will improve the docstring.
So, libnetplan walks through its internal representation of the network configuration to validate it. At this point we don't have the YAML context anymore, only the filename.
When we parse the files, we are walking through the YAML nodes so we have more information.
157ad58 to
8578f8f
Compare
|
Thanks @slyon and @blackboxsw for your comments. I believe I addressed most of them. I think building the jsonpath for ParseExceptions wouldn't be hard. Maybe we could consider doing that in the near future. It sounds useful. |
8578f8f to
e05f400
Compare
Define Netplan specific errors so we don't need to use generic glib error codes.
Now we can raise specific exceptions based on the error domain and code passed from libnetplan. The exceptions will contain details about the error, such as filename, line and column numbers, the error message, error code and error domain.
It's still a glib GError in libnetplan but we call it NetplanError.
e05f400 to
a78b001
Compare
|
Thank you all for the reviews, I'll go ahead and merge it then. |
libnetplan doesn't differentiate each type of error very well. It impacts mainly its frontend where we rely on error message interpretation to tell the user what happened.
As we are planning to offer the Python bindings as a module that will be consumed by others, such as cloud-init, it's important to improve the API error handling.
This RFC incorporates the ideas from #318 and defines specific libnetplan errors and exception for each type with some details.
Originally, this PR was intended to improve error handling to fix issues like the ones mentioned below (which is still addressed).
Old description (still valid):
When netplan get or set fails for some reason, due to bad inputs or file permissions for example, netplan will crash. This handles these situations gracefully.
Examples:
Same examples with this change:
Description
Checklist
make checksuccessfully.make check-coverage).