Skip to content

netplan: Handle command exceptions#334

Merged
daniloegea merged 5 commits intocanonical:mainfrom
daniloegea:error_handling
May 30, 2023
Merged

netplan: Handle command exceptions#334
daniloegea merged 5 commits intocanonical:mainfrom
daniloegea:error_handling

Conversation

@daniloegea
Copy link
Contributor

@daniloegea daniloegea commented Mar 9, 2023

libnetplan doesn't differentiate each type of error very well. It impacts mainly its frontend where we rely on error message interpretation to tell the user what happened.
As we are planning to offer the Python bindings as a module that will be consumed by others, such as cloud-init, it's important to improve the API error handling.

This RFC incorporates the ideas from #318 and defines specific libnetplan errors and exception for each type with some details.

Originally, this PR was intended to improve error handling to fix issues like the ones mentioned below (which is still addressed).


Old description (still valid):

When netplan get or set fails for some reason, due to bad inputs or file permissions for example, netplan will crash. This handles these situations gracefully.

Examples:

$ netplan get
Traceback (most recent call last):
  File "/usr/sbin/netplan", line 23, in <module>
    netplan.main()
  File "/usr/share/netplan/netplan/cli/core.py", line 50, in main
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 241, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/get.py", line 43, in run
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 241, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/get.py", line 61, in command_get
    parser.load_yaml_hierarchy(rootdir=self.root_dir)
  File "/usr/share/netplan/netplan/libnetplan.py", line 122, in load_yaml_hierarchy
    _checked_lib_call(lib.netplan_parser_load_yaml_hierarchy, self._ptr, rootdir.encode('utf-8'))
  File "/usr/share/netplan/netplan/libnetplan.py", line 79, in _checked_lib_call
    raise LibNetplanException(err.contents.message.decode('utf-8'))
netplan.libnetplan.LibNetplanException: Cannot open /etc/netplan/90-test.yaml: Permission denied
$ netplan set bridges.eth0.dhcp4=false
Traceback (most recent call last):
  File "/usr/sbin/netplan", line 23, in <module>
    netplan.main()
  File "/usr/share/netplan/netplan/cli/core.py", line 50, in main
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 241, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/set.py", line 50, in run
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 241, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/set.py", line 81, in command_set
    parser.load_yaml_hierarchy(self.root_dir)
  File "/usr/share/netplan/netplan/libnetplan.py", line 122, in load_yaml_hierarchy
    _checked_lib_call(lib.netplan_parser_load_yaml_hierarchy, self._ptr, rootdir.encode('utf-8'))
  File "/usr/share/netplan/netplan/libnetplan.py", line 79, in _checked_lib_call
    raise LibNetplanException(err.contents.message.decode('utf-8'))
netplan.libnetplan.LibNetplanException: Cannot open /etc/netplan/90-test.yaml: Permission denied
$ netplan set --root-dir /tmp/a bridges.eth0.dhcp4=falsea
Traceback (most recent call last):
  File "/usr/sbin/netplan", line 23, in <module>
    netplan.main()
  File "/usr/share/netplan/netplan/cli/core.py", line 50, in main
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 241, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/set.py", line 50, in run
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 241, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/set.py", line 85, in command_set
    parser.load_yaml(tmp)
  File "/usr/share/netplan/netplan/libnetplan.py", line 119, in load_yaml
    _checked_lib_call(lib.netplan_parser_load_yaml_from_fd, self._ptr, input_file.fileno())
  File "/usr/share/netplan/netplan/libnetplan.py", line 79, in _checked_lib_call
    raise LibNetplanException(err.contents.message.decode('utf-8'))
netplan.libnetplan.LibNetplanException: (null):4:14: Error in network definition: invalid boolean value 'falsea'
(null)
             ^
$ netplan set --root-dir /tmp/a "bridges.eth0={dhcp4: false dhcp6: false}"
Traceback (most recent call last):
  File "/usr/sbin/netplan", line 23, in <module>
    netplan.main()
  File "/usr/share/netplan/netplan/cli/core.py", line 50, in main
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 241, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/set.py", line 50, in run
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 241, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/set.py", line 72, in command_set
    libnetplan.create_yaml_patch(yaml_path, value, tmp)
  File "/usr/share/netplan/netplan/libnetplan.py", line 501, in create_yaml_patch
    _checked_lib_call(lib.netplan_util_create_yaml_patch,
  File "/usr/share/netplan/netplan/libnetplan.py", line 79, in _checked_lib_call
    raise LibNetplanException(err.contents.message.decode('utf-8'))
netplan.libnetplan.LibNetplanException: Error parsing YAML: did not find expected ',' or '}'

Same examples with this change:

$ NETPLAN_GENERATE_PATH=build/src/generate LD_LIBRARY_PATH=build/src/ PYTHONPATH=. ./src/netplan.script get
Command failed: Cannot open /etc/netplan/90-test.yaml: Permission denied
$ NETPLAN_GENERATE_PATH=build/src/generate LD_LIBRARY_PATH=build/src/ PYTHONPATH=. ./src/netplan.script set bridges.eth0.dhcp4=false
Command failed: Cannot open /etc/netplan/90-test.yaml: Permission denied
$ NETPLAN_GENERATE_PATH=build/src/generate LD_LIBRARY_PATH=build/src/ PYTHONPATH=. ./src/netplan.script set --root-dir /tmp/a bridges.eth0.dhcp4=falsea
Command failed: invalid boolean value 'falsea'
$ NETPLAN_GENERATE_PATH=build/src/generate LD_LIBRARY_PATH=build/src/ PYTHONPATH=. ./src/netplan.script set --root-dir /tmp/a "bridges.eth0={dhcp4: false dhcp6: false}"
Command failed: Error parsing YAML: did not find expected ',' or '}'

Description

Checklist

  • Runs make check successfully.
  • Retains 100% code coverage (make check-coverage).
  • New/changed keys in YAML format are documented.
  • (Optional) Adds example YAML for new feature.
  • (Optional) Closes an open bug in Launchpad.

@schopin-pro
Copy link
Contributor

I'm wary of this approach. While some errors should indeed be handled gracefully, especially those caused by user error, some exceptions could also be raised due to programmer error and are thus symptomatic of bugs.

We have machinery in Ubuntu to detect uncaught exceptions and report them to Launchpad, this would bypass it completely :/

@daniloegea daniloegea force-pushed the error_handling branch 2 times, most recently from b0435ab to a53fec7 Compare April 4, 2023 21:26
@daniloegea daniloegea requested review from schopin-pro and slyon April 4, 2023 23:18
@daniloegea
Copy link
Contributor Author

Thank you, @schopin-pro! I introduced a few more specific exceptions to avoid the problem you mentioned.

Copy link
Contributor

@schopin-pro schopin-pro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I just have one suggestion that I don't consider blocking here, plus some idle musing.

Feel free to disregard the suggestion and merge :)

dhcp4: nothanks''')

# The idea was to capture stderr here but for some reason
# my attempts to mock sys.stderr didn't work with pytest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: for the record, many test harnesses mess with std{err,out} to be able to capture output for failing tests while suppressing it for the rest, in order to keep a clean run output. It's not surprising that it conflicts with standard mocking techniques.

There seem to be a way to access the captured output with pytest though: https://docs.pytest.org/en/7.1.x/how-to/capture-stdout-stderr.html#accessing-captured-output-from-a-test-function

I'm happy with the alternative you came up with, though :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also the capture_stderr() context manager, provided by tests/parser/base.py, which could/should be used here to keep a similar tooling between the different test cases.

Copy link
Contributor

@slyon slyon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I like the idea of this PR. But I feel it should be taken a bit further and merged with the suggestions made in #318 (especially parsing out the filename/row/col in addition to the exception type).

This would already be a good start. Longer term (maybe as a follow-up PR), we should actually differentiate the error types in libnetplan.so itself, to avoid parsing the error message strings in libnetplan.py.

dhcp4: nothanks''')

# The idea was to capture stderr here but for some reason
# my attempts to mock sys.stderr didn't work with pytest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also the capture_stderr() context manager, provided by tests/parser/base.py, which could/should be used here to keep a similar tooling between the different test cases.

@daniloegea daniloegea marked this pull request as draft April 18, 2023 16:27
@daniloegea
Copy link
Contributor Author

Thank you for all your comments. I agree we really need to incorporate the ideas from #318.

I demoted the PR to draft and tried to introduce the ideas from #318. It still needs some work but I think this is good enough for an RFC.

I introduced a bunch of enums with error codes so we can stop using the glib ones. It will be checked in the python code so we can raise specific exceptions for each error.

One thing I still want to do is to auto-generate the python enums based on the C ones so we don't need to keep them in sync manually.

Let me know what you think :)

@slyon slyon self-requested a review April 19, 2023 08:05
@schopin-pro
Copy link
Contributor

IMHO auto-generating the error bindings from the C code would be over-engineering at this point. It's an implementation detail that can be revisited later on anyway ;)

@daniloegea
Copy link
Contributor Author

I cleaned up the history mess a little bit and rebased the branch.

Copy link
Contributor

@slyon slyon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I like the direction this is going! And especially the merge with PR #318

Here's a summary of my inline comments:

  • I agree with Simon that auto-sync between the C and Python enums is probably overkill. Those shouldn't change too frequently and as long as we have a solid testing story around it, we should be able to recognize and adopt for changes manually.
  • I asked for some more generic (fallback) error classes inline
  • I asked for some error cases to change their domain/code
  • There might be a conflict in the NETPLAN_ERROR_FILE domain, when returning errno as code.
  • I asked for some cleanup of parent-/child error-classes, using inheritance and default/positinal parameters, for better clarity and ease of use.

@daniloegea
Copy link
Contributor Author

I tried to address most of the comments. I also rebased this branch on top of main.

Copy link

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement in specialized exceptions. Thanks for the ping on the review. Minor questions and or comments. Ignore anything that doesn't make sense for your goals. I think LibNetplanValidationException looks like it'll surface exactly what cloud-init needs if we are calling the python api directly.

try:
self.run_command()
except LibNetplanValidationException as e:
message = f'{e.filename}:{e.line}:{e.column}: {e}'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about encoding a single character label in front of {e.line}:{e.column} to aid in discovery and processing for those that just see the exception string in logs.

Suggested change
message = f'{e.filename}:{e.line}:{e.column}: {e}'
message = f'{e.filename}:l{e.line}:c{e.column}: {e}'


class LibNetplanException(Exception):
SCHEMA_ERROR_MSG_REGEX = (
r'(?P<file_path>.*):(?P<error_line>\d+):(?P<error_col>\d+): (?P<message>.*)'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again wondering about making a clearer encoded format for line and column positions to ensure it's easier to read on glance in logs, and less likely to mis-match with other errors which may have sprinklings of :\d+:\d+ in them in the future

Suggested change
r'(?P<file_path>.*):(?P<error_line>\d+):(?P<error_col>\d+): (?P<message>.*)'
r'(?P<file_path>.*):l(?P<error_line>\d+):c(?P<error_col>\d+): (?P<message>.*)'

elif NETPLAN_ERROR_DOMAINS.is_parser_error(domain):
return LibNetplanParserException(message, domain=domain, error=error)

elif NETPLAN_ERROR_DOMAINS.is_file_error(domain) or NETPLAN_ERROR_DOMAINS.is_file_errno(domain):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there expected cases where we'd treat handling of these two "domains" as distinct?
If not, we coiuld just have

    def is_file_error(cls, domain):
        return domain in (cls.NETPLAN_FILE_ERROR.value, cls.NETPLAN_FILE_ERRNO.value)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely refactored this part of the code. Hopefully it's better now.

def _get_exception_from_error(domain, error, message):

if NETPLAN_ERROR_DOMAINS.is_validation_error(domain):
if NETPLAN_VALIDATION_ERRORS.is_config_validation_error(error):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this check is_config_validation_error?
It seems we are NETPLAN_ERROR_CONFIG_GENERIC we'd get True on is_validation_error, fallthrough to the LibNetplanException at the bottom of the function.

We could just avoid fallingthrough and return the LibNetplanException for NETPLAN_ERROR_CONFIG_GENERIC up here. I guess it really doesn't matter as I think we raise the same LibNetplanException in either case.

if NETPLAN_ERROR_DOMAINS.is_validation_error(domain):
    schema_error = re.match(SCHEMA_ERROR_MSG_REGEX, message)
    if not schema_error:
        return LibNetplanException(message, domain=domain, error=error)  # pragma: nocover
    file_path = schema_error["file_path"]
    error_line = schema_error["error_line"]
    error_col = schema_error["error_col"]
    error_message = schema_error["message"]

    # some error messages might have line breaks, let's ignore them
    msg = error_message.split('\n')
    msg = msg[0].strip()

    return LibNetplanValidationException(msg, domain, error, file_path, error_line, error_col)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored all this code. Now ParserExceptions will have the file name, line and column numbers. As config validations don't know the YAML context, only the YAML file from where the config was loaded, ValidationExceptions will only have the file name.

I also replaced all this if-elif-else with a lookup table. It looks cleaner now.

@daniloegea daniloegea force-pushed the error_handling branch 3 times, most recently from aa11cc4 to a95a3ea Compare May 8, 2023 19:07
@daniloegea daniloegea marked this pull request as ready for review May 10, 2023 07:59
@daniloegea daniloegea requested a review from slyon May 10, 2023 07:59
@daniloegea daniloegea requested a review from blackboxsw May 11, 2023 19:05
Copy link
Contributor

@slyon slyon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the great work!

I worked on the FO114 spec to get it in line with this PR and did another review pass with some more inline comments.

@daniloegea daniloegea requested a review from slyon May 17, 2023 13:11
Copy link
Contributor

@slyon slyon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I left two smaller comments in the inline discussion for lbnetplan.py and openvswitch.c

Copy link

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @daniloegea for the updates. Just a couple of questions inline about what conditions we expect to lead to NETPLAN_PARSER_ERROR vs NETPLAN_VALIDATION_ERROR so we know how to handle those conditions better.

Also, not for this PR, but for discussion:

Have there been thoughts about whether it's useful to represent schema validation failures with a supplement xpath or jsonpath string to represent where in the merged network config a schema validation error originated from? A jsonpath or xpath may give us the ability with the full merged config (maybe NETPLAN_VALIDATION_ERROR even when no file path is present) to represent the offending object key or path within the network config that is causing problems. It also may aid in any UX or UI that attempts to represent full merged network config with annotated errors anchored at the specific keys or values within a given network config dictionary when filename, line and column values don't strictly make sense.

Comment on lines 118 to 120
file_path = schema_error["file_path"]
error_line = schema_error["error_line"]
error_col = schema_error["error_col"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we want these locals here instead of just assigning the instance attrs and drop the assignment we perform below the msg processing?

Suggested change
file_path = schema_error["file_path"]
error_line = schema_error["error_line"]
error_col = schema_error["error_col"]
self.filename = schema_error["file_path"]
self.line = schema_error["error_line"]
self.column = schema_error["error_col"]

#
# filename.yaml:4:14: Error in network definition: invalid boolean value 'falsea'
#
schema_error = re.match(self.SCHEMA_PARSER_ERROR_MSG_REGEX, message)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to raise a ValueError or RuntimeError with something more descriptive if message doesn't meet regex otherwise you'll get ugly tracebacks like:n
TypeError: 'NoneType' object is not subscriptable that may be hard to debug/understand at runtime

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good point, I think I was doing that at some point but removed during the refactoring...


def __init__(self, message=None, domain=None, error=None):
super().__init__(message, domain, error)
schema_error = re.match(self.SCHEMA_VALIDATION_ERROR_MSG_REGEX, message)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to raise ValueError or RuntimeErrors here with an informative message if message doesn't match SCHEMA_VALIDATION_ERROR_MSG_REGEX. This prevents the obscure traceback TypeError: 'NoneType' object is not subscriptable at runtime

self.assertEqual(yaml.safe_load(output), None)

def test_validation_error_exception(self):
parser = libnetplan.Parser()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain in this docstring why this test is supposed to fail? It's valid yaml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we probably should better document our tests... 😅

In this case it will raise ValidationException because "set-name" requires the use of "match".

class _GError(ctypes.Structure):
class NetplanValidationException(NetplanException):
'''
Netplan Validation errors are expected to contain the YAML file name

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this docstr explain in what cases we expect this class to be used (and a docstr on NetplanParserException to better differentiate what symptom each exception indicates).

* Errors for domain NETPLAN_VALIDATION_ERROR
*
* VALIDATION_ERRORS are expected to contain only the YAML file name
* where the error was found.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this error path. In what scenarios do we expect to only have a filename available during error handling and not specific guidance about the lines, columns or config xpath object that is representing a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will improve the docstring.

So, libnetplan walks through its internal representation of the network configuration to validate it. At this point we don't have the YAML context anymore, only the filename.

When we parse the files, we are walking through the YAML nodes so we have more information.

@daniloegea
Copy link
Contributor Author

daniloegea commented May 22, 2023

Thanks @slyon and @blackboxsw for your comments. I believe I addressed most of them.

I think building the jsonpath for ParseExceptions wouldn't be hard. Maybe we could consider doing that in the near future. It sounds useful.

Danilo Egea Gondolfo added 5 commits May 26, 2023 13:57
Define Netplan specific errors so we don't need to use generic glib
error codes.
Now we can raise specific exceptions based on the error domain and code
passed from libnetplan.

The exceptions will contain details about the error, such as filename, line
and column numbers, the error message, error code and error domain.
It's still a glib GError in libnetplan but we call it NetplanError.
@daniloegea
Copy link
Contributor Author

Thank you all for the reviews, I'll go ahead and merge it then.

@daniloegea daniloegea merged commit 1e5c699 into canonical:main May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants