netplan: Handle command exceptions by daniloegea · Pull Request #334 · canonical/netplan

daniloegea · 2023-03-09T17:44:04Z

libnetplan doesn't differentiate each type of error very well. It impacts mainly its frontend where we rely on error message interpretation to tell the user what happened.
As we are planning to offer the Python bindings as a module that will be consumed by others, such as cloud-init, it's important to improve the API error handling.

This RFC incorporates the ideas from #318 and defines specific libnetplan errors and exception for each type with some details.

Originally, this PR was intended to improve error handling to fix issues like the ones mentioned below (which is still addressed).

Old description (still valid):

When netplan get or set fails for some reason, due to bad inputs or file permissions for example, netplan will crash. This handles these situations gracefully.

Examples:

$ netplan get
Traceback (most recent call last):
  File "/usr/sbin/netplan", line 23, in <module>
    netplan.main()
  File "/usr/share/netplan/netplan/cli/core.py", line 50, in main
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 241, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/get.py", line 43, in run
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 241, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/get.py", line 61, in command_get
    parser.load_yaml_hierarchy(rootdir=self.root_dir)
  File "/usr/share/netplan/netplan/libnetplan.py", line 122, in load_yaml_hierarchy
    _checked_lib_call(lib.netplan_parser_load_yaml_hierarchy, self._ptr, rootdir.encode('utf-8'))
  File "/usr/share/netplan/netplan/libnetplan.py", line 79, in _checked_lib_call
    raise LibNetplanException(err.contents.message.decode('utf-8'))
netplan.libnetplan.LibNetplanException: Cannot open /etc/netplan/90-test.yaml: Permission denied

$ netplan set bridges.eth0.dhcp4=false
Traceback (most recent call last):
  File "/usr/sbin/netplan", line 23, in <module>
    netplan.main()
  File "/usr/share/netplan/netplan/cli/core.py", line 50, in main
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 241, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/set.py", line 50, in run
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 241, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/set.py", line 81, in command_set
    parser.load_yaml_hierarchy(self.root_dir)
  File "/usr/share/netplan/netplan/libnetplan.py", line 122, in load_yaml_hierarchy
    _checked_lib_call(lib.netplan_parser_load_yaml_hierarchy, self._ptr, rootdir.encode('utf-8'))
  File "/usr/share/netplan/netplan/libnetplan.py", line 79, in _checked_lib_call
    raise LibNetplanException(err.contents.message.decode('utf-8'))
netplan.libnetplan.LibNetplanException: Cannot open /etc/netplan/90-test.yaml: Permission denied

$ netplan set --root-dir /tmp/a bridges.eth0.dhcp4=falsea
Traceback (most recent call last):
  File "/usr/sbin/netplan", line 23, in <module>
    netplan.main()
  File "/usr/share/netplan/netplan/cli/core.py", line 50, in main
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 241, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/set.py", line 50, in run
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 241, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/set.py", line 85, in command_set
    parser.load_yaml(tmp)
  File "/usr/share/netplan/netplan/libnetplan.py", line 119, in load_yaml
    _checked_lib_call(lib.netplan_parser_load_yaml_from_fd, self._ptr, input_file.fileno())
  File "/usr/share/netplan/netplan/libnetplan.py", line 79, in _checked_lib_call
    raise LibNetplanException(err.contents.message.decode('utf-8'))
netplan.libnetplan.LibNetplanException: (null):4:14: Error in network definition: invalid boolean value 'falsea'
(null)
             ^

$ netplan set --root-dir /tmp/a "bridges.eth0={dhcp4: false dhcp6: false}"
Traceback (most recent call last):
  File "/usr/sbin/netplan", line 23, in <module>
    netplan.main()
  File "/usr/share/netplan/netplan/cli/core.py", line 50, in main
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 241, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/set.py", line 50, in run
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 241, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/set.py", line 72, in command_set
    libnetplan.create_yaml_patch(yaml_path, value, tmp)
  File "/usr/share/netplan/netplan/libnetplan.py", line 501, in create_yaml_patch
    _checked_lib_call(lib.netplan_util_create_yaml_patch,
  File "/usr/share/netplan/netplan/libnetplan.py", line 79, in _checked_lib_call
    raise LibNetplanException(err.contents.message.decode('utf-8'))
netplan.libnetplan.LibNetplanException: Error parsing YAML: did not find expected ',' or '}'

Same examples with this change:

$ NETPLAN_GENERATE_PATH=build/src/generate LD_LIBRARY_PATH=build/src/ PYTHONPATH=. ./src/netplan.script get
Command failed: Cannot open /etc/netplan/90-test.yaml: Permission denied

$ NETPLAN_GENERATE_PATH=build/src/generate LD_LIBRARY_PATH=build/src/ PYTHONPATH=. ./src/netplan.script set bridges.eth0.dhcp4=false
Command failed: Cannot open /etc/netplan/90-test.yaml: Permission denied

$ NETPLAN_GENERATE_PATH=build/src/generate LD_LIBRARY_PATH=build/src/ PYTHONPATH=. ./src/netplan.script set --root-dir /tmp/a bridges.eth0.dhcp4=falsea
Command failed: invalid boolean value 'falsea'

$ NETPLAN_GENERATE_PATH=build/src/generate LD_LIBRARY_PATH=build/src/ PYTHONPATH=. ./src/netplan.script set --root-dir /tmp/a "bridges.eth0={dhcp4: false dhcp6: false}"
Command failed: Error parsing YAML: did not find expected ',' or '}'

Description

Checklist

Runs make check successfully.
Retains 100% code coverage (make check-coverage).
New/changed keys in YAML format are documented.
(Optional) Adds example YAML for new feature.
(Optional) Closes an open bug in Launchpad.

schopin-pro · 2023-03-17T13:30:40Z

I'm wary of this approach. While some errors should indeed be handled gracefully, especially those caused by user error, some exceptions could also be raised due to programmer error and are thus symptomatic of bugs.

We have machinery in Ubuntu to detect uncaught exceptions and report them to Launchpad, this would bypass it completely :/

daniloegea · 2023-04-04T23:20:33Z

Thank you, @schopin-pro! I introduced a few more specific exceptions to avoid the problem you mentioned.

schopin-pro

LGTM.

I just have one suggestion that I don't consider blocking here, plus some idle musing.

Feel free to disregard the suggestion and merge :)

netplan/libnetplan.py

netplan/configmanager.py

schopin-pro · 2023-04-06T09:19:48Z

tests/cli/test_units.py

+                  dhcp4: nothanks''')
+
+        # The idea was to capture stderr here but for some reason
+        # my attempts to mock sys.stderr didn't work with pytest


note: for the record, many test harnesses mess with std{err,out} to be able to capture output for failing tests while suppressing it for the rest, in order to keep a clean run output. It's not surprising that it conflicts with standard mocking techniques.

There seem to be a way to access the captured output with pytest though: https://docs.pytest.org/en/7.1.x/how-to/capture-stdout-stderr.html#accessing-captured-output-from-a-test-function

I'm happy with the alternative you came up with, though :)

There's also the capture_stderr() context manager, provided by tests/parser/base.py, which could/should be used here to keep a similar tooling between the different test cases.

slyon

Overall, I like the idea of this PR. But I feel it should be taken a bit further and merged with the suggestions made in #318 (especially parsing out the filename/row/col in addition to the exception type).

This would already be a good start. Longer term (maybe as a follow-up PR), we should actually differentiate the error types in libnetplan.so itself, to avoid parsing the error message strings in libnetplan.py.

netplan/configmanager.py

netplan/cli/commands/set.py

netplan/libnetplan.py

slyon · 2023-04-12T12:43:58Z

tests/cli/test_units.py

+                  dhcp4: nothanks''')
+
+        # The idea was to capture stderr here but for some reason
+        # my attempts to mock sys.stderr didn't work with pytest


There's also the capture_stderr() context manager, provided by tests/parser/base.py, which could/should be used here to keep a similar tooling between the different test cases.

daniloegea · 2023-04-18T16:55:26Z

Thank you for all your comments. I agree we really need to incorporate the ideas from #318.

I demoted the PR to draft and tried to introduce the ideas from #318. It still needs some work but I think this is good enough for an RFC.

I introduced a bunch of enums with error codes so we can stop using the glib ones. It will be checked in the python code so we can raise specific exceptions for each error.

One thing I still want to do is to auto-generate the python enums based on the C ones so we don't need to keep them in sync manually.

Let me know what you think :)

schopin-pro · 2023-04-20T09:12:32Z

IMHO auto-generating the error bindings from the C code would be over-engineering at this point. It's an implementation detail that can be revisited later on anyway ;)

daniloegea · 2023-04-21T16:10:59Z

I cleaned up the history mess a little bit and rebased the branch.

slyon

Thank you, I like the direction this is going! And especially the merge with PR #318

Here's a summary of my inline comments:

I agree with Simon that auto-sync between the C and Python enums is probably overkill. Those shouldn't change too frequently and as long as we have a solid testing story around it, we should be able to recognize and adopt for changes manually.
I asked for some more generic (fallback) error classes inline
I asked for some error cases to change their domain/code
There might be a conflict in the NETPLAN_ERROR_FILE domain, when returning errno as code.
I asked for some cleanup of parent-/child error-classes, using inheritance and default/positinal parameters, for better clarity and ease of use.

include/types.h

netplan/cli/core.py

netplan/cli/sriov.py

netplan/libnetplan.py

src/parse.c

src/util.c

src/validation.c

daniloegea · 2023-04-28T17:19:17Z

I tried to address most of the comments. I also rebased this branch on top of main.

blackboxsw

Nice improvement in specialized exceptions. Thanks for the ping on the review. Minor questions and or comments. Ignore anything that doesn't make sense for your goals. I think LibNetplanValidationException looks like it'll surface exactly what cloud-init needs if we are calling the python api directly.

blackboxsw · 2023-04-30T12:36:52Z

netplan/cli/core.py

+        try:
+            self.run_command()
+        except LibNetplanValidationException as e:
+            message = f'{e.filename}:{e.line}:{e.column}: {e}'


What do you think about encoding a single character label in front of {e.line}:{e.column} to aid in discovery and processing for those that just see the exception string in logs.

Suggested change

message = f'{e.filename}:{e.line}:{e.column}: {e}'

message = f'{e.filename}:l{e.line}:c{e.column}: {e}'

blackboxsw · 2023-04-30T12:38:56Z

netplan/libnetplan.py


-class LibNetplanException(Exception):
+SCHEMA_ERROR_MSG_REGEX = (
+    r'(?P<file_path>.*):(?P<error_line>\d+):(?P<error_col>\d+): (?P<message>.*)'


Again wondering about making a clearer encoded format for line and column positions to ensure it's easier to read on glance in logs, and less likely to mis-match with other errors which may have sprinklings of :\d+:\d+ in them in the future

Suggested change

r'(?P<file_path>.*):(?P<error_line>\d+):(?P<error_col>\d+): (?P<message>.*)'

r'(?P<file_path>.*):l(?P<error_line>\d+):c(?P<error_col>\d+): (?P<message>.*)'

blackboxsw · 2023-04-30T12:42:50Z

netplan/libnetplan.py

+    elif NETPLAN_ERROR_DOMAINS.is_parser_error(domain):
+        return LibNetplanParserException(message, domain=domain, error=error)
+
+    elif NETPLAN_ERROR_DOMAINS.is_file_error(domain) or NETPLAN_ERROR_DOMAINS.is_file_errno(domain):


Are there expected cases where we'd treat handling of these two "domains" as distinct?
If not, we coiuld just have

def is_file_error(cls, domain): return domain in (cls.NETPLAN_FILE_ERROR.value, cls.NETPLAN_FILE_ERRNO.value)

I completely refactored this part of the code. Hopefully it's better now.

blackboxsw · 2023-04-30T12:51:10Z

netplan/libnetplan.py

+def _get_exception_from_error(domain, error, message):
+
+    if NETPLAN_ERROR_DOMAINS.is_validation_error(domain):
+        if NETPLAN_VALIDATION_ERRORS.is_config_validation_error(error):


Do we need this check is_config_validation_error?
It seems we are NETPLAN_ERROR_CONFIG_GENERIC we'd get True on is_validation_error, fallthrough to the LibNetplanException at the bottom of the function.

We could just avoid fallingthrough and return the LibNetplanException for NETPLAN_ERROR_CONFIG_GENERIC up here. I guess it really doesn't matter as I think we raise the same LibNetplanException in either case.

if NETPLAN_ERROR_DOMAINS.is_validation_error(domain): schema_error = re.match(SCHEMA_ERROR_MSG_REGEX, message) if not schema_error: return LibNetplanException(message, domain=domain, error=error) # pragma: nocover file_path = schema_error["file_path"] error_line = schema_error["error_line"] error_col = schema_error["error_col"] error_message = schema_error["message"] # some error messages might have line breaks, let's ignore them msg = error_message.split('\n') msg = msg[0].strip() return LibNetplanValidationException(msg, domain, error, file_path, error_line, error_col)

I refactored all this code. Now ParserExceptions will have the file name, line and column numbers. As config validations don't know the YAML context, only the YAML file from where the config was loaded, ValidationExceptions will only have the file name.

I also replaced all this if-elif-else with a lookup table. It looks cleaner now.

slyon

Thank you for the great work!

I worked on the FO114 spec to get it in line with this PR and did another review pass with some more inline comments.

netplan/libnetplan.py

tests/cli/test_get_set.py

tests/parser/base.py

src/openvswitch.c

src/sriov.c

src/validation.c

slyon

LGTM!

I left two smaller comments in the inline discussion for lbnetplan.py and openvswitch.c

blackboxsw

Thanks @daniloegea for the updates. Just a couple of questions inline about what conditions we expect to lead to NETPLAN_PARSER_ERROR vs NETPLAN_VALIDATION_ERROR so we know how to handle those conditions better.

Also, not for this PR, but for discussion:

Have there been thoughts about whether it's useful to represent schema validation failures with a supplement xpath or jsonpath string to represent where in the merged network config a schema validation error originated from? A jsonpath or xpath may give us the ability with the full merged config (maybe NETPLAN_VALIDATION_ERROR even when no file path is present) to represent the offending object key or path within the network config that is causing problems. It also may aid in any UX or UI that attempts to represent full merged network config with annotated errors anchored at the specific keys or values within a given network config dictionary when filename, line and column values don't strictly make sense.

blackboxsw · 2023-05-17T16:25:33Z

netplan/libnetplan.py

+        file_path = schema_error["file_path"]
+        error_line = schema_error["error_line"]
+        error_col = schema_error["error_col"]


Is there a reason we want these locals here instead of just assigning the instance attrs and drop the assignment we perform below the msg processing?

Suggested change

file_path = schema_error["file_path"]

error_line = schema_error["error_line"]

error_col = schema_error["error_col"]

self.filename = schema_error["file_path"]

self.line = schema_error["error_line"]

self.column = schema_error["error_col"]

netplan/libnetplan.py

blackboxsw · 2023-05-17T16:50:04Z

netplan/libnetplan.py

+        #
+        # filename.yaml:4:14: Error in network definition: invalid boolean value 'falsea'
+        #
+        schema_error = re.match(self.SCHEMA_PARSER_ERROR_MSG_REGEX, message)


Do you want to raise a ValueError or RuntimeError with something more descriptive if message doesn't meet regex otherwise you'll get ugly tracebacks like:n
TypeError: 'NoneType' object is not subscriptable that may be hard to debug/understand at runtime

Yes, that's a good point, I think I was doing that at some point but removed during the refactoring...

blackboxsw · 2023-05-17T16:55:58Z

netplan/libnetplan.py

+
+    def __init__(self, message=None, domain=None, error=None):
+        super().__init__(message, domain, error)
+        schema_error = re.match(self.SCHEMA_VALIDATION_ERROR_MSG_REGEX, message)


Do we want to raise ValueError or RuntimeErrors here with an informative message if message doesn't match SCHEMA_VALIDATION_ERROR_MSG_REGEX. This prevents the obscure traceback TypeError: 'NoneType' object is not subscriptable at runtime

blackboxsw · 2023-05-17T16:59:13Z

tests/test_libnetplan.py

            self.assertEqual(yaml.safe_load(output), None)
+
+    def test_validation_error_exception(self):
+        parser = libnetplan.Parser()


Could you please explain in this docstring why this test is supposed to fail? It's valid yaml.

Yeah we probably should better document our tests... 😅

In this case it will raise ValidationException because "set-name" requires the use of "match".

blackboxsw · 2023-05-17T17:01:54Z

netplan/libnetplan.py

-class _GError(ctypes.Structure):
+class NetplanValidationException(NetplanException):
+    '''
+    Netplan Validation errors are expected to contain the YAML file name


Can this docstr explain in what cases we expect this class to be used (and a docstr on NetplanParserException to better differentiate what symptom each exception indicates).

blackboxsw · 2023-05-17T17:11:01Z

include/types.h

+ * Errors for domain NETPLAN_VALIDATION_ERROR
+ *
+ * VALIDATION_ERRORS are expected to contain only the YAML file name
+ * where the error was found.


I don't understand this error path. In what scenarios do we expect to only have a filename available during error handling and not specific guidance about the lines, columns or config xpath object that is representing a problem?

I will improve the docstring.

So, libnetplan walks through its internal representation of the network configuration to validate it. At this point we don't have the YAML context anymore, only the filename.

When we parse the files, we are walking through the YAML nodes so we have more information.

daniloegea · 2023-05-22T11:03:50Z

Thanks @slyon and @blackboxsw for your comments. I believe I addressed most of them.

I think building the jsonpath for ParseExceptions wouldn't be hard. Maybe we could consider doing that in the near future. It sounds useful.

Define Netplan specific errors so we don't need to use generic glib error codes.

Now we can raise specific exceptions based on the error domain and code passed from libnetplan. The exceptions will contain details about the error, such as filename, line and column numbers, the error message, error code and error domain.

It's still a glib GError in libnetplan but we call it NetplanError.

daniloegea · 2023-05-30T08:13:18Z

Thank you all for the reviews, I'll go ahead and merge it then.

daniloegea force-pushed the error_handling branch 2 times, most recently from b0435ab to a53fec7 Compare April 4, 2023 21:26

daniloegea requested review from schopin-pro and slyon April 4, 2023 23:18

schopin-pro approved these changes Apr 6, 2023

View reviewed changes

slyon reviewed Apr 12, 2023

View reviewed changes

daniloegea marked this pull request as draft April 18, 2023 16:27

slyon self-requested a review April 19, 2023 08:05

daniloegea force-pushed the error_handling branch from 6191895 to d62db99 Compare April 21, 2023 15:57

slyon reviewed Apr 24, 2023

View reviewed changes

slyon mentioned this pull request Apr 24, 2023

request for comment: specialized schema parse errors #318

Closed

5 tasks

slyon requested a review from blackboxsw April 24, 2023 10:37

slyon mentioned this pull request Apr 24, 2023

libnetplan: don't try to read from a NULL file #342

Merged

5 tasks

daniloegea force-pushed the error_handling branch from d62db99 to 38c354c Compare April 28, 2023 15:52

blackboxsw reviewed Apr 30, 2023

View reviewed changes

daniloegea force-pushed the error_handling branch 3 times, most recently from aa11cc4 to a95a3ea Compare May 8, 2023 19:07

daniloegea marked this pull request as ready for review May 10, 2023 07:59

daniloegea requested a review from slyon May 10, 2023 07:59

daniloegea force-pushed the error_handling branch from a95a3ea to 6b78d38 Compare May 10, 2023 08:42

daniloegea requested a review from blackboxsw May 11, 2023 19:05

slyon reviewed May 15, 2023

View reviewed changes

daniloegea force-pushed the error_handling branch from 6b78d38 to 157ad58 Compare May 17, 2023 10:24

daniloegea requested a review from slyon May 17, 2023 13:11

slyon approved these changes May 17, 2023

View reviewed changes

blackboxsw reviewed May 17, 2023

View reviewed changes

daniloegea force-pushed the error_handling branch from 157ad58 to 8578f8f Compare May 22, 2023 10:49

daniloegea force-pushed the error_handling branch from 8578f8f to e05f400 Compare May 25, 2023 17:02

Danilo Egea Gondolfo added 5 commits May 26, 2023 13:57

netplan: introduce some classes of errors

9496d84

Define Netplan specific errors so we don't need to use generic glib error codes.

netplan: replace glib errors with the new ones

70c5247

netplan: create exceptions for each error code

6fb5afa

Now we can raise specific exceptions based on the error domain and code passed from libnetplan. The exceptions will contain details about the error, such as filename, line and column numbers, the error message, error code and error domain.

tests: adjust the tests for the new error handling

13f6d6c

errors: adopt NetplanError in favor of GError

a78b001

It's still a glib GError in libnetplan but we call it NetplanError.

daniloegea force-pushed the error_handling branch from e05f400 to a78b001 Compare May 26, 2023 12:58

daniloegea merged commit 1e5c699 into canonical:main May 30, 2023

	message = f'{e.filename}:{e.line}:{e.column}: {e}'
	message = f'{e.filename}:l{e.line}:c{e.column}: {e}'

	r'(?P<file_path>.):(?P<error_line>\d+):(?P<error_col>\d+): (?P<message>.)'
	r'(?P<file_path>.):l(?P<error_line>\d+):c(?P<error_col>\d+): (?P<message>.)'

Conversation

daniloegea commented Mar 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

schopin-pro commented Mar 17, 2023

Uh oh!

daniloegea commented Apr 4, 2023

Uh oh!

schopin-pro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slyon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daniloegea commented Apr 18, 2023

Uh oh!

schopin-pro commented Apr 20, 2023

Uh oh!

daniloegea commented Apr 21, 2023

Uh oh!

slyon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

daniloegea commented Apr 28, 2023

Uh oh!

blackboxsw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slyon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slyon left a comment

Choose a reason for hiding this comment

daniloegea commented Mar 9, 2023 •

edited

Loading

slyon left a comment •

edited

Loading

daniloegea commented May 22, 2023 •

edited

Loading