Fix setting boot path in case of UEFI partition (ESP) on MD RAID #2608

pcahyna · 2021-04-28T18:09:39Z

Pull Request Details:

Type: Bug Fix
Impact: High
Reference to related issue (URL): Can't restore EFI system with Software Raid #2595
How was this pull request tested?
Backup & restore on a RHEL 8 host with UEFI but without RAID and on a CentOS 7 host with UEFI and RAID (ESP on RAID). Examination of logs to check that correct efibootmgr commands were executed.
Brief description of the changes in this pull request:

Use dependencies when determining partitions for UEFI

The ESP may be located on a RAID device. In this case, we need to determine the physical RAID components and call efibootmgr on them (after restoring - during the finalize step), because efibootmgr does not know how to tell the firmware to boot from RAID. Moreover, the current code tries to parse a device like /dev/md125 into a disk called /dev/md and partition 125.

Fix by using storage dependencies: RAID components (the actual ESP copies) can be found as dependencies of the ESP block device. For consistency, use dependencies also to locate the ESP block device itself (the old method is preserved as a fallback). Similar strategy is used by the GRUB2 installation code (finalize/Linux-i386/660_install_grub2.sh), which is able to install GRUB2 on the RAID components correctly.

In addition, clean up usr/share/rear/finalize/Linux-i386/670_run_efibootmgr.sh a bit and add more logging to ease analysis when something goes wrong in this area.

TODO: finalize/Linux-i386/610_EFISTUB_run_efibootmgr.sh needs a similar fix.

Fixes #2595

usr/share/rear/finalize/Linux-i386/670_run_efibootmgr.sh

usr/share/rear/lib/layout-functions.sh

jsmeix · 2021-04-29T11:54:01Z

usr/share/rear/lib/layout-functions.sh

+        device=${device%p}
+    fi
+
+    echo $device


What if $device is somehow invalid or empty?
Could some checking be implemented?
Currently get_device_from_partition is called only once
so the caller code could be enhanced to do some error handling
if get_device_from_partition returns non-zero exit value
or if get_device_from_partition output is empty if that indicates an error.

@jsmeix

$device can't be empty here, it has been created in this function, unless the caller has provided an empty $1. Should we protect against that?

Concerning validity, we could check if the device actually exists and is a block device.

You may catch empty $1 after partition_block_device=$1 with something like

contains_visible_char "$partition_block_device" || BugError "Empty string passed to get_device_from_partition"

or probably even better

test -b "$partition_block_device" || BugError "get_device_from_partition called with '$partition_block_device' that is no block device"

which also catches when $1 is empty or only blanks and it shows that in the error message like

get_device_from_partition called with '' that is no block device

or

get_device_from_partition called with ' ' that is no block device

I tested that test -b behaves well also for symlinks
(at least on my openSUSE Leap 15.2 system):

# ls -l /dev/disk/by-id/wwn-0x5000039462b83c55-part1 | grep sda1 lrwxrwxrwx 1 root root 10 Apr 30 07:36 /dev/disk/by-id/wwn-0x5000039462b83c55-part1 -> ../../sda1 # test -b /dev/disk/by-id/wwn-0x5000039462b83c55-part1 && echo y || echo n y

Additionally the current "string munging" implementation
may result a $device string that is not a block device
so something like

test -b "$device" && echo $device

(test -b "$device" returns 1 when it is no block device)
as last command in the function
makes the function return 1 in case of errors
so that its caller can do whatever is best in the caller's environment.

Note that quoting is crucial here because

# test -b qqq && echo y || echo n n # test -b && echo y || echo n y # test -b " " && echo y || echo n n

i.e. test -b returns 0 when its unquoted argument is empty or only blanks.

Concerning invalid block devices, how does disk mapping work when block devices in the original system are mapped to different names in the rescue system? Is the layout and mapping regenerated to refer to the mapped names, or does it contain the original names and one has to map them after resolving dependences?
If the latter, it will have impact on the desirability of validation in this function, as it will be legitimate to handle device names that do not correspond to actual existing devices.

Weekend is near so only a quick reply off the top of my head:

When things fail with errors because of disk mappings
it is still better than to silently proceed with wrong data.
I prefer user reports where things fail with clear error messages
over user reports when things fail in weird inexplicable ways,
cf. "Try hard to care about possible errors" in
https://github.com/rear/rear/wiki/Coding-Style

In migration mode "all needed files" (whatever that exactly means in practice ;-)
should be adapted during "rear recover".

The starting point is layout/prepare/default/300_map_disks.sh and
the actual mapping happens via layout/prepare/default/320_apply_mappings.sh
which contains

for file_to_migrate in "$LAYOUT_FILE" "$original_disk_space_usage_file" "$rescue_config_file" ; do ... ... apply_layout_mappings "$file_to_migrate" ...

so as far as it looks there mappings are only applied to
disklayout.conf layout/config/df.txt /etc/rear/rescue.conf

I assume those three files are usually sufficient
but I don't know for sure, in particular not for special cases.

Searching the ReaR scripts for apply_layout_mappings
(that is the function that does the actual works)
finds that apply_layout_mappings is also called during "rear recover" in
finalize/GNU/Linux/250_migrate_disk_devices_layout.sh
and
finalize/GNU/Linux/260_rename_diskbyid.sh
where a lot more files (in particular restored bootloader config files)
get possibly adapted.

I am not a sufficient UEFI booting expert to know if there are missing
efibootmgr and/or EFI bootloader related files that would need to be adapted.

As far as I understand the code in finalize/Linux-i386/670_run_efibootmgr.sh
it seems disk mapping cannot affect how efibootmgr is called there
because it seems the code in finalize/Linux-i386/670_run_efibootmgr.sh
works directly on the actual state in the currently running recovery system because
it seems all is based on where ESP is mounted in the currently running recovery system
so nothing possibly outdated of how things have been on the original system is used.

As far as I understand the code in finalize/Linux-i386/670_run_efibootmgr.sh
it seems disk mapping cannot affect how efibootmgr is called there
because it seems the code in finalize/Linux-i386/670_run_efibootmgr.sh
works directly on the actual state in the currently running recovery system because
it seems all is based on where ESP is mounted in the currently running recovery system

After this proposed change, the code will use disk mapping (and not just the actual state). But, thanks to apply_layout_mappings being called earlier, the disk mappings should be in sync with the current disk names, so there should be no problem.

I am surprised a bit though that apply_layout_mappings gets called multiple times.

Regarding "something is done multiple times" and similar things in ReaR:
Welcome to surprising adventures with grown code over many years by various contributors ;-)

I think ReaR is a perfect bazaar as in
https://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar

I added both checks that you suggested (for get_device_from_partition input and result).

jsmeix

From plain looking at the changes they look good to me
so I can approve it "bona fide" - in particular because things
are sufficiently commented so I think I even understand it a bit ;-)

pcahyna · 2021-04-29T18:58:13Z

@gdha hello, should it have the enhancement label, when it Fixes #2595, which has the bug label?

pcahyna · 2021-04-29T18:59:04Z

@rmetrich hello, please review, as you reported the problem.

rmetrich · 2021-04-30T10:51:50Z

Hi @pcahyna I'm all good with your code, thanks a lot for the effort.

jsmeix · 2021-04-30T11:25:06Z

I agree that ReaR is expected to restore a software RAID also on UEFI systems
so the issue is a missing expected functionality a.k.a. "bug".

On the other hand there are zillions of other "reasonably expected" functionalitries
that are currently missing in ReaR so such issues are also enhancements
in particular when "reasonably expected" code for that is just not there.

For me a plain "bug" is only when code that is there does not work as intended.

I.e. it is also not a plain "bug" when code that is there does work as intended
but the current intention is not what others "reasonably expect" so such code
is "buggy" and needs some enhancement to behave as "reasonably expected".

jsmeix · 2021-04-30T11:29:40Z

@pcahyna
thank you so much for your additional cleanup work like your recent
8228246

I do very much appreciate it!
It helps so much to get ReaR's code step by step into a better state
where others can easily understand WHY and WHAT and HOW things are done in ReaR
so that others can easily adapt and enhance ReaR in the future.

pcahyna · 2021-04-30T12:37:41Z

@jsmeix, glad you like it! You know,
https://despair.com/products/quality

I can also replace the non-standard case identifiers like Dev and ParNr by something that conforms to the style guide, if you think that it makes sense to do it in the same PR.

jsmeix · 2021-04-30T13:10:15Z

@pcahyna
yes, feel free to mercilessly replace old non-standard identifiers with better names
in particular if needed with more explanatory names
cf. "Code must be easy to read" in
https://github.com/rear/rear/wiki/Coding-Style

I appreciate it!

jsmeix · 2021-04-30T13:31:03Z

I wish you all a relaxed and recovering weekend!

jsmeix · 2021-04-30T13:40:31Z

Some "famous words" before weekend regarding
https://despair.com/products/quality

The race for quality has no finish line- so technically,
it's more like a death march.

Yes, we all will be dead at some time.

But if we made code as intended by
https://github.com/rear/rear/wiki/Coding-Style
our code can evolve (it may not evolve but it can evolve)
so from our code future code can be derived (this may not happen but it can happen)
in contrast to code that (amost) nobody understands which will be completely removed
and replaced by other completely new code from scratch
so code that is hard to understand dies out
while code that is easy to understand can have descendants.

gozora

From the plain looking at the code, it looks good to me.

V.

jsmeix · 2021-05-03T12:48:24Z

It seems there is a dilemma where in the code
the Creating EFI Boot Manager entries... info
should be done.

Current finalize/Linux-i386/670_run_efibootmgr.sh code in this pull request is (excerpts)

LogPrint "Creating EFI Boot Manager entries..."
...
if [ "$esp_mountpoint" = "/" ] ; then
    esp_mountpoint="$TARGET_FS_ROOT/boot/efi"
    Log "Mountpoint of $TARGET_FS_ROOT/$UEFI_BOOTLOADER not found, trying $esp_mountpoint"
fi
...
test -d "$esp_mountpoint" || return 0

so it could tell the user Creating EFI Boot Manager entries...
but then it exits silently and does nothing.

I think when test -d "$esp_mountpoint" fails is must no longer exit silently
so a matching message should be added like

if ! test -d "$esp_mountpoint" ; then
    LogPrint "Skip creating EFI Boot Manager entries (no ESP mountpoint found)"
fi

Or my
https://github.com/rear/rear/pull/2608/files#r622957244
was wrong and nothing should be shown to the user about
Mountpoint of /mnt/local/UEFI_BOOTLOADER not found, trying /mnt/local/boot/efi
and the LogPrint "Creating EFI Boot Manager entries..."
should be moved after the test -d "$esp_mountpoint" || return 0.

pcahyna · 2021-05-03T13:35:29Z

I think when test -d "$esp_mountpoint" fails is must no longer exit silently

This part caught my attention as well. Is it even wise to return 0 in this case? At this point we know that a UEFI bootloader should be used, so skipping the rest of the file is almost certainly a serious error.
(Not so much related to the subject of the PR anymore though. For now, I can add the other LogPrint as you suggest.)

pcahyna · 2021-05-03T14:07:27Z

For now, I can add the other LogPrint as you suggest

Done (I changed the message a bit.)

usr/share/rear/finalize/Linux-i386/670_run_efibootmgr.sh

jsmeix · 2021-05-03T15:07:29Z

@pcahyna
regarding your #2608 (comment)

At this point we know that a UEFI bootloader should be used,
so skipping the rest of the file is almost certainly a serious error.

Then show it as an error to the user so he is informed something went wrong
(e.g. use wording like failed to ... or cannot ... that tells something went wrong)
and the user knows he may have to fix something manually after "rear recover"
and leave the script with return 1 (the Source function doesn't abort)
but do not error out of "rear recover", cf. finalize/Linux-i386/660_install_grub2.sh
https://github.com/rear/rear/blob/master/usr/share/rear/finalize/Linux-i386/660_install_grub2.sh

The main part of the script gets skipped in this case, which is likely a problem and should be reported. Also, we have said before that we are "Creating EFI Boot Manager entries...", so we have to report when it actually does not happen.

get_device_from_partition() now validates the input partition and the converted output (disk/device name). 670_run_efibootmgr.sh now checks for possible errors from get_device_from_partition() and logs them.

Have the LogPrint "Creating EFI Boot Manager entries..." user message before the first LogPrintError "Failed to create EFI Boot Manager entries ..." cf. rear#2608 (comment)

Explain why having no "$TARGET_FS_ROOT/$UEFI_BOOTLOADER" conflicts with USING_UEFI_BOOTLOADER is true i.e. we should create EFI Boot Manager entries but we cannot so we report it as LogPrintError to the user.

Use LogPrintError "... UEFI bootloader '$UEFI_BOOTLOADER' not found..." (i.e. with single quotes) because UEFI_BOOTLOADER might be empty and then this is visible in the message as "... UEFI bootloader '' not found..."

pcahyna · 2021-05-05T11:28:47Z

@jsmeix rebase done.

jsmeix · 2021-05-05T12:04:46Z

@pcahyna
thank you.

I will merge it soon today.

Leave 940_grub2_rescue.sh at the end with explicit "return 1" in case of LogPrintError "efibootmgr failed to create EFI Boot Manager entry ..." cf. same kind of code in finalize/Linux-i386/660_install_grub2.sh

pcahyna · 2021-05-05T19:08:22Z

Oops. I accidentally made the file executable in 27de4cd @jsmeix can you please revert the mode change, or should I submit a new PR reverting it?

jsmeix · 2021-05-07T09:44:15Z

@pcahyna regarding executable finalize/Linux-i386/670_run_efibootmgr.sh
you could simply submit it with mode -rw-r--r-- as "by the way" addon
via your next PR without mentioning it in a commit message
(but you can mention it in a separated PR comment).

Added TODO comment: When the ESP is located on MD RAID we need to determine the physical RAID components and call efibootmgr on each of them, cf. #2608

Fixes mistaken commit 27de4cd in PR rear#2608

Have the LogPrint "Creating EFI Boot Manager entries..." user message before the first LogPrintError "Failed to create EFI Boot Manager entries ..." cf. rear#2608 (comment)

gdha requested review from gozora and jsmeix April 29, 2021 08:22