Skip to content

Conversation

@pcahyna
Copy link
Member

@pcahyna pcahyna commented Apr 28, 2021

Pull Request Details:
  • Type: Bug Fix

  • Impact: High

  • Reference to related issue (URL): Can't restore EFI system with Software Raid #2595

  • How was this pull request tested?
    Backup & restore on a RHEL 8 host with UEFI but without RAID and on a CentOS 7 host with UEFI and RAID (ESP on RAID). Examination of logs to check that correct efibootmgr commands were executed.

  • Brief description of the changes in this pull request:

Use dependencies when determining partitions for UEFI

The ESP may be located on a RAID device. In this case, we need to determine the physical RAID components and call efibootmgr on them (after restoring - during the finalize step), because efibootmgr does not know how to tell the firmware to boot from RAID. Moreover, the current code tries to parse a device like /dev/md125 into a disk called /dev/md and partition 125.

Fix by using storage dependencies: RAID components (the actual ESP copies) can be found as dependencies of the ESP block device. For consistency, use dependencies also to locate the ESP block device itself (the old method is preserved as a fallback). Similar strategy is used by the GRUB2 installation code (finalize/Linux-i386/660_install_grub2.sh), which is able to install GRUB2 on the RAID components correctly.

In addition, clean up usr/share/rear/finalize/Linux-i386/670_run_efibootmgr.sh a bit and add more logging to ease analysis when something goes wrong in this area.

TODO: finalize/Linux-i386/610_EFISTUB_run_efibootmgr.sh needs a similar fix.

Fixes #2595

@gdha gdha requested review from gozora and jsmeix April 29, 2021 08:22
@gdha gdha added the enhancement Adaptions and new features label Apr 29, 2021
device=${device%p}
fi

echo $device
Copy link
Member

@jsmeix jsmeix Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if $device is somehow invalid or empty?
Could some checking be implemented?
Currently get_device_from_partition is called only once
so the caller code could be enhanced to do some error handling
if get_device_from_partition returns non-zero exit value
or if get_device_from_partition output is empty if that indicates an error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsmeix

$device can't be empty here, it has been created in this function, unless the caller has provided an empty $1. Should we protect against that?

Concerning validity, we could check if the device actually exists and is a block device.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may catch empty $1 after partition_block_device=$1 with something like

contains_visible_char "$partition_block_device" || BugError "Empty string passed to get_device_from_partition"

or probably even better

test -b "$partition_block_device" || BugError "get_device_from_partition called with '$partition_block_device' that is no block device"

which also catches when $1 is empty or only blanks and it shows that in the error message like

get_device_from_partition called with '' that is no block device

or

get_device_from_partition called with ' ' that is no block device

I tested that test -b behaves well also for symlinks
(at least on my openSUSE Leap 15.2 system):

# ls -l /dev/disk/by-id/wwn-0x5000039462b83c55-part1 | grep sda1
lrwxrwxrwx 1 root root 10 Apr 30 07:36 /dev/disk/by-id/wwn-0x5000039462b83c55-part1 -> ../../sda1

# test -b /dev/disk/by-id/wwn-0x5000039462b83c55-part1 && echo y || echo n
y

Copy link
Member

@jsmeix jsmeix Apr 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally the current "string munging" implementation
may result a $device string that is not a block device
so something like

test -b "$device" && echo $device

(test -b "$device" returns 1 when it is no block device)
as last command in the function
makes the function return 1 in case of errors
so that its caller can do whatever is best in the caller's environment.

Note that quoting is crucial here because

# test -b qqq && echo y || echo n
n

# test -b     && echo y || echo n
y

# test -b " " && echo y || echo n
n

i.e. test -b returns 0 when its unquoted argument is empty or only blanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerning invalid block devices, how does disk mapping work when block devices in the original system are mapped to different names in the rescue system? Is the layout and mapping regenerated to refer to the mapped names, or does it contain the original names and one has to map them after resolving dependences?
If the latter, it will have impact on the desirability of validation in this function, as it will be legitimate to handle device names that do not correspond to actual existing devices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weekend is near so only a quick reply off the top of my head:

When things fail with errors because of disk mappings
it is still better than to silently proceed with wrong data.
I prefer user reports where things fail with clear error messages
over user reports when things fail in weird inexplicable ways,
cf. "Try hard to care about possible errors" in
https://github.com/rear/rear/wiki/Coding-Style

In migration mode "all needed files" (whatever that exactly means in practice ;-)
should be adapted during "rear recover".

The starting point is layout/prepare/default/300_map_disks.sh and
the actual mapping happens via layout/prepare/default/320_apply_mappings.sh
which contains

for file_to_migrate in "$LAYOUT_FILE" "$original_disk_space_usage_file" "$rescue_config_file" ; do
...
    ... apply_layout_mappings "$file_to_migrate" ...

so as far as it looks there mappings are only applied to
disklayout.conf layout/config/df.txt /etc/rear/rescue.conf

I assume those three files are usually sufficient
but I don't know for sure, in particular not for special cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Searching the ReaR scripts for apply_layout_mappings
(that is the function that does the actual works)
finds that apply_layout_mappings is also called during "rear recover" in
finalize/GNU/Linux/250_migrate_disk_devices_layout.sh
and
finalize/GNU/Linux/260_rename_diskbyid.sh
where a lot more files (in particular restored bootloader config files)
get possibly adapted.

I am not a sufficient UEFI booting expert to know if there are missing
efibootmgr and/or EFI bootloader related files that would need to be adapted.

As far as I understand the code in finalize/Linux-i386/670_run_efibootmgr.sh
it seems disk mapping cannot affect how efibootmgr is called there
because it seems the code in finalize/Linux-i386/670_run_efibootmgr.sh
works directly on the actual state in the currently running recovery system because
it seems all is based on where ESP is mounted in the currently running recovery system
so nothing possibly outdated of how things have been on the original system is used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand the code in finalize/Linux-i386/670_run_efibootmgr.sh
it seems disk mapping cannot affect how efibootmgr is called there
because it seems the code in finalize/Linux-i386/670_run_efibootmgr.sh
works directly on the actual state in the currently running recovery system because
it seems all is based on where ESP is mounted in the currently running recovery system

After this proposed change, the code will use disk mapping (and not just the actual state). But, thanks to apply_layout_mappings being called earlier, the disk mappings should be in sync with the current disk names, so there should be no problem.

I am surprised a bit though that apply_layout_mappings gets called multiple times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding "something is done multiple times" and similar things in ReaR:
Welcome to surprising adventures with grown code over many years by various contributors ;-)

I think ReaR is a perfect bazaar as in
https://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added both checks that you suggested (for get_device_from_partition input and result).

Copy link
Member

@jsmeix jsmeix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From plain looking at the changes they look good to me
so I can approve it "bona fide" - in particular because things
are sufficiently commented so I think I even understand it a bit ;-)

@jsmeix jsmeix self-assigned this Apr 29, 2021
@jsmeix jsmeix added this to the ReaR v2.7 milestone Apr 29, 2021
@jsmeix jsmeix requested a review from gdha April 29, 2021 11:56
@pcahyna
Copy link
Member Author

pcahyna commented Apr 29, 2021

@gdha hello, should it have the enhancement label, when it Fixes #2595, which has the bug label?

@pcahyna
Copy link
Member Author

pcahyna commented Apr 29, 2021

@rmetrich hello, please review, as you reported the problem.

@rmetrich
Copy link
Contributor

Hi @pcahyna I'm all good with your code, thanks a lot for the effort.

@jsmeix jsmeix added the bug The code does not do what it is meant to do label Apr 30, 2021
@jsmeix
Copy link
Member

jsmeix commented Apr 30, 2021

I agree that ReaR is expected to restore a software RAID also on UEFI systems
so the issue is a missing expected functionality a.k.a. "bug".

On the other hand there are zillions of other "reasonably expected" functionalitries
that are currently missing in ReaR so such issues are also enhancements
in particular when "reasonably expected" code for that is just not there.

For me a plain "bug" is only when code that is there does not work as intended.

I.e. it is also not a plain "bug" when code that is there does work as intended
but the current intention is not what others "reasonably expect" so such code
is "buggy" and needs some enhancement to behave as "reasonably expected".

@jsmeix jsmeix added the cleanup label Apr 30, 2021
@jsmeix
Copy link
Member

jsmeix commented Apr 30, 2021

@pcahyna
thank you so much for your additional cleanup work like your recent
8228246

I do very much appreciate it!
It helps so much to get ReaR's code step by step into a better state
where others can easily understand WHY and WHAT and HOW things are done in ReaR
so that others can easily adapt and enhance ReaR in the future.

@pcahyna
Copy link
Member Author

pcahyna commented Apr 30, 2021

@jsmeix, glad you like it! You know,
https://despair.com/products/quality

I can also replace the non-standard case identifiers like Dev and ParNr by something that conforms to the style guide, if you think that it makes sense to do it in the same PR.

@jsmeix
Copy link
Member

jsmeix commented Apr 30, 2021

@pcahyna
yes, feel free to mercilessly replace old non-standard identifiers with better names
in particular if needed with more explanatory names
cf. "Code must be easy to read" in
https://github.com/rear/rear/wiki/Coding-Style

I appreciate it!

@jsmeix
Copy link
Member

jsmeix commented Apr 30, 2021

I wish you all a relaxed and recovering weekend!

@jsmeix
Copy link
Member

jsmeix commented Apr 30, 2021

Some "famous words" before weekend regarding
https://despair.com/products/quality

The race for quality has no finish line- so technically,
it's more like a death march.

Yes, we all will be dead at some time.

But if we made code as intended by
https://github.com/rear/rear/wiki/Coding-Style
our code can evolve (it may not evolve but it can evolve)
so from our code future code can be derived (this may not happen but it can happen)
in contrast to code that (amost) nobody understands which will be completely removed
and replaced by other completely new code from scratch
so code that is hard to understand dies out
while code that is easy to understand can have descendants.

Copy link
Member

@gozora gozora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the plain looking at the code, it looks good to me.

V.

@jsmeix
Copy link
Member

jsmeix commented May 3, 2021

It seems there is a dilemma where in the code
the Creating EFI Boot Manager entries... info
should be done.

Current finalize/Linux-i386/670_run_efibootmgr.sh code in this pull request is (excerpts)

LogPrint "Creating EFI Boot Manager entries..."
...
if [ "$esp_mountpoint" = "/" ] ; then
    esp_mountpoint="$TARGET_FS_ROOT/boot/efi"
    Log "Mountpoint of $TARGET_FS_ROOT/$UEFI_BOOTLOADER not found, trying $esp_mountpoint"
fi
...
test -d "$esp_mountpoint" || return 0

so it could tell the user Creating EFI Boot Manager entries...
but then it exits silently and does nothing.

I think when test -d "$esp_mountpoint" fails is must no longer exit silently
so a matching message should be added like

if ! test -d "$esp_mountpoint" ; then
    LogPrint "Skip creating EFI Boot Manager entries (no ESP mountpoint found)"
fi

Or my
https://github.com/rear/rear/pull/2608/files#r622957244
was wrong and nothing should be shown to the user about
Mountpoint of /mnt/local/UEFI_BOOTLOADER not found, trying /mnt/local/boot/efi
and the LogPrint "Creating EFI Boot Manager entries..."
should be moved after the test -d "$esp_mountpoint" || return 0.

@pcahyna
Copy link
Member Author

pcahyna commented May 3, 2021

I think when test -d "$esp_mountpoint" fails is must no longer exit silently

This part caught my attention as well. Is it even wise to return 0 in this case? At this point we know that a UEFI bootloader should be used, so skipping the rest of the file is almost certainly a serious error.
(Not so much related to the subject of the PR anymore though. For now, I can add the other LogPrint as you suggest.)

@pcahyna
Copy link
Member Author

pcahyna commented May 3, 2021

For now, I can add the other LogPrint as you suggest

Done (I changed the message a bit.)

@jsmeix
Copy link
Member

jsmeix commented May 3, 2021

@pcahyna
regarding your #2608 (comment)

At this point we know that a UEFI bootloader should be used,
so skipping the rest of the file is almost certainly a serious error.

Then show it as an error to the user so he is informed something went wrong
(e.g. use wording like failed to ... or cannot ... that tells something went wrong)
and the user knows he may have to fix something manually after "rear recover"
and leave the script with return 1 (the Source function doesn't abort)
but do not error out of "rear recover", cf. finalize/Linux-i386/660_install_grub2.sh
https://github.com/rear/rear/blob/master/usr/share/rear/finalize/Linux-i386/660_install_grub2.sh

pcahyna and others added 7 commits May 5, 2021 13:00
The main part of the script gets skipped in this case, which is likely
a problem and should be reported. Also, we have said before that we are
"Creating EFI Boot Manager entries...", so we have to report when it
actually does not happen.
get_device_from_partition() now validates the input partition and the
converted output (disk/device name). 670_run_efibootmgr.sh now checks
for possible errors from get_device_from_partition() and logs them.
Have the LogPrint "Creating EFI Boot Manager entries..." user message
before the first LogPrintError "Failed to create EFI Boot Manager entries ..."
cf. rear#2608 (comment)
Explain why having no "$TARGET_FS_ROOT/$UEFI_BOOTLOADER"
conflicts with USING_UEFI_BOOTLOADER is true
i.e. we should create EFI Boot Manager entries but we cannot
so we report it as LogPrintError to the user.
Use LogPrintError "... UEFI bootloader '$UEFI_BOOTLOADER' not found..."
(i.e. with single quotes) because UEFI_BOOTLOADER might be empty and
then this is visible in the message as "... UEFI bootloader '' not found..."
@pcahyna pcahyna force-pushed the fix-uefi-on-raid branch from 1138b53 to 3472863 Compare May 5, 2021 11:03
@pcahyna
Copy link
Member Author

pcahyna commented May 5, 2021

@jsmeix rebase done.

@jsmeix
Copy link
Member

jsmeix commented May 5, 2021

@pcahyna
thank you.

I will merge it soon today.

Leave 940_grub2_rescue.sh at the end with explicit "return 1" in case of
LogPrintError "efibootmgr failed to create EFI Boot Manager entry ..."
cf. same kind of code in finalize/Linux-i386/660_install_grub2.sh
@pcahyna
Copy link
Member Author

pcahyna commented May 5, 2021

Oops. I accidentally made the file executable in 27de4cd @jsmeix can you please revert the mode change, or should I submit a new PR reverting it?

@jsmeix
Copy link
Member

jsmeix commented May 7, 2021

@pcahyna regarding executable finalize/Linux-i386/670_run_efibootmgr.sh
you could simply submit it with mode -rw-r--r-- as "by the way" addon
via your next PR without mentioning it in a commit message
(but you can mention it in a separated PR comment).

jsmeix added a commit that referenced this pull request May 10, 2021
Added TODO comment:
When the ESP is located on MD RAID we need to determine the physical RAID components
and call efibootmgr on each of them, cf. #2608
pcahyna added a commit to pcahyna/rear that referenced this pull request Jun 18, 2021
pcahyna added a commit to pcahyna/rear that referenced this pull request Jun 18, 2021
pcahyna added a commit to pcahyna/rear that referenced this pull request Jun 18, 2021
pcahyna added a commit to pcahyna/rear that referenced this pull request Jun 18, 2021
pcahyna pushed a commit to pcahyna/rear that referenced this pull request Feb 17, 2023
Have the LogPrint "Creating EFI Boot Manager entries..." user message
before the first LogPrintError "Failed to create EFI Boot Manager entries ..."
cf. rear#2608 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug The code does not do what it is meant to do cleanup enhancement Adaptions and new features fixed / solved / done

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can't restore EFI system with Software Raid

5 participants