Skip to content

Conversation

@maflcko
Copy link
Member

@maflcko maflcko commented Jul 17, 2023

Currently the CI containers always run on the host architecture, and only wrap bitcoind into qemu-user when needed. This has many issues:

Fix all issues by:

  • removing HOST from ci/test/00_setup_env.sh.
  • removing QEMU_USER_CMD and ci/test/wrap-qemu.sh.
  • removing DPKG_ADD_ARCH where possible, and pruning PACKAGES where possible.
  • specifying the architecture in CI_IMAGE_NAME_TAG to be used by the container engine.

@DrahtBot
Copy link
Contributor

DrahtBot commented Jul 17, 2023

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK fanquake
Concept ACK willcl-ark

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #28210 (build: Bump minimum supported Clang to clang-13 by MarcoFalke)
  • #28185 (ci: Use hard-coded root path for CI containers (bugfix) by MarcoFalke)
  • #27976 (ci: Start with clean env by MarcoFalke)
  • #21652 (ci: Switch more tasks to self-hosted by MarcoFalke)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@maflcko
Copy link
Member Author

maflcko commented Jul 17, 2023

If it doesn't work locally, you can check the enabled architectures via ls /proc/sys/fs/binfmt_misc/

@maflcko
Copy link
Member Author

maflcko commented Jul 17, 2023

Looks like it is not set up on Cirrus CI: https://cirrus-ci.com/task/6369879847075840?logs=build#L42 ?

@maflcko
Copy link
Member Author

maflcko commented Jul 17, 2023

I may try using a full CI VM next.

@maflcko maflcko marked this pull request as draft July 19, 2023 11:51
@maflcko maflcko marked this pull request as ready for review July 24, 2023 07:56
@maflcko
Copy link
Member Author

maflcko commented Jul 24, 2023

The arm task is now run by @DrahtBot, which should make it green

@maflcko
Copy link
Member Author

maflcko commented Jul 25, 2023

Looks like calling the armhf (cross) compiler (which happens often) via qemu-aarch64 on amd64 is too much overhead and times out? I'll try to run on aarch64 metal directly, which should avoid the extra hop through qemu in the CI.

@maflcko maflcko force-pushed the 2307-ci-qemu- branch 3 times, most recently from e098fe0 to fa4096c Compare July 25, 2023 09:29
@maflcko
Copy link
Member Author

maflcko commented Jul 25, 2023

Looks like CI is now green :)

Also the test failure from #27529 (comment) can now be tested on this pull with just:

MAKEJOBS="-j$(nproc)" FILE_ENV="./ci/test/00_setup_env_s390x.sh" ./ci/test_run_all.sh

Output:

 test  2023-07-25T15:37:42.430000Z TestFramework (ERROR): Unexpected exception caught during testing 
                                   Traceback (most recent call last):
                                     File "/root/b-c-ci/ci/scratch/build/bitcoin-s390x-ibm-linux-gnu/test/functional/test_framework/test_framework.py", line 131, in main
                                       self.run_test()
                                     File "/root/b-c-ci/ci/scratch/build/bitcoin-s390x-ibm-linux-gnu/test/functional/feature_addrman.py", line 68, in run_test
                                       self.start_node(0, extra_args=["-checkaddrman=1"])
                                     File "/root/b-c-ci/ci/scratch/build/bitcoin-s390x-ibm-linux-gnu/test/functional/test_framework/test_framework.py", line 537, in start_node
                                       node.wait_for_rpc_connection()
                                     File "/root/b-c-ci/ci/scratch/build/bitcoin-s390x-ibm-linux-gnu/test/functional/test_framework/test_node.py", line 235, in wait_for_rpc_connection
                                       raise FailedToStartError(self._node_msg(
                                   test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization
 test  2023-07-25T15:37:42.466000Z TestFramework (DEBUG): Closing down network thread 
 test  2023-07-25T15:37:42.518000Z TestFramework (INFO): Stopping nodes 
 test  2023-07-25T15:37:42.519000Z TestFramework.node0 (DEBUG): Stopping node 

 node0 stderr Error: Invalid or corrupt peers.dat (AutoFile::read: end of file: iostream error). If you believe this is a bug, please report it to https://github.com/bitcoin/bitcoin/issues. As a workaround, you can move the file ("/root/b-c-ci/ci/scratch/test_runner/test_runner_₿_🏃_20230725_152500/feature_addrman_40/node0/regtest/peers.dat") out of the way (rename, move, or delete) to have a new one created on the next start. 

@willcl-ark
Copy link
Member

I don't think I can give this much competent review I'm afraid. Partly because I can't get it working locally, and partly because I don't feel like I understand the changes fully yet.

I did try to run MAKEJOBS="-j$(nproc)" FILE_ENV="./ci/test/00_setup_env_s390x.sh" ./ci/test_run_all.sh, but it takes more than 2 hours to run (amd64 local arch), and I keep getting (possibly) unrelated errors like this: 390x_build_error.txt

I do have some questions for my own understanding though...

Am I right in understanding that in the new setup, docker is using qemu and binfmt-misc from the container engines' kernel to run (emulate) e.g. the "docker.io/s390x/debian:bookworm" image, and that this is roughly the same (but simpler to configure from our side) than running an x86 image, manually installing qemu, and then wrapping our various binaries in it. But it also achieves "The python tests are run on the host architecture, making it harder to find architecture specific bugs"?

@maflcko
Copy link
Member Author

maflcko commented Jul 27, 2023

but it takes more than 2 hours to run

Yes, this is expected. Instead of just running bitcoind through qemu, now the whole container is run through qemu.

unrelated errors

Jup, this should be unrelated, I presume it also happens on current master. You may be able to fix it by upgrading your qemu-s390x --version. IIRC qemu 7.2 or later is required.

this is roughly the same (but simpler to configure from our side) than running an x86 image, manually installing qemu, and then wrapping our various binaries in it. But it also achieves "The python tests are run on the host architecture, making it harder to find architecture specific bugs"?

Yeah, it requires a minimal setup procedure on your container engine and it may run slower at runtime. However, the result is otherwise,

  • simpler, less code;
  • more flexible, supporting a guest and host of any architecture combination; (:rocket:)
  • more efficient in the happy case by default, because the container engine will detect if the guest and host architecture matches, and sidestep the expensive qemu.
  • more complete, detecting more bugs, as all executables in the container are run in qemu. (:rocket:)

(I've added rocket emojis to the main features/bugfixes I am interested in)

@willcl-ark
Copy link
Member

Thanks Marco, thats a very helpful explanation (and I nice changes for us!).

I will give it another go this afternoon after trying to update qemu.

@willcl-ark
Copy link
Member

@MarcoFalke

Yes, this is expected. Instead of just running bitcoind through qemu, now the whole container is run through qemu.

Ok running this again... to setup and build Depends still takes more than an hour for me (with it taking 56 mins to progress past building qt) which seems relatively annoying. Am I correct that this build will use a depends cache on actual CI though, which should speed it up there significantly?

@maflcko
Copy link
Member Author

maflcko commented Jul 31, 2023

Depends still takes more than an hour for me

Yes, there seems to be quite an overhead in calling qemu for each configure check, and each build file. This is really the cost that is paid in this pull, but I think it is worth it the benefits.

Am I correct that this build will use a depends cache on actual CI though, which should speed it up there significantly?

CI will run on aarch64 metal, so qemu will never be called. (depends is also cached, but this is happening regardless).

@willcl-ark
Copy link
Member

Unfortunately that build broke again for me for another seemingly unrealted reason. I have the correct version of qemu this time for sure:

will@ubuntu in ~
$ qemu-s390x --version
qemu-s390x version 7.2.0 (Debian 1:7.2+dfsg-5ubuntu2.2)
Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers

I don't think I can test this on my side, but I am a concept ACK on these changes, and they look correct to me.

@maflcko
Copy link
Member Author

maflcko commented Jul 31, 2023

Feel free to create an issue with the unrelated stuff you ran into. Happy to take a look. Personally, I spin up a fresh VM (in a cloud or locally) to do testing, so I likely miss most CI UX issues.

@maflcko
Copy link
Member Author

maflcko commented Aug 4, 2023

Rebased, and added one more line of documentation

@maflcko
Copy link
Member Author

maflcko commented Aug 7, 2023

rebased 🥲

Copy link
Member

@fanquake fanquake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK fad0b67 - this seems ok to me, and removes complexity from our CI system.

I've locally tested the ENVs changed here, and seen the associated slowdown / resource consumption. Had GCC segfault more than once when cross-compiling s390x.

@fanquake fanquake merged commit 4922570 into bitcoin:master Aug 9, 2023
@maflcko maflcko deleted the 2307-ci-qemu- branch August 9, 2023 10:28
@maflcko
Copy link
Member Author

maflcko commented Aug 9, 2023

Had GCC segfault more than once when cross-compiling s390x.

I think this can be fixed by setting MAKEJOBS="-j1" FILE_ENV=... (or a higher value, depending on your memory size)

sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Aug 9, 2023
@maflcko
Copy link
Member Author

maflcko commented Aug 17, 2023

Had GCC segfault more than once when cross-compiling s390x.

I also ran into this on aarch64, and I presume it is just the CPU overheating (or some unrelated bug).

@maflcko
Copy link
Member Author

maflcko commented Aug 27, 2023

Had GCC segfault more than once when cross-compiling s390x.

I also ran into this on aarch64, and I presume it is just the CPU overheating (or some unrelated bug).

It seems it only happens with qemu-s390x on aarch64, and not with e.g. qemu-riscv64 on the same aarch64. So it seems a bug with qemu-s390x.

@bitcoin bitcoin locked and limited conversation to collaborators Dec 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants