Skip to content

fe310: Fix potential deadlock in thread_yield_higher#15277

Closed
nmeum wants to merge 2 commits intoRIOT-OS:masterfrom
nmeum:pr/fe310_sched_deadlock
Closed

fe310: Fix potential deadlock in thread_yield_higher#15277
nmeum wants to merge 2 commits intoRIOT-OS:masterfrom
nmeum:pr/fe310_sched_deadlock

Conversation

@nmeum
Copy link
Copy Markdown
Member

@nmeum nmeum commented Oct 22, 2020

Contribution description

The software interrupt may be raised before the WFI instruction is reached, if it is and no other interrupt is raised afterwards the thread will deadlock as it will never awake from the WFI.

I noticed this bug with a custom RISC-V simulator and cannot reproduce it on real hardware though. However, even on real hardware the WFI instruction may be implemented as a NOP, thus the current code is (in my opinion) suboptimal either way. The change proposed here simply busy waits until the interrupt handler for the software interrupt is invoked (which sets CLINT_MSIP), since this should only take a few cycles it shouldn't have a big impact on power consumption.

Testing procedure

I encountered this bug with the following application. However, since this is a concurrency issue, which I personally cannot reproduce on real hardware (the HiFive1), it's probably difficult to test this PR using the application. I would instead suggest reasoning based on my observations and the suggested changes.

#include <stdio.h>
#include <stdlib.h>
#include "xtimer.h"
#include "timex.h"

int main(void)
{
    puts("Sleeping for 5 seconds...");
    xtimer_sleep(5);
    puts("Sleep done!");
}

Due to the bug the bug in thread_yield_higher the mutex (used internally by xtimer) would deadlock here:

RIOT/core/mutex.c

Lines 61 to 65 in 57264c5

irq_restore(irqstate);
thread_yield_higher();
/* We were woken up by scheduler. Waker removed us from queue.
* We have the mutex now. */
return 1;

I think this example also illustrates why the current implementation of thread_yield_higher might be a bad idea, given that WFI might be implemented as a NOP.

Issues/PRs references

I have only tested this on the HiFive1, not the HiFive1b. Maybe this could explain some of the test failures on the HiFive1b (xtimer failures in #13086) were timing might differ from the HiFive1 in terms of interrupt delivery? Just a wild guess...

nmeum added 2 commits October 22, 2020 09:10
Makes the code a bit more readable.
The software interrupt may be raised before the WFI instruction is
reached, if it is and no other interrupt is raised afterwards the
thread will deadlock as it will never awake from the WFI.

I noticed this bug with a custom RISC-V simulator and cannot reproduce
it on real hardware though. However, even on real hardware the WFI
instruction may be implemented as a NOP, thus the current code is
suboptimal either way. The change proposed here simply busy waits until
the interrupt handler for the software interrupt is invoked (which sets
CLINT_MSIP), since this should only take a few cycles it shouldn't have
a big impact on power consumption.
@benpicco benpicco added Platform: RISC-V Platform: This PR/issue effects RISC-V-based platforms Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) labels Oct 22, 2020
@benpicco benpicco added this to the Release 2021.01 milestone Jan 7, 2021
@benpicco benpicco requested a review from maribu January 7, 2021 11:17
@aabadie
Copy link
Copy Markdown
Contributor

aabadie commented Jan 7, 2021

Just tested this PR on hifive1b and it seems to work.

To give some numbers, I ran tests/bench_thread_flags_pingpong and tests/bench_thread_yield_pingpong:

test master PR
bench_thread_yield_pingpong { "result" : 665232, "ticks" : 481 } { "result" : 618917, "ticks" : 517 }
bench_thread_flags_pingpong { "result" : 551203, "ticks" : 580 } { "result" : 516082, "ticks" : 620 }

So we have ~6-7% performance drop apparently.

@maribu
Copy link
Copy Markdown
Member

maribu commented Jan 7, 2021

Note that the implementation right now is not a RISC-V implementation, but specific to the fe310. So far we can rightfully assume to have the exact behavior this MCU has (including the timing). Hence, WFI cannot be a nop and will be reached before the soft IRQ is triggered. So IMO there is no bug in the code.

That said, obviously we want to support more RISC-V MCUs in the future. Getting rid of code relying on the implementation details of a specific MCU would certainly be a step in the right direction.

Let me do an elf_diff to se if we can do something about the performance impact.

@maribu
Copy link
Copy Markdown
Member

maribu commented Jan 7, 2021

The machine code generated by this PR looks like this:

Disassembly of section .text.thread_yield_higher:

00000000 <thread_yield_higher>:
   0:   020007b7                lui     a5,0x2000
   4:   4705                    li      a4,1
   6:   c398                    sw      a4,0(a5)
   8:   020006b7                lui     a3,0x2000
   c:   4785                    li      a5,1

0000000e <.L12>:
   e:   4298                    lw      a4,0(a3)
  10:   fef70fe3                beq     a4,a5,e <.L12>
  14:   8082                    ret

That is suboptimal because the address of CLINT_MSIP is first loaded into a5 and later again in a3 - why doesn't GCC reuse a5 for the subsequent load is a mystery to me. Similar the immediate 1 is stored in a4 and later again in a5 - again the register could just be reused. Dropping the duplicated code results in 6 B ROM less consumed:

Disassembly of section .text.thread_yield_higher:

00000000 <thread_yield_higher>:
   0:   4785                    li      a5,1
   2:   02000737                lui     a4,0x2000
   6:   c31c                    sw      a5,0(a4)

00000008 <loop_until_softirq_handled>:
   8:   4314                    lw      a3,0(a4)
   a:   fef68fe3                beq     a3,a5,8 <loop_until_softirq_handled>
   e:   8082                    ret
patch of thread_yield_higher() to safe 6 B of ROM
commit 40117a082b354961ed9088e477963e8fce35ed74 (HEAD -> fe310)
Author: Marian Buschsieweke <[email protected]>
Date:   Thu Jan 7 17:10:31 2021 +0100

    cpu/fe310: optimize irq_yield_higher()

diff --git a/cpu/fe310/thread_arch.c b/cpu/fe310/thread_arch.c
index ed5687a49f..e12fe76efb 100644
--- a/cpu/fe310/thread_arch.c
+++ b/cpu/fe310/thread_arch.c
@@ -180,13 +180,32 @@ void cpu_switch_context_exit(void)
 
 void thread_yield_higher(void)
 {
-    /* Use SW intr to schedule context switch */
-    CLINT_REG(CLINT_MSIP) = 1;
-
-    /* Latency of SW intr can be a few cycles; wait for the SW intr.
+    /* Let compiler to register allocation via fake output */
+    uint32_t tmp;
+    uint32_t one = 1;
+    uintptr_t clint_msip = CLINT_CTRL_ADDR + CLINT_MSIP;
+    /* The assembly below implements:
+     *
+     *     *((volatile uint32_t)clint_msip) = 1;
+     *      while (*((volatile uint32_t)clint_msip) == 1) { }
+     *
+     * One would assume that the compiler is able to implement this efficiently,
+     * but alas, this is not the case. The inline assembly version safes 6 B.
+     *
+     * Latency of SW intr can be a few cycles; wait for the SW intr.
      * We cannot use WFI here as the SW intr may be delivered before we
-     * reach the WFI instruction, thereby causing a thread deadlock. */
-    while (CLINT_REG(CLINT_MSIP) == 1) {};
+     * reach the WFI instruction, thereby causing a thread deadlock.
+     */
+    __asm__ volatile (
+        "sw %[one], 0(%[clint_msip])"                                       "\n"
+        "loop_until_softirq_handled:"                                       "\n"
+        "lw %[tmp], 0(%[clint_msip])"                                       "\n"
+        "beq %[tmp], %[one], loop_until_softirq_handled"                    "\n"
+        : [tmp]         "=&r"(tmp)
+        : [one]         "r"(one),
+          [clint_msip]  "r"(clint_msip)
+        : "memory"
+    );
 }
 
 /**

But I would be surprised if two CPU instructions less would do much about the 6-7% performance regression.

@maribu
Copy link
Copy Markdown
Member

maribu commented Jan 7, 2021

Note: Please don't apply the patch above. My local GCC generates proper code for the C code. So I guess, we should just get rid of the legacy riscv-none-embed toolchain in the Docker container and enjoy better code generation.

My local toolchain also generates inefficient machine code. I checked out the wrong branch by accident.

@nmeum
Copy link
Copy Markdown
Member Author

nmeum commented Jan 7, 2021

Hence, WFI cannot be a nop and will be reached before the soft IRQ is triggered. So IMO there is no bug in the code.

You are still making assumptions about how many cycles it will take until the WFI instruction is reached and whether or not the interrupt is raised before it is which also depends on the code generated by the compiler et cetera.

@maribu
Copy link
Copy Markdown
Member

maribu commented Jan 7, 2021

You are still making assumptions about how many cycles it will take until the WFI instruction is reached and whether or not the interrupt is raised before it is which also depends on the code generated by the compiler et cetera.

The function right now contains a plain volatile store and afterwards the WFI instruction. This is very valid to make some reasonable assumptions about this.

@maribu maribu removed the Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) label Jan 7, 2021
@nmeum
Copy link
Copy Markdown
Member Author

nmeum commented Jan 8, 2021

I don't think this is actually that easy to reason about. For one thing, what makes you certain that an external interrupt is never raised between the store and the wfi?

Regardless of whether this is classified as "bug" or not: I would personally be interested in getting this fixed as it prevents execution of vanilla RIOT on any HiFive1 simulator which doesn't have a cycle-accurate timing model.

If this is just about "saving ticks": You can also rewrite the RISC-V context switch implementation to use a synchronous ECALL instead of an asynchronous software interrupt thereby eliminating the need for the busy wait. This is the technique employed by zephyr for context switching on RISC-V. I don't really see the point in optimizing the code executed before the busy wait as you suggested before.

@maribu
Copy link
Copy Markdown
Member

maribu commented Jan 8, 2021

I opened a bug report in GCC regarding the additional 6 bytes of ROM usage due to not reusing the registers. I think this should eventually be fixed.

Copy link
Copy Markdown
Member

@maribu maribu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK. It is IMO a good idea to not rely on implementation details of a specific CPU: In the future, we might want to support hardware that behaves differently. Additionally, we have users that would profit right now by being able to run this code in simulators that are not cycle accurate in their IRQ handling.

@aabadie, @bergzand: Any second thoughts? I agree with @nmeum that we should consider using ECALL rather than soft IRQs for context switches.

@bergzand
Copy link
Copy Markdown
Member

@bergzand: Any second thoughts?

Nope.

I agree with @nmeum that we should consider using ECALL rather than soft IRQs for context switches.

I gave this a shot this weekend, see #15736. Don't have hardware available (tested on Qemu) so I don't know the actual performance impact.

@maribu maribu added the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Jan 11, 2021
@bergzand
Copy link
Copy Markdown
Member

@nmeum I'm closing this one as #15736 is now merged. Feel free to reopen if necessary of course.

@bergzand bergzand closed this Jan 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Platform: RISC-V Platform: This PR/issue effects RISC-V-based platforms

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants