fe310: Fix potential deadlock in thread_yield_higher by nmeum · Pull Request #15277 · RIOT-OS/RIOT

nmeum · 2020-10-22T16:26:48Z

Contribution description

The software interrupt may be raised before the WFI instruction is reached, if it is and no other interrupt is raised afterwards the thread will deadlock as it will never awake from the WFI.

I noticed this bug with a custom RISC-V simulator and cannot reproduce it on real hardware though. However, even on real hardware the WFI instruction may be implemented as a NOP, thus the current code is (in my opinion) suboptimal either way. The change proposed here simply busy waits until the interrupt handler for the software interrupt is invoked (which sets CLINT_MSIP), since this should only take a few cycles it shouldn't have a big impact on power consumption.

Testing procedure

I encountered this bug with the following application. However, since this is a concurrency issue, which I personally cannot reproduce on real hardware (the HiFive1), it's probably difficult to test this PR using the application. I would instead suggest reasoning based on my observations and the suggested changes.

#include <stdio.h>
#include <stdlib.h>
#include "xtimer.h"
#include "timex.h"

int main(void)
{
    puts("Sleeping for 5 seconds...");
    xtimer_sleep(5);
    puts("Sleep done!");
}

Due to the bug the bug in thread_yield_higher the mutex (used internally by xtimer) would deadlock here:

RIOT/core/mutex.c

Lines 61 to 65 in 57264c5

    
                   irq_restore(irqstate); 
        
                   thread_yield_higher(); 
        
                   /* We were woken up by scheduler. Waker removed us from queue. 
        
                    * We have the mutex now. */ 
        
                   return 1;

I think this example also illustrates why the current implementation of thread_yield_higher might be a bad idea, given that WFI might be implemented as a NOP.

Issues/PRs references

I have only tested this on the HiFive1, not the HiFive1b. Maybe this could explain some of the test failures on the HiFive1b (xtimer failures in #13086) were timing might differ from the HiFive1 in terms of interrupt delivery? Just a wild guess...

Makes the code a bit more readable.

The software interrupt may be raised before the WFI instruction is reached, if it is and no other interrupt is raised afterwards the thread will deadlock as it will never awake from the WFI. I noticed this bug with a custom RISC-V simulator and cannot reproduce it on real hardware though. However, even on real hardware the WFI instruction may be implemented as a NOP, thus the current code is suboptimal either way. The change proposed here simply busy waits until the interrupt handler for the software interrupt is invoked (which sets CLINT_MSIP), since this should only take a few cycles it shouldn't have a big impact on power consumption.

cpu/fe310/thread_arch.c

aabadie · 2021-01-07T13:53:27Z

Just tested this PR on hifive1b and it seems to work.

To give some numbers, I ran tests/bench_thread_flags_pingpong and tests/bench_thread_yield_pingpong:

test	master	PR
`bench_thread_yield_pingpong`	`{ "result" : 665232, "ticks" : 481 }`	`{ "result" : 618917, "ticks" : 517 }`
`bench_thread_flags_pingpong`	`{ "result" : 551203, "ticks" : 580 }`	`{ "result" : 516082, "ticks" : 620 }`

So we have ~6-7% performance drop apparently.

maribu · 2021-01-07T14:39:45Z

Note that the implementation right now is not a RISC-V implementation, but specific to the fe310. So far we can rightfully assume to have the exact behavior this MCU has (including the timing). Hence, WFI cannot be a nop and will be reached before the soft IRQ is triggered. So IMO there is no bug in the code.

That said, obviously we want to support more RISC-V MCUs in the future. Getting rid of code relying on the implementation details of a specific MCU would certainly be a step in the right direction.

Let me do an elf_diff to se if we can do something about the performance impact.

maribu · 2021-01-07T16:24:55Z

The machine code generated by this PR looks like this:

Disassembly of section .text.thread_yield_higher:

00000000 <thread_yield_higher>:
   0:   020007b7                lui     a5,0x2000
   4:   4705                    li      a4,1
   6:   c398                    sw      a4,0(a5)
   8:   020006b7                lui     a3,0x2000
   c:   4785                    li      a5,1

0000000e <.L12>:
   e:   4298                    lw      a4,0(a3)
  10:   fef70fe3                beq     a4,a5,e <.L12>
  14:   8082                    ret

That is suboptimal because the address of CLINT_MSIP is first loaded into a5 and later again in a3 - why doesn't GCC reuse a5 for the subsequent load is a mystery to me. Similar the immediate 1 is stored in a4 and later again in a5 - again the register could just be reused. Dropping the duplicated code results in 6 B ROM less consumed:

Disassembly of section .text.thread_yield_higher:

00000000 <thread_yield_higher>:
   0:   4785                    li      a5,1
   2:   02000737                lui     a4,0x2000
   6:   c31c                    sw      a5,0(a4)

00000008 <loop_until_softirq_handled>:
   8:   4314                    lw      a3,0(a4)
   a:   fef68fe3                beq     a3,a5,8 <loop_until_softirq_handled>
   e:   8082                    ret

patch of thread_yield_higher() to safe 6 B of ROM

commit 40117a082b354961ed9088e477963e8fce35ed74 (HEAD -> fe310)
Author: Marian Buschsieweke <[email protected]>
Date:   Thu Jan 7 17:10:31 2021 +0100

    cpu/fe310: optimize irq_yield_higher()

diff --git a/cpu/fe310/thread_arch.c b/cpu/fe310/thread_arch.c
index ed5687a49f..e12fe76efb 100644
--- a/cpu/fe310/thread_arch.c
+++ b/cpu/fe310/thread_arch.c
@@ -180,13 +180,32 @@ void cpu_switch_context_exit(void)
 
 void thread_yield_higher(void)
 {
-    /* Use SW intr to schedule context switch */
-    CLINT_REG(CLINT_MSIP) = 1;
-
-    /* Latency of SW intr can be a few cycles; wait for the SW intr.
+    /* Let compiler to register allocation via fake output */
+    uint32_t tmp;
+    uint32_t one = 1;
+    uintptr_t clint_msip = CLINT_CTRL_ADDR + CLINT_MSIP;
+    /* The assembly below implements:
+     *
+     *     *((volatile uint32_t)clint_msip) = 1;
+     *      while (*((volatile uint32_t)clint_msip) == 1) { }
+     *
+     * One would assume that the compiler is able to implement this efficiently,
+     * but alas, this is not the case. The inline assembly version safes 6 B.
+     *
+     * Latency of SW intr can be a few cycles; wait for the SW intr.
      * We cannot use WFI here as the SW intr may be delivered before we
-     * reach the WFI instruction, thereby causing a thread deadlock. */
-    while (CLINT_REG(CLINT_MSIP) == 1) {};
+     * reach the WFI instruction, thereby causing a thread deadlock.
+     */
+    __asm__ volatile (
+        "sw %[one], 0(%[clint_msip])"                                       "\n"
+        "loop_until_softirq_handled:"                                       "\n"
+        "lw %[tmp], 0(%[clint_msip])"                                       "\n"
+        "beq %[tmp], %[one], loop_until_softirq_handled"                    "\n"
+        : [tmp]         "=&r"(tmp)
+        : [one]         "r"(one),
+          [clint_msip]  "r"(clint_msip)
+        : "memory"
+    );
 }
 
 /**

But I would be surprised if two CPU instructions less would do much about the 6-7% performance regression.

maribu · 2021-01-07T16:27:52Z

Note: Please don't apply the patch above. My local GCC generates proper code for the C code. So I guess, we should just get rid of the legacy riscv-none-embed toolchain in the Docker container and enjoy better code generation.

My local toolchain also generates inefficient machine code. I checked out the wrong branch by accident.

nmeum · 2021-01-07T19:22:50Z

Hence, WFI cannot be a nop and will be reached before the soft IRQ is triggered. So IMO there is no bug in the code.

You are still making assumptions about how many cycles it will take until the WFI instruction is reached and whether or not the interrupt is raised before it is which also depends on the code generated by the compiler et cetera.

maribu · 2021-01-07T21:30:30Z

You are still making assumptions about how many cycles it will take until the WFI instruction is reached and whether or not the interrupt is raised before it is which also depends on the code generated by the compiler et cetera.

The function right now contains a plain volatile store and afterwards the WFI instruction. This is very valid to make some reasonable assumptions about this.

nmeum · 2021-01-08T07:56:48Z

I don't think this is actually that easy to reason about. For one thing, what makes you certain that an external interrupt is never raised between the store and the wfi?

Regardless of whether this is classified as "bug" or not: I would personally be interested in getting this fixed as it prevents execution of vanilla RIOT on any HiFive1 simulator which doesn't have a cycle-accurate timing model.

If this is just about "saving ticks": You can also rewrite the RISC-V context switch implementation to use a synchronous ECALL instead of an asynchronous software interrupt thereby eliminating the need for the busy wait. This is the technique employed by zephyr for context switching on RISC-V. I don't really see the point in optimizing the code executed before the busy wait as you suggested before.

maribu · 2021-01-08T08:34:06Z

I opened a bug report in GCC regarding the additional 6 bytes of ROM usage due to not reusing the registers. I think this should eventually be fixed.

maribu

ACK. It is IMO a good idea to not rely on implementation details of a specific CPU: In the future, we might want to support hardware that behaves differently. Additionally, we have users that would profit right now by being able to run this code in simulators that are not cycle accurate in their IRQ handling.

@aabadie, @bergzand: Any second thoughts? I agree with @nmeum that we should consider using ECALL rather than soft IRQs for context switches.

bergzand · 2021-01-11T10:08:34Z

@bergzand: Any second thoughts?

Nope.

I agree with @nmeum that we should consider using ECALL rather than soft IRQs for context switches.

I gave this a shot this weekend, see #15736. Don't have hardware available (tested on Qemu) so I don't know the actual performance impact.

bergzand · 2021-01-15T13:04:59Z

@nmeum I'm closing this one as #15736 is now merged. Feel free to reopen if necessary of course.

nmeum added 2 commits October 22, 2020 09:10

fe310: Use CLINT_MSIP constant in irq_arch.c

3f244ce

Makes the code a bit more readable.

nmeum requested review from aabadie, bergzand and kaspar030 as code owners October 22, 2020 16:26

nmeum commented Oct 22, 2020

View reviewed changes

cpu/fe310/thread_arch.c Show resolved Hide resolved

benpicco added Platform: RISC-V Platform: This PR/issue effects RISC-V-based platforms Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) labels Oct 22, 2020

benpicco added this to the Release 2021.01 milestone Jan 7, 2021

benpicco requested a review from maribu January 7, 2021 11:17

maribu removed the Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) label Jan 7, 2021

maribu approved these changes Jan 8, 2021

View reviewed changes

bergzand mentioned this pull request Jan 11, 2021

cpu/fe310: Use ecall instruction for thread yield #15736

Merged

maribu added the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Jan 11, 2021

bergzand closed this Jan 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fe310: Fix potential deadlock in thread_yield_higher#15277

fe310: Fix potential deadlock in thread_yield_higher#15277
nmeum wants to merge 2 commits intoRIOT-OS:masterfrom
nmeum:pr/fe310_sched_deadlock

nmeum commented Oct 22, 2020 •

edited

Loading

Uh oh!

Uh oh!

aabadie commented Jan 7, 2021

Uh oh!

maribu commented Jan 7, 2021

Uh oh!

maribu commented Jan 7, 2021

Uh oh!

maribu commented Jan 7, 2021 •

edited

Loading

Uh oh!

nmeum commented Jan 7, 2021

Uh oh!

maribu commented Jan 7, 2021

Uh oh!

nmeum commented Jan 8, 2021 •

edited

Loading

Uh oh!

maribu commented Jan 8, 2021

Uh oh!

maribu left a comment

Uh oh!

bergzand commented Jan 11, 2021

Uh oh!

bergzand commented Jan 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

	irq_restore(irqstate);
	thread_yield_higher();
	/* We were woken up by scheduler. Waker removed us from queue.
	* We have the mutex now. */
	return 1;

Conversation

nmeum commented Oct 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution description

Testing procedure

Issues/PRs references

Uh oh!

Uh oh!

aabadie commented Jan 7, 2021

Uh oh!

maribu commented Jan 7, 2021

Uh oh!

maribu commented Jan 7, 2021

Uh oh!

maribu commented Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nmeum commented Jan 7, 2021

Uh oh!

maribu commented Jan 7, 2021

Uh oh!

nmeum commented Jan 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maribu commented Jan 8, 2021

Uh oh!

maribu left a comment

Choose a reason for hiding this comment

Uh oh!

bergzand commented Jan 11, 2021

Uh oh!

bergzand commented Jan 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nmeum commented Oct 22, 2020 •

edited

Loading

maribu commented Jan 7, 2021 •

edited

Loading

nmeum commented Jan 8, 2021 •

edited

Loading