Skip to content

sys: Added simple memory barrier API#11438

Closed
maribu wants to merge 1 commit intoRIOT-OS:masterfrom
maribu:barriers
Closed

sys: Added simple memory barrier API#11438
maribu wants to merge 1 commit intoRIOT-OS:masterfrom
maribu:barriers

Conversation

@maribu
Copy link
Copy Markdown
Member

@maribu maribu commented Apr 24, 2019

Contribution description

  • Added two and a half primitives to enforce memory ordering
    • compiler_barrier() (and compiler_barrier_data()) prevent the compiler from reordering memory accesses across them
    • memory_barrier() prevents the compiler from reordering memory accesses and - on out-of-order platforms only - prevents the CPU from doing so as well

Testing procedure

#include <stdint.h>
#include "barriers.h"

typedef union {
    uint32_t u32;
    uint8_t u8[4];
} foo_t;

foo_t a, b;

void bar(void) {
    a.u8[0] = 0xa0;
    a.u8[1] = 0xa1;
    b.u8[0] = 0xb0;
    a.u8[2] = 0xa2;
    a.u8[3] = 0xa3;
    /* Add memory_barrier() here */
    b.u8[1] = 0xb1;
    b.u8[2] = 0xb2;
    b.u8[3] = 0xb3;
}

Issues/PRs references

Potential users of this API: #11073, #8842

@maribu maribu added Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation Discussion: RFC The issue/PR is used as a discussion starting point about the item of the issue/PR labels Apr 24, 2019
@miri64 miri64 added Type: new feature The issue requests / The PR implemements a new feature for RIOT and removed Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation labels Apr 24, 2019
* memory accesses. If the implementation of `do_computing()` is visible to the
* compiler, the compiler may conclude that do_computing() could be moved
* above the first or below the second access to `TIMER`, which will render the
* benchmark meaningless. (**Beware:** Even if `do_something()` is an external
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this accurate? AFAIK, function calls are sequence points WRT the abstract machine. LTO cannot change that.

Copy link
Copy Markdown
Member Author

@maribu maribu Apr 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler can do any transformation that will not yield different observable behavior compared to the abstract machine. If I remember correctly observable behavior is defined as (likely incomplete):

  • Side-effects cause by function calls
  • Access to memory marked as volatile
  • The return value of main() (at least on non-embedded systems)

If the C compiler can prove that a function does not yield observable behavior, the compiler thus can get as creative as it wants.


Edit: Fixed use of language

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaspar030 library function calls. I made the experiment a while ago with a some functions defined in a single C file (no LTO), mixing accesses to globals and calls. GCC is smart enough to realize that some function do not touch global memory. I'm not sure if LTO is also as smart.

Your point still applies to

void __sync_synchronize(void) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the experiment a while ago with a some functions defined in a single C file (no LTO), mixing accesses to globals and calls. GCC is smart enough to realize that some function do not touch global memory. I'm not sure if LTO is also as smart.

This sounds like anecdotal evidence.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaspar030: You can observe that compiler barriers do make a difference in #11440, where those are added to irq_disable(), irq_restore() for the ARMv7 platform etc. Those barriers do not make any difference when compiling without LTO, as the function calls are treated as barriers then. (Unless I missed something in the doc, the ARMv7 platform is in-order, so no hardware memory barriers are used there.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_fence_test.c
extern void f(void);

int x;

int *p;

int test(void)
{
    int y = x;
    f();
    return y+x;
}

int z(int x)
{
    return x*x;
}

int test2(void)
{
    int y = x;
    int k = z(x);
    return y+x+k;
}


int test3(void)
{
    int y = x;
    //__sync_synchronize();
    __asm__ __volatile__("": : :"memory");
    return y + x;
}

Compile with arm-none-eabi-gcc -mcpu=cortex-m3 -std=c99 -Og -dA -ggdb -c read_fence_test.c

In test(), the compiler is forced to emit a re-read of x because it does not know if f() modified it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In test(), the compiler is forced to emit a re-read of x because it does not know if f() modified it.

The call to f() behaves like a compiler barrier as long as the compiler cannot access the implementation of f(). With LTO this might no longer be the case (depending on where f() is implemented). But even if f() is implemented in a system library, we should not take it for granted to be a compiler barrier for all eternity: When LTO becomes more and more popular, maybe we get LTO-enabled static linking against system libraries (that is, system libraries are not stored in machine code but rather in the internal representation of the compiler to allow optimization over the whole program).

Therefore, I think we should not rely on calls to external function to behave like compiler barriers.

@jcarrano
Copy link
Copy Markdown
Contributor

I agree with the general idea of this PR (having both compiler and cpu barriers). A couple of issues:

Implementation of barrier() is OK, except that a function marked inline may not be inlined by the compiler, so I believe it is better to have a macro. That's what linux does:

https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/include/linux/compiler-gcc.h#L16

As for memory_barrier, it is just another name for __sync_synchronize, so I don't know which should be the "real" one and which one the "alias". The biggest problem, however, is that RIOT's __sync_synchronize does not do what it is supposed to:

RIOT/core/atomic_c11.c

Lines 354 to 361 in 11da5e8

void __sync_synchronize(void) {
/* ARMv4, ARMv5 do not have any hardware support for memory barriers,
* This is a software only barrier and a no-op, and will likely break on SMP
* systems, but we don't support any multi-CPU ARMv5 or ARMv4 boards in RIOT
* so I don't care. /JN
*/
__asm__ volatile ("" : : : "memory");
}

See, it is essentially a compiler barrier.

I think we should fix this before introducing a function that claims to do something that is not true. Of course, if the target does not have barrier instructions it is because memory access reordering by the CPU is not possible so that is not a problem.

Another issue with that implementation is that it is a function defined in a C file. It is pointless to put a (compiler) barrier in a function since a call to code in another translation unit is already a barrier: since the compiler does not know what the external function does, it must assume anything may have changed (this may not be true with LTO).

Now, as far as I understand, __sync_synchronize is a builtin, so it is probably already defined by the compiler for many platforms.

* prevent correct execution.
*
* ## Why can't I use `volatile` instead
* Memory accesses to `volatile` can be viewed as an alternative to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one important aspect of "volatile": the compiler is forced to use the best-fitting memory write instruction it has available. E.g., on 32bit ARM, writing to an uint16_t foo *volatile will generate a 16bit store instruction.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the Linux guys do is writing a helper function-like macro that casts the pointer to a pointer to volatile memory of the same type and performs the memory access. This will prevent optimization only for that very access that needs to be done in a properly sized store

* @ref compiler_barrier, but only in regard to other accesses to `volatile`
* variables. (Some compilers do not reorder memory accesses to non-`volatile`s
* across accesses to `volatile`s as well, but not all compilers do so.)
* Resorting to mark all variables as `volatile` to enforce the correct order
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having one variable volatile enforces load/store to that one variable to be issued (WRT volatile rules). A memory barrier forces the compiler to forget everything it knows about all variables in the current scope. Doesn't that possibly lead to more reloads than necessary?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in practical code it is observed to perform better to use barriers (see the Linux reference regarding volatile at the bottom). We need to keep in mind that the typical use case is something like:

barrier()
access_to_critical_data()
barrier()

It is very unlikely that memory considered to require being put behind a barrier is accessed unprotected just after the barrier.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally forgot the obvious argument here:

A memory barrier forces the compiler to forget everything it knows about all variables in the current scope.

This is an incorrect generalization. The compiler will assume that global memory is expected to be accessed during the barrier. In general local variables are still fully optimized. (When the address of the stack variables has been passed to an external function, the compiler has to treat those as potentially accessed as well.) So the compiler will only reload globals after the barrier (and only if they are used on both sides of the barrier). Thus, the odds for the barrier to be cheaper than volatile are very good. (And e.g. for the Linux community this was also true in practice.)

Let me prove that GCC is able to optimize access to stack variables (unless their address is passed to external code) across barriers:

#include <stdint.h>

#define memory_barrier() __sync_synchronize()

typedef struct {
	uint8_t left : 4;
	uint8_t right : 4;
} foo_t;

foo_t foo(foo_t a)
{
	a.left = 0xf;
	memory_barrier();
	a.right = 0xf;
	return a;
}

Output of avr-gcc -O3 -S:

	.file	"foo_local.c"
__SP_H__ = 0x3e
__SP_L__ = 0x3d
__SREG__ = 0x3f
__tmp_reg__ = 0
__zero_reg__ = 1
	.text
.global	foo
	.type	foo, @function
foo:
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
	ldi r24,lo8(-1)
/* epilogue start */
	ret
	.size	foo, .-foo
	.ident	"GCC: (GNU) 8.3.0"

@maribu
Copy link
Copy Markdown
Member Author

maribu commented Apr 24, 2019

@jcarrano: Thanks for the quick comments :-) I'll change compiler_barrier() to be a macro and I'll do the same of __sync_synchronize

The biggest problem, however, is that RIOT's __sync_synchronize does not do what it is supposed to[...]

This is true, but only if __sync_synchronize() is not a compiler build-in function. Recent versions of both GCC and clang will never generate a call to that and instead directly generate the machine code issuing the memory barrier

* }
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*
* In the above example from the compilers perspective the value in `TIMER`
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maribu
Copy link
Copy Markdown
Member Author

maribu commented Apr 24, 2019

I'll resume work on this tomorrow. My brain is melting right now anyway - at least in part because of the hot weather :-)

@maribu
Copy link
Copy Markdown
Member Author

maribu commented Apr 25, 2019

@jcarrano: __attribute__((always_inline)) could be used to force inlining of compiler_barrier() and would be nicer to read in GDB when debugging. It is supported by both GCC and clang for quite some time. Would that be an alternative?

(I believe that the use of inline assembly there is less widely supported than __attribute__((always_inline)). If this is correct, this will not reduce compatibility compared to the use of a macro.)

@maribu
Copy link
Copy Markdown
Member Author

maribu commented Apr 25, 2019

I guess it would be nice to have something like READ_ONCE(x) and WRITE_ONCE(x) as well.

E.g. consider this code:

int irq_is_in(void)
{
return __in_isr;
}

I think it would need to be written something like this:

int irq_is_in(void)
{
    int retval;
    retval = __in_isr;
    compiler_barrier();
    return retval;
}

@jcarrano
Copy link
Copy Markdown
Contributor

@maribu In that case irq_is_in would work fine even if __in_isr is not volatile: if __in_isr gets set at ISR entry and unset at exit then from the perspective on any code - in ISR or in normal mode- __in_isr never changes value.

A similar issue was raised a while ago with scheduler variables.

Another issue to watch out for is slight variation in LLVM and GCC when it comes to write barriers, as pointed out in the linux source: https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/include/linux/compiler-gcc.h#L30

Also, I just checked and, at least in ARM, the builtin __sync_synchronize() is OK (generates a dmb instruction), but it looks we are breaking it by redefining it: turns out the redefinition is not effective and it does not break.

So, in conclusion, my position is:

  • let's have barrier (and barrier_data() if we need it)
  • I'd rather go with the proven, traditional definition using a macro.
  • memory_barrier() is not adding anything special on top of __sync_synchronize().
  • __sync_synchronize() should be defined ONLY on those architectures that don't have the builtin (in the cpu and not in core)

@maribu
Copy link
Copy Markdown
Member Author

maribu commented Apr 25, 2019

  • memory_barrier() is not adding anything special on top of __sync_synchronize().

That is true, but it allows future modification. (E.g. assume a specific toolchain has a broken __sync_synchronize(), we could simple use inline assembly to work around that issue. And for in-order platforms with no support for __sync_synchronize() we could just define memory_barrier() to use compiler_barrier().)

Also, I personally think memory_barrier() is a descriptive name, while __sync_synchronize() is not.

I'd also prefer compiler_barrier() over barrier(), as the name compiler_barrier() makes it obvious that CPU reordering is not prevented.

  • I'd rather go with the proven, traditional definition using a macro.

Done

  • __sync_synchronize() should be defined ONLY on those architectures that don't have the builtin (in the cpu and not in core)

I agree, but I'd think that should be a separate PR.

@maribu
Copy link
Copy Markdown
Member Author

maribu commented Apr 25, 2019

  • let's have barrier (and barrier_data() if we need it)

Added compiler_barrier_data() and updated the documentation accordingly. (I fixed also some typos ("read and store" instead of "write and store")

@maribu
Copy link
Copy Markdown
Member Author

maribu commented Apr 26, 2019

@kaspar030, @jcarrano: I'd like to share a little experiment:

foo.c:

#include <stdint.h>
#include <stdio.h>

extern int bar(void);
typedef union {
    uint32_t u32;
    uint8_t u8[4];
} foo_t;

foo_t a = { .u32 = 0 };

int main(void)
{
    a.u8[0] = 0xa0;
    int i = bar();
    a.u8[1] = 0xa1;
    a.u8[2] = 0xa2;
    a.u8[3] = 0xa3;

    printf("a = %x, i = %d\n", (unsigned)a.u32, i);
    return 0;
}

bar.c:

int bar(void) {
    return 0x1337;
}

Compilation with:

gcc -c -O3 -o foo.o foo.c
gcc -c -O3 -o bar.o bar.c
gcc -o no_lto foo.o bar.o

and with:

gcc -flto -c -O3 -o foo.o foo.c
gcc -flto -c -O3 -o bar.o bar.c
gcc -flto -O3 -o lto foo.o bar.o

Partial output of objdump -d no_lto (everything but main and bar omitted):

0000000000001060 <main>:
    1060:	48 83 ec 08          	sub    $0x8,%rsp
    1064:	c6 05 05 30 00 00 a0 	movb   $0xa0,0x3005(%rip)        # 4070 <a>
    106b:	e8 80 01 00 00       	callq  11f0 <bar>
    1070:	ba a2 a3 ff ff       	mov    $0xffffa3a2,%edx
    1075:	c6 05 f5 2f 00 00 a1 	movb   $0xa1,0x2ff5(%rip)        # 4071 <a+0x1>
    107c:	48 8d 3d 7d 0f 00 00 	lea    0xf7d(%rip),%rdi        # 2000 <_fini+0xde8>
    1083:	66 89 15 e8 2f 00 00 	mov    %dx,0x2fe8(%rip)        # 4072 <a+0x2>
    108a:	8b 35 e0 2f 00 00    	mov    0x2fe0(%rip),%esi        # 4070 <a>
    1090:	89 c2                	mov    %eax,%edx
    1092:	31 c0                	xor    %eax,%eax
    1094:	e8 87 ff ff ff       	callq  1020 <printf@plt>
    1099:	31 c0                	xor    %eax,%eax
    109b:	48 83 c4 08          	add    $0x8,%rsp
    109f:	c3                   	retq

00000000000011f0 <bar>:
    11f0:	b8 37 13 00 00       	mov    $0x1337,%eax
    11f5:	c3                   	retq

Partial output of objdump -d lto (everything but main omitted, bar no longer present):

0000000000001060 <main>:
    1060:	48 83 ec 08          	sub    $0x8,%rsp
    1064:	ba 37 13 00 00       	mov    $0x1337,%edx
    1069:	be a0 a1 a2 a3       	mov    $0xa3a2a1a0,%esi
    106e:	31 c0                	xor    %eax,%eax
    1070:	48 8d 3d 89 0f 00 00 	lea    0xf89(%rip),%rdi        # 2000 <_fini+0xe0f>
    1077:	c7 05 ef 2f 00 00 a0 	movl   $0xa3a2a1a0,0x2fef(%rip)        # 4070 <a>
    107e:	a1 a2 a3 
    1081:	e8 9a ff ff ff       	callq  1020 <printf@plt>
    1086:	31 c0                	xor    %eax,%eax
    1088:	48 83 c4 08          	add    $0x8,%rsp
    108c:	c3                   	retq

Without LTO the call to bar() is effectively a compiler barrier. With LTO bar() gets inlined and the accesses to a get reordered across what has been the call to bar() and combined to a single 32 bit store.

@kaspar030
Copy link
Copy Markdown
Contributor

Without LTO the call to bar() is effectively a compiler barrier. With LTO bar() gets inlined and the accesses to a get reordered across what has been the call to bar() and combined to a single 32 bit store.

... and thus a function call cannot be used as compiler barrier.

I have tried the same using a global variable, a volatile variable and a volatile asm access:

bar.c:

#ifdef MEM_VOLATILE
volatile int _bar = 0x1337;
#endif

#ifdef MEM
int _bar = 0x1337;
#endif

int bar(void) {
#ifdef ASM_VOLATILE
    __asm__ volatile ("mov    %eax,%eax;");
#endif
#if defined(MEM_VOLATILE) || defined(MEM)
    return _bar;
#else
    return 0x1337;
#endif
}

When compiled with -DMEM (returning a global variable instead of literal):

gcc -flto -DMEM -c -O3 -o foo.o foo.c
gcc -flto -DMEM -c -O3 -o bar.o bar.c
gcc -flto -DMEM -O3 -o lto foo.o bar.o
 31 0000000000001040 <main>:
 32     1040:›  48 83 ec 08          ›  sub    $0x8,%rsp
 33     1044:›  ba 37 13 00 00       ›  mov    $0x1337,%edx
 34     1049:›  be a0 a1 a2 a3       ›  mov    $0xa3a2a1a0,%esi
 35     104e:›  31 c0                ›  xor    %eax,%eax
 36     1050:›  48 8d 3d ad 0f 00 00 ›  lea    0xfad(%rip),%rdi        # 2004 <_IO_stdin_used+0x4>
 37     1057:›  c7 05 d3 2f 00 00 a0 ›  movl   $0xa3a2a1a0,0x2fd3(%rip)        # 4034 <a>
 38     105e:›  a1 a2 a3·
 39     1061:›  e8 ca ff ff ff       ›  callq  1030 <printf@plt>
 40     1066:›  31 c0                ›  xor    %eax,%eax
 41     1068:›  48 83 c4 08          ›  add    $0x8,%rsp
 42     106c:›  c3                   ›  retq···
 43     106d:›  0f 1f 00             ›  nopl   (%rax)

This is identical to the original LTO version, thus it makes no difference whether the compiler can reason the function is "simple" (just returning a constant) or if it returns a global variable.

When compiled with -DMEM_VOLATILE:

gcc -flto -DMEM_VOLATILE -c -O3 -o foo.o foo.c
gcc -flto -DMEM_VOLATILE -c -O3 -o bar.o bar.c
gcc -flto -DMEM_VOLATILE -O3 -o lto foo.o bar.o
 31 0000000000001040 <main>:
 32     1040:›  b8 a2 a3 ff ff       ›  mov    $0xffffa3a2,%eax
 33     1045:›  48 83 ec 08          ›  sub    $0x8,%rsp
 34     1049:›  c6 05 e8 2f 00 00 a0 ›  movb   $0xa0,0x2fe8(%rip)        # 4038 <__TMC_END__>
 35     1050:›  8b 15 da 2f 00 00    ›  mov    0x2fda(%rip),%edx        # 4030 <_bar>
 36     1056:›  66 89 05 dd 2f 00 00 ›  mov    %ax,0x2fdd(%rip)        # 403a <__TMC_END__+0x2>
 37     105d:›  48 8d 3d a0 0f 00 00 ›  lea    0xfa0(%rip),%rdi        # 2004 <_IO_stdin_used+0x4>
 38     1064:›  31 c0                ›  xor    %eax,%eax
 39     1066:›  c6 05 cc 2f 00 00 a1 ›  movb   $0xa1,0x2fcc(%rip)        # 4039 <__TMC_END__+0x1>
 40     106d:›  8b 35 c5 2f 00 00    ›  mov    0x2fc5(%rip),%esi        # 4038 <__TMC_END__>
 41     1073:›  e8 b8 ff ff ff       ›  callq  1030 <printf@plt>
 42     1078:›  31 c0                ›  xor    %eax,%eax
 43     107a:›  48 83 c4 08          ›  add    $0x8,%rsp
 44     107e:›  c3                   ›  retq···
 45     107f:›  90                   ›  nop

Here the volatile read is inserted after the write of "0xa0" (line nr 35), the memory access to 'a' is not reordered across the inlined function call containing a volatile memory access.
I modified foo.c to define _bar itself (and dropped bar.c). Interestingly, the resulting code is 100% identical to the inlined version. gcc does not re-order the writes to 'a' around the volatile read.

Here's the effect of an __asm__ volatile block (compile with -DASM_VOLATILE):

gcc -flto -DASM_VOLATILE -c -O3 -o foo.o foo.c                                                                                              
gcc -flto -DASM_VOLATILE -c -O3 -o bar.o bar.c                                                                                              
gcc -flto -DASM_VOLATILE -O3 -o lto foo.o bar.o                      
 31 0000000000001040 <main>:
 32     1040:›  48 83 ec 08          ›  sub    $0x8,%rsp
 33     1044:›  c6 05 e9 2f 00 00 a0 ›  movb   $0xa0,0x2fe9(%rip)        # 4034 <a>
 34     104b:›  89 c0                ›  mov    %eax,%eax
 35     104d:›  b8 a2 a3 ff ff       ›  mov    $0xffffa3a2,%eax
 36     1052:›  c6 05 dc 2f 00 00 a1 ›  movb   $0xa1,0x2fdc(%rip)        # 4035 <a+0x1>
 37     1059:›  ba 37 13 00 00       ›  mov    $0x1337,%edx
 38     105e:›  48 8d 3d 9f 0f 00 00 ›  lea    0xf9f(%rip),%rdi        # 2004 <_IO_stdin_used+0x4>
 39     1065:›  66 89 05 ca 2f 00 00 ›  mov    %ax,0x2fca(%rip)        # 4036 <a+0x2>
 40     106c:›  8b 35 c2 2f 00 00    ›  mov    0x2fc2(%rip),%esi        # 4034 <a>
 41     1072:›  31 c0                ›  xor    %eax,%eax
 42     1074:›  e8 b7 ff ff ff       ›  callq  1030 <printf@plt>
 43     1079:›  31 c0                ›  xor    %eax,%eax
 44     107b:›  48 83 c4 08          ›  add    $0x8,%rsp
 45     107f:›  c3                   ›  retq···

Here, gcc does not reorder the writes to 'a' around the asm statement (line nr 34).

So, while the function call itself pretty much vanishes (it doesn't have any effect) when using LTO, gcc, to me, doesn't show any unexpected behavior.

@maribu
Copy link
Copy Markdown
Member Author

maribu commented Apr 26, 2019

@kaspar030: The C standard is sadly not that clear on whether memory accesses that are not marked as volatile can be moved across volatile memory accesses. Citing from John Regehr - Nine ways to break your systems code using volatile:

One school of thought is that compilers may not move accesses to global variables around accesses to volatile variables. There seems to be a consistent reading of the standard that backs this up. The problem with this reading is that important compilers are based on a different interpretation, which says that accesses to non-volatile objects can be arbitrarily moved around volatile accesses.

So while GCC does treat accesses to volatile memory as a compiler barrier (or at least both the version you are running and the version I am using), there is no way to rely on that.

But we also need to keep in mind that this actually doesn't matter: Even if every compiler supported by RIOT does tread accesses to volatile as a compiler barrier, we still need to keep in mind that hardware memory barriers might also be needed. Thus, using volatile as memory barrier is not possible.

@kaspar030
Copy link
Copy Markdown
Contributor

Thus, using volatile as memory barrier is not possible.

I think we need to put this in context.

Using volatile as a compiler barrier e.g., in AVR periph code, is perfectly possible.
It is probably not in high-level code that might break on native, ESP32 or other more sophisticated architectures.

IMO we should come up with clear and simple guidelines for how to do accesses to variables that are shared between threads and/or ISRs.

Providing a memory barrier API is probably not gonna help, or even hurt, as people will use it at random places without deeper understanding.

What we need are instructions like "if a variable is shared between ISR and thread(s), guard it's reads with irq_disable/restore() or use atomics". And of course we need to make sure that those functions are rock-solid, as you've already started.

If you guys think that just using __asm__ volatile ("" : : : "memory"); in the rare cases it is needed is not enough, I propose adding a proper macro to core/include/kernel_defines.h. Same for __sync_synchronize().

I'd rather not have the examples and explanations that this PR adds. It feels opinionated and alarmist and not considering context (sorry @maribu).

Also, I'd like to reserve "barrier.h" for IPC barriers, for which we'll eventually have an implementation.

@maribu
Copy link
Copy Markdown
Member Author

maribu commented Apr 26, 2019

Using volatile as a compiler barrier e.g., in AVR periph code, is perfectly possible.

Let me prove this wrong.

foo.c:

#include <stdint.h>

#ifdef VOL
extern volatile uint8_t __barrier;
#define memory_barrier() __barrier = 0
#else
#define memory_barrier() __sync_synchronize()
#endif

typedef struct {
	uint8_t left : 4;
	uint8_t right : 4;
} foo_t;

extern foo_t a;

void foo(void)
{
	a.left = 0xf;
	memory_barrier();
	a.right = 0xf;
}

Output of avr-gcc -O3 -S foo.c -o sync.s with some comments added:

	.file	"foo.c"
__SP_H__ = 0x3e
__SP_L__ = 0x3d
__SREG__ = 0x3f
__tmp_reg__ = 0
__zero_reg__ = 1
	.text
.global	foo
	.type	foo, @function
foo:
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
	lds r24,a          /* <-- Content of a.right unknown, thus load */
	ori r24,lo8(15)    /* <-- Set a.left to 0xf, keep a.right unmodified */
	sts a,r24          /* <-- Store a */
/* memory_barrier() */
	lds r24,a          /* <-- Load a again */
	ori r24,lo8(-16)   /* <-- Set a.right to 0xf, keep a.left unmodified */
	sts a,r24          /* <-- Store a */
/* epilogue start */
	ret
	.size	foo, .-foo
	.ident	"GCC: (GNU) 8.3.0"

Output of avr-gcc -DVOL -O3 -S foo.c -o vol.s with some comments added:

	.file	"foo.c"
__SP_H__ = 0x3e
__SP_L__ = 0x3d
__SREG__ = 0x3f
__tmp_reg__ = 0
__zero_reg__ = 1
	.text
.global	foo
	.type	foo, @function
foo:
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
	sts __barrier,__zero_reg__  /* <-- memory_barrier() */
	ldi r24,lo8(-1)             /* <-- r24 = 0xff */
	sts a,r24                   /* <-- a = r24 (== 0xff) */
/* epilogue start */
	ret
	.size	foo, .-foo
	.ident	"GCC: (GNU) 8.3.0"

So when using an access to volatile memory as barrier, memory access can be moved across this "barrier": Here one 4bit access was moved across the barrier and combined with another 4 bit access. At the time of the barrier the data in a is thus corrupted.

(Note: When using avr-gcc 5.4.0 from the docker image riot/riotbuild I get the same output except for the different version number in .ident "GCC: (GNU) 8.3.0")

@maribu
Copy link
Copy Markdown
Member Author

maribu commented Apr 29, 2019

Also, I'd like to reserve "barrier.h" for IPC barriers, for which we'll eventually have an implementation.

+1

If you guys think that just using __asm__ volatile ("" : : : "memory"); in the rare cases it is needed is not enough, I propose adding a proper macro to core/include/kernel_defines.h.

I'd say the main benefit is being portable, as __asm__ volatile ("" : : : "memory"); is more of a compiler specific hack. Using a macro instead allows to add support for other compilers later on.

I'm also not sure if core is the right place to put it. I'd expect compiler/memory barriers to only be used directly in platform dependent code. The platform independent code should just use mutex, irq_disable(), or C11 atomics instead of dealing directly with barriers in my opinion.

Providing a memory barrier API is probably not gonna help, or even hurt, as people will use it at random places without deeper understanding.

People using APIs incorrectly or in incorrect situations is a general software development issue not specific to this problem. Providing easy to use APIs and proper documentation is clearly the better way instead of not providing APIs that could potentially be misused.

Also so far all implementations of irq_disable()/irq_restore() I checked are either using externally supplied APIs (e.g. cli() / sei() on AVR, e.g. __disable_irq() on Cortex-M), or contained bugs regarding the use barriers. I'd say that this is anecdotal evidence that such APIs can help to prevent hard-to-trace data corruption bugs.

I'd rather not have the examples and explanations that this PR adds. It feels opinionated and alarmist and not considering context (sorry @maribu).

I can split it off into another PR, so that the discussion about this does not stall the barrier API. But an barrier API without the knowledge when and how to use it is of little benefit. Examples and explanations are usually considered to be a good way to spread knowledge... Any help improving those is greatly appreciated.

@maribu
Copy link
Copy Markdown
Member Author

maribu commented May 15, 2019

Backup of old state discussed here: https://github.com/maribu/RIOT/commits/barriers-bak

I stripped off anything but the most basic documentation and added an @warning saying that in general the high level API like the mutex should be used instead.

@maribu
Copy link
Copy Markdown
Member Author

maribu commented Sep 4, 2019

OK, I indeed cannot come up with an example that is not falling into the following cases:

  1. A high level synchronization primitive can be used
  2. The low level code should be wrapped into irq_disable() and irq_restore() anyway
  3. Inline assembly is used, so adding appropriate clobbers and (on out of order CPUs only) synchronization instructions would be equally fine.

Unless a practical example not falling in these categories pops up, this PR can be closed IMHO

@maribu maribu closed this Sep 4, 2019
@maribu maribu deleted the barriers branch August 23, 2020 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Discussion: RFC The issue/PR is used as a discussion starting point about the item of the issue/PR Type: new feature The issue requests / The PR implemements a new feature for RIOT

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants