sys: Added simple memory barrier API#11438
Conversation
sys/include/barriers.h
Outdated
| * memory accesses. If the implementation of `do_computing()` is visible to the | ||
| * compiler, the compiler may conclude that do_computing() could be moved | ||
| * above the first or below the second access to `TIMER`, which will render the | ||
| * benchmark meaningless. (**Beware:** Even if `do_something()` is an external |
There was a problem hiding this comment.
Is this accurate? AFAIK, function calls are sequence points WRT the abstract machine. LTO cannot change that.
There was a problem hiding this comment.
The compiler can do any transformation that will not yield different observable behavior compared to the abstract machine. If I remember correctly observable behavior is defined as (likely incomplete):
- Side-effects cause by function calls
- Access to memory marked as
volatile - The return value of
main()(at least on non-embedded systems)
If the C compiler can prove that a function does not yield observable behavior, the compiler thus can get as creative as it wants.
Edit: Fixed use of language
There was a problem hiding this comment.
@kaspar030 library function calls. I made the experiment a while ago with a some functions defined in a single C file (no LTO), mixing accesses to globals and calls. GCC is smart enough to realize that some function do not touch global memory. I'm not sure if LTO is also as smart.
Your point still applies to
Line 354 in 11da5e8
There was a problem hiding this comment.
I made the experiment a while ago with a some functions defined in a single C file (no LTO), mixing accesses to globals and calls. GCC is smart enough to realize that some function do not touch global memory. I'm not sure if LTO is also as smart.
This sounds like anecdotal evidence.
There was a problem hiding this comment.
@kaspar030: You can observe that compiler barriers do make a difference in #11440, where those are added to irq_disable(), irq_restore() for the ARMv7 platform etc. Those barriers do not make any difference when compiling without LTO, as the function calls are treated as barriers then. (Unless I missed something in the doc, the ARMv7 platform is in-order, so no hardware memory barriers are used there.)
There was a problem hiding this comment.
read_fence_test.c
extern void f(void);
int x;
int *p;
int test(void)
{
int y = x;
f();
return y+x;
}
int z(int x)
{
return x*x;
}
int test2(void)
{
int y = x;
int k = z(x);
return y+x+k;
}
int test3(void)
{
int y = x;
//__sync_synchronize();
__asm__ __volatile__("": : :"memory");
return y + x;
}Compile with arm-none-eabi-gcc -mcpu=cortex-m3 -std=c99 -Og -dA -ggdb -c read_fence_test.c
In test(), the compiler is forced to emit a re-read of x because it does not know if f() modified it.
There was a problem hiding this comment.
In
test(), the compiler is forced to emit a re-read ofxbecause it does not know iff()modified it.
The call to f() behaves like a compiler barrier as long as the compiler cannot access the implementation of f(). With LTO this might no longer be the case (depending on where f() is implemented). But even if f() is implemented in a system library, we should not take it for granted to be a compiler barrier for all eternity: When LTO becomes more and more popular, maybe we get LTO-enabled static linking against system libraries (that is, system libraries are not stored in machine code but rather in the internal representation of the compiler to allow optimization over the whole program).
Therefore, I think we should not rely on calls to external function to behave like compiler barriers.
|
I agree with the general idea of this PR (having both compiler and cpu barriers). A couple of issues: Implementation of barrier() is OK, except that a function marked inline may not be inlined by the compiler, so I believe it is better to have a macro. That's what linux does: As for memory_barrier, it is just another name for Lines 354 to 361 in 11da5e8 See, it is essentially a compiler barrier. I think we should fix this before introducing a function that claims to do something that is not true. Of course, if the target does not have barrier instructions it is because memory access reordering by the CPU is not possible so that is not a problem. Another issue with that implementation is that it is a function defined in a C file. It is pointless to put a (compiler) barrier in a function since a call to code in another translation unit is already a barrier: since the compiler does not know what the external function does, it must assume anything may have changed (this may not be true with LTO). Now, as far as I understand, __sync_synchronize is a builtin, so it is probably already defined by the compiler for many platforms. |
sys/include/barriers.h
Outdated
| * prevent correct execution. | ||
| * | ||
| * ## Why can't I use `volatile` instead | ||
| * Memory accesses to `volatile` can be viewed as an alternative to |
There was a problem hiding this comment.
There's one important aspect of "volatile": the compiler is forced to use the best-fitting memory write instruction it has available. E.g., on 32bit ARM, writing to an uint16_t foo *volatile will generate a 16bit store instruction.
There was a problem hiding this comment.
What the Linux guys do is writing a helper function-like macro that casts the pointer to a pointer to volatile memory of the same type and performs the memory access. This will prevent optimization only for that very access that needs to be done in a properly sized store
sys/include/barriers.h
Outdated
| * @ref compiler_barrier, but only in regard to other accesses to `volatile` | ||
| * variables. (Some compilers do not reorder memory accesses to non-`volatile`s | ||
| * across accesses to `volatile`s as well, but not all compilers do so.) | ||
| * Resorting to mark all variables as `volatile` to enforce the correct order |
There was a problem hiding this comment.
having one variable volatile enforces load/store to that one variable to be issued (WRT volatile rules). A memory barrier forces the compiler to forget everything it knows about all variables in the current scope. Doesn't that possibly lead to more reloads than necessary?
There was a problem hiding this comment.
At least in practical code it is observed to perform better to use barriers (see the Linux reference regarding volatile at the bottom). We need to keep in mind that the typical use case is something like:
barrier()
access_to_critical_data()
barrier()It is very unlikely that memory considered to require being put behind a barrier is accessed unprotected just after the barrier.
There was a problem hiding this comment.
I totally forgot the obvious argument here:
A memory barrier forces the compiler to forget everything it knows about all variables in the current scope.
This is an incorrect generalization. The compiler will assume that global memory is expected to be accessed during the barrier. In general local variables are still fully optimized. (When the address of the stack variables has been passed to an external function, the compiler has to treat those as potentially accessed as well.) So the compiler will only reload globals after the barrier (and only if they are used on both sides of the barrier). Thus, the odds for the barrier to be cheaper than volatile are very good. (And e.g. for the Linux community this was also true in practice.)
Let me prove that GCC is able to optimize access to stack variables (unless their address is passed to external code) across barriers:
#include <stdint.h>
#define memory_barrier() __sync_synchronize()
typedef struct {
uint8_t left : 4;
uint8_t right : 4;
} foo_t;
foo_t foo(foo_t a)
{
a.left = 0xf;
memory_barrier();
a.right = 0xf;
return a;
}Output of avr-gcc -O3 -S:
.file "foo_local.c"
__SP_H__ = 0x3e
__SP_L__ = 0x3d
__SREG__ = 0x3f
__tmp_reg__ = 0
__zero_reg__ = 1
.text
.global foo
.type foo, @function
foo:
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
ldi r24,lo8(-1)
/* epilogue start */
ret
.size foo, .-foo
.ident "GCC: (GNU) 8.3.0"
|
@jcarrano: Thanks for the quick comments :-) I'll change
This is true, but only if |
sys/include/barriers.h
Outdated
| * } | ||
| * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| * | ||
| * In the above example from the compilers perspective the value in `TIMER` |
There was a problem hiding this comment.
I did. This is also the example presented here https://blog.regehr.org/archives/28, or similar here https://hackaday.com/2015/08/18/embed-with-elliot-the-volatile-keyword/
|
I'll resume work on this tomorrow. My brain is melting right now anyway - at least in part because of the hot weather :-) |
|
@jcarrano: (I believe that the use of inline assembly there is less widely supported than |
|
I guess it would be nice to have something like E.g. consider this code: RIOT/cpu/atmega_common/irq_arch.c Lines 88 to 91 in 215d5cd I think it would need to be written something like this: int irq_is_in(void)
{
int retval;
retval = __in_isr;
compiler_barrier();
return retval;
} |
|
@maribu In that case A similar issue was raised a while ago with scheduler variables. Another issue to watch out for is slight variation in LLVM and GCC when it comes to write barriers, as pointed out in the linux source: https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/include/linux/compiler-gcc.h#L30 Also, I just checked and, at least in ARM, the builtin __sync_synchronize() is OK (generates a So, in conclusion, my position is:
|
That is true, but it allows future modification. (E.g. assume a specific toolchain has a broken Also, I personally think I'd also prefer
Done
I agree, but I'd think that should be a separate PR. |
Added |
|
@kaspar030, @jcarrano: I'd like to share a little experiment:
#include <stdint.h>
#include <stdio.h>
extern int bar(void);
typedef union {
uint32_t u32;
uint8_t u8[4];
} foo_t;
foo_t a = { .u32 = 0 };
int main(void)
{
a.u8[0] = 0xa0;
int i = bar();
a.u8[1] = 0xa1;
a.u8[2] = 0xa2;
a.u8[3] = 0xa3;
printf("a = %x, i = %d\n", (unsigned)a.u32, i);
return 0;
}
int bar(void) {
return 0x1337;
}Compilation with: and with: Partial output of Partial output of Without LTO the call to |
... and thus a function call cannot be used as compiler barrier. I have tried the same using a global variable, a volatile variable and a volatile asm access: bar.c: When compiled with This is identical to the original LTO version, thus it makes no difference whether the compiler can reason the function is "simple" (just returning a constant) or if it returns a global variable. When compiled with Here the volatile read is inserted after the write of "0xa0" (line nr 35), the memory access to 'a' is not reordered across the inlined function call containing a volatile memory access. Here's the effect of an Here, gcc does not reorder the writes to 'a' around the asm statement (line nr 34). So, while the function call itself pretty much vanishes (it doesn't have any effect) when using LTO, gcc, to me, doesn't show any unexpected behavior. |
|
@kaspar030: The C standard is sadly not that clear on whether memory accesses that are not marked as
So while GCC does treat accesses to But we also need to keep in mind that this actually doesn't matter: Even if every compiler supported by RIOT does tread accesses to |
I think we need to put this in context. Using volatile as a compiler barrier e.g., in AVR periph code, is perfectly possible. IMO we should come up with clear and simple guidelines for how to do accesses to variables that are shared between threads and/or ISRs. Providing a memory barrier API is probably not gonna help, or even hurt, as people will use it at random places without deeper understanding. What we need are instructions like "if a variable is shared between ISR and thread(s), guard it's reads with irq_disable/restore() or use atomics". And of course we need to make sure that those functions are rock-solid, as you've already started. If you guys think that just using I'd rather not have the examples and explanations that this PR adds. It feels opinionated and alarmist and not considering context (sorry @maribu). Also, I'd like to reserve "barrier.h" for IPC barriers, for which we'll eventually have an implementation. |
Let me prove this wrong.
#include <stdint.h>
#ifdef VOL
extern volatile uint8_t __barrier;
#define memory_barrier() __barrier = 0
#else
#define memory_barrier() __sync_synchronize()
#endif
typedef struct {
uint8_t left : 4;
uint8_t right : 4;
} foo_t;
extern foo_t a;
void foo(void)
{
a.left = 0xf;
memory_barrier();
a.right = 0xf;
}Output of Output of So when using an access to (Note: When using avr-gcc 5.4.0 from the docker image |
+1
I'd say the main benefit is being portable, as I'm also not sure if
People using APIs incorrectly or in incorrect situations is a general software development issue not specific to this problem. Providing easy to use APIs and proper documentation is clearly the better way instead of not providing APIs that could potentially be misused. Also so far all implementations of
I can split it off into another PR, so that the discussion about this does not stall the barrier API. But an barrier API without the knowledge when and how to use it is of little benefit. Examples and explanations are usually considered to be a good way to spread knowledge... Any help improving those is greatly appreciated. |
|
Backup of old state discussed here: https://github.com/maribu/RIOT/commits/barriers-bak I stripped off anything but the most basic documentation and added an |
|
OK, I indeed cannot come up with an example that is not falling into the following cases:
Unless a practical example not falling in these categories pops up, this PR can be closed IMHO |
Contribution description
compiler_barrier()(andcompiler_barrier_data()) prevent the compiler from reordering memory accesses across themmemory_barrier()prevents the compiler from reordering memory accesses and - on out-of-order platforms only - prevents the CPU from doing so as wellTesting procedure
compiler_barrier()check https://en.wikipedia.org/wiki/Memory_ordering#Compiler_memory_barriermemory_barrier()check https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html and https://llvm.org/docs/Atomics.htmlexamples/hello-world/main.c)aandbwill be reordered and combined to two 32 bit writes on 32-bit platforms.acan still be merged and reordered in regard to the write tob.u8[0], but not in regard to any other write tob- which take place after the barrier.)nucleo-f767zi) actual memory barrier instructions are emitted by the compiler (e.g.dmb(data memory barrier)); On in-order CPUs those instructions are neither supported nor neededIssues/PRs references
Potential users of this API: #11073, #8842