-
Notifications
You must be signed in to change notification settings - Fork 1.7k
"BLAS: Bad memory unallocation!" errors on Power #2760
Description
We recently encountered some failures in some of our tests on Power, which a colleague of mine traced to an apparent issue in OpenBLAS. For reference, consider the attached test program, test.f. (Remove the .txt extension below - github made me add that:)
The symptom when you run this program on Power with recent OpenBLAS versions up to and including OpenBLAS 0.3.10 is that OpenBLAS will print the following messages while the program runs:
BLAS : Bad memory unallocation! : 1024 0x7ffe39b40000
BLAS : Bad memory unallocation! : 1024 0x7ffdadb40000
BLAS : Bad memory unallocation! : 1024 0x7ffe1db40000
BLAS : Bad memory unallocation! : 1024 0x7ffd0db40000
We have observed this behavior when compiling OpenBLAS with both gfortran/gcc, as well as the hybrid nvc/nvfortran/gcc build referenced in my comment in issue #2718. Something like the following should be sufficient to expose the bug:
make CC=gcc FC=gfortran USE_OPENMP=1 DYNAMIC_ARCH=1 NUM_THREADS=512 all
My colleague did some analysis, and here is what he found:
On POWER8 and POWER9 systems, there is a race condition in the memory management logic in routines blas_memory_alloc() and blas_memory_free() (driver/others/memory.c) when the library is compiled for use with OpenMP.
At runtime we are seeing the following errors from blas_memory_free():
BLAS : Bad memory unallocation! : 1024 0x2001b9440000
The message comes from line 2906, if "free_area" is not found in table "memory".
[cparrott73 note - I'm not sure if these line numbers are from the 0.3.10 or the 0.3.7 version, but this should get you in the general ballpark of the problem.]
2871
2872 int position;
2873
2874 #ifdef DEBUG
2875 printf("Unmapped Start : %p ...\n", free_area);
2876 #endif
2877
2878 position = 0;
2879 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2880 LOCK_COMMAND(&alloc_lock);
2881 #endif
2882 while ((position < NUM_BUFFERS) && (memory[position].addr != free_area))
2883 position++;
2884
2885 if (memory[position].addr != free_area) goto error;
2886
2887 #ifdef DEBUG
2888 printf(" Position : %d\n", position);
2889 #endif
2890
2891 // arm: ensure all writes are finished before other thread takes this memory
2892 WMB;
2893
2894 memory[position].used = 0;
2895 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2896 UNLOCK_COMMAND(&alloc_lock);
2897 #endif
2898
2899 #ifdef DEBUG
2900 printf("Unmap Succeeded.\n\n");
2901 #endif
2902
2903 return;
2904
2905 error:
2906 printf("BLAS : Bad memory unallocation! : %4d %p\n", position, free_area);
2907
2908 #ifdef DEBUG
2909 for (position = 0; position < NUM_BUFFERS; position++)
2910 printf("%4ld %p : %d\n", position, memory[position].addr, memory[position].used);
2911 #endif
2912 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2913 UNLOCK_COMMAND(&alloc_lock);
2914 #endif
2915 return;
2916 }
After much tracing, the problem with the free turns out to be an issue with allocation (blas_memory_alloc()). There are various paths in that routine depending upon which environment the build targets. For our OpenMP build, the region of code that is of interest is:
2739
2740 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2741 LOCK_COMMAND(&alloc_lock);
2742 #endif
2743 do {
2744 RMB;
2745 #if defined(USE_OPENMP)
2746 if (!memory[position].used) {
2747 blas_lock(&memory[position].lock);
2748 #endif
2749 if (!memory[position].used) goto allocation;
2750
2751 #if defined(USE_OPENMP)
2752 blas_unlock(&memory[position].lock);
2753 }
2754 #endif
2755 position ++;
2756
2757 } while (position < NUM_BUFFERS);
2758 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2759 UNLOCK_COMMAND(&alloc_lock);
2760 #endif
2761 goto error;
2762
2763 allocation :
2764
2765 #ifdef DEBUG
2766 printf(" Position -> %d\n", position);
2767 #endif
2768
2769 memory[position].used = 1;
2770 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2771 UNLOCK_COMMAND(&alloc_lock);
2772 #else
2773 blas_unlock(&memory[position].lock);
2774 #endif
The code is looping through the table "memory" looking to see if an entry is "used" (line 2746). If the entry is free, a lock is acquired (line 2747), and then the "used" flag is again checked to see whether the entry is still free. If the entry is still free, control transfers to "allocation:", the entry is marked as being "used" (line 2769), and the lock is subsequently released (line 2773).
This is a good implementation which minimizes what has to be serialized the number of locks that have to be acquired/released, thus minimizing the amount of logic that has to be serialized.
The problem is with the implementation of the "blas_lock" routine which is defined in header file common_power.h (top level directory). The lock is coded with inline assembly as:
102 __asm__ __volatile__ (
103 "0: lwarx %0, 0, %1\n"
104 " cmpwi %0, 0\n"
105 " bne- 1f\n"
106 " stwcx. %2,0, %1\n"
107 " bne- 0b\n"
108 "1: "
109 : "=&r"(ret)
110 : "r"(address), "r" (val)
111 : "cr0", "memory");
112 #else
What is wrong, is that there is a missing "isync" import memory barrier after after acquiring the lock (BTW, this locking does not guard against a lock being acquired multiple times - not sure if that's intended or not).
This is what the assembly for blas_lock should be:
102 __asm__ __volatile__ (
103 "0: lwarx %0, 0, %1\n"
104 " cmpwi %0, 0\n"
105 " bne- 1f\n"
106 " stwcx. %2,0, %1\n"
107 " bne- 0b\n"
108 " isync\n"
109 "1: "
110 : "=&r"(ret)
111 : "r"(address), "r" (val)
112 : "cr0", "memory");
(Note the isync instruction added at line 108 in this version.)
Without the import barrier, the read of "used" (line 2749) is being speculatively executed, prior to knowing whether the lock has been acquired or not. Here's the supporting documentation from IBM's "Power ISA Version 2.07" manual:
B.2.1.1 Acquire Lock and Import
Shared Storage
If lwarx and stwcx. instructions are used to obtain the
lock, an import barrier can be constructed by placing an
isync instruction immediately following the loop containing
the lwarx and stwcx.. The following example
uses the “Compare and Swap” primitive to acquire the
lock.
In this example it is assumed that the address of the
lock is in GPR 3, the value indicating that the lock is
free is in GPR 4, the value to which the lock should be
set is in GPR 5, the old value of the lock is returned in
GPR 6, and the address of the shared data structure is
in GPR 9.
loop:
lwarx r6,0,r3,1 #load lock and reserve
cmpw r4,r6 #skip ahead if
bne- wait # lock not free
stwcx. r5,0,r3 #try to set lock
bne- loop #loop if lost reservation
isync #import barrier
lwz r7,data1(r9)#load shared data
.
.
wait... #wait for lock to free
The hint provided with lwarx indicates that after the
program acquires the lock variable (i.e. stwcx. is successful),
it will release it (i.e. store to it) prior to another
program attempting to modify it.
The second bne- does not complete until CR0 has
been set by the stwcx.. The stwcx. does not set CR0
until it has completed (successfully or unsuccessfully).
The lock is acquired when the stwcx. completes successfully.
Together, the second bne- and the subsequent
isync create an import barrier that prevents the
load from “data1” from being performed until the branch
has been resolved not to be taken.
I implemented an alternative version of the alloc and free logic for OpenMP builds using the GNU legacy atomic operation __sync_val_compare_and_swap, which now only references the "used" field of the structure. This implementation makes the "used" field both a lock and a flag (original code disabled using "//"):
2739
2740 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2741 LOCK_COMMAND(&alloc_lock);
2742 #endif
2743 do {
2744 RMB;
2745 #if defined(USE_OPENMP)
2746 // if (!memory[position].used) {
2747 // blas_lock(&memory[position].lock);
2748 #endif
2749 // if (!memory[position].used) goto allocation;
2750 if (memory[position].used == 0 && __sync_val_compare_and_swap(&memory[position].used, 0, 1) == 0) goto allocation;
2751
2752 #if defined(USE_OPENMP)
2753 // blas_unlock(&memory[position].lock);
2754 // }
2755 #endif
2756 position ++;
2757
2758 } while (position < NUM_BUFFERS);
2759 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2760 UNLOCK_COMMAND(&alloc_lock);
2761 #endif
2762 goto error;
2763
2764 allocation :
2765
2766 #ifdef DEBUG
2767 printf(" Position -> %d\n", position);
2768 #endif
2769
2770 // memory[position].used = 1;
2771 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2772 UNLOCK_COMMAND(&alloc_lock);
2773 #else
2774 // blas_unlock(&memory[position].lock);
2775 #endif
and blas_free():
2880 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2881 LOCK_COMMAND(&alloc_lock);
2882 #endif
2883 while ((position < NUM_BUFFERS) && (memory[position].addr != free_area))
2884 position++;
2885
2886 if (memory[position].addr != free_area) goto error;
2887
2888 #ifdef DEBUG
2889 printf(" Position : %d\n", position);
2890 #endif
2891
2892 // arm: ensure all writes are finished before other thread takes this memory
2893 WMB;
2894
2895 // memory[position].used = 0;
2896 assert(__sync_val_compare_and_swap(&memory[position].used, 1, 0) == 1);
2897 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2898 UNLOCK_COMMAND(&alloc_lock);
2899 #endif
Line 2896 might be considered "overkill", but it does guarantee table coherency (and abort if it is not).
One last issue - the "memory" table is not properly aligned, though compilers will properly align the structure:
2604 BLASULONG lock;
2605 void *addr;
2606 #if defined(WHEREAMI) && !defined(USE_OPENMP)
2607 int pos;
2608 #endif
2609 int used;
2610 #ifndef __64BIT__
2611 char dummy[48];
2612 #else
2613 char dummy[40];
2614 #endif
2615
2616 } memory[NUM_BUFFERS];
The "dummy" padding (line 2613) is incorrect for OpenMP builds when BLASULONG is 64 bits and compiling for 64-bits. sizeof(BLASULONG) + sizeof(*addr) + sizeof(int) = 8+8+4 = 20. It should be declared as 44 elements for 64-byte alignment.
But, this brings up a different issue for POWER8 and POWER9 systems. IBM recommends that locks be the only element in an aligned 128-byte (granule) cache line. Here's the programming note from IBM's POWER ISA Version 2.07 reference manual from section 1.7.3.1:
Because the reservation is lost if another processor
stores anywhere in the reservation granule, lock
words (or bytes, halfwords, or doublewords) should
be allocated such that few such stores occur, other
than perhaps to the lock word itself. (Stores by
other processors to the lock word result from contention
for the lock, and are an expected consequence
of using locks to control access to shared
storage; stores to other locations in the reservation
granule can cause needless reservation loss.)
Such allocation can most easily be accomplished
by allocating an entire reservation granule for the
lock and wasting all but one word. Because reservation
granule size is implementation-dependent,
portable code must do such allocation dynamically.
Similar considerations apply to other data that are
shared directly using lwarx and stwcx. (e.g., pointers
in certain linked lists; see Section B.3, “List
Insertion” on page 833).
[cparrott73 note - I have yet not implemented the second recommendation regarding alignment from my colleague, but the first recommendation with regard to locking appears to be sufficient to make the BLAS: Bad memory unallocation! messages go away. I will include these changes in the patches I will submit against OpenBLAS 0.3.10 to work around compilation issues with the new NVIDIA HPC SDK 20.7 compilers, as promised in issue #2718.]