Skip to content

"BLAS: Bad memory unallocation!" errors on Power #2760

@cparrott73

Description

@cparrott73

We recently encountered some failures in some of our tests on Power, which a colleague of mine traced to an apparent issue in OpenBLAS. For reference, consider the attached test program, test.f. (Remove the .txt extension below - github made me add that:)

test.f.txt

The symptom when you run this program on Power with recent OpenBLAS versions up to and including OpenBLAS 0.3.10 is that OpenBLAS will print the following messages while the program runs:

BLAS : Bad memory unallocation! : 1024  0x7ffe39b40000
BLAS : Bad memory unallocation! : 1024  0x7ffdadb40000
BLAS : Bad memory unallocation! : 1024  0x7ffe1db40000
BLAS : Bad memory unallocation! : 1024  0x7ffd0db40000

We have observed this behavior when compiling OpenBLAS with both gfortran/gcc, as well as the hybrid nvc/nvfortran/gcc build referenced in my comment in issue #2718. Something like the following should be sufficient to expose the bug:

make CC=gcc FC=gfortran USE_OPENMP=1 DYNAMIC_ARCH=1 NUM_THREADS=512 all

My colleague did some analysis, and here is what he found:

On POWER8 and POWER9 systems, there is a race condition in the memory management logic in routines blas_memory_alloc() and blas_memory_free() (driver/others/memory.c) when the library is compiled for use with OpenMP.

At runtime we are seeing the following errors from blas_memory_free():

BLAS : Bad memory unallocation! : 1024 0x2001b9440000

The message comes from line 2906, if "free_area" is not found in table "memory".

[cparrott73 note - I'm not sure if these line numbers are from the 0.3.10 or the 0.3.7 version, but this should get you in the general ballpark of the problem.]

2871
2872   int position;
2873
2874 #ifdef DEBUG
2875   printf("Unmapped Start : %p ...\n", free_area);
2876 #endif
2877
2878   position = 0;
2879 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2880   LOCK_COMMAND(&alloc_lock);
2881 #endif
2882   while ((position < NUM_BUFFERS) && (memory[position].addr != free_area))
2883     position++;
2884
2885   if (memory[position].addr != free_area) goto error;
2886
2887 #ifdef DEBUG
2888   printf("  Position : %d\n", position);
2889 #endif
2890
2891   // arm: ensure all writes are finished before other thread takes this memory
2892   WMB;
2893
2894   memory[position].used = 0;
2895 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2896   UNLOCK_COMMAND(&alloc_lock);
2897 #endif
2898
2899 #ifdef DEBUG
2900   printf("Unmap Succeeded.\n\n");
2901 #endif
2902
2903   return;
2904
2905  error:
2906   printf("BLAS : Bad memory unallocation! : %4d  %p\n", position,  free_area);
2907
2908 #ifdef DEBUG
2909   for (position = 0; position < NUM_BUFFERS; position++)
2910     printf("%4ld  %p : %d\n", position, memory[position].addr, memory[position].used);
2911 #endif
2912 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2913   UNLOCK_COMMAND(&alloc_lock);
2914 #endif
2915   return;
2916 }

After much tracing, the problem with the free turns out to be an issue with allocation (blas_memory_alloc()). There are various paths in that routine depending upon which environment the build targets. For our OpenMP build, the region of code that is of interest is:

2739
2740 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2741   LOCK_COMMAND(&alloc_lock);
2742 #endif
2743   do {
2744       RMB;
2745 #if defined(USE_OPENMP)  
2746     if (!memory[position].used) {
2747       blas_lock(&memory[position].lock);
2748 #endif
2749       if (!memory[position].used) goto allocation;
2750
2751 #if defined(USE_OPENMP)
2752       blas_unlock(&memory[position].lock);
2753     }
2754 #endif
2755     position ++;
2756
2757   } while (position < NUM_BUFFERS);
2758 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2759   UNLOCK_COMMAND(&alloc_lock);
2760 #endif
2761   goto error;
2762
2763   allocation :
2764
2765 #ifdef DEBUG
2766   printf("  Position -> %d\n", position);
2767 #endif
2768
2769   memory[position].used = 1;
2770 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2771   UNLOCK_COMMAND(&alloc_lock);
2772 #else
2773   blas_unlock(&memory[position].lock);
2774 #endif

The code is looping through the table "memory" looking to see if an entry is "used" (line 2746). If the entry is free, a lock is acquired (line 2747), and then the "used" flag is again checked to see whether the entry is still free. If the entry is still free, control transfers to "allocation:", the entry is marked as being "used" (line 2769), and the lock is subsequently released (line 2773).

This is a good implementation which minimizes what has to be serialized the number of locks that have to be acquired/released, thus minimizing the amount of logic that has to be serialized.

The problem is with the implementation of the "blas_lock" routine which is defined in header file common_power.h (top level directory). The lock is coded with inline assembly as:

102   __asm__ __volatile__ (
103            "0:  lwarx %0, 0, %1\n"
104            "    cmpwi %0, 0\n"
105            "    bne- 1f\n"
106            "    stwcx. %2,0, %1\n"
107            "    bne- 0b\n"
108            "1:    "
109         : "=&r"(ret)
110         : "r"(address), "r" (val)
111         : "cr0", "memory");
112 #else  

What is wrong, is that there is a missing "isync" import memory barrier after after acquiring the lock (BTW, this locking does not guard against a lock being acquired multiple times - not sure if that's intended or not).

This is what the assembly for blas_lock should be:

102   __asm__ __volatile__ (
103        "0:  lwarx %0, 0, %1\n"
104        "    cmpwi %0, 0\n"
105        "    bne- 1f\n"
106        "    stwcx. %2,0, %1\n"
107        "    bne- 0b\n"
108        "    isync\n"
109        "1:    "
110     : "=&r"(ret)
111     : "r"(address), "r" (val)
112     : "cr0", "memory");

(Note the isync instruction added at line 108 in this version.)

Without the import barrier, the read of "used" (line 2749) is being speculatively executed, prior to knowing whether the lock has been acquired or not. Here's the supporting documentation from IBM's "Power ISA Version 2.07" manual:

B.2.1.1 Acquire Lock and Import
Shared Storage
If lwarx and stwcx. instructions are used to obtain the
lock, an import barrier can be constructed by placing an
isync instruction immediately following the loop containing
the lwarx and stwcx.. The following example
uses the “Compare and Swap” primitive to acquire the
lock.
In this example it is assumed that the address of the
lock is in GPR 3, the value indicating that the lock is
free is in GPR 4, the value to which the lock should be
set is in GPR 5, the old value of the lock is returned in
GPR 6, and the address of the shared data structure is
in GPR 9.
loop:
lwarx r6,0,r3,1 #load lock and reserve
cmpw r4,r6 #skip ahead if
bne- wait # lock not free
stwcx. r5,0,r3 #try to set lock
bne- loop #loop if lost reservation
isync #import barrier
lwz r7,data1(r9)#load shared data
.
.
wait... #wait for lock to free
The hint provided with lwarx indicates that after the
program acquires the lock variable (i.e. stwcx. is successful),
it will release it (i.e. store to it) prior to another
program attempting to modify it.
The second bne- does not complete until CR0 has
been set by the stwcx.. The stwcx. does not set CR0
until it has completed (successfully or unsuccessfully).
The lock is acquired when the stwcx. completes successfully.
Together, the second bne- and the subsequent
isync create an import barrier that prevents the
load from “data1” from being performed until the branch
has been resolved not to be taken.

I implemented an alternative version of the alloc and free logic for OpenMP builds using the GNU legacy atomic operation __sync_val_compare_and_swap, which now only references the "used" field of the structure. This implementation makes the "used" field both a lock and a flag (original code disabled using "//"):

2739
2740 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2741   LOCK_COMMAND(&alloc_lock);
2742 #endif
2743   do {
2744       RMB;
2745 #if defined(USE_OPENMP)  
2746 //    if (!memory[position].used) {
2747 //      blas_lock(&memory[position].lock);
2748 #endif
2749 //      if (!memory[position].used) goto allocation;
2750       if (memory[position].used == 0 && __sync_val_compare_and_swap(&memory[position].used, 0, 1) == 0) goto allocation;
2751
2752 #if defined(USE_OPENMP)
2753 //      blas_unlock(&memory[position].lock);      
2754 //    }
2755 #endif
2756     position ++;
2757
2758   } while (position < NUM_BUFFERS);
2759 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2760   UNLOCK_COMMAND(&alloc_lock);
2761 #endif
2762   goto error;
2763
2764   allocation :
2765
2766 #ifdef DEBUG
2767   printf("  Position -> %d\n", position);
2768 #endif
2769
2770 //  memory[position].used = 1;
2771 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2772   UNLOCK_COMMAND(&alloc_lock);
2773 #else
2774 //  blas_unlock(&memory[position].lock);    
2775 #endif

and blas_free():

2880 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2881   LOCK_COMMAND(&alloc_lock);
2882 #endif
2883   while ((position < NUM_BUFFERS) && (memory[position].addr != free_area))
2884     position++;
2885
2886   if (memory[position].addr != free_area) goto error;
2887
2888 #ifdef DEBUG
2889   printf("  Position : %d\n", position);
2890 #endif
2891
2892   // arm: ensure all writes are finished before other thread takes this memory
2893   WMB;
2894
2895 //  memory[position].used = 0;
2896   assert(__sync_val_compare_and_swap(&memory[position].used, 1, 0) == 1);
2897 #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP)
2898   UNLOCK_COMMAND(&alloc_lock);
2899 #endif

Line 2896 might be considered "overkill", but it does guarantee table coherency (and abort if it is not).

One last issue - the "memory" table is not properly aligned, though compilers will properly align the structure:

2604   BLASULONG lock;
2605   void *addr;
2606 #if defined(WHEREAMI) && !defined(USE_OPENMP)
2607   int   pos;
2608 #endif
2609   int used;
2610 #ifndef __64BIT__
2611   char dummy[48];
2612 #else
2613   char dummy[40];
2614 #endif
2615
2616 } memory[NUM_BUFFERS];

The "dummy" padding (line 2613) is incorrect for OpenMP builds when BLASULONG is 64 bits and compiling for 64-bits. sizeof(BLASULONG) + sizeof(*addr) + sizeof(int) = 8+8+4 = 20. It should be declared as 44 elements for 64-byte alignment.

But, this brings up a different issue for POWER8 and POWER9 systems. IBM recommends that locks be the only element in an aligned 128-byte (granule) cache line. Here's the programming note from IBM's POWER ISA Version 2.07 reference manual from section 1.7.3.1:

Because the reservation is lost if another processor
stores anywhere in the reservation granule, lock
words (or bytes, halfwords, or doublewords) should
be allocated such that few such stores occur, other
than perhaps to the lock word itself. (Stores by
other processors to the lock word result from contention
for the lock, and are an expected consequence
of using locks to control access to shared
storage; stores to other locations in the reservation
granule can cause needless reservation loss.)
Such allocation can most easily be accomplished
by allocating an entire reservation granule for the
lock and wasting all but one word. Because reservation
granule size is implementation-dependent,
portable code must do such allocation dynamically.
Similar considerations apply to other data that are
shared directly using lwarx and stwcx. (e.g., pointers
in certain linked lists; see Section B.3, “List
Insertion” on page 833).

[cparrott73 note - I have yet not implemented the second recommendation regarding alignment from my colleague, but the first recommendation with regard to locking appears to be sufficient to make the BLAS: Bad memory unallocation! messages go away. I will include these changes in the patches I will submit against OpenBLAS 0.3.10 to work around compilation issues with the new NVIDIA HPC SDK 20.7 compilers, as promised in issue #2718.]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions