[SDK] `BatchSpanProcessor::ForceFlush` appears to be hanging forever

**Describe your environment**
Platform: Linux/x86_64
Build system: CMake
Package manager: conda/conda-forge
OpenTelemetry version: 1.14.1

**Steps to reproduce**
Unfortunately, this is entwined into a large proprietary application and I've not been able to extricate it. We are using the OTLP exporter, BatchSpanProcessor, and we are calling Flush periodically with a timeout.

**What is the expected behavior?**
ForceFlush with a timeout should reliably complete within the timeout (roughly)

**What is the actual behavior?**
ForceFlush appears to get stuck inside the spinlock:

```
#0  0x00007f52a6c49f17 in __GI_sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1  0x00007f52a84ce4ee in opentelemetry::v1::sdk::trace::BatchSpanProcessor::ForceFlush(std::chrono::duration<long, std::ratio<1l, 1000000l> >) () from /home/lidavidm/miniforge3/envs/dev/bin/../lib/libopentelemetry_trace.so
```

or 

```
   0x00007f52a84ce49e <+334>:	jne    0x7f52a84ce4bd <_ZN13opentelemetry2v13sdk5trace18BatchSpanProcessor10ForceFlushENSt6chrono8durationIlSt5ratioILl1ELl1000000EEEE+365>
   0x00007f52a84ce4a0 <+336>:	pause
=> 0x00007f52a84ce4a2 <+338>:	mov    %ebx,%eax
   0x00007f52a84ce4a4 <+340>:	and    $0x7f,%eax
   0x00007f52a84ce4a7 <+343>:	cmp    $0x7f,%eax
```

These traces appear to me to correspond with the spinlock portion. It appears to me that the spinlock portion does not respect the timeout:

https://github.com/open-telemetry/opentelemetry-cpp/blob/07f6cb54ece56691dbd2a94b0cbeec722ff6a631/sdk/src/trace/batch_span_processor.cc#L120-L137

**Additional context**
We _think_ this only occurred once we upgraded from 1.13.0 to 1.14.1. Note that we did not used to have a timeout, since we did not feel a need to. When we upgraded, we noticed it was getting stuck exporting, and added the timeout to try to avoid this (and as general good practice), but it seems there's some deeper issue as to why the export seems to never complete.

I did set a breakpoint on the actual Export method and it does appear to be called regularly, but the main thread that is blocking in ForceFlush isn't getting unblocked.


	// If it will be already signaled, we must wait util notified.
	// We use a spin lock here
	if (false ==
	synchronization_data_->is_force_flush_pending.exchange(false, std::memory_order_acq_rel))
	{
	for (int retry_waiting_times = 0;
	false == synchronization_data_->is_force_flush_notified.load(std::memory_order_acquire);
	++retry_waiting_times)
	{
	opentelemetry::common::SpinLockMutex::fast_yield();
	if ((retry_waiting_times & 127) == 127)
	{
	std::this_thread::yield();
	}
	}
	}
	synchronization_data_->is_force_flush_notified.store(false, std::memory_order_release);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SDK] `BatchSpanProcessor::ForceFlush` appears to be hanging forever #2574

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SDK] BatchSpanProcessor::ForceFlush appears to be hanging forever #2574

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[SDK] `BatchSpanProcessor::ForceFlush` appears to be hanging forever #2574