Skip to content

Conversation

@goldsborough
Copy link
Contributor

@goldsborough goldsborough commented Jun 19, 2018

Creates a mechanism of globally defaulting TensorOptions properties.

The mechanism works primarily via two classes:

  1. DefaultTensorOptions, which provides thread-safe, mutually exclusive access to a single, global TensorOptions instance,
  2. OptionsGuard, which will set and reset this global TensorOptions instance.

Finally, the default constructor of TensorOptions will make a copy of the DefaultTensorOptions instance, and copy its properties.

Usage is e.g.

{
  at::OptionsGuard guard(at::dtype(at::kInt));
  // in this scope, all tensors have int dtype
  auto tensor = at::empty({3, 4}); // = empty({3, 4}, at::kInt)
}

{
  at::OptionsGuard guard(at::layout(at::kSparse));
  // in this scope, all tensors will be sparse
  auto tensor = at::empty({3, 4}); // = empty({3, 4}, at::kSparse)
}

{
  at::OptionsGuard guard(at::requires_grad(true));
  // in this scope, all variables will have requires_grad = true
  auto tensor = torch::empty({3, 4}); // = empty({3, 4}, at::requires_grad())
}

Please scrutinize for thread safety and performance.

@colesbury @ezyang @zdevito

CC @ebetica

@goldsborough goldsborough force-pushed the default-tensor-options branch 3 times, most recently from 019b889 to 048e25d Compare June 20, 2018 02:54

This comment was marked as off-topic.

This comment was marked as off-topic.

@ezyang
Copy link
Contributor

ezyang commented Jun 20, 2018

If you want a design that handles both global and thread local state, I suggest taking a look at https://fb.quip.com/w9G9AlEXbPlq (copy pasted below), which is a proposal of mine on how to put it together:

c10: The global context

This proposal is not slated for immediate implementation (instead, we will simply achieve API parity)

Motivation

In most real-world programs, there is a need for some sort of global state which can be easily accessed anywhere in the program. Common examples of such state include:

  • Choice of CPU/CUDA memory allocator, and flags for said memory allocator
  • Random number generator (PyTorch only; Caffe2 has per-operator RNGs)
  • Cached information about the system (e.g., cudaDeviceProperties, number of available devices, etc)
  • The dispatch table (projected in c10)

Global state makes it difficult to have two copies of a library in memory, or run nearby programs with different settings for the context. Thus, there is a desire to encapsulate this state in some sort of context object, to allow for this.

Prior Art

PyTorch/ATen. ATen has a Context object, which at present is a single static global variable which user code frequently access directly. Context objects are propagated because every Tensor stores a pointer to the Context object responsible for creating it; “proper” use says that you should use this particular Context object, although a lot of code that was subsequently written for ATen has not followed this constraint.

We (Edward, Sam) don't think this model (context passing via Tensor objects) makes semantic sense for what a context object should “be”. Additionally, it seems in practice very difficult to educate developers how to pass around context objects directly, when there is a very attractively named “getGlobalContext()” function which they can use to synthesize a context from scratch.

Caffe2. Caffe2 simply uses global variables as necessary. For example, the CPUAllocator is a static unique pointer defined in caffe2/core/allocator.cc; additionally, Caffe2 makes use of gflags (or Caffe2's portable copy thereof), which programs the Caffe2 static registry, which is itself a static library. No context is propagated with actual tensors; instead, per-tensor configuration like the data deleter is stored via the shared_ptr.

Design Space

To determine how we handle the global context, we have to make a few decisions:

  • Do we want to support having multiple copies of c10 in the same address space?
  • Do we want to allow changes to the context in the middle of program execution (rather than solely at static initialization time)?
  • Do we want to allow multiple changes to the context atomically?
  • Should it be possible to change the context for all threads of execution?

Proposal

We offer the following proposal, given that we answered YES to each of the previous questions. Simpler designs are possible if you decide you don't care about some of these properties.

The context is mediated by two levels of indirection: a thread-local context pointer, which points to an atomic memory cell containing a pointer to the actual struct. The Context struct itself is immutable.

#include <memory>
#include <atomic>
#include <mutex>
#include <cstdlib>

// immutable class which contains all of the context data
class ContextImpl {};
// Unfortunately, we can't use std::shared_ptr<ContextImpl>, as atomic accesses to
// this pointer are implemented with mutex. If this were possible to do, we could
// also have accurate reference counts on ContextImpl, allowing us to deallocate
// it when it became free.
using Context = std::atomic<ContextImpl*>;
using ContextPtr = std::shared_ptr<Context>;

static thread_local ContextPtr thread_context = nullptr;
static ContextImpl global_context;
static ContextPtr global_context_ptr = nullptr;
static std::mutex global_context_mutex;

// THREAD STATE                         GLOBAL CONTEXT POINTER
// TTTTTTTT (thread local state)        MMMMMMMM (mutex protected global state)
//    \ shared_ptr                      /
//     \---> CONTEXT (MEMORY CELL) <---/
//            AAAAAAAA (atomically accessed word)
//                  \ pointer
//                   \--------> CONTEXT IMPL
//                              IIIIIIII (immutable memory)
//                              IIIIIIII
//                              IIIIIIII
//                               ...


ContextPtr getGlobalContextSlow() {
  std::lock_guard<std::mutex> guard(global_context_mutex);
  if (!global_context_ptr) {
    // Allocate JUST the mutable context memory cell.  This cell can
    // be garbage collected when all references to it go away.
    global_context_ptr = std::make_shared<Context>(&global_context);
  }
  return global_context_ptr;
};

ContextImpl* getContext() {
  if (!thread_context) thread_context = getGlobalContextSlow();
  return thread_context->load();
}

/* Fast path assembly looks like this:

getContext():
  push rbp
  push rbx
  cmp BYTE PTR fs:__tls_guard@tpoff, 0
  jne .L74 sub rsp, 24
  ...
L74:
  mov rax, QWORD PTR fs:thread_context@tpoff
  test rax, rax
  je .L106
  mov rax, QWORD PTR [rax]
  add rsp, 24
  pop rbx
  pop rbp
  ret
*/

https://godbolt.org/g/GPZPf6

There are a few key characteristics of the implementation above:

  • The most difficult thing to account for is how to atomically handle multiple changes to the Context in a thread safe manner. If you don't care about handling multiple changes, there is an obvious alternative: have the Context object itself provide atomic operations for setting/getting individual properties. We think that it may be necessary to apply multiple changes in practice; e.g., the allocator is typically two fields: a malloc() and a free() implementation. To make this possible, the design makes the ContextImpl immutable, and introduces another level of indirection (now called Context) which is an atomic pointer which can be atomically changed when required. Now a swap of the context implementation can apply arbitrary changes without affecting existing readers.
  • The thread local state is required to allow individual threads to locally change the context under use to something else. All threads default to the global state otherwise.
  • As written, the above code requires the lifetime of ContextImpl objects to be managed externally to this mechanism. With atomic_shared_pointer (concurrency TS, now std::atomicstd::shared_ptr in C++20), we could efficiently also reference count ContextImpl objects, allowing them to be promptly destructed.

Open questions

  • The design of the context interacts closely with the design of the dispatch table, which lives in the context. In particular, an immutable context implies that the dispatch table must also be immutable. The alternative choice is to design a hash table which supports unsynchronized reads and rare, expensive writes. The primary hazard in this case is handling when the hash table needs to be resized; this implies you need some sort of pointer which can be swapped.
    • Zach has suggested that a good simplifying assumption is that, if an operation is added to the dispatch table at time X, any contexts from before X simply do not see the method. This allows for the implementation strategy where you have a direct, unsynchronized pointer to the dispatch table for method invocation on a Tensor, and a slower, global indirection to the dispatch table which can be updated atomically when changes are made.
  • Sebastian Messmer suggests that we should not have a global state pointer which we initialize threads with no setting for thread_context, and instead ask users to always reinitialize the state when they spawn new threads.
    • Ed's concern: don't want to have to call at::init() initially.
  • With the dispatch table, we would like operators to be registered immediately upon dlopen(). However, this poses a problem for thread-local dispatch tables: if dlopen() is called multiple times, the library is only loaded the first time. This means you can't “reload” a library to get it into a second dispatch table; you'll have to do this manually itself. So it may make the most sense to make the dispatch table have a visibility equivalent to the dynamic linker, AKA be global state.

@goldsborough goldsborough force-pushed the default-tensor-options branch from 048e25d to 06b665f Compare June 20, 2018 21:20
@goldsborough
Copy link
Contributor Author

@ezyang I made it thread local, please take another look :)

@ezyang
Copy link
Contributor

ezyang commented Jun 20, 2018

@pytorchbot retest this please

@goldsborough goldsborough force-pushed the default-tensor-options branch 2 times, most recently from a4167cf to e2e65a4 Compare June 21, 2018 19:21
@goldsborough goldsborough force-pushed the default-tensor-options branch from e2e65a4 to 82cb0fe Compare June 21, 2018 20:42
/// - layout: kStrided,
/// - requires_grad: false
TensorOptions() = default;
explicit TensorOptions(bool use_thread_local_default_options);

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@ezyang
Copy link
Contributor

ezyang commented Jun 22, 2018

Much better! But ask around about the "early" versus "late" binding of defaults.

@goldsborough
Copy link
Contributor Author

This is ready to be merged I think. @ezyang

@goldsborough goldsborough merged commit a5df8ec into pytorch:master Jun 25, 2018
@goldsborough goldsborough deleted the default-tensor-options branch June 25, 2018 04:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants