Skip to content

Proper support for distributed computing, parallelism and concurrency #1868

@dumblob

Description

@dumblob

There is one thing in V I do care a lot about - parallelism & concurrency for seamless local and distributed programming.

Note, I didn't want to write this to #561 as the scope I'm describing is broader.

I'm confident every new language must provide first-class support for both parallelism and concurrency. Concurrency to allow one physical CPU thread do cooperative scheduling and parallelism to allow spawning the same go/v routine on several physical threads. This is actually what go lang does (when GOMAXPROCS is set to 2 or more and the underlying operating system supports threads) except for always spawning just one go routine disregarding the number of physical CPU cores (which is where V should improve on - see below).

Why both concurrency and parallelism and not either of them?

In the world of big.LITTLE, GPU/FPGA offloading, CPUs with tens/hundreds of physical cores (and hyperthreading on top), mobile devices trying to save as much energy as possible, NUMA supercomputers, mobile & close vicinity networks versus optical fiber networks etc. it's inevitable to dynamically schedule computation directly in runtime of each application (not just operating system wide which is too coarse and thus inefficient and not just cooperatively which would use just one physical CPU core per application which makes sense only for power saving scenarios, but nowhere else).

We have instruction-level paralellism covered (it's not yet present in V - see e.g. my rant about restrict - but it's a "solved problem" in todays compilers). The rest of the parallelism is just about "how close can we get to pure dataflow programming". Which appears to be kind of difficult.

Because instruction-level paralelism is solved, the best approach nowadays seems to be to:

  1. write as much serial code which allows being highly optimized by instruction-level parallelism at once in one go/v routine (typically a tight loop or one tree of nested tight loops or a slow I/O or offloading to GPU/FPGA or similar) - the outermost construct of this go routine will be always an endless loop yielding (calling the cooperative/parallel scheduler) every N iterations

  2. then schedule these go/v routines in cooperative manner on one physical CPU core with fixed-size queue (single-producer-single-consumer "SPSC" - for performant implementations see discussion below) in between these go/v routines (this queue is called a channel in go lang); the queue size might be between the size of L1 and 2*L2 cache as suggested in paragraph 3.3 in Analyzing Efficient Stream Processing on Modern Hardware.

  3. and if any go/v routine shall become a bottleneck during computation (i.e. consumer is slower than producer for some time - this can be easily monitored by the cooperative/parallel scheduler as metric "how full are the queues"), a new instance of the slow consumer go/v routine shall be spawned (with respect to reentrancy) in a different operating system thread (i.e. on a different physical CPU core) including all the plumbing work (e.g. spawning an additional go routine acting as multiplexer getting the producers outgoing data and multiplexing it to the original as well as the new instance of the consumer; and spawning a demultiplexer go routine to join the outgoing data from both consumers) - this everything would the cooperative/parallel scheduler do.

  4. and if the queues shall become almost empty for some time, remove the multiplexers and demultiplexers.

The cooperative/parallel scheduler as referenced above shall be a user-definable routine (if not present, a default built-in scheduler working similarly like described above will be used). This scheduler shall run preemptively (in its own thread?) and get as input arguments at least the following: thread handles, pointers to all go/v routines and corresponding queues between them, pointer to the built-in scheduler (in case the user just wanted to wrap it), and pointer to user-defined metadata (e.g. statistics allowing better scheduling judgement). This would allow for really complex scheduling on embedded systems as well as NUMA systems or any other highly dynamic systems (e.g. a smartphone could leverage being shortly connected to a cluster of other computers or smartphones and offload some slow algorithm there etc.). See e.g. MultiQueues.

This way one can write application once and deploy it to your grandmas wrist watches as well as to a supercomputer (assuming the user-level scheduler is a "bit more" clever than outlined above - imagine complexity similar to an SQL query optimizer for a distributed DB). It's also a highly failure-tolerant system (imagine Erlang which supports even an online update of the whole application in memory while serving all clients under full load without interruption!). Also as you may have noticed, go lang does not scale first because there is no spawning of redundant go routines (those which will be bottle necks) to other physical CPU cores and second because go doesn't have a dynamic user-influencable scheduler (taking into account advanced statistics, power consumption, etc.).

There is one catch though. If implemented naively, then such elasticity doesn't guarantee global ordering among channels/queues (ordering works only inside of one channel) and real life scenarios in IT are unfortunately more often than not relying on ordering. This has to be accounted for and will require user intervention (i.e. be syntactically explicit). Either in the form of knowledge where to (a) insert "tagging" of items before they'll be spilled among different channels and (b) where the "tagging" will be removed and used to assemble the original ordering. Or in the form of assigning a strict priority to each channel. Or using other scheme.

See also lock-free and wait-free datastructures (state of the art as of now) which might be handy for an efficient implementation of parallelism.

IMHO we could actually postpone the implementation of concurrency and stick with just parallelism (as it's easier to implement than the interleaving between concurrency and parallelism as outlined above), implement the queue and a basic scheduler and first then (maybe even after V 1.0) get back to concurrency. As of now we have parallelism without queues and without the scheduler, so we actually already have a good starting point.

Table of parallelism possibilities we have nowadays (just for reference):

parallelism kind suited for execution duration latency overhead suited for how frequent execution startup internode bandwidth quality has influence special parallel SW architecture required requirements for execution & deployment
CPU non-JIT vectorization on a CPU core very short to long ~none frequent none (it's just 1 node) no binary for the particular CPU
CPU JIT vectorization on a CPU core short to long low moderate to frequent none (it's just 1 node) no app source code + VM binary for the particular CPU
CPU cores utilization (thread, process, ...) long low to moderate moderate lower to none (it's just 1 node) yes (except: pure objects as actors or alike) binary for the particular CPU
accelerators (GPU, FPGA, ...) long low to moderate moderate lower to none (it's just 1 node) yes (except: array/matrix/tensor/...-based languages) binary for the particular CPU and accelerator (GPU, FPGA, ...)
supercomputer with NUMA topology long moderate sporadic to moderate moderate (hypercube/... is really fast) yes (except: pure objects as actors, etc., array/matrix/tensor/...-based languages) binary for the particular CPU and accelerator (GPU, FPGA, ...)
LAN cluster long moderate to high sporadic to moderate moderate to high (can be fast, but not always) yes binary for at least 2 CPUs and/or at least 2 accelerators (GPU, FPGA, ...)
WAN cluster long high sporadic high (basically can't be that fast) yes binary for at least 2 CPUs and/or at least 2 accelerators (GPU, FPGA, ...)

(in practice these might overlap and/or be used simultaneously)

P.S. The dynamic scaling over the different kinds of parallelism as outlined in the above table is called "fine grained distributed computing". And if anything from the above proposal sounds crazy to you, then I can assure you, that the world doesn't sleep and there is at least one seamless (fully dynamic) solution offering first class fine grained distributed computing - Unison.


Other things to consider and/or track:

  1. As of March 2021 the fastest publicly known (and most comprehensive & general) parallelism & concurrency backend/library is Weave
  2. use PRNG (pseudo-random number generator) which copes well with dynamic threading (i.e. doesn't suffer from "same seed leads to same random numbers in all threads") - see e.g. splittable PRNGs such as JAX
  3. maybe infinite loop in rand a7c8483#r39632493
  4. g_str_buf with parallel access:
  5. FPU stuff needs special attention from the threading (work stealing) scheduler - see [Request] Optional per-thread startup procedure mratsim/weave#163
  6. under the hood, we might utilize some wait-free and lock-free algorithms (see e.g. https://github.com/pramalhe/ConcurrencyFreaks )
  7. reconsider deadlock/livelock/no_thread_running/... detection - see dead lock error mesage like in Go #10340
  8. incorporate a "sharded" RwMutex into the standard library and use it also for V's built-in "sharded" data structures (map etc.)
  9. memory models of multicore architectures (x86, ARM, POWER, ...) are still sometimes not fully formally defined but SW memory models are even worse at that 😮 https://research.swtch.com/mm

Metadata

Metadata

Assignees

Labels

Feature/Enhancement RequestThis issue is made to request a feature or an enhancement to an existing one.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions