Proper support for distributed computing, parallelism and concurrency

There is one thing in *V* I do care a lot about - parallelism & concurrency for seamless local and distributed programming.

Note, I didn't want to write this to https://github.com/vlang/v/issues/561 as the scope I'm describing is broader.

I'm confident every new language must provide first-class support for both parallelism and concurrency. Concurrency to allow one physical CPU thread do cooperative scheduling and parallelism to allow spawning the same go/v routine on several physical threads. This is actually what go lang does (when GOMAXPROCS is set to 2 or more and the underlying operating system supports threads) except for always spawning just one go routine disregarding the number of physical CPU cores (which is where *V* should improve on - see below).

Why both concurrency and parallelism and not either of them?

In the world of big.LITTLE, GPU/FPGA offloading, CPUs with tens/hundreds of physical cores (and hyperthreading on top), mobile devices trying to save as much energy as possible, NUMA supercomputers, mobile & close vicinity networks versus optical fiber networks etc. it's inevitable to dynamically schedule computation directly in runtime of each application (not just operating system wide which is too coarse and thus inefficient and not just cooperatively which would use just one physical CPU core per application which makes sense only for power saving scenarios, but nowhere else).

We have instruction-level paralellism covered (it's not yet present in *V* - see e.g. [my rant about `restrict`](https://github.com/vlang/v/issues/1839 ) - but it's a "solved problem" in todays compilers). The rest of the parallelism is just about "how close can we get to pure [dataflow programming](https://en.wikipedia.org/wiki/Dataflow_programming )". Which appears to be kind of difficult.

Because instruction-level paralelism is solved, the best approach nowadays seems to be to:

1. write as much serial code which allows being highly optimized by instruction-level parallelism at once in one go/v routine (typically a [tight loop](https://en.wiktionary.org/wiki/tight_loop ) or one tree of nested tight loops or a slow I/O or offloading to GPU/FPGA or similar) - the outermost construct of this go routine will be always an endless loop yielding (calling the cooperative/parallel scheduler) every N iterations

2. then schedule these go/v routines in cooperative manner on **one physical CPU core** with fixed-size queue ([single-producer-single-consumer "SPSC"](http://www.vitorian.com/x1/archives/370 ) - for performant implementations see discussion below) in between these go/v routines (this queue is called a *channel* in go lang); the queue size might be between the size of L1 and 2*L2 cache as suggested in [paragraph 3.3 in Analyzing Efficient Stream Processing on Modern Hardware](http://www.vldb.org/pvldb/vol12/p516-zeuch.pdf ).

3. and if any go/v routine shall become a bottleneck during computation (i.e. consumer is slower than producer for some time - this can be easily monitored by the cooperative/parallel scheduler as metric "how full are the queues"), a new instance of the slow consumer go/v routine shall be spawned (with respect to reentrancy) in a different operating system thread (i.e. on a different physical CPU core) including all the plumbing work (e.g. spawning an additional go routine acting as multiplexer getting the producers outgoing data and multiplexing it to the original as well as the new instance of the consumer; and spawning a demultiplexer go routine to join the outgoing data from both consumers) - this everything would the cooperative/parallel scheduler do.

4. and if the queues shall become almost empty for some time, remove the multiplexers and demultiplexers.

The *cooperative/parallel scheduler* as referenced above shall be a **user-definable** routine (if not present, a default built-in scheduler working similarly like described above will be used). This scheduler shall run preemptively (in its own thread?) and get as input arguments at least the following: thread handles, pointers to all go/v routines and corresponding queues between them, pointer to the built-in scheduler (in case the user just wanted to wrap it), and pointer to user-defined metadata (e.g. statistics allowing better scheduling judgement). This would allow for really complex scheduling on embedded systems as well as NUMA systems or any other highly dynamic systems (e.g. a smartphone could leverage being shortly connected to a cluster of other computers or smartphones and offload some slow algorithm there etc.). See e.g. [MultiQueues](https://arxiv.org/abs/1411.1209 ).

This way one can write application once and deploy it to your grandmas wrist watches as well as to a supercomputer (assuming the user-level scheduler is a "bit more" clever than outlined above - imagine complexity similar to an SQL query optimizer for a distributed DB). It's also a highly failure-tolerant system (imagine Erlang which supports even an online update of the whole application in memory while serving all clients under full load without interruption!). Also as you may have noticed, go lang does not scale first because there is no spawning of redundant go routines (those which will be bottle necks) to other physical CPU cores and second because go doesn't have a dynamic user-influencable scheduler (taking into account advanced statistics, power consumption, etc.).

There is one catch though. If implemented naively, then such elasticity doesn't guarantee global ordering among channels/queues (ordering works only inside of one channel) and real life scenarios in IT are unfortunately more often than not relying on ordering. This has to be accounted for and will require user intervention (i.e. be syntactically **explicit**). Either in the form of knowledge where to (a) insert "tagging" of items before they'll be spilled among different channels and (b) where the "tagging" will be removed and used to assemble the original ordering. Or in the form of assigning a strict priority to each channel. Or using other scheme.

See also [lock-free and wait-free](https://github.com/pramalhe/ConcurrencyFreaks ) datastructures (state of the art as of now) which might be handy for an efficient implementation of parallelism.

IMHO we could actually postpone the implementation of *concurrency* and stick with just *parallelism* (as it's easier to implement than the interleaving between concurrency and parallelism as outlined above), implement the queue and a basic scheduler and first then (maybe even after *V* 1.0) get back to concurrency. As of now we have parallelism without queues and without the scheduler, so we actually already have a good starting point.

Table of parallelism possibilities we have nowadays (just for reference):

parallelism kind | suited for execution duration | latency overhead | suited for how frequent execution startup | internode bandwidth quality has influence | special parallel SW architecture required | requirements for execution & deployment
---| ---| ---| ---| ---| ---| ---
CPU non-JIT vectorization on a CPU core      | very short to long    | ~none            | frequent             | none (it's just 1 node)             | no                                 | binary for the particular CPU
CPU JIT vectorization on a CPU core          | short to long | low              | moderate to frequent | none (it's just 1 node)             | no                                 | app source code + VM binary for the particular CPU
CPU cores utilization (thread, process, ...) | long             | low to moderate  | moderate             | lower to none (it's just 1 node)      | yes (except: pure objects as actors or alike) | binary for the particular CPU
accelerators (GPU, FPGA, ...) | long             | low to moderate  | moderate             | lower to none (it's just 1 node)      | yes (except: array/matrix/tensor/...-based languages) | binary for the particular CPU and accelerator (GPU, FPGA, ...)
supercomputer with NUMA topology | long             | moderate         | sporadic to moderate | moderate (hypercube/... is really fast) | yes (except: pure objects as actors, etc., array/matrix/tensor/...-based languages) | binary for the particular CPU and accelerator (GPU, FPGA, ...)
LAN cluster                            | long             | moderate to high | sporadic to moderate | moderate to high (can be fast, but not always) | yes                                | binary for at least 2 CPUs and/or at least 2 accelerators (GPU, FPGA, ...)
WAN cluster                            | long             | high             | sporadic             | high (basically can't be that fast)        | yes                                | binary for at least 2 CPUs and/or at least 2 accelerators (GPU, FPGA, ...)

(in practice these might overlap and/or be used simultaneously)

P.S. The dynamic scaling over the different kinds of parallelism as outlined in the above table is called "fine grained distributed computing". And if anything from the above proposal sounds crazy to you, then I can assure you, that the world doesn't sleep and there is at least one seamless (fully dynamic) solution offering first class fine grained distributed computing - [Unison](http://unisonweb.org/posts/ ).

----
Other things to consider and/or track:
1. As of March 2021 the **fastest** publicly known (and most comprehensive & general) parallelism & concurrency backend/library is [Weave](https://github.com/mratsim/weave/commit/b092f7c7526b00ac3db74afe7311ae5b2e0031c7 )
2. use PRNG (pseudo-random number generator) which copes well with dynamic threading (i.e. doesn't suffer from "same seed leads to same random numbers in all threads") - see e.g. splittable PRNGs such as [JAX](https://github.com/google/jax/blob/master/design_notes/prng.md )
3. maybe infinite loop in `rand` https://github.com/vlang/v/commit/a7c84834f4ba4948b8102a05b603bf8d51484760#r39632493
4. `g_str_buf` with parallel access:
   - https://github.com/vlang/v/commit/778a1cc34ae5674c345d73e2a507bce9f77dea1b#commitcomment-36848204
   - https://github.com/vlang/v/commit/0f92800921c706de0bd872f7eff30047a99c734d
5. FPU stuff needs special attention from the threading (work stealing) scheduler - see https://github.com/mratsim/weave/issues/163
6. under the hood, we might utilize some wait-free and lock-free algorithms (see e.g. https://github.com/pramalhe/ConcurrencyFreaks )
7. reconsider deadlock/livelock/no_thread_running/... detection - see https://github.com/vlang/v/issues/10340
8. incorporate a ["sharded" RwMutex](https://github.com/kelindar/smutex ) into the standard library and use it also for V's built-in "sharded" data structures (`map` etc.)
9. memory models of multicore architectures (x86, ARM, POWER, ...) are still sometimes **not** fully formally defined but SW memory models are even worse at that :open_mouth: https://research.swtch.com/mm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proper support for distributed computing, parallelism and concurrency #1868

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

parallelism kind	suited for execution duration	latency overhead	suited for how frequent execution startup	internode bandwidth quality has influence	special parallel SW architecture required	requirements for execution & deployment
CPU non-JIT vectorization on a CPU core	very short to long	~none	frequent	none (it's just 1 node)	no	binary for the particular CPU
CPU JIT vectorization on a CPU core	short to long	low	moderate to frequent	none (it's just 1 node)	no	app source code + VM binary for the particular CPU
CPU cores utilization (thread, process, ...)	long	low to moderate	moderate	lower to none (it's just 1 node)	yes (except: pure objects as actors or alike)	binary for the particular CPU
accelerators (GPU, FPGA, ...)	long	low to moderate	moderate	lower to none (it's just 1 node)	yes (except: array/matrix/tensor/...-based languages)	binary for the particular CPU and accelerator (GPU, FPGA, ...)
supercomputer with NUMA topology	long	moderate	sporadic to moderate	moderate (hypercube/... is really fast)	yes (except: pure objects as actors, etc., array/matrix/tensor/...-based languages)	binary for the particular CPU and accelerator (GPU, FPGA, ...)
LAN cluster	long	moderate to high	sporadic to moderate	moderate to high (can be fast, but not always)	yes	binary for at least 2 CPUs and/or at least 2 accelerators (GPU, FPGA, ...)
WAN cluster	long	high	sporadic	high (basically can't be that fast)	yes	binary for at least 2 CPUs and/or at least 2 accelerators (GPU, FPGA, ...)

Uh oh!

Proper support for distributed computing, parallelism and concurrency #1868

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions