You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is one thing in V I do care a lot about - parallelism & concurrency for seamless local and distributed programming.
Note, I didn't want to write this to #561 as the scope I'm describing is broader.
I'm confident every new language must provide first-class support for both parallelism and concurrency. Concurrency to allow one physical CPU thread do cooperative scheduling and parallelism to allow spawning the same go/v routine on several physical threads. This is actually what go lang does (when GOMAXPROCS is set to 2 or more and the underlying operating system supports threads) except for always spawning just one go routine disregarding the number of physical CPU cores (which is where V should improve on - see below).
Why both concurrency and parallelism and not either of them?
In the world of big.LITTLE, GPU/FPGA offloading, CPUs with tens/hundreds of physical cores (and hyperthreading on top), mobile devices trying to save as much energy as possible, NUMA supercomputers, mobile & close vicinity networks versus optical fiber networks etc. it's inevitable to dynamically schedule computation directly in runtime of each application (not just operating system wide which is too coarse and thus inefficient and not just cooperatively which would use just one physical CPU core per application which makes sense only for power saving scenarios, but nowhere else).
We have instruction-level paralellism covered (it's not yet present in V - see e.g. my rant about restrict - but it's a "solved problem" in todays compilers). The rest of the parallelism is just about "how close can we get to pure dataflow programming". Which appears to be kind of difficult.
Because instruction-level paralelism is solved, the best approach nowadays seems to be to:
write as much serial code which allows being highly optimized by instruction-level parallelism at once in one go/v routine (typically a tight loop or one tree of nested tight loops or a slow I/O or offloading to GPU/FPGA or similar) - the outermost construct of this go routine will be always an endless loop yielding (calling the cooperative/parallel scheduler) every N iterations
and if any go/v routine shall become a bottleneck during computation (i.e. consumer is slower than producer for some time - this can be easily monitored by the cooperative/parallel scheduler as metric "how full are the queues"), a new instance of the slow consumer go/v routine shall be spawned (with respect to reentrancy) in a different operating system thread (i.e. on a different physical CPU core) including all the plumbing work (e.g. spawning an additional go routine acting as multiplexer getting the producers outgoing data and multiplexing it to the original as well as the new instance of the consumer; and spawning a demultiplexer go routine to join the outgoing data from both consumers) - this everything would the cooperative/parallel scheduler do.
and if the queues shall become almost empty for some time, remove the multiplexers and demultiplexers.
The cooperative/parallel scheduler as referenced above shall be a user-definable routine (if not present, a default built-in scheduler working similarly like described above will be used). This scheduler shall run preemptively (in its own thread?) and get as input arguments at least the following: thread handles, pointers to all go/v routines and corresponding queues between them, pointer to the built-in scheduler (in case the user just wanted to wrap it), and pointer to user-defined metadata (e.g. statistics allowing better scheduling judgement). This would allow for really complex scheduling on embedded systems as well as NUMA systems or any other highly dynamic systems (e.g. a smartphone could leverage being shortly connected to a cluster of other computers or smartphones and offload some slow algorithm there etc.). See e.g. MultiQueues.
This way one can write application once and deploy it to your grandmas wrist watches as well as to a supercomputer (assuming the user-level scheduler is a "bit more" clever than outlined above - imagine complexity similar to an SQL query optimizer for a distributed DB). It's also a highly failure-tolerant system (imagine Erlang which supports even an online update of the whole application in memory while serving all clients under full load without interruption!). Also as you may have noticed, go lang does not scale first because there is no spawning of redundant go routines (those which will be bottle necks) to other physical CPU cores and second because go doesn't have a dynamic user-influencable scheduler (taking into account advanced statistics, power consumption, etc.).
There is one catch though. If implemented naively, then such elasticity doesn't guarantee global ordering among channels/queues (ordering works only inside of one channel) and real life scenarios in IT are unfortunately more often than not relying on ordering. This has to be accounted for and will require user intervention (i.e. be syntactically explicit). Either in the form of knowledge where to (a) insert "tagging" of items before they'll be spilled among different channels and (b) where the "tagging" will be removed and used to assemble the original ordering. Or in the form of assigning a strict priority to each channel. Or using other scheme.
See also lock-free and wait-free datastructures (state of the art as of now) which might be handy for an efficient implementation of parallelism.
IMHO we could actually postpone the implementation of concurrency and stick with just parallelism (as it's easier to implement than the interleaving between concurrency and parallelism as outlined above), implement the queue and a basic scheduler and first then (maybe even after V 1.0) get back to concurrency. As of now we have parallelism without queues and without the scheduler, so we actually already have a good starting point.
Table of parallelism possibilities we have nowadays (just for reference):
parallelism kind
suited for execution duration
latency overhead
suited for how frequent execution startup
internode bandwidth quality has influence
special parallel SW architecture required
requirements for execution & deployment
CPU non-JIT vectorization on a CPU core
very short to long
~none
frequent
none (it's just 1 node)
no
binary for the particular CPU
CPU JIT vectorization on a CPU core
short to long
low
moderate to frequent
none (it's just 1 node)
no
app source code + VM binary for the particular CPU
binary for the particular CPU and accelerator (GPU, FPGA, ...)
supercomputer with NUMA topology
long
moderate
sporadic to moderate
moderate (hypercube/... is really fast)
yes (except: pure objects as actors, etc., array/matrix/tensor/...-based languages)
binary for the particular CPU and accelerator (GPU, FPGA, ...)
LAN cluster
long
moderate to high
sporadic to moderate
moderate to high (can be fast, but not always)
yes
binary for at least 2 CPUs and/or at least 2 accelerators (GPU, FPGA, ...)
WAN cluster
long
high
sporadic
high (basically can't be that fast)
yes
binary for at least 2 CPUs and/or at least 2 accelerators (GPU, FPGA, ...)
(in practice these might overlap and/or be used simultaneously)
P.S. The dynamic scaling over the different kinds of parallelism as outlined in the above table is called "fine grained distributed computing". And if anything from the above proposal sounds crazy to you, then I can assure you, that the world doesn't sleep and there is at least one seamless (fully dynamic) solution offering first class fine grained distributed computing - Unison.
Other things to consider and/or track:
As of March 2021 the fastest publicly known (and most comprehensive & general) parallelism & concurrency backend/library is Weave
use PRNG (pseudo-random number generator) which copes well with dynamic threading (i.e. doesn't suffer from "same seed leads to same random numbers in all threads") - see e.g. splittable PRNGs such as JAX
There is one thing in V I do care a lot about - parallelism & concurrency for seamless local and distributed programming.
Note, I didn't want to write this to #561 as the scope I'm describing is broader.
I'm confident every new language must provide first-class support for both parallelism and concurrency. Concurrency to allow one physical CPU thread do cooperative scheduling and parallelism to allow spawning the same go/v routine on several physical threads. This is actually what go lang does (when GOMAXPROCS is set to 2 or more and the underlying operating system supports threads) except for always spawning just one go routine disregarding the number of physical CPU cores (which is where V should improve on - see below).
Why both concurrency and parallelism and not either of them?
In the world of big.LITTLE, GPU/FPGA offloading, CPUs with tens/hundreds of physical cores (and hyperthreading on top), mobile devices trying to save as much energy as possible, NUMA supercomputers, mobile & close vicinity networks versus optical fiber networks etc. it's inevitable to dynamically schedule computation directly in runtime of each application (not just operating system wide which is too coarse and thus inefficient and not just cooperatively which would use just one physical CPU core per application which makes sense only for power saving scenarios, but nowhere else).
We have instruction-level paralellism covered (it's not yet present in V - see e.g. my rant about
restrict- but it's a "solved problem" in todays compilers). The rest of the parallelism is just about "how close can we get to pure dataflow programming". Which appears to be kind of difficult.Because instruction-level paralelism is solved, the best approach nowadays seems to be to:
write as much serial code which allows being highly optimized by instruction-level parallelism at once in one go/v routine (typically a tight loop or one tree of nested tight loops or a slow I/O or offloading to GPU/FPGA or similar) - the outermost construct of this go routine will be always an endless loop yielding (calling the cooperative/parallel scheduler) every N iterations
then schedule these go/v routines in cooperative manner on one physical CPU core with fixed-size queue (single-producer-single-consumer "SPSC" - for performant implementations see discussion below) in between these go/v routines (this queue is called a channel in go lang); the queue size might be between the size of L1 and 2*L2 cache as suggested in paragraph 3.3 in Analyzing Efficient Stream Processing on Modern Hardware.
and if any go/v routine shall become a bottleneck during computation (i.e. consumer is slower than producer for some time - this can be easily monitored by the cooperative/parallel scheduler as metric "how full are the queues"), a new instance of the slow consumer go/v routine shall be spawned (with respect to reentrancy) in a different operating system thread (i.e. on a different physical CPU core) including all the plumbing work (e.g. spawning an additional go routine acting as multiplexer getting the producers outgoing data and multiplexing it to the original as well as the new instance of the consumer; and spawning a demultiplexer go routine to join the outgoing data from both consumers) - this everything would the cooperative/parallel scheduler do.
and if the queues shall become almost empty for some time, remove the multiplexers and demultiplexers.
The cooperative/parallel scheduler as referenced above shall be a user-definable routine (if not present, a default built-in scheduler working similarly like described above will be used). This scheduler shall run preemptively (in its own thread?) and get as input arguments at least the following: thread handles, pointers to all go/v routines and corresponding queues between them, pointer to the built-in scheduler (in case the user just wanted to wrap it), and pointer to user-defined metadata (e.g. statistics allowing better scheduling judgement). This would allow for really complex scheduling on embedded systems as well as NUMA systems or any other highly dynamic systems (e.g. a smartphone could leverage being shortly connected to a cluster of other computers or smartphones and offload some slow algorithm there etc.). See e.g. MultiQueues.
This way one can write application once and deploy it to your grandmas wrist watches as well as to a supercomputer (assuming the user-level scheduler is a "bit more" clever than outlined above - imagine complexity similar to an SQL query optimizer for a distributed DB). It's also a highly failure-tolerant system (imagine Erlang which supports even an online update of the whole application in memory while serving all clients under full load without interruption!). Also as you may have noticed, go lang does not scale first because there is no spawning of redundant go routines (those which will be bottle necks) to other physical CPU cores and second because go doesn't have a dynamic user-influencable scheduler (taking into account advanced statistics, power consumption, etc.).
There is one catch though. If implemented naively, then such elasticity doesn't guarantee global ordering among channels/queues (ordering works only inside of one channel) and real life scenarios in IT are unfortunately more often than not relying on ordering. This has to be accounted for and will require user intervention (i.e. be syntactically explicit). Either in the form of knowledge where to (a) insert "tagging" of items before they'll be spilled among different channels and (b) where the "tagging" will be removed and used to assemble the original ordering. Or in the form of assigning a strict priority to each channel. Or using other scheme.
See also lock-free and wait-free datastructures (state of the art as of now) which might be handy for an efficient implementation of parallelism.
IMHO we could actually postpone the implementation of concurrency and stick with just parallelism (as it's easier to implement than the interleaving between concurrency and parallelism as outlined above), implement the queue and a basic scheduler and first then (maybe even after V 1.0) get back to concurrency. As of now we have parallelism without queues and without the scheduler, so we actually already have a good starting point.
Table of parallelism possibilities we have nowadays (just for reference):
(in practice these might overlap and/or be used simultaneously)
P.S. The dynamic scaling over the different kinds of parallelism as outlined in the above table is called "fine grained distributed computing". And if anything from the above proposal sounds crazy to you, then I can assure you, that the world doesn't sleep and there is at least one seamless (fully dynamic) solution offering first class fine grained distributed computing - Unison.
Other things to consider and/or track:
randa7c8483#r39632493g_str_bufwith parallel access:mapetc.)