Support CPU - WebAssembly scenario of the op level execution use case

Open this issue to follow up the operation-specific APIs discussion of [3/18 WebML CG call](https://www.w3.org/2021/03/18-webmachinelearning-minutes.html#t02). @pyu10055 @wchao1115 @anssiko @jbingham, please take a look.

### Use case

This is one scenario of the framework's [op level execution use case](https://github.com/webmachinelearning/webnn/pull/154) (more details can be found in [operation-specific API proposal](https://github.com/webmachinelearning/proposals/issues/2)). A JavaScript ML framework executes ops on the CPU device with WebAssembly. For the compute intensive ops, such as conv2d or matmul, the framework also wants to use WebNN API to execute the op (by a single-op MLGraph) with the ML-specific instructions, such as [Vector Neural Network Instructions (VNNI)](https://en.wikichip.org/wiki/x86/avx512_vnni#:~:text=AVX%2D512%20Vector%20Neural%20Network%20Instructions%20(AVX512%20VNNI)%20is,convolutional%20neural%20network%2Dbased%20algorithms), on the same CPU device.

### Requirements

WebNN should allow frameworks create a `MLContex` for CPU device. This would avoid the unnecessary data copying cross devices when frameworks use WebAssembly - CPU to execute other ops.

WebNN should allow frameworks control when the output data is available for access. This would avoid the unnecessary tensor layout conversions between native ML API and the WebNN. Some background:
 - Some native ML APIs use hardware dependent memory layout for acceleration, for example oneDNN uses different [blocked memory layouts](https://oneapi-src.github.io/oneDNN/dev_guide_understanding_memory_formats.html) for better vectorization and cache reuse on different platforms.
 - The memory layout conversions are expensive.
 - Frameworks may use WebNN API to execute multiple ops (via multiple single-op MLGraphs) without access the intermediate results between them. 

For example, a user of TensorFlow.js may execute 3 `conv2d` but only access the output of the last one:
```js
c = tf.conv2d(a, b);
e = tf.conv2d(c, d);
h = tf.conv2d(f, g);
output = await h.data();
```
A potential WebNN implementation would only need to do the memory layout conversion and put the data into `ArrayBufferView` when `h.data()` is invoked.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support CPU - WebAssembly scenario of the op level execution use case #156

Use case

Requirements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support CPU - WebAssembly scenario of the op level execution use case #156

Description

Use case

Requirements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions