-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Winch: Add splat instructions to x64 using AVX2 #10028
Winch: Add splat instructions to x64 using AVX2 #10028
Conversation
Subscribe to Label Action
This issue or pull request has been labeled: "winch"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
winch/codegen/src/isa/x64/asm.rs
Outdated
@@ -168,6 +168,18 @@ impl From<OperandSize> for Option<ExtMode> { | |||
} | |||
} | |||
|
|||
impl SplatKind { | |||
/// Get the operand size for `vpbroadcast`. | |||
pub(super) fn vpbroadcast_operand_size(&self) -> OperandSize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of making this specific to x64, can we instead call this lane_size
or something along those lines and move the implementation to masm.rs
? I don't think we'd only need this for x64.
winch/codegen/src/isa/x64/masm.rs
Outdated
// Results in the first 4 bytes and second 4 bytes being | ||
// swapped and then the swapped bytes being copied. | ||
// [d0, d1, d2, d3, d4, d5, d6, d7, ...] yields | ||
// [d4, d5, d6, d7, d0, d1, d2, d3, d4, d5, d6, d7, d0, d1, d2, d3]. | ||
let mask = 0b0100_0100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not wrong, this is the exact same comment/mask value used in load_splat
. Could we extract this as a helper?
winch/codegen/src/masm.rs
Outdated
/// Takes the int in a `src` operand and replicates it across lanes of | ||
/// `size` in a destination result. | ||
fn splat_int(&mut self, context: &mut CodeGenContext<Emission>, size: SplatKind) -> Result<()>; | ||
|
||
/// Takes the `src` operand and replicates it across lanes of `size` in | ||
/// `dst`. | ||
fn splat(&mut self, dst: WritableReg, src: RegImm, size: SplatKind) -> Result<()>; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a look at the callsites, and trying to think about future backends, I'm tempted to suggest adding a single entry here, splat
, defined as
fn splat(&mut self, context: &mut CodeGenContext, kind: SplatKind) {
...
}
That's potentially the most flexible approach, and that would potentially make the definition more uniform across backends: as far as I can tell, the only reason why we have a special splat_int
is because of the requirements in x64.
For that suggestion to work though, we'd probably need to extend SplatKind
to encode the source type, (e.g., SplatKind::I8x16
) -- I don't think this changes anything fundamental, since we'd still be able to derive the lane_size
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one thing with changing SplatKind
to encode an I
or F
is the v128.load*_splat
instructions aren't necessarily operating on integers or floats. It's operating on whatever state was stored in the memory. That said, it should still be fine to have the visitor specify a SplatKind::I32x4
for v128.load32_splat
even if it's operating on 4 floats, it'll still have the same result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be explicit, we could rename the current SplatKind
to SplatLoadKind
and introduce the type-aware-version in SplatKind
. That would tighten the implementation and keep the semantics as close as possible to the Wasm semantics.
match self { | ||
Imm::I32(n) => n.to_le_bytes().to_vec(), | ||
Imm::I64(n) => n.to_le_bytes().to_vec(), | ||
Imm::F32(n) => n.to_le_bytes().to_vec(), | ||
Imm::F64(n) => n.to_le_bytes().to_vec(), | ||
Imm::V128(n) => n.to_le_bytes().to_vec(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you return a slice
here instead? The constant pool will copy the bytes into a Vec<u8>
, but returning a slice will at least prevent heap allocations in case this gets used for anything else that's not constant pool related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more specific, using as_slice()
instead of to_vec()
, similar to the other uses of add_constant
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I use as_slice()
and return a &[u8]
, I get this message from the compiler: cannot return value referencing temporary value
. Likely because the array that's created by to_le_bytes()
gets dropped when the method returns. The arrays are also of different lengths between the variants. We could maybe use some unsafe code to create a reference to the underlying bytes. Otherwise I'm not sure how to avoid the heap allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is not used for anything that the constant pool, I don't think we need to introduce any unsafe code. Perhaps adding a comment stating that this method heap allocates and it's intended to be used mostly with the constant pool?
Part of #8093. Implements the following instructions on x64 with AVX2 support:
i8x16.splat
i16x8.splat
i32x4.splat
i64x2.splat
f32x4.splat
f64x2.splat