Skip to content

[HLSL][DXIL][SPIRV] QuadReadAcrossY intrinsic support#187440

Merged
bob80905 merged 2 commits intollvm:mainfrom
kcloudy0717:kcloudy0717/QuadReadAcrossY
Mar 25, 2026
Merged

[HLSL][DXIL][SPIRV] QuadReadAcrossY intrinsic support#187440
bob80905 merged 2 commits intollvm:mainfrom
kcloudy0717:kcloudy0717/QuadReadAcrossY

Conversation

@kcloudy0717
Copy link
Copy Markdown
Contributor

This PR adds QuadReadAcrossY intrinsic support in HLSL with codegen for both DirectX and SPIRV backends. Resolves #99176.

  • Implement QuadReadAcrossY clang builtin,
  • Link QuadReadAcrossY clang builtin with hlsl_intrinsics.h
  • Add sema checks for QuadReadAcrossY to CheckHLSLBuiltinFunctionCall in SemaChecking.cpp
  • Add codegen for QuadReadAcrossY to EmitHLSLBuiltinExpr in CGBuiltin.cpp
  • Add codegen tests to clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl
  • Add sema tests to clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl
  • Create the int_dx_QuadReadAcrossY intrinsic in IntrinsicsDirectX.td
  • Create the DXILOpMapping of int_dx_QuadReadAcrossY to 123 in DXIL.td
  • Create the QuadReadAcrossY.ll and QuadReadAcrossY_errors.ll tests in llvm/test/CodeGen/DirectX/
  • Create the int_spv_QuadReadAcrossY intrinsic in IntrinsicsSPIRV.td
  • In SPIRVInstructionSelector.cpp create the QuadReadAcrossY lowering and map it to int_spv_QuadReadAcrossY in SPIRVInstructionSelector::selectIntrinsic.
  • Create SPIR-V backend test case in llvm/test/CodeGen/SPIRV/hlsl-intrinsics/QuadReadAcrossY.ll

@llvmbot llvmbot added backend:X86 clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:headers Headers provided by Clang, e.g. for intrinsics clang:codegen IR generation bugs: mangling, exceptions, etc. backend:DirectX HLSL HLSL Language Support backend:SPIR-V llvm:ir labels Mar 19, 2026
@llvmbot
Copy link
Copy Markdown
Member

llvmbot commented Mar 19, 2026

@llvm/pr-subscribers-backend-spir-v
@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-ir

Author: Kai (kcloudy0717)

Changes

This PR adds QuadReadAcrossY intrinsic support in HLSL with codegen for both DirectX and SPIRV backends. Resolves #99176.

  • Implement QuadReadAcrossY clang builtin,
  • Link QuadReadAcrossY clang builtin with hlsl_intrinsics.h
  • Add sema checks for QuadReadAcrossY to CheckHLSLBuiltinFunctionCall in SemaChecking.cpp
  • Add codegen for QuadReadAcrossY to EmitHLSLBuiltinExpr in CGBuiltin.cpp
  • Add codegen tests to clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl
  • Add sema tests to clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl
  • Create the int_dx_QuadReadAcrossY intrinsic in IntrinsicsDirectX.td
  • Create the DXILOpMapping of int_dx_QuadReadAcrossY to 123 in DXIL.td
  • Create the QuadReadAcrossY.ll and QuadReadAcrossY_errors.ll tests in llvm/test/CodeGen/DirectX/
  • Create the int_spv_QuadReadAcrossY intrinsic in IntrinsicsSPIRV.td
  • In SPIRVInstructionSelector.cpp create the QuadReadAcrossY lowering and map it to int_spv_QuadReadAcrossY in SPIRVInstructionSelector::selectIntrinsic.
  • Create SPIR-V backend test case in llvm/test/CodeGen/SPIRV/hlsl-intrinsics/QuadReadAcrossY.ll

Patch is 22.96 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/187440.diff

16 Files Affected:

  • (modified) clang/include/clang/Basic/Builtins.td (+6)
  • (modified) clang/lib/CodeGen/CGHLSLBuiltins.cpp (+7)
  • (modified) clang/lib/CodeGen/CGHLSLRuntime.h (+1)
  • (modified) clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h (+99)
  • (modified) clang/lib/Sema/SemaHLSL.cpp (+2-1)
  • (added) clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl (+46)
  • (added) clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl (+28)
  • (modified) llvm/include/llvm/IR/IntrinsicsDirectX.td (+1)
  • (modified) llvm/include/llvm/IR/IntrinsicsSPIRV.td (+1)
  • (modified) llvm/lib/Target/DirectX/DXIL.td (+4)
  • (modified) llvm/lib/Target/DirectX/DXILShaderFlags.cpp (+1)
  • (modified) llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp (+1)
  • (modified) llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp (+3)
  • (added) llvm/test/CodeGen/DirectX/QuadReadAcrossY.ll (+87)
  • (modified) llvm/test/CodeGen/DirectX/ShaderFlags/wave-ops.ll (+7)
  • (added) llvm/test/CodeGen/SPIRV/hlsl-intrinsics/QuadReadAcrossY.ll (+44)
diff --git a/clang/include/clang/Basic/Builtins.td b/clang/include/clang/Basic/Builtins.td
index 8e002c5b900aa..5a1ab7197b4a8 100644
--- a/clang/include/clang/Basic/Builtins.td
+++ b/clang/include/clang/Basic/Builtins.td
@@ -5276,6 +5276,12 @@ def HLSLQuadReadAcrossX : LangBuiltin<"HLSL_LANG"> {
   let Prototype = "void(...)";
 }
 
+def HLSLQuadReadAcrossY : LangBuiltin<"HLSL_LANG"> {
+  let Spellings = ["__builtin_hlsl_quad_read_across_y"];
+  let Attributes = [NoThrow, Const];
+  let Prototype = "void(...)";
+}
+
 def HLSLClamp : LangBuiltin<"HLSL_LANG"> {
   let Spellings = ["__builtin_hlsl_elementwise_clamp"];
   let Attributes = [NoThrow, Const, CustomTypeChecking];
diff --git a/clang/lib/CodeGen/CGHLSLBuiltins.cpp b/clang/lib/CodeGen/CGHLSLBuiltins.cpp
index 22293c5983d89..98edd479c02a5 100644
--- a/clang/lib/CodeGen/CGHLSLBuiltins.cpp
+++ b/clang/lib/CodeGen/CGHLSLBuiltins.cpp
@@ -1402,6 +1402,13 @@ Value *CodeGenFunction::EmitHLSLBuiltinExpr(unsigned BuiltinID,
                                &CGM.getModule(), ID, {OpExpr->getType()}),
                            ArrayRef{OpExpr}, "hlsl.quad.read.across.x");
   }
+  case Builtin::BI__builtin_hlsl_quad_read_across_y: {
+    Value *OpExpr = EmitScalarExpr(E->getArg(0));
+    Intrinsic::ID ID = CGM.getHLSLRuntime().getQuadReadAcrossYIntrinsic();
+    return EmitRuntimeCall(Intrinsic::getOrInsertDeclaration(
+                               &CGM.getModule(), ID, {OpExpr->getType()}),
+                           ArrayRef{OpExpr}, "hlsl.quad.read.across.y");
+  }
   case Builtin::BI__builtin_hlsl_elementwise_sign: {
     auto *Arg0 = E->getArg(0);
     Value *Op0 = EmitScalarExpr(Arg0);
diff --git a/clang/lib/CodeGen/CGHLSLRuntime.h b/clang/lib/CodeGen/CGHLSLRuntime.h
index d76cc18c9a259..f93ed28b5c787 100644
--- a/clang/lib/CodeGen/CGHLSLRuntime.h
+++ b/clang/lib/CodeGen/CGHLSLRuntime.h
@@ -158,6 +158,7 @@ class CGHLSLRuntime {
   GENERATE_HLSL_INTRINSIC_FUNCTION(WaveGetLaneCount, wave_get_lane_count)
   GENERATE_HLSL_INTRINSIC_FUNCTION(WaveReadLaneAt, wave_readlane)
   GENERATE_HLSL_INTRINSIC_FUNCTION(QuadReadAcrossX, quad_read_across_x)
+  GENERATE_HLSL_INTRINSIC_FUNCTION(QuadReadAcrossY, quad_read_across_y)
   GENERATE_HLSL_INTRINSIC_FUNCTION(FirstBitUHigh, firstbituhigh)
   GENERATE_HLSL_INTRINSIC_FUNCTION(FirstBitSHigh, firstbitshigh)
   GENERATE_HLSL_INTRINSIC_FUNCTION(FirstBitLow, firstbitlow)
diff --git a/clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h b/clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h
index c2e4d74af6873..a1048d21b20e6 100644
--- a/clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h
+++ b/clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h
@@ -3604,6 +3604,105 @@ __attribute__((convergent)) double3 QuadReadAcrossX(double3);
 _HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_x)
 __attribute__((convergent)) double4 QuadReadAcrossX(double4);
 
+//===----------------------------------------------------------------------===//
+// QuadReadAcrossY builtins
+//===----------------------------------------------------------------------===//
+
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) half QuadReadAcrossY(half);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) half2 QuadReadAcrossY(half2);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) half3 QuadReadAcrossY(half3);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) half4 QuadReadAcrossY(half4);
+
+#ifdef __HLSL_ENABLE_16_BIT
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int16_t QuadReadAcrossY(int16_t);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int16_t2 QuadReadAcrossY(int16_t2);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int16_t3 QuadReadAcrossY(int16_t3);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int16_t4 QuadReadAcrossY(int16_t4);
+
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint16_t QuadReadAcrossY(uint16_t);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint16_t2 QuadReadAcrossY(uint16_t2);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint16_t3 QuadReadAcrossY(uint16_t3);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint16_t4 QuadReadAcrossY(uint16_t4);
+#endif
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int QuadReadAcrossY(int);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int2 QuadReadAcrossY(int2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int3 QuadReadAcrossY(int3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int4 QuadReadAcrossY(int4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint QuadReadAcrossY(uint);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint2 QuadReadAcrossY(uint2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint3 QuadReadAcrossY(uint3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint4 QuadReadAcrossY(uint4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int64_t QuadReadAcrossY(int64_t);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int64_t2 QuadReadAcrossY(int64_t2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int64_t3 QuadReadAcrossY(int64_t3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int64_t4 QuadReadAcrossY(int64_t4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint64_t QuadReadAcrossY(uint64_t);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint64_t2 QuadReadAcrossY(uint64_t2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint64_t3 QuadReadAcrossY(uint64_t3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint64_t4 QuadReadAcrossY(uint64_t4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) float QuadReadAcrossY(float);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) float2 QuadReadAcrossY(float2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) float3 QuadReadAcrossY(float3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) float4 QuadReadAcrossY(float4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) double QuadReadAcrossY(double);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) double2 QuadReadAcrossY(double2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) double3 QuadReadAcrossY(double3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) double4 QuadReadAcrossY(double4);
+
 //===----------------------------------------------------------------------===//
 // sign builtins
 //===----------------------------------------------------------------------===//
diff --git a/clang/lib/Sema/SemaHLSL.cpp b/clang/lib/Sema/SemaHLSL.cpp
index 0619295cd2fbb..badbf9416d44f 100644
--- a/clang/lib/Sema/SemaHLSL.cpp
+++ b/clang/lib/Sema/SemaHLSL.cpp
@@ -4236,7 +4236,8 @@ bool SemaHLSL::CheckBuiltinFunctionCall(unsigned BuiltinID, CallExpr *TheCall) {
     TheCall->setType(ArgTyExpr);
     break;
   }
-  case Builtin::BI__builtin_hlsl_quad_read_across_x: {
+  case Builtin::BI__builtin_hlsl_quad_read_across_x:
+  case Builtin::BI__builtin_hlsl_quad_read_across_y: {
     if (SemaRef.checkArgCount(TheCall, 1))
       return true;
 
diff --git a/clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl b/clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl
new file mode 100644
index 0000000000000..a050aa9cf1722
--- /dev/null
+++ b/clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl
@@ -0,0 +1,46 @@
+// RUN: %clang_cc1 -std=hlsl2021 -finclude-default-header -triple \
+// RUN:   dxil-pc-shadermodel6.3-compute %s -emit-llvm -disable-llvm-passes -o - | \
+// RUN:   FileCheck %s --check-prefixes=CHECK,CHECK-DXIL
+// RUN: %clang_cc1 -std=hlsl2021 -finclude-default-header -triple \
+// RUN:   spirv-pc-vulkan-compute %s -emit-llvm -disable-llvm-passes -o - | \
+// RUN:   FileCheck %s --check-prefixes=CHECK,CHECK-SPIRV
+
+// Test basic lowering to runtime function call.
+
+// CHECK-LABEL: test_int
+int test_int(int expr) {
+  // CHECK-SPIRV:  %[[RET:.*]] = call spir_func [[TY:.*]] @llvm.spv.quad.read.across.y.i32([[TY]] %[[#]])
+  // CHECK-DXIL:  %[[RET:.*]] = call [[TY:.*]] @llvm.dx.quad.read.across.y.i32([[TY]] %[[#]])
+  // CHECK:  ret [[TY]] %[[RET]]
+  return QuadReadAcrossY(expr);
+}
+
+// CHECK-DXIL: declare [[TY]] @llvm.dx.quad.read.across.y.i32([[TY]]) #[[#attr:]]
+// CHECK-SPIRV: declare [[TY]] @llvm.spv.quad.read.across.y.i32([[TY]]) #[[#attr:]]
+
+// CHECK-LABEL: test_uint64_t
+uint64_t test_uint64_t(uint64_t expr) {
+  // CHECK-SPIRV:  %[[RET:.*]] = call spir_func [[TY:.*]] @llvm.spv.quad.read.across.y.i64([[TY]] %[[#]])
+  // CHECK-DXIL:  %[[RET:.*]] = call [[TY:.*]] @llvm.dx.quad.read.across.y.i64([[TY]] %[[#]])
+  // CHECK:  ret [[TY]] %[[RET]]
+  return QuadReadAcrossY(expr);
+}
+
+// CHECK-DXIL: declare [[TY]] @llvm.dx.quad.read.across.y.i64([[TY]]) #[[#attr:]]
+// CHECK-SPIRV: declare [[TY]] @llvm.spv.quad.read.across.y.i64([[TY]]) #[[#attr:]]
+
+// Test basic lowering to runtime function call with array and float value.
+
+// CHECK-LABEL: test_floatv4
+float4 test_floatv4(float4 expr) {
+  // CHECK-SPIRV:  %[[RET1:.*]] = call reassoc nnan ninf nsz arcp afn spir_func [[TY1:.*]] @llvm.spv.quad.read.across.y.v4f32([[TY1]] %[[#]]
+  // CHECK-DXIL:  %[[RET1:.*]] = call reassoc nnan ninf nsz arcp afn [[TY1:.*]] @llvm.dx.quad.read.across.y.v4f32([[TY1]] %[[#]])
+  // CHECK:  ret [[TY1]] %[[RET1]]
+  return QuadReadAcrossY(expr);
+}
+
+// CHECK-DXIL: declare [[TY1]] @llvm.dx.quad.read.across.y.v4f32([[TY1]]) #[[#attr]]
+// CHECK-SPIRV: declare [[TY1]] @llvm.spv.quad.read.across.y.v4f32([[TY1]]) #[[#attr]]
+
+// CHECK: attributes #[[#attr]] = {{{.*}} convergent {{.*}}}
+
diff --git a/clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl b/clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl
new file mode 100644
index 0000000000000..2995895ff4c3d
--- /dev/null
+++ b/clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl
@@ -0,0 +1,28 @@
+// RUN: %clang_cc1 -finclude-default-header -triple dxil-pc-shadermodel6.6-library %s -emit-llvm-only -disable-llvm-passes -verify
+
+int test_too_few_arg() {
+  return __builtin_hlsl_quad_read_across_y();
+  // expected-error@-1 {{too few arguments to function call, expected 1, have 0}}
+}
+
+float2 test_too_many_arg(float2 p0) {
+  return __builtin_hlsl_quad_read_across_y(p0, p0);
+  // expected-error@-1 {{too many arguments to function call, expected 1, have 2}}
+}
+
+bool test_expr_bool_type_check(bool p0) {
+  return __builtin_hlsl_quad_read_across_y(p0);
+  // expected-error@-1 {{invalid operand of type 'bool'}}
+}
+
+bool2 test_expr_bool_vec_type_check(bool2 p0) {
+  return __builtin_hlsl_quad_read_across_y(p0);
+  // expected-error@-1 {{invalid operand of type 'bool2' (aka 'vector<bool, 2>')}}
+}
+
+struct S { float f; };
+
+S test_expr_struct_type_check(S p0) {
+  return __builtin_hlsl_quad_read_across_y(p0);
+  // expected-error@-1 {{invalid operand of type 'S' where a scalar or vector is required}}
+}
diff --git a/llvm/include/llvm/IR/IntrinsicsDirectX.td b/llvm/include/llvm/IR/IntrinsicsDirectX.td
index 3f7922382f090..8ee07b32c1eca 100644
--- a/llvm/include/llvm/IR/IntrinsicsDirectX.td
+++ b/llvm/include/llvm/IR/IntrinsicsDirectX.td
@@ -255,6 +255,7 @@ def int_dx_wave_prefix_usum : DefaultAttrsIntrinsic<[llvm_anyint_ty], [LLVMMatch
 def int_dx_wave_prefix_product : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
 def int_dx_wave_prefix_uproduct : DefaultAttrsIntrinsic<[llvm_anyint_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
 def int_dx_quad_read_across_x : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
+def int_dx_quad_read_across_y : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
 def int_dx_sign : DefaultAttrsIntrinsic<[LLVMScalarOrSameVectorWidth<0, llvm_i32_ty>], [llvm_any_ty], [IntrNoMem]>;
 def int_dx_step : DefaultAttrsIntrinsic<[LLVMMatchType<0>], [llvm_anyfloat_ty, LLVMMatchType<0>], [IntrNoMem]>;
 def int_dx_splitdouble : DefaultAttrsIntrinsic<[llvm_anyint_ty, LLVMMatchType<0>],
diff --git a/llvm/include/llvm/IR/IntrinsicsSPIRV.td b/llvm/include/llvm/IR/IntrinsicsSPIRV.td
index 5d467adb08c3d..6f030b7f2509f 100644
--- a/llvm/include/llvm/IR/IntrinsicsSPIRV.td
+++ b/llvm/include/llvm/IR/IntrinsicsSPIRV.td
@@ -148,6 +148,7 @@ def int_spv_rsqrt : DefaultAttrsIntrinsic<[LLVMMatchType<0>], [llvm_anyfloat_ty]
   def int_spv_wave_prefix_sum : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
   def int_spv_wave_prefix_product : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
   def int_spv_quad_read_across_x : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
+  def int_spv_quad_read_across_y : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
   def int_spv_sign : DefaultAttrsIntrinsic<[LLVMScalarOrSameVectorWidth<0, llvm_i32_ty>], [llvm_any_ty], [IntrNoMem]>;
   def int_spv_radians : DefaultAttrsIntrinsic<[LLVMMatchType<0>], [llvm_anyfloat_ty], [IntrNoMem]>;
   def int_spv_group_memory_barrier_with_group_sync : ClangBuiltin<"__builtin_spirv_group_barrier">,
diff --git a/llvm/lib/Target/DirectX/DXIL.td b/llvm/lib/Target/DirectX/DXIL.td
index 20de198ef96d2..13fc147e60216 100644
--- a/llvm/lib/Target/DirectX/DXIL.td
+++ b/llvm/lib/Target/DirectX/DXIL.td
@@ -1214,6 +1214,10 @@ def QuadOp : DXILOp<123, quadOp> {
                  [
                    IntrinArgIndex<0>, IntrinArgI8<QuadOpKind_ReadAcrossX>
                  ]>,
+    IntrinSelect<int_dx_quad_read_across_y,
+                 [
+                   IntrinArgIndex<0>, IntrinArgI8<QuadOpKind_ReadAcrossY>
+                 ]>,
   ];
 
   let arguments = [OverloadTy, Int8Ty];
diff --git a/llvm/lib/Target/DirectX/DXILShaderFlags.cpp b/llvm/lib/Target/DirectX/DXILShaderFlags.cpp
index fab8bea379f74..69b942a171992 100644
--- a/llvm/lib/Target/DirectX/DXILShaderFlags.cpp
+++ b/llvm/lib/Target/DirectX/DXILShaderFlags.cpp
@@ -107,6 +107,7 @@ static bool checkWaveOps(Intrinsic::ID IID) {
   case Intrinsic::dx_wave_prefix_uproduct:
     // Quad Op Variants
   case Intrinsic::dx_quad_read_across_x:
+  case Intrinsic::dx_quad_read_across_y:
     return true;
   }
 }
diff --git a/llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp b/llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp
index 0c652b3bb29c0..830fcc9da862e 100644
--- a/llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp
+++ b/llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp
@@ -77,6 +77,7 @@ bool DirectXTTIImpl::isTargetIntrinsicTriviallyScalarizable(
   case Intrinsic::dx_wave_prefix_usum:
   case Intrinsic::dx_wave_prefix_uproduct:
   case Intrinsic::dx_quad_read_across_x:
+  case Intrinsic::dx_quad_read_across_y:
   case Intrinsic::dx_imad:
   case Intrinsic::dx_umad:
   case Intrinsic::dx_ddx_coarse:
diff --git a/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp b/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp
index eb30002d2e1a5..abad90891b4f5 100644
--- a/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp
+++ b/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp
@@ -4457,6 +4457,9 @@ bool SPIRVInstructionSelector::selectIntrinsic(Register ResVReg,
   case Intrinsic::spv_quad_read_across_x: {
     return selectQuadSwap(ResVReg, ResType, I, /*Direction*/ 0);
   }
+  case Intrinsic::spv_quad_read_across_y: {
+    return selectQuadSwap(ResVReg, ResType, I, /*Direction*/ 1);
+  }
   case Intrinsic::spv_step:
     return selectExtInst(ResVReg, ResType, I, CL::step, GL::Step);
   case Intrinsic::spv_radians:
diff --git a/llvm/test/CodeGen/DirectX/QuadReadAcrossY.ll b/llvm/test/CodeGen/DirectX/QuadReadAcrossY.ll
new file mode 100644
index 0000000000000..d90ac759aa23d
--- /dev/null
+++ b/llvm/test/CodeGen/DirectX/QuadReadAcrossY.ll
@@ -0,0 +1,87 @@
+; RUN: opt -S -scalarizer -dxil-op-lower -mtriple=dxil-pc-shadermodel6.3-library < %s | FileCheck %s
+
+; Test that for scalar values, QuadReadAcrossY maps down to the DirectX op
+
+define noundef half @quad_read_across_y_half(half noundef %expr) {
+entry:
+; CHECK: call half @dx.op.quadOp.f16(i32 123, half %expr, i8 1)
+  %ret = call half @llvm.dx.quad.read.across.y.f16(half %expr)
+  ret half %ret
+}
+
+define noundef float @quad_read_across_y_float(float noundef %expr) {
+entry:
+; CHECK: call float @dx.op.quadOp.f32(i32 123, float %expr, i8 1)
+  %ret = call float @llvm.dx.quad.read.across.y.f32(float %expr)
+  ret float %ret
+}
+
+define noundef double @quad_read_across_y_double(double noundef %expr) {
+entry:
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr, i8 1)
+  %ret = call double @llvm.dx.quad.read.across.y.f64(double %expr)
+  ret double %ret
+}
+
+define noundef i16 @quad_read_across_y_i16(i16 noundef %expr) {
+entry:
+; CHECK: call i16 @dx.op.quadOp.i16(i32 123, i16 %expr, i8 1)
+  %ret = call i16 @llvm.dx.quad.read.across.y.i16(i16 %expr)
+  ret i16 %ret
+}
+
+define noundef i32 @quad_read_across_y_i32(i32 noundef %expr) {
+entry:
+; CHECK: call i32 @dx.op.quadOp.i32(i32 123, i32 %expr, i8 1)
+  %ret = call i32 @llvm.dx.quad.read.across.y.i32(i32 %expr)
+  ret i32 %ret
+}
+
+define noundef i64 @quad_read_across_y_i64(i64 noundef %expr) {
+entry:
+; CHECK: call i64 @dx.op.quadOp.i64(i32 123, i64 %expr, i8 1)
+  %ret = call i64 @llvm.dx.quad.read.across.y.i64(i64 %expr)
+  ret i64 %ret
+}
+
+declare half @llvm.dx.quad.read.across.y.f16(half)
+declare float @llvm.dx.quad.read.across.y.f32(float)
+declare double @llvm.dx.quad.read.across.y.f64(double)
+
+declare i16 @llvm.dx.quad.read.across.y.i16(i16)
+declare i32 @llvm.dx.quad.read.across.y.i32(i32)
+declare i64 @llvm.dx.quad.read.across.y.i64(i64)
+
+; Test that for vector values, QuadReadAcrossY scalarizes and maps down to the
+; DirectX op
+
+define noundef <2 x half> @quad_read_across_y_v2half(<2 x half> noundef %expr) {
+entry:
+; CHECK: call half @dx.op.quadOp.f16(i32 123, half %expr.i0, i8 1)
+; CHECK: call half @dx.op.quadOp.f16(i32 123, half %expr.i1, i8 1)
+  %ret = call <2 x half> @llvm.dx.quad.read.across.y.v2f16(<2 x half> %expr)
+  ret <2 x half> %ret
+}
+
+define noundef <3 x i32> @quad_read_across_y_v3i32(<3 x i32> noundef %expr) {
+entry:
+; CHECK: call i32 @dx.op.quadOp.i32(i32 123, i32 %expr.i0, i8 1)
+; CHECK: call i32 @dx.op.quadOp.i32(i32 123, i32 %expr.i1, i8 1)
+; CHECK: call i32 @dx.op.quadOp.i32(i32 123, i32 %expr.i2, i8 1)
+  %ret = call <3 x i32> @llvm.dx.quad.read.across.y.v3i32(<3 x i32> %expr)
+  ret <3 x i32> %ret
+}
+
+define noundef <4 x double> @quad_read_across_y_v4f64(<4 x double> noundef %expr) {
+entry:
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr.i0, i8 1)
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr.i1, i8 1)
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr.i2, i8 1)
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr.i3, i8 1)
+  ...
[truncated]

@llvmbot
Copy link
Copy Markdown
Member

llvmbot commented Mar 19, 2026

@llvm/pr-subscribers-backend-directx

Author: Kai (kcloudy0717)

Changes

This PR adds QuadReadAcrossY intrinsic support in HLSL with codegen for both DirectX and SPIRV backends. Resolves #99176.

  • Implement QuadReadAcrossY clang builtin,
  • Link QuadReadAcrossY clang builtin with hlsl_intrinsics.h
  • Add sema checks for QuadReadAcrossY to CheckHLSLBuiltinFunctionCall in SemaChecking.cpp
  • Add codegen for QuadReadAcrossY to EmitHLSLBuiltinExpr in CGBuiltin.cpp
  • Add codegen tests to clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl
  • Add sema tests to clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl
  • Create the int_dx_QuadReadAcrossY intrinsic in IntrinsicsDirectX.td
  • Create the DXILOpMapping of int_dx_QuadReadAcrossY to 123 in DXIL.td
  • Create the QuadReadAcrossY.ll and QuadReadAcrossY_errors.ll tests in llvm/test/CodeGen/DirectX/
  • Create the int_spv_QuadReadAcrossY intrinsic in IntrinsicsSPIRV.td
  • In SPIRVInstructionSelector.cpp create the QuadReadAcrossY lowering and map it to int_spv_QuadReadAcrossY in SPIRVInstructionSelector::selectIntrinsic.
  • Create SPIR-V backend test case in llvm/test/CodeGen/SPIRV/hlsl-intrinsics/QuadReadAcrossY.ll

Patch is 22.96 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/187440.diff

16 Files Affected:

  • (modified) clang/include/clang/Basic/Builtins.td (+6)
  • (modified) clang/lib/CodeGen/CGHLSLBuiltins.cpp (+7)
  • (modified) clang/lib/CodeGen/CGHLSLRuntime.h (+1)
  • (modified) clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h (+99)
  • (modified) clang/lib/Sema/SemaHLSL.cpp (+2-1)
  • (added) clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl (+46)
  • (added) clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl (+28)
  • (modified) llvm/include/llvm/IR/IntrinsicsDirectX.td (+1)
  • (modified) llvm/include/llvm/IR/IntrinsicsSPIRV.td (+1)
  • (modified) llvm/lib/Target/DirectX/DXIL.td (+4)
  • (modified) llvm/lib/Target/DirectX/DXILShaderFlags.cpp (+1)
  • (modified) llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp (+1)
  • (modified) llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp (+3)
  • (added) llvm/test/CodeGen/DirectX/QuadReadAcrossY.ll (+87)
  • (modified) llvm/test/CodeGen/DirectX/ShaderFlags/wave-ops.ll (+7)
  • (added) llvm/test/CodeGen/SPIRV/hlsl-intrinsics/QuadReadAcrossY.ll (+44)
diff --git a/clang/include/clang/Basic/Builtins.td b/clang/include/clang/Basic/Builtins.td
index 8e002c5b900aa..5a1ab7197b4a8 100644
--- a/clang/include/clang/Basic/Builtins.td
+++ b/clang/include/clang/Basic/Builtins.td
@@ -5276,6 +5276,12 @@ def HLSLQuadReadAcrossX : LangBuiltin<"HLSL_LANG"> {
   let Prototype = "void(...)";
 }
 
+def HLSLQuadReadAcrossY : LangBuiltin<"HLSL_LANG"> {
+  let Spellings = ["__builtin_hlsl_quad_read_across_y"];
+  let Attributes = [NoThrow, Const];
+  let Prototype = "void(...)";
+}
+
 def HLSLClamp : LangBuiltin<"HLSL_LANG"> {
   let Spellings = ["__builtin_hlsl_elementwise_clamp"];
   let Attributes = [NoThrow, Const, CustomTypeChecking];
diff --git a/clang/lib/CodeGen/CGHLSLBuiltins.cpp b/clang/lib/CodeGen/CGHLSLBuiltins.cpp
index 22293c5983d89..98edd479c02a5 100644
--- a/clang/lib/CodeGen/CGHLSLBuiltins.cpp
+++ b/clang/lib/CodeGen/CGHLSLBuiltins.cpp
@@ -1402,6 +1402,13 @@ Value *CodeGenFunction::EmitHLSLBuiltinExpr(unsigned BuiltinID,
                                &CGM.getModule(), ID, {OpExpr->getType()}),
                            ArrayRef{OpExpr}, "hlsl.quad.read.across.x");
   }
+  case Builtin::BI__builtin_hlsl_quad_read_across_y: {
+    Value *OpExpr = EmitScalarExpr(E->getArg(0));
+    Intrinsic::ID ID = CGM.getHLSLRuntime().getQuadReadAcrossYIntrinsic();
+    return EmitRuntimeCall(Intrinsic::getOrInsertDeclaration(
+                               &CGM.getModule(), ID, {OpExpr->getType()}),
+                           ArrayRef{OpExpr}, "hlsl.quad.read.across.y");
+  }
   case Builtin::BI__builtin_hlsl_elementwise_sign: {
     auto *Arg0 = E->getArg(0);
     Value *Op0 = EmitScalarExpr(Arg0);
diff --git a/clang/lib/CodeGen/CGHLSLRuntime.h b/clang/lib/CodeGen/CGHLSLRuntime.h
index d76cc18c9a259..f93ed28b5c787 100644
--- a/clang/lib/CodeGen/CGHLSLRuntime.h
+++ b/clang/lib/CodeGen/CGHLSLRuntime.h
@@ -158,6 +158,7 @@ class CGHLSLRuntime {
   GENERATE_HLSL_INTRINSIC_FUNCTION(WaveGetLaneCount, wave_get_lane_count)
   GENERATE_HLSL_INTRINSIC_FUNCTION(WaveReadLaneAt, wave_readlane)
   GENERATE_HLSL_INTRINSIC_FUNCTION(QuadReadAcrossX, quad_read_across_x)
+  GENERATE_HLSL_INTRINSIC_FUNCTION(QuadReadAcrossY, quad_read_across_y)
   GENERATE_HLSL_INTRINSIC_FUNCTION(FirstBitUHigh, firstbituhigh)
   GENERATE_HLSL_INTRINSIC_FUNCTION(FirstBitSHigh, firstbitshigh)
   GENERATE_HLSL_INTRINSIC_FUNCTION(FirstBitLow, firstbitlow)
diff --git a/clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h b/clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h
index c2e4d74af6873..a1048d21b20e6 100644
--- a/clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h
+++ b/clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h
@@ -3604,6 +3604,105 @@ __attribute__((convergent)) double3 QuadReadAcrossX(double3);
 _HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_x)
 __attribute__((convergent)) double4 QuadReadAcrossX(double4);
 
+//===----------------------------------------------------------------------===//
+// QuadReadAcrossY builtins
+//===----------------------------------------------------------------------===//
+
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) half QuadReadAcrossY(half);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) half2 QuadReadAcrossY(half2);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) half3 QuadReadAcrossY(half3);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) half4 QuadReadAcrossY(half4);
+
+#ifdef __HLSL_ENABLE_16_BIT
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int16_t QuadReadAcrossY(int16_t);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int16_t2 QuadReadAcrossY(int16_t2);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int16_t3 QuadReadAcrossY(int16_t3);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int16_t4 QuadReadAcrossY(int16_t4);
+
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint16_t QuadReadAcrossY(uint16_t);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint16_t2 QuadReadAcrossY(uint16_t2);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint16_t3 QuadReadAcrossY(uint16_t3);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint16_t4 QuadReadAcrossY(uint16_t4);
+#endif
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int QuadReadAcrossY(int);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int2 QuadReadAcrossY(int2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int3 QuadReadAcrossY(int3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int4 QuadReadAcrossY(int4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint QuadReadAcrossY(uint);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint2 QuadReadAcrossY(uint2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint3 QuadReadAcrossY(uint3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint4 QuadReadAcrossY(uint4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int64_t QuadReadAcrossY(int64_t);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int64_t2 QuadReadAcrossY(int64_t2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int64_t3 QuadReadAcrossY(int64_t3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int64_t4 QuadReadAcrossY(int64_t4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint64_t QuadReadAcrossY(uint64_t);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint64_t2 QuadReadAcrossY(uint64_t2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint64_t3 QuadReadAcrossY(uint64_t3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint64_t4 QuadReadAcrossY(uint64_t4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) float QuadReadAcrossY(float);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) float2 QuadReadAcrossY(float2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) float3 QuadReadAcrossY(float3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) float4 QuadReadAcrossY(float4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) double QuadReadAcrossY(double);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) double2 QuadReadAcrossY(double2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) double3 QuadReadAcrossY(double3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) double4 QuadReadAcrossY(double4);
+
 //===----------------------------------------------------------------------===//
 // sign builtins
 //===----------------------------------------------------------------------===//
diff --git a/clang/lib/Sema/SemaHLSL.cpp b/clang/lib/Sema/SemaHLSL.cpp
index 0619295cd2fbb..badbf9416d44f 100644
--- a/clang/lib/Sema/SemaHLSL.cpp
+++ b/clang/lib/Sema/SemaHLSL.cpp
@@ -4236,7 +4236,8 @@ bool SemaHLSL::CheckBuiltinFunctionCall(unsigned BuiltinID, CallExpr *TheCall) {
     TheCall->setType(ArgTyExpr);
     break;
   }
-  case Builtin::BI__builtin_hlsl_quad_read_across_x: {
+  case Builtin::BI__builtin_hlsl_quad_read_across_x:
+  case Builtin::BI__builtin_hlsl_quad_read_across_y: {
     if (SemaRef.checkArgCount(TheCall, 1))
       return true;
 
diff --git a/clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl b/clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl
new file mode 100644
index 0000000000000..a050aa9cf1722
--- /dev/null
+++ b/clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl
@@ -0,0 +1,46 @@
+// RUN: %clang_cc1 -std=hlsl2021 -finclude-default-header -triple \
+// RUN:   dxil-pc-shadermodel6.3-compute %s -emit-llvm -disable-llvm-passes -o - | \
+// RUN:   FileCheck %s --check-prefixes=CHECK,CHECK-DXIL
+// RUN: %clang_cc1 -std=hlsl2021 -finclude-default-header -triple \
+// RUN:   spirv-pc-vulkan-compute %s -emit-llvm -disable-llvm-passes -o - | \
+// RUN:   FileCheck %s --check-prefixes=CHECK,CHECK-SPIRV
+
+// Test basic lowering to runtime function call.
+
+// CHECK-LABEL: test_int
+int test_int(int expr) {
+  // CHECK-SPIRV:  %[[RET:.*]] = call spir_func [[TY:.*]] @llvm.spv.quad.read.across.y.i32([[TY]] %[[#]])
+  // CHECK-DXIL:  %[[RET:.*]] = call [[TY:.*]] @llvm.dx.quad.read.across.y.i32([[TY]] %[[#]])
+  // CHECK:  ret [[TY]] %[[RET]]
+  return QuadReadAcrossY(expr);
+}
+
+// CHECK-DXIL: declare [[TY]] @llvm.dx.quad.read.across.y.i32([[TY]]) #[[#attr:]]
+// CHECK-SPIRV: declare [[TY]] @llvm.spv.quad.read.across.y.i32([[TY]]) #[[#attr:]]
+
+// CHECK-LABEL: test_uint64_t
+uint64_t test_uint64_t(uint64_t expr) {
+  // CHECK-SPIRV:  %[[RET:.*]] = call spir_func [[TY:.*]] @llvm.spv.quad.read.across.y.i64([[TY]] %[[#]])
+  // CHECK-DXIL:  %[[RET:.*]] = call [[TY:.*]] @llvm.dx.quad.read.across.y.i64([[TY]] %[[#]])
+  // CHECK:  ret [[TY]] %[[RET]]
+  return QuadReadAcrossY(expr);
+}
+
+// CHECK-DXIL: declare [[TY]] @llvm.dx.quad.read.across.y.i64([[TY]]) #[[#attr:]]
+// CHECK-SPIRV: declare [[TY]] @llvm.spv.quad.read.across.y.i64([[TY]]) #[[#attr:]]
+
+// Test basic lowering to runtime function call with array and float value.
+
+// CHECK-LABEL: test_floatv4
+float4 test_floatv4(float4 expr) {
+  // CHECK-SPIRV:  %[[RET1:.*]] = call reassoc nnan ninf nsz arcp afn spir_func [[TY1:.*]] @llvm.spv.quad.read.across.y.v4f32([[TY1]] %[[#]]
+  // CHECK-DXIL:  %[[RET1:.*]] = call reassoc nnan ninf nsz arcp afn [[TY1:.*]] @llvm.dx.quad.read.across.y.v4f32([[TY1]] %[[#]])
+  // CHECK:  ret [[TY1]] %[[RET1]]
+  return QuadReadAcrossY(expr);
+}
+
+// CHECK-DXIL: declare [[TY1]] @llvm.dx.quad.read.across.y.v4f32([[TY1]]) #[[#attr]]
+// CHECK-SPIRV: declare [[TY1]] @llvm.spv.quad.read.across.y.v4f32([[TY1]]) #[[#attr]]
+
+// CHECK: attributes #[[#attr]] = {{{.*}} convergent {{.*}}}
+
diff --git a/clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl b/clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl
new file mode 100644
index 0000000000000..2995895ff4c3d
--- /dev/null
+++ b/clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl
@@ -0,0 +1,28 @@
+// RUN: %clang_cc1 -finclude-default-header -triple dxil-pc-shadermodel6.6-library %s -emit-llvm-only -disable-llvm-passes -verify
+
+int test_too_few_arg() {
+  return __builtin_hlsl_quad_read_across_y();
+  // expected-error@-1 {{too few arguments to function call, expected 1, have 0}}
+}
+
+float2 test_too_many_arg(float2 p0) {
+  return __builtin_hlsl_quad_read_across_y(p0, p0);
+  // expected-error@-1 {{too many arguments to function call, expected 1, have 2}}
+}
+
+bool test_expr_bool_type_check(bool p0) {
+  return __builtin_hlsl_quad_read_across_y(p0);
+  // expected-error@-1 {{invalid operand of type 'bool'}}
+}
+
+bool2 test_expr_bool_vec_type_check(bool2 p0) {
+  return __builtin_hlsl_quad_read_across_y(p0);
+  // expected-error@-1 {{invalid operand of type 'bool2' (aka 'vector<bool, 2>')}}
+}
+
+struct S { float f; };
+
+S test_expr_struct_type_check(S p0) {
+  return __builtin_hlsl_quad_read_across_y(p0);
+  // expected-error@-1 {{invalid operand of type 'S' where a scalar or vector is required}}
+}
diff --git a/llvm/include/llvm/IR/IntrinsicsDirectX.td b/llvm/include/llvm/IR/IntrinsicsDirectX.td
index 3f7922382f090..8ee07b32c1eca 100644
--- a/llvm/include/llvm/IR/IntrinsicsDirectX.td
+++ b/llvm/include/llvm/IR/IntrinsicsDirectX.td
@@ -255,6 +255,7 @@ def int_dx_wave_prefix_usum : DefaultAttrsIntrinsic<[llvm_anyint_ty], [LLVMMatch
 def int_dx_wave_prefix_product : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
 def int_dx_wave_prefix_uproduct : DefaultAttrsIntrinsic<[llvm_anyint_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
 def int_dx_quad_read_across_x : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
+def int_dx_quad_read_across_y : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
 def int_dx_sign : DefaultAttrsIntrinsic<[LLVMScalarOrSameVectorWidth<0, llvm_i32_ty>], [llvm_any_ty], [IntrNoMem]>;
 def int_dx_step : DefaultAttrsIntrinsic<[LLVMMatchType<0>], [llvm_anyfloat_ty, LLVMMatchType<0>], [IntrNoMem]>;
 def int_dx_splitdouble : DefaultAttrsIntrinsic<[llvm_anyint_ty, LLVMMatchType<0>],
diff --git a/llvm/include/llvm/IR/IntrinsicsSPIRV.td b/llvm/include/llvm/IR/IntrinsicsSPIRV.td
index 5d467adb08c3d..6f030b7f2509f 100644
--- a/llvm/include/llvm/IR/IntrinsicsSPIRV.td
+++ b/llvm/include/llvm/IR/IntrinsicsSPIRV.td
@@ -148,6 +148,7 @@ def int_spv_rsqrt : DefaultAttrsIntrinsic<[LLVMMatchType<0>], [llvm_anyfloat_ty]
   def int_spv_wave_prefix_sum : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
   def int_spv_wave_prefix_product : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
   def int_spv_quad_read_across_x : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
+  def int_spv_quad_read_across_y : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
   def int_spv_sign : DefaultAttrsIntrinsic<[LLVMScalarOrSameVectorWidth<0, llvm_i32_ty>], [llvm_any_ty], [IntrNoMem]>;
   def int_spv_radians : DefaultAttrsIntrinsic<[LLVMMatchType<0>], [llvm_anyfloat_ty], [IntrNoMem]>;
   def int_spv_group_memory_barrier_with_group_sync : ClangBuiltin<"__builtin_spirv_group_barrier">,
diff --git a/llvm/lib/Target/DirectX/DXIL.td b/llvm/lib/Target/DirectX/DXIL.td
index 20de198ef96d2..13fc147e60216 100644
--- a/llvm/lib/Target/DirectX/DXIL.td
+++ b/llvm/lib/Target/DirectX/DXIL.td
@@ -1214,6 +1214,10 @@ def QuadOp : DXILOp<123, quadOp> {
                  [
                    IntrinArgIndex<0>, IntrinArgI8<QuadOpKind_ReadAcrossX>
                  ]>,
+    IntrinSelect<int_dx_quad_read_across_y,
+                 [
+                   IntrinArgIndex<0>, IntrinArgI8<QuadOpKind_ReadAcrossY>
+                 ]>,
   ];
 
   let arguments = [OverloadTy, Int8Ty];
diff --git a/llvm/lib/Target/DirectX/DXILShaderFlags.cpp b/llvm/lib/Target/DirectX/DXILShaderFlags.cpp
index fab8bea379f74..69b942a171992 100644
--- a/llvm/lib/Target/DirectX/DXILShaderFlags.cpp
+++ b/llvm/lib/Target/DirectX/DXILShaderFlags.cpp
@@ -107,6 +107,7 @@ static bool checkWaveOps(Intrinsic::ID IID) {
   case Intrinsic::dx_wave_prefix_uproduct:
     // Quad Op Variants
   case Intrinsic::dx_quad_read_across_x:
+  case Intrinsic::dx_quad_read_across_y:
     return true;
   }
 }
diff --git a/llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp b/llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp
index 0c652b3bb29c0..830fcc9da862e 100644
--- a/llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp
+++ b/llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp
@@ -77,6 +77,7 @@ bool DirectXTTIImpl::isTargetIntrinsicTriviallyScalarizable(
   case Intrinsic::dx_wave_prefix_usum:
   case Intrinsic::dx_wave_prefix_uproduct:
   case Intrinsic::dx_quad_read_across_x:
+  case Intrinsic::dx_quad_read_across_y:
   case Intrinsic::dx_imad:
   case Intrinsic::dx_umad:
   case Intrinsic::dx_ddx_coarse:
diff --git a/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp b/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp
index eb30002d2e1a5..abad90891b4f5 100644
--- a/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp
+++ b/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp
@@ -4457,6 +4457,9 @@ bool SPIRVInstructionSelector::selectIntrinsic(Register ResVReg,
   case Intrinsic::spv_quad_read_across_x: {
     return selectQuadSwap(ResVReg, ResType, I, /*Direction*/ 0);
   }
+  case Intrinsic::spv_quad_read_across_y: {
+    return selectQuadSwap(ResVReg, ResType, I, /*Direction*/ 1);
+  }
   case Intrinsic::spv_step:
     return selectExtInst(ResVReg, ResType, I, CL::step, GL::Step);
   case Intrinsic::spv_radians:
diff --git a/llvm/test/CodeGen/DirectX/QuadReadAcrossY.ll b/llvm/test/CodeGen/DirectX/QuadReadAcrossY.ll
new file mode 100644
index 0000000000000..d90ac759aa23d
--- /dev/null
+++ b/llvm/test/CodeGen/DirectX/QuadReadAcrossY.ll
@@ -0,0 +1,87 @@
+; RUN: opt -S -scalarizer -dxil-op-lower -mtriple=dxil-pc-shadermodel6.3-library < %s | FileCheck %s
+
+; Test that for scalar values, QuadReadAcrossY maps down to the DirectX op
+
+define noundef half @quad_read_across_y_half(half noundef %expr) {
+entry:
+; CHECK: call half @dx.op.quadOp.f16(i32 123, half %expr, i8 1)
+  %ret = call half @llvm.dx.quad.read.across.y.f16(half %expr)
+  ret half %ret
+}
+
+define noundef float @quad_read_across_y_float(float noundef %expr) {
+entry:
+; CHECK: call float @dx.op.quadOp.f32(i32 123, float %expr, i8 1)
+  %ret = call float @llvm.dx.quad.read.across.y.f32(float %expr)
+  ret float %ret
+}
+
+define noundef double @quad_read_across_y_double(double noundef %expr) {
+entry:
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr, i8 1)
+  %ret = call double @llvm.dx.quad.read.across.y.f64(double %expr)
+  ret double %ret
+}
+
+define noundef i16 @quad_read_across_y_i16(i16 noundef %expr) {
+entry:
+; CHECK: call i16 @dx.op.quadOp.i16(i32 123, i16 %expr, i8 1)
+  %ret = call i16 @llvm.dx.quad.read.across.y.i16(i16 %expr)
+  ret i16 %ret
+}
+
+define noundef i32 @quad_read_across_y_i32(i32 noundef %expr) {
+entry:
+; CHECK: call i32 @dx.op.quadOp.i32(i32 123, i32 %expr, i8 1)
+  %ret = call i32 @llvm.dx.quad.read.across.y.i32(i32 %expr)
+  ret i32 %ret
+}
+
+define noundef i64 @quad_read_across_y_i64(i64 noundef %expr) {
+entry:
+; CHECK: call i64 @dx.op.quadOp.i64(i32 123, i64 %expr, i8 1)
+  %ret = call i64 @llvm.dx.quad.read.across.y.i64(i64 %expr)
+  ret i64 %ret
+}
+
+declare half @llvm.dx.quad.read.across.y.f16(half)
+declare float @llvm.dx.quad.read.across.y.f32(float)
+declare double @llvm.dx.quad.read.across.y.f64(double)
+
+declare i16 @llvm.dx.quad.read.across.y.i16(i16)
+declare i32 @llvm.dx.quad.read.across.y.i32(i32)
+declare i64 @llvm.dx.quad.read.across.y.i64(i64)
+
+; Test that for vector values, QuadReadAcrossY scalarizes and maps down to the
+; DirectX op
+
+define noundef <2 x half> @quad_read_across_y_v2half(<2 x half> noundef %expr) {
+entry:
+; CHECK: call half @dx.op.quadOp.f16(i32 123, half %expr.i0, i8 1)
+; CHECK: call half @dx.op.quadOp.f16(i32 123, half %expr.i1, i8 1)
+  %ret = call <2 x half> @llvm.dx.quad.read.across.y.v2f16(<2 x half> %expr)
+  ret <2 x half> %ret
+}
+
+define noundef <3 x i32> @quad_read_across_y_v3i32(<3 x i32> noundef %expr) {
+entry:
+; CHECK: call i32 @dx.op.quadOp.i32(i32 123, i32 %expr.i0, i8 1)
+; CHECK: call i32 @dx.op.quadOp.i32(i32 123, i32 %expr.i1, i8 1)
+; CHECK: call i32 @dx.op.quadOp.i32(i32 123, i32 %expr.i2, i8 1)
+  %ret = call <3 x i32> @llvm.dx.quad.read.across.y.v3i32(<3 x i32> %expr)
+  ret <3 x i32> %ret
+}
+
+define noundef <4 x double> @quad_read_across_y_v4f64(<4 x double> noundef %expr) {
+entry:
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr.i0, i8 1)
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr.i1, i8 1)
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr.i2, i8 1)
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr.i3, i8 1)
+  ...
[truncated]

@llvmbot
Copy link
Copy Markdown
Member

llvmbot commented Mar 19, 2026

@llvm/pr-subscribers-hlsl

Author: Kai (kcloudy0717)

Changes

This PR adds QuadReadAcrossY intrinsic support in HLSL with codegen for both DirectX and SPIRV backends. Resolves #99176.

  • Implement QuadReadAcrossY clang builtin,
  • Link QuadReadAcrossY clang builtin with hlsl_intrinsics.h
  • Add sema checks for QuadReadAcrossY to CheckHLSLBuiltinFunctionCall in SemaChecking.cpp
  • Add codegen for QuadReadAcrossY to EmitHLSLBuiltinExpr in CGBuiltin.cpp
  • Add codegen tests to clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl
  • Add sema tests to clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl
  • Create the int_dx_QuadReadAcrossY intrinsic in IntrinsicsDirectX.td
  • Create the DXILOpMapping of int_dx_QuadReadAcrossY to 123 in DXIL.td
  • Create the QuadReadAcrossY.ll and QuadReadAcrossY_errors.ll tests in llvm/test/CodeGen/DirectX/
  • Create the int_spv_QuadReadAcrossY intrinsic in IntrinsicsSPIRV.td
  • In SPIRVInstructionSelector.cpp create the QuadReadAcrossY lowering and map it to int_spv_QuadReadAcrossY in SPIRVInstructionSelector::selectIntrinsic.
  • Create SPIR-V backend test case in llvm/test/CodeGen/SPIRV/hlsl-intrinsics/QuadReadAcrossY.ll

Patch is 22.96 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/187440.diff

16 Files Affected:

  • (modified) clang/include/clang/Basic/Builtins.td (+6)
  • (modified) clang/lib/CodeGen/CGHLSLBuiltins.cpp (+7)
  • (modified) clang/lib/CodeGen/CGHLSLRuntime.h (+1)
  • (modified) clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h (+99)
  • (modified) clang/lib/Sema/SemaHLSL.cpp (+2-1)
  • (added) clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl (+46)
  • (added) clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl (+28)
  • (modified) llvm/include/llvm/IR/IntrinsicsDirectX.td (+1)
  • (modified) llvm/include/llvm/IR/IntrinsicsSPIRV.td (+1)
  • (modified) llvm/lib/Target/DirectX/DXIL.td (+4)
  • (modified) llvm/lib/Target/DirectX/DXILShaderFlags.cpp (+1)
  • (modified) llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp (+1)
  • (modified) llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp (+3)
  • (added) llvm/test/CodeGen/DirectX/QuadReadAcrossY.ll (+87)
  • (modified) llvm/test/CodeGen/DirectX/ShaderFlags/wave-ops.ll (+7)
  • (added) llvm/test/CodeGen/SPIRV/hlsl-intrinsics/QuadReadAcrossY.ll (+44)
diff --git a/clang/include/clang/Basic/Builtins.td b/clang/include/clang/Basic/Builtins.td
index 8e002c5b900aa..5a1ab7197b4a8 100644
--- a/clang/include/clang/Basic/Builtins.td
+++ b/clang/include/clang/Basic/Builtins.td
@@ -5276,6 +5276,12 @@ def HLSLQuadReadAcrossX : LangBuiltin<"HLSL_LANG"> {
   let Prototype = "void(...)";
 }
 
+def HLSLQuadReadAcrossY : LangBuiltin<"HLSL_LANG"> {
+  let Spellings = ["__builtin_hlsl_quad_read_across_y"];
+  let Attributes = [NoThrow, Const];
+  let Prototype = "void(...)";
+}
+
 def HLSLClamp : LangBuiltin<"HLSL_LANG"> {
   let Spellings = ["__builtin_hlsl_elementwise_clamp"];
   let Attributes = [NoThrow, Const, CustomTypeChecking];
diff --git a/clang/lib/CodeGen/CGHLSLBuiltins.cpp b/clang/lib/CodeGen/CGHLSLBuiltins.cpp
index 22293c5983d89..98edd479c02a5 100644
--- a/clang/lib/CodeGen/CGHLSLBuiltins.cpp
+++ b/clang/lib/CodeGen/CGHLSLBuiltins.cpp
@@ -1402,6 +1402,13 @@ Value *CodeGenFunction::EmitHLSLBuiltinExpr(unsigned BuiltinID,
                                &CGM.getModule(), ID, {OpExpr->getType()}),
                            ArrayRef{OpExpr}, "hlsl.quad.read.across.x");
   }
+  case Builtin::BI__builtin_hlsl_quad_read_across_y: {
+    Value *OpExpr = EmitScalarExpr(E->getArg(0));
+    Intrinsic::ID ID = CGM.getHLSLRuntime().getQuadReadAcrossYIntrinsic();
+    return EmitRuntimeCall(Intrinsic::getOrInsertDeclaration(
+                               &CGM.getModule(), ID, {OpExpr->getType()}),
+                           ArrayRef{OpExpr}, "hlsl.quad.read.across.y");
+  }
   case Builtin::BI__builtin_hlsl_elementwise_sign: {
     auto *Arg0 = E->getArg(0);
     Value *Op0 = EmitScalarExpr(Arg0);
diff --git a/clang/lib/CodeGen/CGHLSLRuntime.h b/clang/lib/CodeGen/CGHLSLRuntime.h
index d76cc18c9a259..f93ed28b5c787 100644
--- a/clang/lib/CodeGen/CGHLSLRuntime.h
+++ b/clang/lib/CodeGen/CGHLSLRuntime.h
@@ -158,6 +158,7 @@ class CGHLSLRuntime {
   GENERATE_HLSL_INTRINSIC_FUNCTION(WaveGetLaneCount, wave_get_lane_count)
   GENERATE_HLSL_INTRINSIC_FUNCTION(WaveReadLaneAt, wave_readlane)
   GENERATE_HLSL_INTRINSIC_FUNCTION(QuadReadAcrossX, quad_read_across_x)
+  GENERATE_HLSL_INTRINSIC_FUNCTION(QuadReadAcrossY, quad_read_across_y)
   GENERATE_HLSL_INTRINSIC_FUNCTION(FirstBitUHigh, firstbituhigh)
   GENERATE_HLSL_INTRINSIC_FUNCTION(FirstBitSHigh, firstbitshigh)
   GENERATE_HLSL_INTRINSIC_FUNCTION(FirstBitLow, firstbitlow)
diff --git a/clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h b/clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h
index c2e4d74af6873..a1048d21b20e6 100644
--- a/clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h
+++ b/clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h
@@ -3604,6 +3604,105 @@ __attribute__((convergent)) double3 QuadReadAcrossX(double3);
 _HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_x)
 __attribute__((convergent)) double4 QuadReadAcrossX(double4);
 
+//===----------------------------------------------------------------------===//
+// QuadReadAcrossY builtins
+//===----------------------------------------------------------------------===//
+
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) half QuadReadAcrossY(half);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) half2 QuadReadAcrossY(half2);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) half3 QuadReadAcrossY(half3);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) half4 QuadReadAcrossY(half4);
+
+#ifdef __HLSL_ENABLE_16_BIT
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int16_t QuadReadAcrossY(int16_t);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int16_t2 QuadReadAcrossY(int16_t2);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int16_t3 QuadReadAcrossY(int16_t3);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int16_t4 QuadReadAcrossY(int16_t4);
+
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint16_t QuadReadAcrossY(uint16_t);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint16_t2 QuadReadAcrossY(uint16_t2);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint16_t3 QuadReadAcrossY(uint16_t3);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint16_t4 QuadReadAcrossY(uint16_t4);
+#endif
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int QuadReadAcrossY(int);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int2 QuadReadAcrossY(int2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int3 QuadReadAcrossY(int3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int4 QuadReadAcrossY(int4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint QuadReadAcrossY(uint);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint2 QuadReadAcrossY(uint2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint3 QuadReadAcrossY(uint3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint4 QuadReadAcrossY(uint4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int64_t QuadReadAcrossY(int64_t);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int64_t2 QuadReadAcrossY(int64_t2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int64_t3 QuadReadAcrossY(int64_t3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int64_t4 QuadReadAcrossY(int64_t4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint64_t QuadReadAcrossY(uint64_t);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint64_t2 QuadReadAcrossY(uint64_t2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint64_t3 QuadReadAcrossY(uint64_t3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint64_t4 QuadReadAcrossY(uint64_t4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) float QuadReadAcrossY(float);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) float2 QuadReadAcrossY(float2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) float3 QuadReadAcrossY(float3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) float4 QuadReadAcrossY(float4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) double QuadReadAcrossY(double);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) double2 QuadReadAcrossY(double2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) double3 QuadReadAcrossY(double3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) double4 QuadReadAcrossY(double4);
+
 //===----------------------------------------------------------------------===//
 // sign builtins
 //===----------------------------------------------------------------------===//
diff --git a/clang/lib/Sema/SemaHLSL.cpp b/clang/lib/Sema/SemaHLSL.cpp
index 0619295cd2fbb..badbf9416d44f 100644
--- a/clang/lib/Sema/SemaHLSL.cpp
+++ b/clang/lib/Sema/SemaHLSL.cpp
@@ -4236,7 +4236,8 @@ bool SemaHLSL::CheckBuiltinFunctionCall(unsigned BuiltinID, CallExpr *TheCall) {
     TheCall->setType(ArgTyExpr);
     break;
   }
-  case Builtin::BI__builtin_hlsl_quad_read_across_x: {
+  case Builtin::BI__builtin_hlsl_quad_read_across_x:
+  case Builtin::BI__builtin_hlsl_quad_read_across_y: {
     if (SemaRef.checkArgCount(TheCall, 1))
       return true;
 
diff --git a/clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl b/clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl
new file mode 100644
index 0000000000000..a050aa9cf1722
--- /dev/null
+++ b/clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl
@@ -0,0 +1,46 @@
+// RUN: %clang_cc1 -std=hlsl2021 -finclude-default-header -triple \
+// RUN:   dxil-pc-shadermodel6.3-compute %s -emit-llvm -disable-llvm-passes -o - | \
+// RUN:   FileCheck %s --check-prefixes=CHECK,CHECK-DXIL
+// RUN: %clang_cc1 -std=hlsl2021 -finclude-default-header -triple \
+// RUN:   spirv-pc-vulkan-compute %s -emit-llvm -disable-llvm-passes -o - | \
+// RUN:   FileCheck %s --check-prefixes=CHECK,CHECK-SPIRV
+
+// Test basic lowering to runtime function call.
+
+// CHECK-LABEL: test_int
+int test_int(int expr) {
+  // CHECK-SPIRV:  %[[RET:.*]] = call spir_func [[TY:.*]] @llvm.spv.quad.read.across.y.i32([[TY]] %[[#]])
+  // CHECK-DXIL:  %[[RET:.*]] = call [[TY:.*]] @llvm.dx.quad.read.across.y.i32([[TY]] %[[#]])
+  // CHECK:  ret [[TY]] %[[RET]]
+  return QuadReadAcrossY(expr);
+}
+
+// CHECK-DXIL: declare [[TY]] @llvm.dx.quad.read.across.y.i32([[TY]]) #[[#attr:]]
+// CHECK-SPIRV: declare [[TY]] @llvm.spv.quad.read.across.y.i32([[TY]]) #[[#attr:]]
+
+// CHECK-LABEL: test_uint64_t
+uint64_t test_uint64_t(uint64_t expr) {
+  // CHECK-SPIRV:  %[[RET:.*]] = call spir_func [[TY:.*]] @llvm.spv.quad.read.across.y.i64([[TY]] %[[#]])
+  // CHECK-DXIL:  %[[RET:.*]] = call [[TY:.*]] @llvm.dx.quad.read.across.y.i64([[TY]] %[[#]])
+  // CHECK:  ret [[TY]] %[[RET]]
+  return QuadReadAcrossY(expr);
+}
+
+// CHECK-DXIL: declare [[TY]] @llvm.dx.quad.read.across.y.i64([[TY]]) #[[#attr:]]
+// CHECK-SPIRV: declare [[TY]] @llvm.spv.quad.read.across.y.i64([[TY]]) #[[#attr:]]
+
+// Test basic lowering to runtime function call with array and float value.
+
+// CHECK-LABEL: test_floatv4
+float4 test_floatv4(float4 expr) {
+  // CHECK-SPIRV:  %[[RET1:.*]] = call reassoc nnan ninf nsz arcp afn spir_func [[TY1:.*]] @llvm.spv.quad.read.across.y.v4f32([[TY1]] %[[#]]
+  // CHECK-DXIL:  %[[RET1:.*]] = call reassoc nnan ninf nsz arcp afn [[TY1:.*]] @llvm.dx.quad.read.across.y.v4f32([[TY1]] %[[#]])
+  // CHECK:  ret [[TY1]] %[[RET1]]
+  return QuadReadAcrossY(expr);
+}
+
+// CHECK-DXIL: declare [[TY1]] @llvm.dx.quad.read.across.y.v4f32([[TY1]]) #[[#attr]]
+// CHECK-SPIRV: declare [[TY1]] @llvm.spv.quad.read.across.y.v4f32([[TY1]]) #[[#attr]]
+
+// CHECK: attributes #[[#attr]] = {{{.*}} convergent {{.*}}}
+
diff --git a/clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl b/clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl
new file mode 100644
index 0000000000000..2995895ff4c3d
--- /dev/null
+++ b/clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl
@@ -0,0 +1,28 @@
+// RUN: %clang_cc1 -finclude-default-header -triple dxil-pc-shadermodel6.6-library %s -emit-llvm-only -disable-llvm-passes -verify
+
+int test_too_few_arg() {
+  return __builtin_hlsl_quad_read_across_y();
+  // expected-error@-1 {{too few arguments to function call, expected 1, have 0}}
+}
+
+float2 test_too_many_arg(float2 p0) {
+  return __builtin_hlsl_quad_read_across_y(p0, p0);
+  // expected-error@-1 {{too many arguments to function call, expected 1, have 2}}
+}
+
+bool test_expr_bool_type_check(bool p0) {
+  return __builtin_hlsl_quad_read_across_y(p0);
+  // expected-error@-1 {{invalid operand of type 'bool'}}
+}
+
+bool2 test_expr_bool_vec_type_check(bool2 p0) {
+  return __builtin_hlsl_quad_read_across_y(p0);
+  // expected-error@-1 {{invalid operand of type 'bool2' (aka 'vector<bool, 2>')}}
+}
+
+struct S { float f; };
+
+S test_expr_struct_type_check(S p0) {
+  return __builtin_hlsl_quad_read_across_y(p0);
+  // expected-error@-1 {{invalid operand of type 'S' where a scalar or vector is required}}
+}
diff --git a/llvm/include/llvm/IR/IntrinsicsDirectX.td b/llvm/include/llvm/IR/IntrinsicsDirectX.td
index 3f7922382f090..8ee07b32c1eca 100644
--- a/llvm/include/llvm/IR/IntrinsicsDirectX.td
+++ b/llvm/include/llvm/IR/IntrinsicsDirectX.td
@@ -255,6 +255,7 @@ def int_dx_wave_prefix_usum : DefaultAttrsIntrinsic<[llvm_anyint_ty], [LLVMMatch
 def int_dx_wave_prefix_product : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
 def int_dx_wave_prefix_uproduct : DefaultAttrsIntrinsic<[llvm_anyint_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
 def int_dx_quad_read_across_x : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
+def int_dx_quad_read_across_y : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
 def int_dx_sign : DefaultAttrsIntrinsic<[LLVMScalarOrSameVectorWidth<0, llvm_i32_ty>], [llvm_any_ty], [IntrNoMem]>;
 def int_dx_step : DefaultAttrsIntrinsic<[LLVMMatchType<0>], [llvm_anyfloat_ty, LLVMMatchType<0>], [IntrNoMem]>;
 def int_dx_splitdouble : DefaultAttrsIntrinsic<[llvm_anyint_ty, LLVMMatchType<0>],
diff --git a/llvm/include/llvm/IR/IntrinsicsSPIRV.td b/llvm/include/llvm/IR/IntrinsicsSPIRV.td
index 5d467adb08c3d..6f030b7f2509f 100644
--- a/llvm/include/llvm/IR/IntrinsicsSPIRV.td
+++ b/llvm/include/llvm/IR/IntrinsicsSPIRV.td
@@ -148,6 +148,7 @@ def int_spv_rsqrt : DefaultAttrsIntrinsic<[LLVMMatchType<0>], [llvm_anyfloat_ty]
   def int_spv_wave_prefix_sum : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
   def int_spv_wave_prefix_product : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
   def int_spv_quad_read_across_x : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
+  def int_spv_quad_read_across_y : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
   def int_spv_sign : DefaultAttrsIntrinsic<[LLVMScalarOrSameVectorWidth<0, llvm_i32_ty>], [llvm_any_ty], [IntrNoMem]>;
   def int_spv_radians : DefaultAttrsIntrinsic<[LLVMMatchType<0>], [llvm_anyfloat_ty], [IntrNoMem]>;
   def int_spv_group_memory_barrier_with_group_sync : ClangBuiltin<"__builtin_spirv_group_barrier">,
diff --git a/llvm/lib/Target/DirectX/DXIL.td b/llvm/lib/Target/DirectX/DXIL.td
index 20de198ef96d2..13fc147e60216 100644
--- a/llvm/lib/Target/DirectX/DXIL.td
+++ b/llvm/lib/Target/DirectX/DXIL.td
@@ -1214,6 +1214,10 @@ def QuadOp : DXILOp<123, quadOp> {
                  [
                    IntrinArgIndex<0>, IntrinArgI8<QuadOpKind_ReadAcrossX>
                  ]>,
+    IntrinSelect<int_dx_quad_read_across_y,
+                 [
+                   IntrinArgIndex<0>, IntrinArgI8<QuadOpKind_ReadAcrossY>
+                 ]>,
   ];
 
   let arguments = [OverloadTy, Int8Ty];
diff --git a/llvm/lib/Target/DirectX/DXILShaderFlags.cpp b/llvm/lib/Target/DirectX/DXILShaderFlags.cpp
index fab8bea379f74..69b942a171992 100644
--- a/llvm/lib/Target/DirectX/DXILShaderFlags.cpp
+++ b/llvm/lib/Target/DirectX/DXILShaderFlags.cpp
@@ -107,6 +107,7 @@ static bool checkWaveOps(Intrinsic::ID IID) {
   case Intrinsic::dx_wave_prefix_uproduct:
     // Quad Op Variants
   case Intrinsic::dx_quad_read_across_x:
+  case Intrinsic::dx_quad_read_across_y:
     return true;
   }
 }
diff --git a/llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp b/llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp
index 0c652b3bb29c0..830fcc9da862e 100644
--- a/llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp
+++ b/llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp
@@ -77,6 +77,7 @@ bool DirectXTTIImpl::isTargetIntrinsicTriviallyScalarizable(
   case Intrinsic::dx_wave_prefix_usum:
   case Intrinsic::dx_wave_prefix_uproduct:
   case Intrinsic::dx_quad_read_across_x:
+  case Intrinsic::dx_quad_read_across_y:
   case Intrinsic::dx_imad:
   case Intrinsic::dx_umad:
   case Intrinsic::dx_ddx_coarse:
diff --git a/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp b/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp
index eb30002d2e1a5..abad90891b4f5 100644
--- a/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp
+++ b/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp
@@ -4457,6 +4457,9 @@ bool SPIRVInstructionSelector::selectIntrinsic(Register ResVReg,
   case Intrinsic::spv_quad_read_across_x: {
     return selectQuadSwap(ResVReg, ResType, I, /*Direction*/ 0);
   }
+  case Intrinsic::spv_quad_read_across_y: {
+    return selectQuadSwap(ResVReg, ResType, I, /*Direction*/ 1);
+  }
   case Intrinsic::spv_step:
     return selectExtInst(ResVReg, ResType, I, CL::step, GL::Step);
   case Intrinsic::spv_radians:
diff --git a/llvm/test/CodeGen/DirectX/QuadReadAcrossY.ll b/llvm/test/CodeGen/DirectX/QuadReadAcrossY.ll
new file mode 100644
index 0000000000000..d90ac759aa23d
--- /dev/null
+++ b/llvm/test/CodeGen/DirectX/QuadReadAcrossY.ll
@@ -0,0 +1,87 @@
+; RUN: opt -S -scalarizer -dxil-op-lower -mtriple=dxil-pc-shadermodel6.3-library < %s | FileCheck %s
+
+; Test that for scalar values, QuadReadAcrossY maps down to the DirectX op
+
+define noundef half @quad_read_across_y_half(half noundef %expr) {
+entry:
+; CHECK: call half @dx.op.quadOp.f16(i32 123, half %expr, i8 1)
+  %ret = call half @llvm.dx.quad.read.across.y.f16(half %expr)
+  ret half %ret
+}
+
+define noundef float @quad_read_across_y_float(float noundef %expr) {
+entry:
+; CHECK: call float @dx.op.quadOp.f32(i32 123, float %expr, i8 1)
+  %ret = call float @llvm.dx.quad.read.across.y.f32(float %expr)
+  ret float %ret
+}
+
+define noundef double @quad_read_across_y_double(double noundef %expr) {
+entry:
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr, i8 1)
+  %ret = call double @llvm.dx.quad.read.across.y.f64(double %expr)
+  ret double %ret
+}
+
+define noundef i16 @quad_read_across_y_i16(i16 noundef %expr) {
+entry:
+; CHECK: call i16 @dx.op.quadOp.i16(i32 123, i16 %expr, i8 1)
+  %ret = call i16 @llvm.dx.quad.read.across.y.i16(i16 %expr)
+  ret i16 %ret
+}
+
+define noundef i32 @quad_read_across_y_i32(i32 noundef %expr) {
+entry:
+; CHECK: call i32 @dx.op.quadOp.i32(i32 123, i32 %expr, i8 1)
+  %ret = call i32 @llvm.dx.quad.read.across.y.i32(i32 %expr)
+  ret i32 %ret
+}
+
+define noundef i64 @quad_read_across_y_i64(i64 noundef %expr) {
+entry:
+; CHECK: call i64 @dx.op.quadOp.i64(i32 123, i64 %expr, i8 1)
+  %ret = call i64 @llvm.dx.quad.read.across.y.i64(i64 %expr)
+  ret i64 %ret
+}
+
+declare half @llvm.dx.quad.read.across.y.f16(half)
+declare float @llvm.dx.quad.read.across.y.f32(float)
+declare double @llvm.dx.quad.read.across.y.f64(double)
+
+declare i16 @llvm.dx.quad.read.across.y.i16(i16)
+declare i32 @llvm.dx.quad.read.across.y.i32(i32)
+declare i64 @llvm.dx.quad.read.across.y.i64(i64)
+
+; Test that for vector values, QuadReadAcrossY scalarizes and maps down to the
+; DirectX op
+
+define noundef <2 x half> @quad_read_across_y_v2half(<2 x half> noundef %expr) {
+entry:
+; CHECK: call half @dx.op.quadOp.f16(i32 123, half %expr.i0, i8 1)
+; CHECK: call half @dx.op.quadOp.f16(i32 123, half %expr.i1, i8 1)
+  %ret = call <2 x half> @llvm.dx.quad.read.across.y.v2f16(<2 x half> %expr)
+  ret <2 x half> %ret
+}
+
+define noundef <3 x i32> @quad_read_across_y_v3i32(<3 x i32> noundef %expr) {
+entry:
+; CHECK: call i32 @dx.op.quadOp.i32(i32 123, i32 %expr.i0, i8 1)
+; CHECK: call i32 @dx.op.quadOp.i32(i32 123, i32 %expr.i1, i8 1)
+; CHECK: call i32 @dx.op.quadOp.i32(i32 123, i32 %expr.i2, i8 1)
+  %ret = call <3 x i32> @llvm.dx.quad.read.across.y.v3i32(<3 x i32> %expr)
+  ret <3 x i32> %ret
+}
+
+define noundef <4 x double> @quad_read_across_y_v4f64(<4 x double> noundef %expr) {
+entry:
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr.i0, i8 1)
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr.i1, i8 1)
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr.i2, i8 1)
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr.i3, i8 1)
+  ...
[truncated]

@llvmbot
Copy link
Copy Markdown
Member

llvmbot commented Mar 19, 2026

@llvm/pr-subscribers-clang-codegen

Author: Kai (kcloudy0717)

Changes

This PR adds QuadReadAcrossY intrinsic support in HLSL with codegen for both DirectX and SPIRV backends. Resolves #99176.

  • Implement QuadReadAcrossY clang builtin,
  • Link QuadReadAcrossY clang builtin with hlsl_intrinsics.h
  • Add sema checks for QuadReadAcrossY to CheckHLSLBuiltinFunctionCall in SemaChecking.cpp
  • Add codegen for QuadReadAcrossY to EmitHLSLBuiltinExpr in CGBuiltin.cpp
  • Add codegen tests to clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl
  • Add sema tests to clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl
  • Create the int_dx_QuadReadAcrossY intrinsic in IntrinsicsDirectX.td
  • Create the DXILOpMapping of int_dx_QuadReadAcrossY to 123 in DXIL.td
  • Create the QuadReadAcrossY.ll and QuadReadAcrossY_errors.ll tests in llvm/test/CodeGen/DirectX/
  • Create the int_spv_QuadReadAcrossY intrinsic in IntrinsicsSPIRV.td
  • In SPIRVInstructionSelector.cpp create the QuadReadAcrossY lowering and map it to int_spv_QuadReadAcrossY in SPIRVInstructionSelector::selectIntrinsic.
  • Create SPIR-V backend test case in llvm/test/CodeGen/SPIRV/hlsl-intrinsics/QuadReadAcrossY.ll

Patch is 22.96 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/187440.diff

16 Files Affected:

  • (modified) clang/include/clang/Basic/Builtins.td (+6)
  • (modified) clang/lib/CodeGen/CGHLSLBuiltins.cpp (+7)
  • (modified) clang/lib/CodeGen/CGHLSLRuntime.h (+1)
  • (modified) clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h (+99)
  • (modified) clang/lib/Sema/SemaHLSL.cpp (+2-1)
  • (added) clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl (+46)
  • (added) clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl (+28)
  • (modified) llvm/include/llvm/IR/IntrinsicsDirectX.td (+1)
  • (modified) llvm/include/llvm/IR/IntrinsicsSPIRV.td (+1)
  • (modified) llvm/lib/Target/DirectX/DXIL.td (+4)
  • (modified) llvm/lib/Target/DirectX/DXILShaderFlags.cpp (+1)
  • (modified) llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp (+1)
  • (modified) llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp (+3)
  • (added) llvm/test/CodeGen/DirectX/QuadReadAcrossY.ll (+87)
  • (modified) llvm/test/CodeGen/DirectX/ShaderFlags/wave-ops.ll (+7)
  • (added) llvm/test/CodeGen/SPIRV/hlsl-intrinsics/QuadReadAcrossY.ll (+44)
diff --git a/clang/include/clang/Basic/Builtins.td b/clang/include/clang/Basic/Builtins.td
index 8e002c5b900aa..5a1ab7197b4a8 100644
--- a/clang/include/clang/Basic/Builtins.td
+++ b/clang/include/clang/Basic/Builtins.td
@@ -5276,6 +5276,12 @@ def HLSLQuadReadAcrossX : LangBuiltin<"HLSL_LANG"> {
   let Prototype = "void(...)";
 }
 
+def HLSLQuadReadAcrossY : LangBuiltin<"HLSL_LANG"> {
+  let Spellings = ["__builtin_hlsl_quad_read_across_y"];
+  let Attributes = [NoThrow, Const];
+  let Prototype = "void(...)";
+}
+
 def HLSLClamp : LangBuiltin<"HLSL_LANG"> {
   let Spellings = ["__builtin_hlsl_elementwise_clamp"];
   let Attributes = [NoThrow, Const, CustomTypeChecking];
diff --git a/clang/lib/CodeGen/CGHLSLBuiltins.cpp b/clang/lib/CodeGen/CGHLSLBuiltins.cpp
index 22293c5983d89..98edd479c02a5 100644
--- a/clang/lib/CodeGen/CGHLSLBuiltins.cpp
+++ b/clang/lib/CodeGen/CGHLSLBuiltins.cpp
@@ -1402,6 +1402,13 @@ Value *CodeGenFunction::EmitHLSLBuiltinExpr(unsigned BuiltinID,
                                &CGM.getModule(), ID, {OpExpr->getType()}),
                            ArrayRef{OpExpr}, "hlsl.quad.read.across.x");
   }
+  case Builtin::BI__builtin_hlsl_quad_read_across_y: {
+    Value *OpExpr = EmitScalarExpr(E->getArg(0));
+    Intrinsic::ID ID = CGM.getHLSLRuntime().getQuadReadAcrossYIntrinsic();
+    return EmitRuntimeCall(Intrinsic::getOrInsertDeclaration(
+                               &CGM.getModule(), ID, {OpExpr->getType()}),
+                           ArrayRef{OpExpr}, "hlsl.quad.read.across.y");
+  }
   case Builtin::BI__builtin_hlsl_elementwise_sign: {
     auto *Arg0 = E->getArg(0);
     Value *Op0 = EmitScalarExpr(Arg0);
diff --git a/clang/lib/CodeGen/CGHLSLRuntime.h b/clang/lib/CodeGen/CGHLSLRuntime.h
index d76cc18c9a259..f93ed28b5c787 100644
--- a/clang/lib/CodeGen/CGHLSLRuntime.h
+++ b/clang/lib/CodeGen/CGHLSLRuntime.h
@@ -158,6 +158,7 @@ class CGHLSLRuntime {
   GENERATE_HLSL_INTRINSIC_FUNCTION(WaveGetLaneCount, wave_get_lane_count)
   GENERATE_HLSL_INTRINSIC_FUNCTION(WaveReadLaneAt, wave_readlane)
   GENERATE_HLSL_INTRINSIC_FUNCTION(QuadReadAcrossX, quad_read_across_x)
+  GENERATE_HLSL_INTRINSIC_FUNCTION(QuadReadAcrossY, quad_read_across_y)
   GENERATE_HLSL_INTRINSIC_FUNCTION(FirstBitUHigh, firstbituhigh)
   GENERATE_HLSL_INTRINSIC_FUNCTION(FirstBitSHigh, firstbitshigh)
   GENERATE_HLSL_INTRINSIC_FUNCTION(FirstBitLow, firstbitlow)
diff --git a/clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h b/clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h
index c2e4d74af6873..a1048d21b20e6 100644
--- a/clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h
+++ b/clang/lib/Headers/hlsl/hlsl_alias_intrinsics.h
@@ -3604,6 +3604,105 @@ __attribute__((convergent)) double3 QuadReadAcrossX(double3);
 _HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_x)
 __attribute__((convergent)) double4 QuadReadAcrossX(double4);
 
+//===----------------------------------------------------------------------===//
+// QuadReadAcrossY builtins
+//===----------------------------------------------------------------------===//
+
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) half QuadReadAcrossY(half);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) half2 QuadReadAcrossY(half2);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) half3 QuadReadAcrossY(half3);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) half4 QuadReadAcrossY(half4);
+
+#ifdef __HLSL_ENABLE_16_BIT
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int16_t QuadReadAcrossY(int16_t);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int16_t2 QuadReadAcrossY(int16_t2);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int16_t3 QuadReadAcrossY(int16_t3);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int16_t4 QuadReadAcrossY(int16_t4);
+
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint16_t QuadReadAcrossY(uint16_t);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint16_t2 QuadReadAcrossY(uint16_t2);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint16_t3 QuadReadAcrossY(uint16_t3);
+_HLSL_16BIT_AVAILABILITY_SHADERMODEL_DEFAULT()
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint16_t4 QuadReadAcrossY(uint16_t4);
+#endif
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int QuadReadAcrossY(int);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int2 QuadReadAcrossY(int2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int3 QuadReadAcrossY(int3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int4 QuadReadAcrossY(int4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint QuadReadAcrossY(uint);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint2 QuadReadAcrossY(uint2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint3 QuadReadAcrossY(uint3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint4 QuadReadAcrossY(uint4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int64_t QuadReadAcrossY(int64_t);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int64_t2 QuadReadAcrossY(int64_t2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int64_t3 QuadReadAcrossY(int64_t3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) int64_t4 QuadReadAcrossY(int64_t4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint64_t QuadReadAcrossY(uint64_t);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint64_t2 QuadReadAcrossY(uint64_t2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint64_t3 QuadReadAcrossY(uint64_t3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) uint64_t4 QuadReadAcrossY(uint64_t4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) float QuadReadAcrossY(float);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) float2 QuadReadAcrossY(float2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) float3 QuadReadAcrossY(float3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) float4 QuadReadAcrossY(float4);
+
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) double QuadReadAcrossY(double);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) double2 QuadReadAcrossY(double2);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) double3 QuadReadAcrossY(double3);
+_HLSL_BUILTIN_ALIAS(__builtin_hlsl_quad_read_across_y)
+__attribute__((convergent)) double4 QuadReadAcrossY(double4);
+
 //===----------------------------------------------------------------------===//
 // sign builtins
 //===----------------------------------------------------------------------===//
diff --git a/clang/lib/Sema/SemaHLSL.cpp b/clang/lib/Sema/SemaHLSL.cpp
index 0619295cd2fbb..badbf9416d44f 100644
--- a/clang/lib/Sema/SemaHLSL.cpp
+++ b/clang/lib/Sema/SemaHLSL.cpp
@@ -4236,7 +4236,8 @@ bool SemaHLSL::CheckBuiltinFunctionCall(unsigned BuiltinID, CallExpr *TheCall) {
     TheCall->setType(ArgTyExpr);
     break;
   }
-  case Builtin::BI__builtin_hlsl_quad_read_across_x: {
+  case Builtin::BI__builtin_hlsl_quad_read_across_x:
+  case Builtin::BI__builtin_hlsl_quad_read_across_y: {
     if (SemaRef.checkArgCount(TheCall, 1))
       return true;
 
diff --git a/clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl b/clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl
new file mode 100644
index 0000000000000..a050aa9cf1722
--- /dev/null
+++ b/clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl
@@ -0,0 +1,46 @@
+// RUN: %clang_cc1 -std=hlsl2021 -finclude-default-header -triple \
+// RUN:   dxil-pc-shadermodel6.3-compute %s -emit-llvm -disable-llvm-passes -o - | \
+// RUN:   FileCheck %s --check-prefixes=CHECK,CHECK-DXIL
+// RUN: %clang_cc1 -std=hlsl2021 -finclude-default-header -triple \
+// RUN:   spirv-pc-vulkan-compute %s -emit-llvm -disable-llvm-passes -o - | \
+// RUN:   FileCheck %s --check-prefixes=CHECK,CHECK-SPIRV
+
+// Test basic lowering to runtime function call.
+
+// CHECK-LABEL: test_int
+int test_int(int expr) {
+  // CHECK-SPIRV:  %[[RET:.*]] = call spir_func [[TY:.*]] @llvm.spv.quad.read.across.y.i32([[TY]] %[[#]])
+  // CHECK-DXIL:  %[[RET:.*]] = call [[TY:.*]] @llvm.dx.quad.read.across.y.i32([[TY]] %[[#]])
+  // CHECK:  ret [[TY]] %[[RET]]
+  return QuadReadAcrossY(expr);
+}
+
+// CHECK-DXIL: declare [[TY]] @llvm.dx.quad.read.across.y.i32([[TY]]) #[[#attr:]]
+// CHECK-SPIRV: declare [[TY]] @llvm.spv.quad.read.across.y.i32([[TY]]) #[[#attr:]]
+
+// CHECK-LABEL: test_uint64_t
+uint64_t test_uint64_t(uint64_t expr) {
+  // CHECK-SPIRV:  %[[RET:.*]] = call spir_func [[TY:.*]] @llvm.spv.quad.read.across.y.i64([[TY]] %[[#]])
+  // CHECK-DXIL:  %[[RET:.*]] = call [[TY:.*]] @llvm.dx.quad.read.across.y.i64([[TY]] %[[#]])
+  // CHECK:  ret [[TY]] %[[RET]]
+  return QuadReadAcrossY(expr);
+}
+
+// CHECK-DXIL: declare [[TY]] @llvm.dx.quad.read.across.y.i64([[TY]]) #[[#attr:]]
+// CHECK-SPIRV: declare [[TY]] @llvm.spv.quad.read.across.y.i64([[TY]]) #[[#attr:]]
+
+// Test basic lowering to runtime function call with array and float value.
+
+// CHECK-LABEL: test_floatv4
+float4 test_floatv4(float4 expr) {
+  // CHECK-SPIRV:  %[[RET1:.*]] = call reassoc nnan ninf nsz arcp afn spir_func [[TY1:.*]] @llvm.spv.quad.read.across.y.v4f32([[TY1]] %[[#]]
+  // CHECK-DXIL:  %[[RET1:.*]] = call reassoc nnan ninf nsz arcp afn [[TY1:.*]] @llvm.dx.quad.read.across.y.v4f32([[TY1]] %[[#]])
+  // CHECK:  ret [[TY1]] %[[RET1]]
+  return QuadReadAcrossY(expr);
+}
+
+// CHECK-DXIL: declare [[TY1]] @llvm.dx.quad.read.across.y.v4f32([[TY1]]) #[[#attr]]
+// CHECK-SPIRV: declare [[TY1]] @llvm.spv.quad.read.across.y.v4f32([[TY1]]) #[[#attr]]
+
+// CHECK: attributes #[[#attr]] = {{{.*}} convergent {{.*}}}
+
diff --git a/clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl b/clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl
new file mode 100644
index 0000000000000..2995895ff4c3d
--- /dev/null
+++ b/clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl
@@ -0,0 +1,28 @@
+// RUN: %clang_cc1 -finclude-default-header -triple dxil-pc-shadermodel6.6-library %s -emit-llvm-only -disable-llvm-passes -verify
+
+int test_too_few_arg() {
+  return __builtin_hlsl_quad_read_across_y();
+  // expected-error@-1 {{too few arguments to function call, expected 1, have 0}}
+}
+
+float2 test_too_many_arg(float2 p0) {
+  return __builtin_hlsl_quad_read_across_y(p0, p0);
+  // expected-error@-1 {{too many arguments to function call, expected 1, have 2}}
+}
+
+bool test_expr_bool_type_check(bool p0) {
+  return __builtin_hlsl_quad_read_across_y(p0);
+  // expected-error@-1 {{invalid operand of type 'bool'}}
+}
+
+bool2 test_expr_bool_vec_type_check(bool2 p0) {
+  return __builtin_hlsl_quad_read_across_y(p0);
+  // expected-error@-1 {{invalid operand of type 'bool2' (aka 'vector<bool, 2>')}}
+}
+
+struct S { float f; };
+
+S test_expr_struct_type_check(S p0) {
+  return __builtin_hlsl_quad_read_across_y(p0);
+  // expected-error@-1 {{invalid operand of type 'S' where a scalar or vector is required}}
+}
diff --git a/llvm/include/llvm/IR/IntrinsicsDirectX.td b/llvm/include/llvm/IR/IntrinsicsDirectX.td
index 3f7922382f090..8ee07b32c1eca 100644
--- a/llvm/include/llvm/IR/IntrinsicsDirectX.td
+++ b/llvm/include/llvm/IR/IntrinsicsDirectX.td
@@ -255,6 +255,7 @@ def int_dx_wave_prefix_usum : DefaultAttrsIntrinsic<[llvm_anyint_ty], [LLVMMatch
 def int_dx_wave_prefix_product : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
 def int_dx_wave_prefix_uproduct : DefaultAttrsIntrinsic<[llvm_anyint_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
 def int_dx_quad_read_across_x : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
+def int_dx_quad_read_across_y : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
 def int_dx_sign : DefaultAttrsIntrinsic<[LLVMScalarOrSameVectorWidth<0, llvm_i32_ty>], [llvm_any_ty], [IntrNoMem]>;
 def int_dx_step : DefaultAttrsIntrinsic<[LLVMMatchType<0>], [llvm_anyfloat_ty, LLVMMatchType<0>], [IntrNoMem]>;
 def int_dx_splitdouble : DefaultAttrsIntrinsic<[llvm_anyint_ty, LLVMMatchType<0>],
diff --git a/llvm/include/llvm/IR/IntrinsicsSPIRV.td b/llvm/include/llvm/IR/IntrinsicsSPIRV.td
index 5d467adb08c3d..6f030b7f2509f 100644
--- a/llvm/include/llvm/IR/IntrinsicsSPIRV.td
+++ b/llvm/include/llvm/IR/IntrinsicsSPIRV.td
@@ -148,6 +148,7 @@ def int_spv_rsqrt : DefaultAttrsIntrinsic<[LLVMMatchType<0>], [llvm_anyfloat_ty]
   def int_spv_wave_prefix_sum : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
   def int_spv_wave_prefix_product : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
   def int_spv_quad_read_across_x : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
+  def int_spv_quad_read_across_y : DefaultAttrsIntrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrConvergent, IntrNoMem]>;
   def int_spv_sign : DefaultAttrsIntrinsic<[LLVMScalarOrSameVectorWidth<0, llvm_i32_ty>], [llvm_any_ty], [IntrNoMem]>;
   def int_spv_radians : DefaultAttrsIntrinsic<[LLVMMatchType<0>], [llvm_anyfloat_ty], [IntrNoMem]>;
   def int_spv_group_memory_barrier_with_group_sync : ClangBuiltin<"__builtin_spirv_group_barrier">,
diff --git a/llvm/lib/Target/DirectX/DXIL.td b/llvm/lib/Target/DirectX/DXIL.td
index 20de198ef96d2..13fc147e60216 100644
--- a/llvm/lib/Target/DirectX/DXIL.td
+++ b/llvm/lib/Target/DirectX/DXIL.td
@@ -1214,6 +1214,10 @@ def QuadOp : DXILOp<123, quadOp> {
                  [
                    IntrinArgIndex<0>, IntrinArgI8<QuadOpKind_ReadAcrossX>
                  ]>,
+    IntrinSelect<int_dx_quad_read_across_y,
+                 [
+                   IntrinArgIndex<0>, IntrinArgI8<QuadOpKind_ReadAcrossY>
+                 ]>,
   ];
 
   let arguments = [OverloadTy, Int8Ty];
diff --git a/llvm/lib/Target/DirectX/DXILShaderFlags.cpp b/llvm/lib/Target/DirectX/DXILShaderFlags.cpp
index fab8bea379f74..69b942a171992 100644
--- a/llvm/lib/Target/DirectX/DXILShaderFlags.cpp
+++ b/llvm/lib/Target/DirectX/DXILShaderFlags.cpp
@@ -107,6 +107,7 @@ static bool checkWaveOps(Intrinsic::ID IID) {
   case Intrinsic::dx_wave_prefix_uproduct:
     // Quad Op Variants
   case Intrinsic::dx_quad_read_across_x:
+  case Intrinsic::dx_quad_read_across_y:
     return true;
   }
 }
diff --git a/llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp b/llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp
index 0c652b3bb29c0..830fcc9da862e 100644
--- a/llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp
+++ b/llvm/lib/Target/DirectX/DirectXTargetTransformInfo.cpp
@@ -77,6 +77,7 @@ bool DirectXTTIImpl::isTargetIntrinsicTriviallyScalarizable(
   case Intrinsic::dx_wave_prefix_usum:
   case Intrinsic::dx_wave_prefix_uproduct:
   case Intrinsic::dx_quad_read_across_x:
+  case Intrinsic::dx_quad_read_across_y:
   case Intrinsic::dx_imad:
   case Intrinsic::dx_umad:
   case Intrinsic::dx_ddx_coarse:
diff --git a/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp b/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp
index eb30002d2e1a5..abad90891b4f5 100644
--- a/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp
+++ b/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp
@@ -4457,6 +4457,9 @@ bool SPIRVInstructionSelector::selectIntrinsic(Register ResVReg,
   case Intrinsic::spv_quad_read_across_x: {
     return selectQuadSwap(ResVReg, ResType, I, /*Direction*/ 0);
   }
+  case Intrinsic::spv_quad_read_across_y: {
+    return selectQuadSwap(ResVReg, ResType, I, /*Direction*/ 1);
+  }
   case Intrinsic::spv_step:
     return selectExtInst(ResVReg, ResType, I, CL::step, GL::Step);
   case Intrinsic::spv_radians:
diff --git a/llvm/test/CodeGen/DirectX/QuadReadAcrossY.ll b/llvm/test/CodeGen/DirectX/QuadReadAcrossY.ll
new file mode 100644
index 0000000000000..d90ac759aa23d
--- /dev/null
+++ b/llvm/test/CodeGen/DirectX/QuadReadAcrossY.ll
@@ -0,0 +1,87 @@
+; RUN: opt -S -scalarizer -dxil-op-lower -mtriple=dxil-pc-shadermodel6.3-library < %s | FileCheck %s
+
+; Test that for scalar values, QuadReadAcrossY maps down to the DirectX op
+
+define noundef half @quad_read_across_y_half(half noundef %expr) {
+entry:
+; CHECK: call half @dx.op.quadOp.f16(i32 123, half %expr, i8 1)
+  %ret = call half @llvm.dx.quad.read.across.y.f16(half %expr)
+  ret half %ret
+}
+
+define noundef float @quad_read_across_y_float(float noundef %expr) {
+entry:
+; CHECK: call float @dx.op.quadOp.f32(i32 123, float %expr, i8 1)
+  %ret = call float @llvm.dx.quad.read.across.y.f32(float %expr)
+  ret float %ret
+}
+
+define noundef double @quad_read_across_y_double(double noundef %expr) {
+entry:
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr, i8 1)
+  %ret = call double @llvm.dx.quad.read.across.y.f64(double %expr)
+  ret double %ret
+}
+
+define noundef i16 @quad_read_across_y_i16(i16 noundef %expr) {
+entry:
+; CHECK: call i16 @dx.op.quadOp.i16(i32 123, i16 %expr, i8 1)
+  %ret = call i16 @llvm.dx.quad.read.across.y.i16(i16 %expr)
+  ret i16 %ret
+}
+
+define noundef i32 @quad_read_across_y_i32(i32 noundef %expr) {
+entry:
+; CHECK: call i32 @dx.op.quadOp.i32(i32 123, i32 %expr, i8 1)
+  %ret = call i32 @llvm.dx.quad.read.across.y.i32(i32 %expr)
+  ret i32 %ret
+}
+
+define noundef i64 @quad_read_across_y_i64(i64 noundef %expr) {
+entry:
+; CHECK: call i64 @dx.op.quadOp.i64(i32 123, i64 %expr, i8 1)
+  %ret = call i64 @llvm.dx.quad.read.across.y.i64(i64 %expr)
+  ret i64 %ret
+}
+
+declare half @llvm.dx.quad.read.across.y.f16(half)
+declare float @llvm.dx.quad.read.across.y.f32(float)
+declare double @llvm.dx.quad.read.across.y.f64(double)
+
+declare i16 @llvm.dx.quad.read.across.y.i16(i16)
+declare i32 @llvm.dx.quad.read.across.y.i32(i32)
+declare i64 @llvm.dx.quad.read.across.y.i64(i64)
+
+; Test that for vector values, QuadReadAcrossY scalarizes and maps down to the
+; DirectX op
+
+define noundef <2 x half> @quad_read_across_y_v2half(<2 x half> noundef %expr) {
+entry:
+; CHECK: call half @dx.op.quadOp.f16(i32 123, half %expr.i0, i8 1)
+; CHECK: call half @dx.op.quadOp.f16(i32 123, half %expr.i1, i8 1)
+  %ret = call <2 x half> @llvm.dx.quad.read.across.y.v2f16(<2 x half> %expr)
+  ret <2 x half> %ret
+}
+
+define noundef <3 x i32> @quad_read_across_y_v3i32(<3 x i32> noundef %expr) {
+entry:
+; CHECK: call i32 @dx.op.quadOp.i32(i32 123, i32 %expr.i0, i8 1)
+; CHECK: call i32 @dx.op.quadOp.i32(i32 123, i32 %expr.i1, i8 1)
+; CHECK: call i32 @dx.op.quadOp.i32(i32 123, i32 %expr.i2, i8 1)
+  %ret = call <3 x i32> @llvm.dx.quad.read.across.y.v3i32(<3 x i32> %expr)
+  ret <3 x i32> %ret
+}
+
+define noundef <4 x double> @quad_read_across_y_v4f64(<4 x double> noundef %expr) {
+entry:
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr.i0, i8 1)
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr.i1, i8 1)
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr.i2, i8 1)
+; CHECK: call double @dx.op.quadOp.f64(i32 123, double %expr.i3, i8 1)
+  ...
[truncated]

@kcloudy0717
Copy link
Copy Markdown
Contributor Author

Pinging @bob80905 and @farzonl for review. The respective test PR can be found in llvm/offload-test-suite#993.

Copy link
Copy Markdown
Contributor

@bob80905 bob80905 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kcloudy0717 kcloudy0717 force-pushed the kcloudy0717/QuadReadAcrossY branch from 31ae7cf to 1687ad7 Compare March 23, 2026 16:23
@kcloudy0717
Copy link
Copy Markdown
Contributor Author

@bob80905 This PR is good to go too.

Comment on lines +18 to +19
// CHECK-DXIL: declare [[TY]] @llvm.dx.quad.read.across.y.i32([[TY]]) #[[#attr:]]
// CHECK-SPIRV: declare [[TY]] @llvm.spv.quad.read.across.y.i32([[TY]]) #[[#attr:]]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine, but we do have a means of reducing the number of target specific check lines. See clang/test/CodeGenHLSL/builtins/dot.hlsl for an example:

// DXCHECK: %hlsl.dot = call i32 @llvm.[[ICF:dx]].sdot.v2i32(<2 x i32>
// SPVCHECK: %hlsl.dot = call i32 @llvm.[[ICF:spv]].sdot.v2i32(<2 x i32>

// CHECK: %hlsl.dot = call i32 @llvm.[[ICF]].sdot.v3i32(<3 x i32>

See how all successive checks just uses ICF after the first one?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a neat trick! I didn't know about this, thanks for letting me know. I updated the codegen tests for the intrinsic, it now uses the ICF capture (as well as CC capture for spir_func calling convention) and it should have tests for all supported types including 16bit tests as well for the intrinsic. Could you give it a read?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll apply the feedback to QuadReadAcrossX's test since those tests are not comprehensive as well. I'll open a separate PR for this and tag you for review.

// Test basic lowering to runtime function call with array and float value.

// CHECK-LABEL: test_floatv4
float4 test_floatv4(float4 expr) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want some 16 bit testing aswell.

@@ -0,0 +1,28 @@
// RUN: %clang_cc1 -finclude-default-header -triple dxil-pc-shadermodel6.6-library %s -emit-llvm-only -disable-llvm-passes -verify
Copy link
Copy Markdown
Member

@farzonl farzonl Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need -emit-llvm-only -disable-llvm-passes? This should just be sema checks we should never get to codegen. All that is needed is -verify

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is probably unnecessary. The reason why it's here is because it was in majority of other error tests as well, I simply copied over. It should be removed now.

Copy link
Copy Markdown
Member

@farzonl farzonl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM. Tests need to be updated.

@kcloudy0717 kcloudy0717 requested a review from farzonl March 25, 2026 08:49
Copy link
Copy Markdown
Contributor

@bob80905 bob80905 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the unexpected passes are expected, now that we are adding support in clang.

@bob80905 bob80905 merged commit a546c77 into llvm:main Mar 25, 2026
11 of 12 checks passed
@vzakhari
Copy link
Copy Markdown
Contributor

My make check-flang returns the following errors now:

../../../../bin/clang-tblgen: error opening .../build/tools/clang/lib/Headers/hlsl/hlsl_inline_intrinsics_gen.inc: No such file or dir
ectory
make[3]: *** [tools/clang/lib/Headers/CMakeFiles/hlsl-resource-headers.dir/build.make:79: tools/clang/lib/Headers/hlsl/hlsl_inline_intrinsics_gen.inc] Error 1
make[3]: *** Waiting for unfinished jobs....

This PR is the latest in HLSL area. Can I have any clue what might be wrong?

@farzonl
Copy link
Copy Markdown
Member

farzonl commented Mar 25, 2026

My make check-flang returns the following errors now:

../../../../bin/clang-tblgen: error opening .../build/tools/clang/lib/Headers/hlsl/hlsl_inline_intrinsics_gen.inc: No such file or dir
ectory
make[3]: *** [tools/clang/lib/Headers/CMakeFiles/hlsl-resource-headers.dir/build.make:79: tools/clang/lib/Headers/hlsl/hlsl_inline_intrinsics_gen.inc] Error 1
make[3]: *** Waiting for unfinished jobs....

This PR is the latest in HLSL area. Can I have any clue what might be wrong?

This isn’t the pr that changed things.
This pr did: #187610
@lcohedron can you take a look

@vzakhari
Copy link
Copy Markdown
Contributor

The issue resolves if I mkdir tools/clang/lib/Headers/hlsl in the build directory, i.e. these commands fail when there is no hlsl subdir:

# Generate HLSL intrinsic overloads
clang_generate_header(-gen-hlsl-alias-intrinsics HLSLIntrinsics.td
                      hlsl/hlsl_alias_intrinsics_gen.inc)
clang_generate_header(-gen-hlsl-inline-intrinsics HLSLIntrinsics.td
                      hlsl/hlsl_inline_intrinsics_gen.inc)

@Icohedron
Copy link
Copy Markdown
Contributor

My make check-flang returns the following errors now:

../../../../bin/clang-tblgen: error opening .../build/tools/clang/lib/Headers/hlsl/hlsl_inline_intrinsics_gen.inc: No such file or dir
ectory
make[3]: *** [tools/clang/lib/Headers/CMakeFiles/hlsl-resource-headers.dir/build.make:79: tools/clang/lib/Headers/hlsl/hlsl_inline_intrinsics_gen.inc] Error 1
make[3]: *** Waiting for unfinished jobs....

This PR is the latest in HLSL area. Can I have any clue what might be wrong?

This isn’t the pr that changed things. This pr did: #187610 @lcohedron can you take a look

I'm not sure why the hlsl_inline_intrinsics_gen.inc file isn't being generated. It should be generated unconditionally since 7619b80 removed the condition on HLSL being enabled for generating the files.

I will investigate further

@vzakhari
Copy link
Copy Markdown
Contributor

@Icohedron I think the destination directory must exist for clang-tblgen to work

Icohedron added a commit that referenced this pull request Mar 25, 2026
This PR should fix an issue reported by
#187440 (comment)
and
#187610 (comment)
where using `clang_generate_header` to generate a header into a
directory that did not exist caused an error.
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Mar 25, 2026
…(#188618)

This PR should fix an issue reported by
llvm/llvm-project#187440 (comment)
and
llvm/llvm-project#187610 (comment)
where using `clang_generate_header` to generate a header into a
directory that did not exist caused an error.
bogner pushed a commit to llvm/offload-test-suite that referenced this pull request Mar 26, 2026
We can remove the XFAIL directives for Clang config for QuadReadAcrossY
tests now that QuadReadAcrossY intrinsic has landed in LLVM:
llvm/llvm-project#187440.

I also appended `convergence` suffix for QuadReadAcrossY control flow
test file.

Co-authored-by: Finn Plummer <[email protected]>
ambergorzynski pushed a commit to ambergorzynski/llvm-project that referenced this pull request Mar 27, 2026
This PR adds QuadReadAcrossY intrinsic support in HLSL with codegen for
both DirectX and SPIRV backends. Resolves
llvm#99176.

- [x] Implement `QuadReadAcrossY` clang builtin,
- [x] Link `QuadReadAcrossY` clang builtin with `hlsl_intrinsics.h`
- [x] Add sema checks for `QuadReadAcrossY` to
`CheckHLSLBuiltinFunctionCall` in `SemaChecking.cpp`
- [x] Add codegen for `QuadReadAcrossY` to `EmitHLSLBuiltinExpr` in
`CGBuiltin.cpp`
- [x] Add codegen tests to
`clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl`
- [x] Add sema tests to
`clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl`
- [x] Create the `int_dx_QuadReadAcrossY` intrinsic in
`IntrinsicsDirectX.td`
- [x] Create the `DXILOpMapping` of `int_dx_QuadReadAcrossY` to `123` in
`DXIL.td`
- [x] Create the `QuadReadAcrossY.ll` and `QuadReadAcrossY_errors.ll`
tests in `llvm/test/CodeGen/DirectX/`
- [x] Create the `int_spv_QuadReadAcrossY` intrinsic in
`IntrinsicsSPIRV.td`
- [x] In SPIRVInstructionSelector.cpp create the `QuadReadAcrossY`
lowering and map it to `int_spv_QuadReadAcrossY` in
`SPIRVInstructionSelector::selectIntrinsic`.
- [x] Create SPIR-V backend test case in
`llvm/test/CodeGen/SPIRV/hlsl-intrinsics/QuadReadAcrossY.ll`
ambergorzynski pushed a commit to ambergorzynski/llvm-project that referenced this pull request Mar 27, 2026
This PR should fix an issue reported by
llvm#187440 (comment)
and
llvm#187610 (comment)
where using `clang_generate_header` to generate a header into a
directory that did not exist caused an error.
Aadarsh-Keshri pushed a commit to Aadarsh-Keshri/llvm-project that referenced this pull request Mar 28, 2026
This PR adds QuadReadAcrossY intrinsic support in HLSL with codegen for
both DirectX and SPIRV backends. Resolves
llvm#99176.

- [x] Implement `QuadReadAcrossY` clang builtin,
- [x] Link `QuadReadAcrossY` clang builtin with `hlsl_intrinsics.h`
- [x] Add sema checks for `QuadReadAcrossY` to
`CheckHLSLBuiltinFunctionCall` in `SemaChecking.cpp`
- [x] Add codegen for `QuadReadAcrossY` to `EmitHLSLBuiltinExpr` in
`CGBuiltin.cpp`
- [x] Add codegen tests to
`clang/test/CodeGenHLSL/builtins/QuadReadAcrossY.hlsl`
- [x] Add sema tests to
`clang/test/SemaHLSL/BuiltIns/QuadReadAcrossY-errors.hlsl`
- [x] Create the `int_dx_QuadReadAcrossY` intrinsic in
`IntrinsicsDirectX.td`
- [x] Create the `DXILOpMapping` of `int_dx_QuadReadAcrossY` to `123` in
`DXIL.td`
- [x] Create the `QuadReadAcrossY.ll` and `QuadReadAcrossY_errors.ll`
tests in `llvm/test/CodeGen/DirectX/`
- [x] Create the `int_spv_QuadReadAcrossY` intrinsic in
`IntrinsicsSPIRV.td`
- [x] In SPIRVInstructionSelector.cpp create the `QuadReadAcrossY`
lowering and map it to `int_spv_QuadReadAcrossY` in
`SPIRVInstructionSelector::selectIntrinsic`.
- [x] Create SPIR-V backend test case in
`llvm/test/CodeGen/SPIRV/hlsl-intrinsics/QuadReadAcrossY.ll`
Aadarsh-Keshri pushed a commit to Aadarsh-Keshri/llvm-project that referenced this pull request Mar 28, 2026
This PR should fix an issue reported by
llvm#187440 (comment)
and
llvm#187610 (comment)
where using `clang_generate_header` to generate a header into a
directory that did not exist caused an error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:DirectX backend:SPIR-V backend:X86 clang:codegen IR generation bugs: mangling, exceptions, etc. clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:headers Headers provided by Clang, e.g. for intrinsics HLSL HLSL Language Support llvm:ir

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement the QuadReadAcrossY HLSL Function

6 participants