Exploring CPU microkernels on a matmul examplelink
Basic setup, command lineslink
Source file: matmul.mlir
:
func.func @matmul_dynamic(%lhs: tensor<?x?xf32>, %rhs: tensor<?x?xf32>, %acc: tensor<?x?xf32>) -> tensor<?x?xf32> {
%result = linalg.matmul ins(%lhs, %rhs: tensor<?x?xf32>, tensor<?x?xf32>) outs(%acc: tensor<?x?xf32>) -> tensor<?x?xf32>
return %result: tensor<?x?xf32>
}
Basic compilation command line:
$ iree-compile matmul.mlir -o /tmp/matmul.vmfb \
--iree-hal-target-backends=llvm-cpu \
--iree-llvmcpu-target-cpu=znver4 \
--iree-llvmcpu-enable-ukernels=all
This creates a IREE bytecode module:
$ ls -l /tmp/matmul.vmfb
-rw-rw-r-- 1 2884 Jan 22 10:37 /tmp/matmul.vmfb
The above .vmfb
is the only thing that's needed to run this matmul on the
target device. But to understand microkernels, we are now going to generate
additional intermediate files.
Additional iree-compile
flags to save intermediate files (IR, assembly, object
code):
--iree-hal-dump-executable-intermediates-to=/tmp/matmul --x86-asm-syntax=intel
This saves LLVM IR in binary serialization ("bitcode", filename extension
.bc
). To read it, we need to "disassemble" it using llvm-dis
to obtain
textual IR (filename extension .ll
).
llvm-dis /tmp/matmul/*.bc
Intermediate files:
35196 /tmp/matmul/module_matmul_linked_llvm_cpu_embedded_elf_x86_64.codegen.bc
251597 /tmp/matmul/module_matmul_linked_llvm_cpu_embedded_elf_x86_64.codegen.ll
181740 /tmp/matmul/module_matmul_linked_llvm_cpu_embedded_elf_x86_64.linked.bc
1396190 /tmp/matmul/module_matmul_linked_llvm_cpu_embedded_elf_x86_64.linked.ll
32096 /tmp/matmul/module_matmul_linked_llvm_cpu_embedded_elf_x86_64.o
34504 /tmp/matmul/module_matmul_linked_llvm_cpu_embedded_elf_x86_64.optimized.bc
184981 /tmp/matmul/module_matmul_linked_llvm_cpu_embedded_elf_x86_64.optimized.ll
82016 /tmp/matmul/module_matmul_linked_llvm_cpu_embedded_elf_x86_64.s
Another important iree-compile
flag: --mlir-print-ir-after-all
records the
IR after each pass. We save that (stderr) output to a file, ir.log
by
appending to the iree-compile
command line:
--mlir-print-ir-after-all 2>/tmp/matmul/ir.log
Overview of the compilation and linking flowlink
This graph shows the transformations from the source matmul.mlir
to the final
matmul.vmfb
with the various intermediates met in the previous section:
graph TD;
matmulontensors-- CPUMaterializeEncoding -->mmt4dontensors;
mmt4dontensors-- CPULowerToUKernels -->ukernelontensors;
ukernelontensors-- IREEComprehensiveBufferize -->ukernelonmemref;
ukernelonmemref-- LowerUKernelOpsToCalls -->ukernelcall;
ukernelcall-- ConvertToLLVM -->codegenll;
codegenll-->bitcodelinking;
genericsource-- clang -emit-llvm --> genericbitcode -- llvm-link --> ukernelbitcode;
archsource -- clang -emit-llvm --> archbitcode -- llvm-link --> ukernelbitcode;
ukernelbitcode-->ukernelbitcodeembedded;
ukernelbitcodeembedded-->bitcodelinking;
bitcodelinking-->linkedll;
linkedll -- IR optimization --> optimizedll;
optimizedll -- LLVM x86 backend --> asm -- LLVM assembler --> object -- iree-compile output --> vmfb;
matmulontensors["linalg.matmul on tensors"];
mmt4dontensors["linalg.mmt4d on tensors"];
ukernelontensors["ukernel.generic on tensors"];
ukernelonmemref["ukernel.generic on memrefs"];
ukernelcall["call to ukernel entry point"];
codegenll["module_matmul_...codegen.ll"];
linkedll["module_matmul_...linked.ll"];
optimizedll["module_matmul_...optimized.ll"];
genericsource["generic source code
mmt4d.c"]
archsource["architecture-specific source code
mmt4d_x86_64_avx512_base.c"]
genericbitcode["generic code as bitcode
ukernel_bitcode_generic_x86_64.bc"]
archbitcode["architecture-specific code as bitcode
ukernel_bitcode_arch_x86_64_avx512_base.bc"]
ukernelbitcode["linked bitcode
ukernel_bitcode_x86_64.bc"];
ukernelbitcodeembedded["microkernel bitcode embedded as
static data in iree-compile"];
bitcodelinking["llvm::Linker::LinkInModule"];
asm["x86 asm, module_matmul_...s"];
object["x86 ELF, module_matmul_...o"];
vmfb["matmul.vmfb"];
subgraph Part1["Part 1: MLIR code generation"]
matmulontensors
mmt4dontensors
ukernelontensors
ukernelonmemref
ukernelcall
codegenll
end
subgraph Part2["Part 2: Microkernels compilation (part of the IREE build)"]
genericsource
archsource
genericbitcode
archbitcode
ukernelbitcode
ukernelbitcodeembedded
end
subgraph Part3["Part 3: Linking with microkernels, optimizing, producing object code"]
bitcodelinking
linkedll
optimizedll
asm
object
vmfb
end
style Part1 stroke:#FDD835,stroke-width:2px
style Part2 stroke:#039BE5,stroke-width:2px
style Part3 stroke:#43A047,stroke-width:2px
🟨 Part 1: MLIR code generationlink
Some initial boilerplate happens around our linalg.matmul
before anything
interesting happens to it.:
➤ Appendix: IR dump after WrapEntryPointsPass
Next, the first interesting thing is the CPUMaterializeEncoding
pass, where
the linalg.matmul
gets rewritten into a linalg.mmt4d
which is a matmul with
a tiled data layout. This is where we start specializing to the target ISA
feature set, AVX-512, favoring a 16x16 tile size for this float32 matmul.
➤ Appendix: IR Dump After CPUMaterializeEncoding
The idea is that linalg.mmt4d
is what we will have a microkernel for, below.
There is no need to have microkernels for anything but the target-optimal tiled
layout, so we don't bother carrying a microkernel for linalg.matmul
itself.
The matrix layout transformation, bringing matrix data into this tiled layout,
is also out of the scope of this linalg.mmt4d
and hence of the mmt4d
microkernel: we can rely on generic code-generation to take care of these
byte-permutations, which is our preference as we aim to let that fuse into
producers/consumers.
Next comes the rewrite of linalg.mmt4d
into a microkernel op, done by the
CPULowerToUKernels
pass. Here is the TableGen definition of the generic
microkernel op we're going to generate:
TableGen definition of ukernel.generic
C++ compiler code for CPULowerToUKernels
➤ Appendix: IR Dump After CPULowerToUKernels
Notice that this IR is still working on tensor
values, not on memref
values.
- Rewrites are much nicer to perform on tensors than on memrefs.
ukernel.generic
works with both tensors and memrefs.- Allows performing the rewrite to
ukernel.generic
while still on tensors, then just ride bufferization.
Next, bufferization takes place. tensor
values become memref
.
➤ Appendix: IR Dump After IREEComprehensiveBufferize
Next, the LowerUKernelOpsToCalls
runs, rewriting ukernel.generic
ops into
function calls.
- Made possible by bufferization: there now are buffer pointers and strides to pass to the target function.
➤ Appendix: IR Dump After LowerUKernelOpsToCalls
Finally, this gets lowered to the MLIR LLVM dialect, in preparation for outputting plain LLVM IR.
➤ Appendix: IR Dump After ConvertToLLVM
The above gets converted to plain LLVM IR and that's our first intermediate
file, module_matmul_linked_llvm_cpu_embedded_elf_x86_64.codegen.bc
, which
llvm-dis
helps disassemble into a textual IR file (.ll
).
➤ Appendix: Intermediate file: ...codegen.bc
, disassembled to ...codegen.ll
The above IR references an external symbol iree_uk_mmt4d
for the microkernel
that it calls, so it now needs to be linked against the ukernels bitcode.
🟦 Part 2: Microkernels compilation (part of the IREE build)link
Microkernels are:
- Compiled to self-contained bitcode, once for each target architecture.
- That puts requirement on the source languages that they can be defined in.
- Can use C via
clang -emit-llvm
plus extra flags like-ffreestanding
.- The source must not
#include
standard library headers or do anything OS-specific.
- The source must not
- Can use inline assembly but not out-of-line assembly.
- Can use C via
- That puts requirement on the source languages that they can be defined in.
- Taking scalar parameters, including buffer pointers and strides.
- Array-processing microkernels have a memory-to-memory interface.
- No vector-to-vector microkernels.
- Store-to-load-forwarding can still happen post-linking, effectively achieving the same.
- Microkernel ops avoid MLIR vector dialect altogether.
C source code for the
iree_uk_mmt4d
microkernel entry point
This calls an architecture-specific function to return a function pointer to the optimized inner-loop implementation to use for given data types and SIMD ISA features, and then uses that in a generic outer-loop implementation.
So the really interesting part is the implementation of the inner-loop function
that we got a function pointer to. For example,
here
is the one used in our example where the element type is f32
and the target
has AVX-512.
A custom CMake function,
iree_bitcode_library
,
wraps clang
to compile these C source files with special flags to obtain
freestanding bitcode.
Likewise, a custom CMake function,
iree_link_bitcode
,
wraps llvm-link
to link bitcode files.
These are used during the IREE compiler build (as a dependency of
iree-compile
) to build microkernels as bitcode for all supported target
architectures, generating one bitcode file for each architecture in the build
directory:
~/iree-build$ ls ./runtime/src/iree/builtins/ukernel/ukernel_bitcode_*.bc | grep -v generic
./runtime/src/iree/builtins/ukernel/ukernel_bitcode_arm_32.bc
./runtime/src/iree/builtins/ukernel/ukernel_bitcode_arm_64.bc
./runtime/src/iree/builtins/ukernel/ukernel_bitcode_riscv_32.bc
./runtime/src/iree/builtins/ukernel/ukernel_bitcode_riscv_64.bc
./runtime/src/iree/builtins/ukernel/ukernel_bitcode_x86_64.bc
These files are then embedded as static data within iree-compile
, so that
iree-compile
stays self-contained.
Here are some samples of ukernel bitcode if you are curious what it looks like:
➤ Appendix: embedded microkernel bitcode: iree_uk_mmt4d
ukernel entry point
➤ Appendix: embedded microkernel bitcode: inner-loop tile function
🟩 Part 3: Linking with microkernels, optimizing, producing object codelink
The previous two sections covered respectively the compilation of the MLIR module, and the compilation of microkernels, as two separate bitcode modules. Now we turn to how these two bitcode modules are linked together.
After code generation, iree-compile
loads microkernel bitcode:
https://github.com/iree-org/iree/blob/c437add6a3b1e3e873cec95505d37c4938fee74f/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp#L490
It is worth zooming into that loadUKernelBitcode
function as, in addition to
just loading the bitcode, it does one important thing: it adds the
alwaysinline
attribute on every function. As we will see just below, always
inlining microkernels is key to achieving perfect results with no downsides
compared to a pure code-generation approach.
https://github.com/iree-org/iree/blob/c437add6a3b1e3e873cec95505d37c4938fee74f/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/Builtins/UKernel.cpp#L36-L62
And links it into the current module: https://github.com/iree-org/iree/blob/c437add6a3b1e3e873cec95505d37c4938fee74f/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp#L499
The linked IR so far is not very interesting, as it is still essentially just
the concatenation of the above-discussed codegen and microkernel bitcode
(except now with alwaysinline
attributes). If you are curious, it is dumped as
the ...linked.bc
file.
Where it gets interesting is that immediately after that, we run LLVM IR optimization passes, which can be thought of as a form of link-time optimization (LTO): https://github.com/iree-org/iree/blob/c437add6a3b1e3e873cec95505d37c4938fee74f/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp#L527
At this point, all the microkernel code gets inlined into the dispatch function, the correct AVX-512 optimized tile function is selected and inlined, and everything else is DCE'd. That's how the user pays no cost for what they don't use --- not only for the microkernel entry points that they don't call, but also for all the unused code paths within each microkernel.
➤ Appendix: Intermediate file: ...optimized.bc
, disassembled to ...optimized.ll
This then goes to the LLVM x86 backend, which produces x86 assembly.
Appendixlink
IR dump after WrapEntryPointsPasslink
// -----// IR Dump After mlir::iree_compiler::IREE::ABI::WrapEntryPointsPass (iree-abi-wrap-entry-points) //----- //
[...]
// -----// IR Dump After Inliner (inline) //----- //
#executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "znver4", cpu_features = "+mmx,+popcnt,+sse,+sse2,+sse3,+ssse3,+sse4.1,+sse4.2,+avx,+avx2,+sse4a,+fma,+avx512f,+bmi,+bmi2,+aes,+pclmul,+avx512vl,+avx512bw,+avx512dq,+avx512cd,+avx512vbmi,+avx512ifma,+avx512vpopcntdq,+avx512vbmi2,+gfni,+vpclmulqdq,+avx512vnni,+avx512bitalg,+avx512bf16,+adx,+clflushopt,+clwb,+clzero,+cx16,+cx8,+crc32,+f16c,+fsgsbase,+fxsr,+invpcid,+lzcnt,+movbe,+mwaitx,+pku,+prfchw,+rdpid,+rdpru,+rdrnd,+rdseed,+sahf,+sha,+shstk,+vaes,+wbnoinvd,+x87,+xsave,+xsavec,+xsaveopt,+xsaves,+evex512", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 64 : index, target_triple = "x86_64-unknown-unknown-eabi-elf", ukernels = "all"}>
#device_target_llvm_cpu = #hal.device.target<"llvm-cpu", {executable_targets = [#executable_target_embedded_elf_x86_64_]}>
module attributes {hal.device.targets = [#device_target_llvm_cpu]} {
func.func @matmul_dynamic(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "sync func @matmul_dynamic(%input0: tensor<?x?xf32>, %input1: tensor<?x?xf32>, %input2: tensor<?x?xf32>) -> (%output0: tensor<?x?xf32>)"}} {
%0 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[0] : index
%1 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[1] : index
%2 = hal.tensor.import %arg0 "input0" : !hal.buffer_view -> tensor<?x?xf32>{%0, %1}
%3 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[0] : index
%4 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[1] : index
%5 = hal.tensor.import %arg1 "input1" : !hal.buffer_view -> tensor<?x?xf32>{%3, %4}
%6 = hal.buffer_view.dim<%arg2 : !hal.buffer_view>[0] : index
%7 = hal.buffer_view.dim<%arg2 : !hal.buffer_view>[1] : index
%8 = hal.tensor.import %arg2 "input2" : !hal.buffer_view -> tensor<?x?xf32>{%6, %7}
%9 = linalg.matmul ins(%2, %5 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%8 : tensor<?x?xf32>) -> tensor<?x?xf32>
%10 = hal.tensor.export %9 "output0" : tensor<?x?xf32>{%6, %7} -> !hal.buffer_view
return %10 : !hal.buffer_view
}
}
IR Dump After CPUMaterializeEncodinglink
// -----// IR Dump After CPUMaterializeEncoding (iree-codegen-cpu-materialize-encoding) //----- //
[...]
// -----// IR Dump After Canonicalizer (canonicalize) //----- //
[...]
// -----// IR Dump After CSE (cse) //----- //
#executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "znver4", cpu_features = "+mmx,+popcnt,+sse,+sse2,+sse3,+ssse3,+sse4.1,+sse4.2,+avx,+avx2,+sse4a,+fma,+avx512f,+bmi,+bmi2,+aes,+pclmul,+avx512vl,+avx512bw,+avx512dq,+avx512cd,+avx512vbmi,+avx512ifma,+avx512vpopcntdq,+avx512vbmi2,+gfni,+vpclmulqdq,+avx512vnni,+avx512bitalg,+avx512bf16,+adx,+clflushopt,+clwb,+clzero,+cx16,+cx8,+crc32,+f16c,+fsgsbase,+fxsr,+invpcid,+lzcnt,+movbe,+mwaitx,+pku,+prfchw,+rdpid,+rdpru,+rdrnd,+rdseed,+sahf,+sha,+shstk,+vaes,+wbnoinvd,+x87,+xsave,+xsavec,+xsaveopt,+xsaves,+evex512", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 64 : index, target_triple = "x86_64-unknown-unknown-eabi-elf", ukernels = "all"}>
#map = affine_map<()[s0] -> (s0 ceildiv 16)>
#device_target_llvm_cpu = #hal.device.target<"llvm-cpu", {executable_targets = [#executable_target_embedded_elf_x86_64_]}>
module attributes {hal.device.targets = [#device_target_llvm_cpu]} {
func.func @matmul_dynamic(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "sync func @matmul_dynamic(%input0: tensor<?x?xf32>, %input1: tensor<?x?xf32>, %input2: tensor<?x?xf32>) -> (%output0: tensor<?x?xf32>)"}} {
%cst = arith.constant 0.000000e+00 : f32
%0 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[0] : index
%1 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[1] : index
%2 = hal.tensor.import %arg0 "input0" : !hal.buffer_view -> tensor<?x?xf32>{%0, %1}
%3 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[0] : index
%4 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[1] : index
%5 = hal.tensor.import %arg1 "input1" : !hal.buffer_view -> tensor<?x?xf32>{%3, %4}
%6 = hal.buffer_view.dim<%arg2 : !hal.buffer_view>[0] : index
%7 = hal.buffer_view.dim<%arg2 : !hal.buffer_view>[1] : index
%8 = hal.tensor.import %arg2 "input2" : !hal.buffer_view -> tensor<?x?xf32>{%6, %7}
%9 = affine.apply #map()[%0]
%10 = tensor.empty(%9, %1) : tensor<?x?x16x1xf32>
%pack = tensor.pack %2 padding_value(%cst : f32) outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [16, 1] into %10 : tensor<?x?xf32> -> tensor<?x?x16x1xf32>
%11 = affine.apply #map()[%4]
%12 = tensor.empty(%11, %3) : tensor<?x?x16x1xf32>
%pack_0 = tensor.pack %5 padding_value(%cst : f32) outer_dims_perm = [1, 0] inner_dims_pos = [1, 0] inner_tiles = [16, 1] into %12 : tensor<?x?xf32> -> tensor<?x?x16x1xf32>
%13 = affine.apply #map()[%6]
%14 = affine.apply #map()[%7]
%15 = tensor.empty(%13, %14) : tensor<?x?x16x16xf32>
%pack_1 = tensor.pack %8 padding_value(%cst : f32) outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [16, 16] into %15 : tensor<?x?xf32> -> tensor<?x?x16x16xf32>
%16 = linalg.mmt4d ins(%pack, %pack_0 : tensor<?x?x16x1xf32>, tensor<?x?x16x1xf32>) outs(%pack_1 : tensor<?x?x16x16xf32>) -> tensor<?x?x16x16xf32>
%17 = tensor.empty(%6, %7) : tensor<?x?xf32>
%unpack = tensor.unpack %16 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [16, 16] into %17 : tensor<?x?x16x16xf32> -> tensor<?x?xf32>
%18 = hal.tensor.export %unpack "output0" : tensor<?x?xf32>{%6, %7} -> !hal.buffer_view
return %18 : !hal.buffer_view
}
}
IR Dump After CPULowerToUKernelslink
// -----// IR Dump After CPULowerToUKernels (iree-codegen-cpu-lower-to-ukernels) //----- //
module {
func.func @matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32() {
%c1281_i32 = arith.constant 1281 : i32
%c1_i32 = arith.constant 1 : i32
%c16_i32 = arith.constant 16 : i32
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
%c32_i64 = arith.constant 32 : i64
%0 = hal.interface.constant.load[0] : i32
%1 = hal.interface.constant.load[1] : i32
%2 = hal.interface.constant.load[2] : i32
%3 = hal.interface.constant.load[3] : i32
%4 = hal.interface.constant.load[4] : i32
%5 = hal.interface.constant.load[5] : i32
%6 = hal.interface.constant.load[6] : i32
%7 = hal.interface.constant.load[7] : i32
%8 = hal.interface.constant.load[8] : i32
%9 = hal.interface.constant.load[9] : i32
%10 = hal.interface.constant.load[10] : i32
%11 = hal.interface.constant.load[11] : i32
%12 = hal.interface.constant.load[12] : i32
%13 = hal.interface.constant.load[13] : i32
%14 = hal.interface.constant.load[14] : i32
%15 = hal.interface.constant.load[15] : i32
%16 = arith.extui %0 : i32 to i64
%17 = arith.extui %1 : i32 to i64
%18 = arith.shli %17, %c32_i64 : i64
%19 = arith.ori %16, %18 : i64
%20 = arith.index_castui %19 : i64 to index
%21 = arith.extui %2 : i32 to i64
%22 = arith.extui %3 : i32 to i64
%23 = arith.shli %22, %c32_i64 : i64
%24 = arith.ori %21, %23 : i64
%25 = arith.index_castui %24 : i64 to index
%26 = arith.extui %4 : i32 to i64
%27 = arith.extui %5 : i32 to i64
%28 = arith.shli %27, %c32_i64 : i64
%29 = arith.ori %26, %28 : i64
%30 = arith.index_castui %29 : i64 to index
%31 = arith.extui %6 : i32 to i64
%32 = arith.extui %7 : i32 to i64
%33 = arith.shli %32, %c32_i64 : i64
%34 = arith.ori %31, %33 : i64
%35 = arith.index_castui %34 : i64 to index
%36 = arith.extui %8 : i32 to i64
%37 = arith.extui %9 : i32 to i64
%38 = arith.shli %37, %c32_i64 : i64
%39 = arith.ori %36, %38 : i64
%40 = arith.index_castui %39 : i64 to index
%41 = arith.extui %10 : i32 to i64
%42 = arith.extui %11 : i32 to i64
%43 = arith.shli %42, %c32_i64 : i64
%44 = arith.ori %41, %43 : i64
%45 = arith.index_castui %44 : i64 to index
%46 = arith.extui %12 : i32 to i64
%47 = arith.extui %13 : i32 to i64
%48 = arith.shli %47, %c32_i64 : i64
%49 = arith.ori %46, %48 : i64
%50 = arith.index_castui %49 : i64 to index
%51 = arith.extui %14 : i32 to i64
%52 = arith.extui %15 : i32 to i64
%53 = arith.shli %52, %c32_i64 : i64
%54 = arith.ori %51, %53 : i64
%55 = arith.index_castui %54 : i64 to index
%56 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<?x?x16x1xf32>>{%30, %35}
%57 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%20) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<?x?x16x1xf32>>{%40, %45}
%58 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%25) : !flow.dispatch.tensor<readwrite:tensor<?x?x16x16xf32>>{%50, %55}
%workgroup_id_x = hal.interface.workgroup.id[0] : index
%workgroup_count_x = hal.interface.workgroup.count[0] : index
%workgroup_id_y = hal.interface.workgroup.id[1] : index
%workgroup_count_y = hal.interface.workgroup.count[1] : index
scf.for %arg0 = %workgroup_id_y to %30 step %workgroup_count_y {
scf.for %arg1 = %workgroup_id_x to %40 step %workgroup_count_x {
%59 = flow.dispatch.tensor.load %56, offsets = [%arg0, 0, 0, 0], sizes = [1, %35, 16, 1], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?x16x1xf32>>{%30, %35} -> tensor<1x?x16x1xf32>
%60 = flow.dispatch.tensor.load %57, offsets = [%arg1, 0, 0, 0], sizes = [1, %35, 16, 1], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?x16x1xf32>>{%40, %45} -> tensor<1x?x16x1xf32>
%61 = flow.dispatch.tensor.load %58, offsets = [%arg0, %arg1, 0, 0], sizes = [1, 1, 16, 16], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readwrite:tensor<?x?x16x16xf32>>{%50, %55} -> tensor<1x1x16x16xf32>
%dim = tensor.dim %60, %c1 : tensor<1x?x16x1xf32>
%62 = iree_codegen.ukernel.generic "iree_uk_mmt4d" ins(%59, %60 : tensor<1x?x16x1xf32>, tensor<1x?x16x1xf32>) outs(%61 : tensor<1x1x16x16xf32>) (%c1, %c1, %dim, %c16_i32, %c16_i32, %c1_i32, %c1281_i32 : index, index, index, i32, i32, i32, i32) fn_def_attrs {hal.import.bitcode = true, hal.import.cconv = 1 : i32, hal.import.fields = ["processor_data"]} strided_outer_dims(1) -> tensor<1x1x16x16xf32>
flow.dispatch.tensor.store %62, %58, offsets = [%arg0, %arg1, 0, 0], sizes = [1, 1, 16, 16], strides = [1, 1, 1, 1] : tensor<1x1x16x16xf32> -> !flow.dispatch.tensor<readwrite:tensor<?x?x16x16xf32>>{%50, %55}
}
}
return
}
}
IR Dump After IREEComprehensiveBufferizelink
// -----// IR Dump After IREEComprehensiveBufferize (iree-codegen-iree-comprehensive-bufferize) //----- //
[...]
// -----// IR Dump After EmptyTensorToAllocTensor (empty-tensor-to-alloc-tensor) //----- //
[...]
// -----// IR Dump After ResolveShapedTypeResultDims (resolve-shaped-type-result-dims) //----- //
[...]
// -----// IR Dump After Canonicalizer (canonicalize) //----- //
[...]
// -----// IR Dump After CSE (cse) //----- //
[...]
// -----// IR Dump After CleanupBufferAllocView (iree-codegen-cleanup-buffer-alloc-view) //----- //
func.func @matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32() {
%c1281_i32 = arith.constant 1281 : i32
%c1_i32 = arith.constant 1 : i32
%c16_i32 = arith.constant 16 : i32
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
%c32_i64 = arith.constant 32 : i64
%0 = hal.interface.constant.load[0] : i32
%1 = hal.interface.constant.load[1] : i32
%2 = hal.interface.constant.load[2] : i32
%3 = hal.interface.constant.load[3] : i32
%4 = hal.interface.constant.load[4] : i32
%5 = hal.interface.constant.load[5] : i32
%6 = hal.interface.constant.load[6] : i32
%7 = hal.interface.constant.load[7] : i32
%8 = hal.interface.constant.load[8] : i32
%9 = hal.interface.constant.load[9] : i32
%10 = hal.interface.constant.load[10] : i32
%11 = hal.interface.constant.load[11] : i32
%12 = hal.interface.constant.load[12] : i32
%13 = hal.interface.constant.load[13] : i32
%14 = hal.interface.constant.load[14] : i32
%15 = hal.interface.constant.load[15] : i32
%16 = arith.extui %0 : i32 to i64
%17 = arith.extui %1 : i32 to i64
%18 = arith.shli %17, %c32_i64 : i64
%19 = arith.ori %16, %18 : i64
%20 = arith.index_castui %19 : i64 to index
%21 = arith.extui %2 : i32 to i64
%22 = arith.extui %3 : i32 to i64
%23 = arith.shli %22, %c32_i64 : i64
%24 = arith.ori %21, %23 : i64
%25 = arith.index_castui %24 : i64 to index
%26 = arith.extui %4 : i32 to i64
%27 = arith.extui %5 : i32 to i64
%28 = arith.shli %27, %c32_i64 : i64
%29 = arith.ori %26, %28 : i64
%30 = arith.index_castui %29 : i64 to index
%31 = arith.extui %6 : i32 to i64
%32 = arith.extui %7 : i32 to i64
%33 = arith.shli %32, %c32_i64 : i64
%34 = arith.ori %31, %33 : i64
%35 = arith.index_castui %34 : i64 to index
%36 = arith.extui %8 : i32 to i64
%37 = arith.extui %9 : i32 to i64
%38 = arith.shli %37, %c32_i64 : i64
%39 = arith.ori %36, %38 : i64
%40 = arith.index_castui %39 : i64 to index
%41 = arith.extui %10 : i32 to i64
%42 = arith.extui %11 : i32 to i64
%43 = arith.shli %42, %c32_i64 : i64
%44 = arith.ori %41, %43 : i64
%45 = arith.index_castui %44 : i64 to index
%46 = arith.extui %12 : i32 to i64
%47 = arith.extui %13 : i32 to i64
%48 = arith.shli %47, %c32_i64 : i64
%49 = arith.ori %46, %48 : i64
%50 = arith.index_castui %49 : i64 to index
%51 = arith.extui %14 : i32 to i64
%52 = arith.extui %15 : i32 to i64
%53 = arith.shli %52, %c32_i64 : i64
%54 = arith.ori %51, %53 : i64
%55 = arith.index_castui %54 : i64 to index
%56 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<?x?x16x1xf32, #hal.descriptor_type<storage_buffer>>{%30, %35}
memref.assume_alignment %56, 64 : memref<?x?x16x1xf32, #hal.descriptor_type<storage_buffer>>
%57 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%20) flags(ReadOnly) : memref<?x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>{%40, %45}
memref.assume_alignment %57, 1 : memref<?x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
%58 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%25) : memref<?x?x16x16xf32, strided<[?, 256, 16, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>{%50, %55}
memref.assume_alignment %58, 1 : memref<?x?x16x16xf32, strided<[?, 256, 16, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
%workgroup_id_x = hal.interface.workgroup.id[0] : index
%workgroup_count_x = hal.interface.workgroup.count[0] : index
%workgroup_id_y = hal.interface.workgroup.id[1] : index
%workgroup_count_y = hal.interface.workgroup.count[1] : index
scf.for %arg0 = %workgroup_id_y to %30 step %workgroup_count_y {
%subview = memref.subview %56[%arg0, 0, 0, 0] [1, %35, 16, 1] [1, 1, 1, 1] : memref<?x?x16x1xf32, #hal.descriptor_type<storage_buffer>> to memref<1x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
scf.for %arg1 = %workgroup_id_x to %40 step %workgroup_count_x {
%subview_0 = memref.subview %57[%arg1, 0, 0, 0] [1, %35, 16, 1] [1, 1, 1, 1] : memref<?x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<1x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
%subview_1 = memref.subview %58[%arg0, %arg1, 0, 0] [1, 1, 16, 16] [1, 1, 1, 1] : memref<?x?x16x16xf32, strided<[?, 256, 16, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<1x1x16x16xf32, strided<[?, 256, 16, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
iree_codegen.ukernel.generic "iree_uk_mmt4d" ins(%subview, %subview_0 : memref<1x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, memref<1x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>) outs(%subview_1 : memref<1x1x16x16xf32, strided<[?, 256, 16, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>) (%c1, %c1, %35, %c16_i32, %c16_i32, %c1_i32, %c1281_i32 : index, index, index, i32, i32, i32, i32) fn_def_attrs {hal.import.bitcode = true, hal.import.cconv = 1 : i32, hal.import.fields = ["processor_data"]} strided_outer_dims(1)
}
}
return
}
IR Dump After LowerUKernelOpsToCallslink
// -----// IR Dump After LowerUKernelOpsToCalls (iree-codegen-lower-ukernel-ops-to-calls) //----- //
module {
func.func private @iree_uk_mmt4d(memref<f32>, index, index, memref<f32>, index, index, memref<f32>, index, index, index, index, index, i32, i32, i32, i32) attributes {hal.import.bitcode = true, hal.import.cconv = 1 : i32, hal.import.fields = ["processor_data"], llvm.bareptr = true}
func.func @matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32() {
%c1281_i32 = arith.constant 1281 : i32
%c1_i32 = arith.constant 1 : i32
%c16_i32 = arith.constant 16 : i32
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
%c32_i64 = arith.constant 32 : i64
%0 = hal.interface.constant.load[0] : i32
%1 = hal.interface.constant.load[1] : i32
%2 = hal.interface.constant.load[2] : i32
%3 = hal.interface.constant.load[3] : i32
%4 = hal.interface.constant.load[4] : i32
%5 = hal.interface.constant.load[5] : i32
%6 = hal.interface.constant.load[6] : i32
%7 = hal.interface.constant.load[7] : i32
%8 = hal.interface.constant.load[8] : i32
%9 = hal.interface.constant.load[9] : i32
%10 = hal.interface.constant.load[10] : i32
%11 = hal.interface.constant.load[11] : i32
%12 = hal.interface.constant.load[12] : i32
%13 = hal.interface.constant.load[13] : i32
%14 = hal.interface.constant.load[14] : i32
%15 = hal.interface.constant.load[15] : i32
%16 = arith.extui %0 : i32 to i64
%17 = arith.extui %1 : i32 to i64
%18 = arith.shli %17, %c32_i64 : i64
%19 = arith.ori %16, %18 : i64
%20 = arith.index_castui %19 : i64 to index
%21 = arith.extui %2 : i32 to i64
%22 = arith.extui %3 : i32 to i64
%23 = arith.shli %22, %c32_i64 : i64
%24 = arith.ori %21, %23 : i64
%25 = arith.index_castui %24 : i64 to index
%26 = arith.extui %4 : i32 to i64
%27 = arith.extui %5 : i32 to i64
%28 = arith.shli %27, %c32_i64 : i64
%29 = arith.ori %26, %28 : i64
%30 = arith.index_castui %29 : i64 to index
%31 = arith.extui %6 : i32 to i64
%32 = arith.extui %7 : i32 to i64
%33 = arith.shli %32, %c32_i64 : i64
%34 = arith.ori %31, %33 : i64
%35 = arith.index_castui %34 : i64 to index
%36 = arith.extui %8 : i32 to i64
%37 = arith.extui %9 : i32 to i64
%38 = arith.shli %37, %c32_i64 : i64
%39 = arith.ori %36, %38 : i64
%40 = arith.index_castui %39 : i64 to index
%41 = arith.extui %10 : i32 to i64
%42 = arith.extui %11 : i32 to i64
%43 = arith.shli %42, %c32_i64 : i64
%44 = arith.ori %41, %43 : i64
%45 = arith.index_castui %44 : i64 to index
%46 = arith.extui %12 : i32 to i64
%47 = arith.extui %13 : i32 to i64
%48 = arith.shli %47, %c32_i64 : i64
%49 = arith.ori %46, %48 : i64
%50 = arith.index_castui %49 : i64 to index
%51 = arith.extui %14 : i32 to i64
%52 = arith.extui %15 : i32 to i64
%53 = arith.shli %52, %c32_i64 : i64
%54 = arith.ori %51, %53 : i64
%55 = arith.index_castui %54 : i64 to index
%56 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<?x?x16x1xf32>{%30, %35}
memref.assume_alignment %56, 64 : memref<?x?x16x1xf32>
%57 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%20) flags(ReadOnly) : memref<?x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>>{%40, %45}
memref.assume_alignment %57, 1 : memref<?x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>>
%58 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%25) : memref<?x?x16x16xf32, strided<[?, 256, 16, 1], offset: ?>>{%50, %55}
memref.assume_alignment %58, 1 : memref<?x?x16x16xf32, strided<[?, 256, 16, 1], offset: ?>>
%workgroup_id_x = hal.interface.workgroup.id[0] : index
%workgroup_count_x = hal.interface.workgroup.count[0] : index
%workgroup_id_y = hal.interface.workgroup.id[1] : index
%workgroup_count_y = hal.interface.workgroup.count[1] : index
scf.for %arg0 = %workgroup_id_y to %30 step %workgroup_count_y {
%subview = memref.subview %56[%arg0, 0, 0, 0] [1, %35, 16, 1] [1, 1, 1, 1] : memref<?x?x16x1xf32> to memref<1x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>>
scf.for %arg1 = %workgroup_id_x to %40 step %workgroup_count_x {
%subview_0 = memref.subview %57[%arg1, 0, 0, 0] [1, %35, 16, 1] [1, 1, 1, 1] : memref<?x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>> to memref<1x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>>
%subview_1 = memref.subview %58[%arg0, %arg1, 0, 0] [1, 1, 16, 16] [1, 1, 1, 1] : memref<?x?x16x16xf32, strided<[?, 256, 16, 1], offset: ?>> to memref<1x1x16x16xf32, strided<[?, 256, 16, 1], offset: ?>>
%base_buffer, %offset, %sizes:4, %strides:4 = memref.extract_strided_metadata %subview : memref<1x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>> -> memref<f32>, index, index, index, index, index, index, index, index, index
%base_buffer_2, %offset_3, %sizes_4:4, %strides_5:4 = memref.extract_strided_metadata %subview_0 : memref<1x?x16x1xf32, strided<[?, 16, 1, 1], offset: ?>> -> memref<f32>, index, index, index, index, index, index, index, index, index
%base_buffer_6, %offset_7, %sizes_8:4, %strides_9:4 = memref.extract_strided_metadata %subview_1 : memref<1x1x16x16xf32, strided<[?, 256, 16, 1], offset: ?>> -> memref<f32>, index, index, index, index, index, index, index, index, index
func.call @iree_uk_mmt4d(%base_buffer, %offset, %strides#0, %base_buffer_2, %offset_3, %strides_5#0, %base_buffer_6, %offset_7, %strides_9#0, %c1, %c1, %35, %c16_i32, %c16_i32, %c1_i32, %c1281_i32) : (memref<f32>, index, index, memref<f32>, index, index, memref<f32>, index, index, index, index, index, i32, i32, i32, i32) -> ()
}
}
return
}
}
IR Dump After ConvertToLLVMlink
// -----// IR Dump After ConvertToLLVM (iree-convert-to-llvm) //----- //
module attributes {llvm.data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", llvm.target_triple = "x86_64-unknown-unknown-eabi-elf"} {
llvm.func @iree_uk_mmt4d(!llvm.ptr) attributes {hal.import.bitcode = true, hal.import.cconv = 1 : i32, hal.import.fields = ["processor_data"], llvm.bareptr = true}
llvm.func @matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32(%arg0: !llvm.ptr {llvm.align = 16 : i64, llvm.noalias}, %arg1: !llvm.ptr {llvm.align = 16 : i64, llvm.noalias}, %arg2: !llvm.ptr {llvm.align = 16 : i64, llvm.noalias}) -> i32 {
%0 = llvm.mlir.constant(4293970975 : i64) : i64
%1 = llvm.mlir.constant(8 : i64) : i64
%2 = llvm.mlir.constant(0 : i32) : i32
%3 = llvm.mlir.constant(256 : index) : i64
%4 = llvm.mlir.constant(-1 : index) : i64
%5 = llvm.mlir.constant(4 : index) : i64
%6 = llvm.mlir.constant(16 : index) : i64
%7 = llvm.mlir.constant(0 : index) : i64
%8 = llvm.mlir.constant(1281 : i32) : i32
%9 = llvm.mlir.constant(1 : i32) : i32
%10 = llvm.mlir.constant(16 : i32) : i32
%11 = llvm.mlir.constant(1 : index) : i64
%12 = llvm.mlir.constant(32 : i64) : i64
%13 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%14 = llvm.extractvalue %13[9] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%15 = llvm.load %14 : !llvm.ptr -> i32
%16 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%17 = llvm.extractvalue %16[9] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%18 = llvm.getelementptr %17[1] : (!llvm.ptr) -> !llvm.ptr, i32
%19 = llvm.load %18 : !llvm.ptr -> i32
%20 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%21 = llvm.extractvalue %20[9] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%22 = llvm.getelementptr %21[2] : (!llvm.ptr) -> !llvm.ptr, i32
%23 = llvm.load %22 : !llvm.ptr -> i32
%24 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%25 = llvm.extractvalue %24[9] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%26 = llvm.getelementptr %25[3] : (!llvm.ptr) -> !llvm.ptr, i32
%27 = llvm.load %26 : !llvm.ptr -> i32
%28 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%29 = llvm.extractvalue %28[9] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%30 = llvm.getelementptr %29[4] : (!llvm.ptr) -> !llvm.ptr, i32
%31 = llvm.load %30 : !llvm.ptr -> i32
%32 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%33 = llvm.extractvalue %32[9] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%34 = llvm.getelementptr %33[5] : (!llvm.ptr) -> !llvm.ptr, i32
%35 = llvm.load %34 : !llvm.ptr -> i32
%36 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%37 = llvm.extractvalue %36[9] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%38 = llvm.getelementptr %37[6] : (!llvm.ptr) -> !llvm.ptr, i32
%39 = llvm.load %38 : !llvm.ptr -> i32
%40 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%41 = llvm.extractvalue %40[9] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%42 = llvm.getelementptr %41[7] : (!llvm.ptr) -> !llvm.ptr, i32
%43 = llvm.load %42 : !llvm.ptr -> i32
%44 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%45 = llvm.extractvalue %44[9] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%46 = llvm.getelementptr %45[8] : (!llvm.ptr) -> !llvm.ptr, i32
%47 = llvm.load %46 : !llvm.ptr -> i32
%48 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%49 = llvm.extractvalue %48[9] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%50 = llvm.getelementptr %49[9] : (!llvm.ptr) -> !llvm.ptr, i32
%51 = llvm.load %50 : !llvm.ptr -> i32
%52 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%53 = llvm.extractvalue %52[9] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%54 = llvm.getelementptr %53[10] : (!llvm.ptr) -> !llvm.ptr, i32
%55 = llvm.load %54 : !llvm.ptr -> i32
%56 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%57 = llvm.extractvalue %56[9] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%58 = llvm.getelementptr %57[11] : (!llvm.ptr) -> !llvm.ptr, i32
%59 = llvm.load %58 : !llvm.ptr -> i32
%60 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%61 = llvm.extractvalue %60[9] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%62 = llvm.getelementptr %61[14] : (!llvm.ptr) -> !llvm.ptr, i32
%63 = llvm.load %62 : !llvm.ptr -> i32
%64 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%65 = llvm.extractvalue %64[9] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%66 = llvm.getelementptr %65[15] : (!llvm.ptr) -> !llvm.ptr, i32
%67 = llvm.load %66 : !llvm.ptr -> i32
%68 = llvm.zext %15 : i32 to i64
%69 = llvm.zext %19 : i32 to i64
%70 = llvm.shl %69, %12 : i64
%71 = llvm.or %68, %70 : i64
%72 = llvm.zext %23 : i32 to i64
%73 = llvm.zext %27 : i32 to i64
%74 = llvm.shl %73, %12 : i64
%75 = llvm.or %72, %74 : i64
%76 = llvm.zext %31 : i32 to i64
%77 = llvm.zext %35 : i32 to i64
%78 = llvm.shl %77, %12 : i64
%79 = llvm.or %76, %78 : i64
%80 = llvm.zext %39 : i32 to i64
%81 = llvm.zext %43 : i32 to i64
%82 = llvm.shl %81, %12 : i64
%83 = llvm.or %80, %82 : i64
%84 = llvm.zext %47 : i32 to i64
%85 = llvm.zext %51 : i32 to i64
%86 = llvm.shl %85, %12 : i64
%87 = llvm.or %84, %86 : i64
%88 = llvm.zext %55 : i32 to i64
%89 = llvm.zext %59 : i32 to i64
%90 = llvm.shl %89, %12 : i64
%91 = llvm.or %88, %90 : i64
%92 = llvm.zext %63 : i32 to i64
%93 = llvm.zext %67 : i32 to i64
%94 = llvm.shl %93, %12 : i64
%95 = llvm.or %92, %94 : i64
%96 = llvm.mul %83, %6 : i64
%97 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%98 = llvm.extractvalue %97[10] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%99 = llvm.load %98 : !llvm.ptr -> !llvm.ptr
%100 = llvm.mul %91, %6 : i64
%101 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%102 = llvm.extractvalue %101[10] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%103 = llvm.load %102 : !llvm.ptr -> !llvm.ptr
%104 = llvm.mul %95, %3 : i64
%105 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%106 = llvm.extractvalue %105[10] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%107 = llvm.getelementptr %106[1] : (!llvm.ptr) -> !llvm.ptr, !llvm.ptr
%108 = llvm.load %107 : !llvm.ptr -> !llvm.ptr
%109 = llvm.load %arg2 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_workgroup_state_v0_t", (i32, i32, i16, i16, i32, ptr, i32)>
%110 = llvm.extractvalue %109[0] : !llvm.struct<"iree_hal_executable_workgroup_state_v0_t", (i32, i32, i16, i16, i32, ptr, i32)>
%111 = llvm.zext %110 : i32 to i64
%112 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%113 = llvm.extractvalue %112[4] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%114 = llvm.zext %113 : i32 to i64
%115 = llvm.load %arg2 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_workgroup_state_v0_t", (i32, i32, i16, i16, i32, ptr, i32)>
%116 = llvm.extractvalue %115[1] : !llvm.struct<"iree_hal_executable_workgroup_state_v0_t", (i32, i32, i16, i16, i32, ptr, i32)>
%117 = llvm.zext %116 : i32 to i64
%118 = llvm.load %arg1 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%119 = llvm.extractvalue %118[5] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr, ptr, ptr)>
%120 = llvm.zext %119 : i32 to i64
llvm.br ^bb1(%117 : i64)
^bb1(%121: i64): // 2 preds: ^bb0, ^bb4
%122 = llvm.icmp "slt" %121, %79 : i64
llvm.cond_br %122, ^bb2(%111 : i64), ^bb5
^bb2(%123: i64): // 2 preds: ^bb1, ^bb3
%124 = llvm.icmp "slt" %123, %87 : i64
llvm.cond_br %124, ^bb3, ^bb4
^bb3: // pred: ^bb2
%125 = llvm.mul %83, %6 : i64
%126 = llvm.mul %121, %125 : i64
%127 = llvm.icmp "slt" %71, %7 : i64
%128 = llvm.sub %4, %71 : i64
%129 = llvm.select %127, %128, %71 : i1, i64
%130 = llvm.sdiv %129, %5 : i64
%131 = llvm.sub %4, %130 : i64
%132 = llvm.select %127, %131, %130 : i1, i64
%133 = llvm.mul %91, %6 : i64
%134 = llvm.mul %123, %133 : i64
%135 = llvm.add %132, %134 : i64
%136 = llvm.mul %123, %3 : i64
%137 = llvm.icmp "slt" %75, %7 : i64
%138 = llvm.sub %4, %75 : i64
%139 = llvm.select %137, %138, %75 : i1, i64
%140 = llvm.sdiv %139, %5 : i64
%141 = llvm.sub %4, %140 : i64
%142 = llvm.select %137, %141, %140 : i1, i64
%143 = llvm.add %136, %142 : i64
%144 = llvm.mul %95, %3 : i64
%145 = llvm.mul %121, %144 : i64
%146 = llvm.add %143, %145 : i64
%147 = llvm.getelementptr inbounds %arg0[4] : (!llvm.ptr) -> !llvm.ptr, !llvm.ptr
%148 = llvm.alloca %1 x i64 {alignment = 8 : i64} : (i64) -> !llvm.ptr
%149 = llvm.load %147 : !llvm.ptr -> i64
%150 = llvm.or %149, %0 : i64
llvm.store %150, %148 : i64, !llvm.ptr
%151 = llvm.getelementptr inbounds %147[1] : (!llvm.ptr) -> !llvm.ptr, i64
%152 = llvm.load %151 : !llvm.ptr -> i64
%153 = llvm.getelementptr inbounds %148[1] : (!llvm.ptr) -> !llvm.ptr, i64
llvm.store %152, %153 : i64, !llvm.ptr
%154 = llvm.getelementptr inbounds %147[2] : (!llvm.ptr) -> !llvm.ptr, i64
%155 = llvm.load %154 : !llvm.ptr -> i64
%156 = llvm.getelementptr inbounds %148[2] : (!llvm.ptr) -> !llvm.ptr, i64
llvm.store %155, %156 : i64, !llvm.ptr
%157 = llvm.getelementptr inbounds %147[3] : (!llvm.ptr) -> !llvm.ptr, i64
%158 = llvm.load %157 : !llvm.ptr -> i64
%159 = llvm.getelementptr inbounds %148[3] : (!llvm.ptr) -> !llvm.ptr, i64
llvm.store %158, %159 : i64, !llvm.ptr
%160 = llvm.getelementptr inbounds %147[4] : (!llvm.ptr) -> !llvm.ptr, i64
%161 = llvm.load %160 : !llvm.ptr -> i64
%162 = llvm.getelementptr inbounds %148[4] : (!llvm.ptr) -> !llvm.ptr, i64
llvm.store %161, %162 : i64, !llvm.ptr
%163 = llvm.getelementptr inbounds %147[5] : (!llvm.ptr) -> !llvm.ptr, i64
%164 = llvm.load %163 : !llvm.ptr -> i64
%165 = llvm.getelementptr inbounds %148[5] : (!llvm.ptr) -> !llvm.ptr, i64
llvm.store %164, %165 : i64, !llvm.ptr
%166 = llvm.getelementptr inbounds %147[6] : (!llvm.ptr) -> !llvm.ptr, i64
%167 = llvm.load %166 : !llvm.ptr -> i64
%168 = llvm.getelementptr inbounds %148[6] : (!llvm.ptr) -> !llvm.ptr, i64
llvm.store %167, %168 : i64, !llvm.ptr
%169 = llvm.getelementptr inbounds %147[7] : (!llvm.ptr) -> !llvm.ptr, i64
%170 = llvm.load %169 : !llvm.ptr -> i64
%171 = llvm.getelementptr inbounds %148[7] : (!llvm.ptr) -> !llvm.ptr, i64
llvm.store %170, %171 : i64, !llvm.ptr
%172 = llvm.alloca %11 x !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)> : (i64) -> !llvm.ptr
%173 = llvm.mlir.undef : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>
%174 = llvm.insertvalue %99, %173[0] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>
%175 = llvm.insertvalue %126, %174[1] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>
%176 = llvm.insertvalue %96, %175[2] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>
%177 = llvm.insertvalue %103, %176[3] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>
%178 = llvm.insertvalue %135, %177[4] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>
%179 = llvm.insertvalue %100, %178[5] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>
%180 = llvm.insertvalue %108, %179[6] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>
%181 = llvm.insertvalue %146, %180[7] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>
%182 = llvm.insertvalue %104, %181[8] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>
%183 = llvm.insertvalue %11, %182[9] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>
%184 = llvm.insertvalue %11, %183[10] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>
%185 = llvm.insertvalue %83, %184[11] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>
%186 = llvm.insertvalue %10, %185[12] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>
%187 = llvm.insertvalue %10, %186[13] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>
%188 = llvm.insertvalue %9, %187[14] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>
%189 = llvm.insertvalue %8, %188[15] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>
%190 = llvm.insertvalue %148, %189[16] : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>
llvm.store %190, %172 : !llvm.struct<(ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr)>, !llvm.ptr
llvm.call @iree_uk_mmt4d(%172) : (!llvm.ptr) -> ()
%191 = llvm.add %123, %114 : i64
llvm.br ^bb2(%191 : i64)
^bb4: // pred: ^bb2
%192 = llvm.add %121, %120 : i64
llvm.br ^bb1(%192 : i64)
^bb5: // pred: ^bb1
llvm.return %2 : i32
}
}
Intermediate file: ...codegen.bc
, disassembled to ...codegen.ll
link
define internal i32 @matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32(ptr noalias nonnull align 16 %0, ptr noalias nonnull align 16 %1, ptr noalias nonnull align 16 %2) #0 !dbg !90 {
%4 = load %iree_hal_executable_dispatch_state_v0_t.7, ptr %1, align 8, !dbg !91
%5 = extractvalue %iree_hal_executable_dispatch_state_v0_t.7 %4, 9, !dbg !91
%6 = load i32, ptr %5, align 4, !dbg !91
%7 = getelementptr i32, ptr %5, i32 1, !dbg !91
%8 = load i32, ptr %7, align 4, !dbg !91
%9 = getelementptr i32, ptr %5, i32 2, !dbg !91
%10 = load i32, ptr %9, align 4, !dbg !91
%11 = getelementptr i32, ptr %5, i32 3, !dbg !91
%12 = load i32, ptr %11, align 4, !dbg !91
%13 = getelementptr i32, ptr %5, i32 4, !dbg !91
%14 = load i32, ptr %13, align 4, !dbg !91
%15 = getelementptr i32, ptr %5, i32 5, !dbg !91
%16 = load i32, ptr %15, align 4, !dbg !91
%17 = getelementptr i32, ptr %5, i32 6, !dbg !91
%18 = load i32, ptr %17, align 4, !dbg !91
%19 = getelementptr i32, ptr %5, i32 7, !dbg !91
%20 = load i32, ptr %19, align 4, !dbg !91
%21 = getelementptr i32, ptr %5, i32 8, !dbg !91
%22 = load i32, ptr %21, align 4, !dbg !91
%23 = getelementptr i32, ptr %5, i32 9, !dbg !91
%24 = load i32, ptr %23, align 4, !dbg !91
%25 = getelementptr i32, ptr %5, i32 10, !dbg !91
%26 = load i32, ptr %25, align 4, !dbg !91
%27 = getelementptr i32, ptr %5, i32 11, !dbg !91
%28 = load i32, ptr %27, align 4, !dbg !91
%29 = getelementptr i32, ptr %5, i32 14, !dbg !91
%30 = load i32, ptr %29, align 4, !dbg !91
%31 = getelementptr i32, ptr %5, i32 15, !dbg !91
%32 = load i32, ptr %31, align 4, !dbg !91
%33 = zext i32 %6 to i64, !dbg !91
%34 = zext i32 %8 to i64, !dbg !91
%35 = shl i64 %34, 32, !dbg !91
%36 = or i64 %33, %35, !dbg !91
%37 = zext i32 %10 to i64, !dbg !91
%38 = zext i32 %12 to i64, !dbg !91
%39 = shl i64 %38, 32, !dbg !91
%40 = or i64 %37, %39, !dbg !91
%41 = zext i32 %14 to i64, !dbg !91
%42 = zext i32 %16 to i64, !dbg !91
%43 = shl i64 %42, 32, !dbg !91
%44 = or i64 %41, %43, !dbg !91
%45 = zext i32 %18 to i64, !dbg !91
%46 = zext i32 %20 to i64, !dbg !91
%47 = shl i64 %46, 32, !dbg !91
%48 = or i64 %45, %47, !dbg !91
%49 = zext i32 %22 to i64, !dbg !91
%50 = zext i32 %24 to i64, !dbg !91
%51 = shl i64 %50, 32, !dbg !91
%52 = or i64 %49, %51, !dbg !91
%53 = zext i32 %26 to i64, !dbg !91
%54 = zext i32 %28 to i64, !dbg !91
%55 = shl i64 %54, 32, !dbg !91
%56 = or i64 %53, %55, !dbg !91
%57 = zext i32 %30 to i64, !dbg !91
%58 = zext i32 %32 to i64, !dbg !91
%59 = shl i64 %58, 32, !dbg !91
%60 = or i64 %57, %59, !dbg !91
%61 = mul i64 %48, 16, !dbg !91
%62 = extractvalue %iree_hal_executable_dispatch_state_v0_t.7 %4, 10, !dbg !91
%63 = load ptr, ptr %62, align 8, !dbg !91
%64 = mul i64 %56, 16, !dbg !91
%65 = mul i64 %60, 256, !dbg !91
%66 = getelementptr ptr, ptr %62, i32 1, !dbg !91
%67 = load ptr, ptr %66, align 8, !dbg !91
%68 = load %iree_hal_executable_workgroup_state_v0_t.8, ptr %2, align 8, !dbg !91
%69 = extractvalue %iree_hal_executable_workgroup_state_v0_t.8 %68, 0, !dbg !91
%70 = zext i32 %69 to i64, !dbg !91
%71 = extractvalue %iree_hal_executable_dispatch_state_v0_t.7 %4, 4, !dbg !91
%72 = zext i32 %71 to i64, !dbg !91
%73 = extractvalue %iree_hal_executable_workgroup_state_v0_t.8 %68, 1, !dbg !91
%74 = zext i32 %73 to i64, !dbg !91
%75 = extractvalue %iree_hal_executable_dispatch_state_v0_t.7 %4, 5, !dbg !91
%76 = zext i32 %75 to i64, !dbg !91
br label %77, !dbg !91
77: ; preds = %147, %3
%78 = phi i64 [ %148, %147 ], [ %74, %3 ]
%79 = icmp slt i64 %78, %44, !dbg !91
br i1 %79, label %80, label %149, !dbg !91
80: ; preds = %83, %77
%81 = phi i64 [ %146, %83 ], [ %70, %77 ]
%82 = icmp slt i64 %81, %52, !dbg !91
br i1 %82, label %83, label %147, !dbg !91
83: ; preds = %80
%84 = mul i64 %78, %61, !dbg !91
%85 = icmp slt i64 %36, 0, !dbg !91
%86 = sub i64 -1, %36, !dbg !91
%87 = select i1 %85, i64 %86, i64 %36, !dbg !91
%88 = sdiv i64 %87, 4, !dbg !91
%89 = sub i64 -1, %88, !dbg !91
%90 = select i1 %85, i64 %89, i64 %88, !dbg !91
%91 = mul i64 %81, %64, !dbg !91
%92 = add i64 %90, %91, !dbg !91
%93 = mul i64 %81, 256, !dbg !91
%94 = icmp slt i64 %40, 0, !dbg !91
%95 = sub i64 -1, %40, !dbg !91
%96 = select i1 %94, i64 %95, i64 %40, !dbg !91
%97 = sdiv i64 %96, 4, !dbg !91
%98 = sub i64 -1, %97, !dbg !91
%99 = select i1 %94, i64 %98, i64 %97, !dbg !91
%100 = add i64 %93, %99, !dbg !91
%101 = mul i64 %78, %65, !dbg !91
%102 = add i64 %100, %101, !dbg !91
%103 = getelementptr inbounds ptr, ptr %0, i32 4, !dbg !91
%104 = alloca i64, i64 8, align 8, !dbg !91
%105 = load i64, ptr %103, align 4, !dbg !91
%106 = or i64 %105, 4293970975, !dbg !91
store i64 %106, ptr %104, align 4, !dbg !91
%107 = getelementptr inbounds i64, ptr %103, i32 1, !dbg !91
%108 = load i64, ptr %107, align 4, !dbg !91
%109 = getelementptr inbounds i64, ptr %104, i32 1, !dbg !91
store i64 %108, ptr %109, align 4, !dbg !91
%110 = getelementptr inbounds i64, ptr %103, i32 2, !dbg !91
%111 = load i64, ptr %110, align 4, !dbg !91
%112 = getelementptr inbounds i64, ptr %104, i32 2, !dbg !91
store i64 %111, ptr %112, align 4, !dbg !91
%113 = getelementptr inbounds i64, ptr %103, i32 3, !dbg !91
%114 = load i64, ptr %113, align 4, !dbg !91
%115 = getelementptr inbounds i64, ptr %104, i32 3, !dbg !91
store i64 %114, ptr %115, align 4, !dbg !91
%116 = getelementptr inbounds i64, ptr %103, i32 4, !dbg !91
%117 = load i64, ptr %116, align 4, !dbg !91
%118 = getelementptr inbounds i64, ptr %104, i32 4, !dbg !91
store i64 %117, ptr %118, align 4, !dbg !91
%119 = getelementptr inbounds i64, ptr %103, i32 5, !dbg !91
%120 = load i64, ptr %119, align 4, !dbg !91
%121 = getelementptr inbounds i64, ptr %104, i32 5, !dbg !91
store i64 %120, ptr %121, align 4, !dbg !91
%122 = getelementptr inbounds i64, ptr %103, i32 6, !dbg !91
%123 = load i64, ptr %122, align 4, !dbg !91
%124 = getelementptr inbounds i64, ptr %104, i32 6, !dbg !91
store i64 %123, ptr %124, align 4, !dbg !91
%125 = getelementptr inbounds i64, ptr %103, i32 7, !dbg !91
%126 = load i64, ptr %125, align 4, !dbg !91
%127 = getelementptr inbounds i64, ptr %104, i32 7, !dbg !91
store i64 %126, ptr %127, align 4, !dbg !91
%128 = alloca { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr }, i64 1, align 8, !dbg !91
%129 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } undef, ptr %63, 0, !dbg !91
%130 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %129, i64 %84, 1, !dbg !91
%131 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %130, i64 %61, 2, !dbg !91
%132 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %131, ptr %63, 3, !dbg !91
%133 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %132, i64 %92, 4, !dbg !91
%134 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %133, i64 %64, 5, !dbg !91
%135 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %134, ptr %67, 6, !dbg !91
%136 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %135, i64 %102, 7, !dbg !91
%137 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %136, i64 %65, 8, !dbg !91
%138 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %137, i64 1, 9, !dbg !91
%139 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %138, i64 1, 10, !dbg !91
%140 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %139, i64 %48, 11, !dbg !91
%141 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %140, i32 16, 12, !dbg !91
%142 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %141, i32 16, 13, !dbg !91
%143 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %142, i32 1, 14, !dbg !91
%144 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %143, i32 1281, 15, !dbg !91
%145 = insertvalue { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %144, ptr %104, 16, !dbg !91
store { ptr, i64, i64, ptr, i64, i64, ptr, i64, i64, i64, i64, i64, i32, i32, i32, i32, ptr } %145, ptr %128, align 8, !dbg !91
call void @iree_uk_mmt4d(ptr %128), !dbg !91
%146 = add i64 %81, %72, !dbg !91
br label %80, !dbg !91
147: ; preds = %80
%148 = add i64 %78, %76, !dbg !91
br label %77, !dbg !91
149: ; preds = %77
ret i32 0, !dbg !91
}
Ukernel bitcode: entry pointlink
; Function Attrs: nounwind
define dso_local noundef i32 @iree_uk_mmt4d(ptr noundef %0) local_unnamed_addr #10 {
%2 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 9
%3 = load i64, ptr %2, align 8, !tbaa !1001
%4 = icmp eq i64 %3, 0
br i1 %4, label %133, label %5
5: ; preds = %1
%6 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 10
%7 = load i64, ptr %6, align 8, !tbaa !1002
%8 = icmp eq i64 %7, 0
br i1 %8, label %133, label %9
9: ; preds = %5
%10 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 11
%11 = load i64, ptr %10, align 8, !tbaa !19
%12 = icmp eq i64 %11, 0
br i1 %12, label %13, label %18
13: ; preds = %9
%14 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 15
%15 = load i32, ptr %14, align 4, !tbaa !9
%16 = and i32 %15, 256
%17 = icmp eq i32 %16, 0
br i1 %17, label %18, label %133
18: ; preds = %13, %9
%19 = tail call ptr @iree_uk_mmt4d_select_tile_func(ptr noundef nonnull %0) #14
%20 = load i64, ptr %2, align 8, !tbaa !1001
%21 = trunc i64 %20 to i32
%22 = load i64, ptr %6, align 8, !tbaa !1002
%23 = trunc i64 %22 to i32
%24 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 15
%25 = load i32, ptr %24, align 4, !tbaa !9
%26 = zext i32 %25 to i64
%27 = shl i64 %26, 56
%28 = add i64 %27, -72057594037927936
%29 = ashr exact i64 %28, 56
%30 = getelementptr inbounds [9 x i32], ptr @switch.table.iree_uk_mmt4d, i64 0, i64 %29
%31 = load i32, ptr %30, align 4
%32 = lshr i32 %31, 8
%33 = and i32 %31, 7
%34 = and i32 %32, 7
%35 = and i32 %31, 327680
%36 = add nsw i32 %35, -196608
%37 = lshr exact i32 %36, 16
%38 = zext nneg i32 %37 to i64
%39 = zext nneg i32 %33 to i64
%40 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 3
%41 = load ptr, ptr %40, align 8, !tbaa !1003
%42 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 4
%43 = load i64, ptr %42, align 8, !tbaa !1004
%44 = zext nneg i32 %34 to i64
%45 = shl i64 %43, %44
%46 = sdiv i64 %45, 8
%47 = getelementptr inbounds i8, ptr %41, i64 %46
%48 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 2
%49 = load i64, ptr %48, align 8, !tbaa !1005
%50 = shl i64 %49, %39
%51 = sdiv i64 %50, 8
%52 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 5
%53 = load i64, ptr %52, align 8, !tbaa !1006
%54 = shl i64 %53, %44
%55 = sdiv i64 %54, 8
%56 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 8
%57 = load i64, ptr %56, align 8, !tbaa !1007
%58 = shl i64 %57, %38
%59 = icmp sgt i32 %21, 0
br i1 %59, label %60, label %133
60: ; preds = %18
%61 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 13
%62 = load i32, ptr %61, align 4, !tbaa !996
%63 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 12
%64 = load i32, ptr %63, align 8, !tbaa !1000
%65 = shl i32 %62, 16
%66 = ashr exact i32 %65, 16
%67 = shl i32 %64, 16
%68 = ashr exact i32 %67, 16
%69 = mul nsw i32 %66, %68
%70 = shl i32 %69, %37
%71 = load ptr, ptr %0, align 8, !tbaa !1008
%72 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 1
%73 = load i64, ptr %72, align 8, !tbaa !1009
%74 = shl i64 %73, %39
%75 = sdiv i64 %74, 8
%76 = getelementptr inbounds i8, ptr %71, i64 %75
%77 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 6
%78 = load ptr, ptr %77, align 8, !tbaa !1010
%79 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %0, i64 0, i32 7
%80 = load i64, ptr %79, align 8, !tbaa !1011
%81 = shl i64 %80, %38
%82 = getelementptr inbounds i8, ptr %78, i64 %81
%83 = icmp sgt i32 %23, 0
%84 = sext i32 %70 to i64
br i1 %83, label %90, label %85
85: ; preds = %60
%86 = and i32 %21, 3
%87 = icmp ult i32 %21, 4
br i1 %87, label %121, label %88
88: ; preds = %85
%89 = and i32 %21, 2147483644
br label %107
90: ; preds = %60, %102
%91 = phi i32 [ %105, %102 ], [ 0, %60 ]
%92 = phi ptr [ %103, %102 ], [ %82, %60 ]
%93 = phi ptr [ %104, %102 ], [ %76, %60 ]
tail call void @llvm.prefetch.p0(ptr %92, i32 1, i32 1, i32 1)
tail call void @llvm.prefetch.p0(ptr %93, i32 0, i32 3, i32 1)
tail call void @llvm.prefetch.p0(ptr %47, i32 0, i32 3, i32 1)
br label %94
94: ; preds = %94, %90
%95 = phi i32 [ 0, %90 ], [ %100, %94 ]
%96 = phi ptr [ %47, %90 ], [ %99, %94 ]
%97 = phi ptr [ %92, %90 ], [ %98, %94 ]
tail call void %19(ptr noundef %97, ptr noundef %93, ptr noundef %96, ptr noundef nonnull %0) #14
%98 = getelementptr inbounds i8, ptr %97, i64 %84
%99 = getelementptr inbounds i8, ptr %96, i64 %55
%100 = add nuw nsw i32 %95, 1
%101 = icmp eq i32 %100, %23
br i1 %101, label %102, label %94, !llvm.loop !1012
102: ; preds = %94
%103 = getelementptr inbounds i8, ptr %92, i64 %58
%104 = getelementptr inbounds i8, ptr %93, i64 %51
%105 = add nuw nsw i32 %91, 1
%106 = icmp eq i32 %105, %21
br i1 %106, label %133, label %90, !llvm.loop !1013
107: ; preds = %107, %88
%108 = phi ptr [ %82, %88 ], [ %117, %107 ]
%109 = phi ptr [ %76, %88 ], [ %118, %107 ]
%110 = phi i32 [ 0, %88 ], [ %119, %107 ]
tail call void @llvm.prefetch.p0(ptr %108, i32 1, i32 1, i32 1)
tail call void @llvm.prefetch.p0(ptr %109, i32 0, i32 3, i32 1)
tail call void @llvm.prefetch.p0(ptr %47, i32 0, i32 3, i32 1)
%111 = getelementptr inbounds i8, ptr %108, i64 %58
%112 = getelementptr inbounds i8, ptr %109, i64 %51
tail call void @llvm.prefetch.p0(ptr %111, i32 1, i32 1, i32 1)
tail call void @llvm.prefetch.p0(ptr %112, i32 0, i32 3, i32 1)
tail call void @llvm.prefetch.p0(ptr %47, i32 0, i32 3, i32 1)
%113 = getelementptr inbounds i8, ptr %111, i64 %58
%114 = getelementptr inbounds i8, ptr %112, i64 %51
tail call void @llvm.prefetch.p0(ptr %113, i32 1, i32 1, i32 1)
tail call void @llvm.prefetch.p0(ptr %114, i32 0, i32 3, i32 1)
tail call void @llvm.prefetch.p0(ptr %47, i32 0, i32 3, i32 1)
%115 = getelementptr inbounds i8, ptr %113, i64 %58
%116 = getelementptr inbounds i8, ptr %114, i64 %51
tail call void @llvm.prefetch.p0(ptr %115, i32 1, i32 1, i32 1)
tail call void @llvm.prefetch.p0(ptr %116, i32 0, i32 3, i32 1)
tail call void @llvm.prefetch.p0(ptr %47, i32 0, i32 3, i32 1)
%117 = getelementptr inbounds i8, ptr %115, i64 %58
%118 = getelementptr inbounds i8, ptr %116, i64 %51
%119 = add i32 %110, 4
%120 = icmp eq i32 %119, %89
br i1 %120, label %121, label %107, !llvm.loop !1013
121: ; preds = %107, %85
%122 = phi ptr [ %82, %85 ], [ %117, %107 ]
%123 = phi ptr [ %76, %85 ], [ %118, %107 ]
%124 = icmp eq i32 %86, 0
br i1 %124, label %133, label %125
125: ; preds = %121, %125
%126 = phi ptr [ %129, %125 ], [ %122, %121 ]
%127 = phi ptr [ %130, %125 ], [ %123, %121 ]
%128 = phi i32 [ %131, %125 ], [ 0, %121 ]
tail call void @llvm.prefetch.p0(ptr %126, i32 1, i32 1, i32 1)
tail call void @llvm.prefetch.p0(ptr %127, i32 0, i32 3, i32 1)
tail call void @llvm.prefetch.p0(ptr %47, i32 0, i32 3, i32 1)
%129 = getelementptr inbounds i8, ptr %126, i64 %58
%130 = getelementptr inbounds i8, ptr %127, i64 %51
%131 = add i32 %128, 1
%132 = icmp eq i32 %131, %86
br i1 %132, label %133, label %125, !llvm.loop !1014
133: ; preds = %121, %125, %102, %1, %5, %13, %18
ret i32 0
}
Ukernel bitcode: tile functionlink
; Function Attrs: nofree norecurse nosync nounwind memory(read, argmem: readwrite, inaccessiblemem: readwrite)
define dso_local void @iree_uk_mmt4d_tile_f32f32f32_16x16x1_x86_64_avx512_base(ptr noalias nocapture noundef %0, ptr noalias nocapture noundef readonly %1, ptr noalias nocapture noundef readonly %2, ptr nocapture noundef readonly %3) #4 {
tail call void @llvm.experimental.noalias.scope.decl(metadata !367)
tail call void @llvm.experimental.noalias.scope.decl(metadata !370)
tail call void @llvm.experimental.noalias.scope.decl(metadata !372)
tail call void @llvm.prefetch.p0(ptr %1, i32 0, i32 3, i32 1), !noalias !374
tail call void @llvm.prefetch.p0(ptr %2, i32 0, i32 3, i32 1), !noalias !375
%5 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %3, i64 0, i32 15
%6 = load i32, ptr %5, align 4, !tbaa !9, !noalias !376
%7 = and i32 %6, 256
%8 = icmp eq i32 %7, 0
br i1 %8, label %41, label %9
9: ; preds = %4
%10 = load <16 x float>, ptr %0, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%11 = getelementptr inbounds float, ptr %0, i64 16
%12 = load <16 x float>, ptr %11, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%13 = getelementptr inbounds float, ptr %0, i64 32
%14 = load <16 x float>, ptr %13, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%15 = getelementptr inbounds float, ptr %0, i64 48
%16 = load <16 x float>, ptr %15, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%17 = getelementptr inbounds float, ptr %0, i64 64
%18 = load <16 x float>, ptr %17, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%19 = getelementptr inbounds float, ptr %0, i64 80
%20 = load <16 x float>, ptr %19, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%21 = getelementptr inbounds float, ptr %0, i64 96
%22 = load <16 x float>, ptr %21, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%23 = getelementptr inbounds float, ptr %0, i64 112
%24 = load <16 x float>, ptr %23, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%25 = getelementptr inbounds float, ptr %0, i64 128
%26 = load <16 x float>, ptr %25, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%27 = getelementptr inbounds float, ptr %0, i64 144
%28 = load <16 x float>, ptr %27, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%29 = getelementptr inbounds float, ptr %0, i64 160
%30 = load <16 x float>, ptr %29, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%31 = getelementptr inbounds float, ptr %0, i64 176
%32 = load <16 x float>, ptr %31, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%33 = getelementptr inbounds float, ptr %0, i64 192
%34 = load <16 x float>, ptr %33, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%35 = getelementptr inbounds float, ptr %0, i64 208
%36 = load <16 x float>, ptr %35, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%37 = getelementptr inbounds float, ptr %0, i64 224
%38 = load <16 x float>, ptr %37, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%39 = getelementptr inbounds float, ptr %0, i64 240
%40 = load <16 x float>, ptr %39, align 1, !tbaa !17, !alias.scope !367, !noalias !377
br label %41
41: ; preds = %4, %9
%42 = phi <16 x float> [ %40, %9 ], [ zeroinitializer, %4 ]
%43 = phi <16 x float> [ %38, %9 ], [ zeroinitializer, %4 ]
%44 = phi <16 x float> [ %36, %9 ], [ zeroinitializer, %4 ]
%45 = phi <16 x float> [ %34, %9 ], [ zeroinitializer, %4 ]
%46 = phi <16 x float> [ %32, %9 ], [ zeroinitializer, %4 ]
%47 = phi <16 x float> [ %30, %9 ], [ zeroinitializer, %4 ]
%48 = phi <16 x float> [ %28, %9 ], [ zeroinitializer, %4 ]
%49 = phi <16 x float> [ %26, %9 ], [ zeroinitializer, %4 ]
%50 = phi <16 x float> [ %24, %9 ], [ zeroinitializer, %4 ]
%51 = phi <16 x float> [ %22, %9 ], [ zeroinitializer, %4 ]
%52 = phi <16 x float> [ %20, %9 ], [ zeroinitializer, %4 ]
%53 = phi <16 x float> [ %18, %9 ], [ zeroinitializer, %4 ]
%54 = phi <16 x float> [ %16, %9 ], [ zeroinitializer, %4 ]
%55 = phi <16 x float> [ %14, %9 ], [ zeroinitializer, %4 ]
%56 = phi <16 x float> [ %12, %9 ], [ zeroinitializer, %4 ]
%57 = phi <16 x float> [ %10, %9 ], [ zeroinitializer, %4 ]
%58 = getelementptr inbounds %struct.iree_uk_mmt4d_params_t, ptr %3, i64 0, i32 11
%59 = load i64, ptr %58, align 8, !tbaa !19, !noalias !376
%60 = icmp sgt i64 %59, 0
br i1 %60, label %61, label %167
61: ; preds = %41, %61
%62 = phi <16 x float> [ %161, %61 ], [ %42, %41 ]
%63 = phi <16 x float> [ %156, %61 ], [ %43, %41 ]
%64 = phi <16 x float> [ %151, %61 ], [ %44, %41 ]
%65 = phi <16 x float> [ %146, %61 ], [ %45, %41 ]
%66 = phi <16 x float> [ %141, %61 ], [ %46, %41 ]
%67 = phi <16 x float> [ %136, %61 ], [ %47, %41 ]
%68 = phi <16 x float> [ %131, %61 ], [ %48, %41 ]
%69 = phi <16 x float> [ %126, %61 ], [ %49, %41 ]
%70 = phi <16 x float> [ %121, %61 ], [ %50, %41 ]
%71 = phi <16 x float> [ %116, %61 ], [ %51, %41 ]
%72 = phi <16 x float> [ %111, %61 ], [ %52, %41 ]
%73 = phi <16 x float> [ %106, %61 ], [ %53, %41 ]
%74 = phi <16 x float> [ %101, %61 ], [ %54, %41 ]
%75 = phi <16 x float> [ %96, %61 ], [ %55, %41 ]
%76 = phi <16 x float> [ %91, %61 ], [ %56, %41 ]
%77 = phi <16 x float> [ %86, %61 ], [ %57, %41 ]
%78 = phi i64 [ %165, %61 ], [ 0, %41 ]
%79 = phi ptr [ %164, %61 ], [ %1, %41 ]
%80 = phi ptr [ %162, %61 ], [ %2, %41 ]
%81 = load <16 x float>, ptr %80, align 1, !tbaa !17, !alias.scope !372, !noalias !375
%82 = getelementptr inbounds float, ptr %80, i64 128
tail call void @llvm.prefetch.p0(ptr nonnull %82, i32 0, i32 3, i32 1), !noalias !375
%83 = load float, ptr %79, align 4, !tbaa !331, !alias.scope !370, !noalias !374
%84 = insertelement <16 x float> poison, float %83, i64 0
%85 = shufflevector <16 x float> %84, <16 x float> poison, <16 x i32> zeroinitializer
%86 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %85, <16 x float> %81, <16 x float> %77)
%87 = getelementptr inbounds float, ptr %79, i64 1
%88 = load float, ptr %87, align 4, !tbaa !331, !alias.scope !370, !noalias !374
%89 = insertelement <16 x float> poison, float %88, i64 0
%90 = shufflevector <16 x float> %89, <16 x float> poison, <16 x i32> zeroinitializer
%91 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %90, <16 x float> %81, <16 x float> %76)
%92 = getelementptr inbounds float, ptr %79, i64 2
%93 = load float, ptr %92, align 4, !tbaa !331, !alias.scope !370, !noalias !374
%94 = insertelement <16 x float> poison, float %93, i64 0
%95 = shufflevector <16 x float> %94, <16 x float> poison, <16 x i32> zeroinitializer
%96 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %95, <16 x float> %81, <16 x float> %75)
%97 = getelementptr inbounds float, ptr %79, i64 3
%98 = load float, ptr %97, align 4, !tbaa !331, !alias.scope !370, !noalias !374
%99 = insertelement <16 x float> poison, float %98, i64 0
%100 = shufflevector <16 x float> %99, <16 x float> poison, <16 x i32> zeroinitializer
%101 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %100, <16 x float> %81, <16 x float> %74)
%102 = getelementptr inbounds float, ptr %79, i64 4
%103 = load float, ptr %102, align 4, !tbaa !331, !alias.scope !370, !noalias !374
%104 = insertelement <16 x float> poison, float %103, i64 0
%105 = shufflevector <16 x float> %104, <16 x float> poison, <16 x i32> zeroinitializer
%106 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %105, <16 x float> %81, <16 x float> %73)
%107 = getelementptr inbounds float, ptr %79, i64 5
%108 = load float, ptr %107, align 4, !tbaa !331, !alias.scope !370, !noalias !374
%109 = insertelement <16 x float> poison, float %108, i64 0
%110 = shufflevector <16 x float> %109, <16 x float> poison, <16 x i32> zeroinitializer
%111 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %110, <16 x float> %81, <16 x float> %72)
%112 = getelementptr inbounds float, ptr %79, i64 6
%113 = load float, ptr %112, align 4, !tbaa !331, !alias.scope !370, !noalias !374
%114 = insertelement <16 x float> poison, float %113, i64 0
%115 = shufflevector <16 x float> %114, <16 x float> poison, <16 x i32> zeroinitializer
%116 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %115, <16 x float> %81, <16 x float> %71)
%117 = getelementptr inbounds float, ptr %79, i64 7
%118 = load float, ptr %117, align 4, !tbaa !331, !alias.scope !370, !noalias !374
%119 = insertelement <16 x float> poison, float %118, i64 0
%120 = shufflevector <16 x float> %119, <16 x float> poison, <16 x i32> zeroinitializer
%121 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %120, <16 x float> %81, <16 x float> %70)
%122 = getelementptr inbounds float, ptr %79, i64 8
%123 = load float, ptr %122, align 4, !tbaa !331, !alias.scope !370, !noalias !374
%124 = insertelement <16 x float> poison, float %123, i64 0
%125 = shufflevector <16 x float> %124, <16 x float> poison, <16 x i32> zeroinitializer
%126 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %125, <16 x float> %81, <16 x float> %69)
%127 = getelementptr inbounds float, ptr %79, i64 9
%128 = load float, ptr %127, align 4, !tbaa !331, !alias.scope !370, !noalias !374
%129 = insertelement <16 x float> poison, float %128, i64 0
%130 = shufflevector <16 x float> %129, <16 x float> poison, <16 x i32> zeroinitializer
%131 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %130, <16 x float> %81, <16 x float> %68)
%132 = getelementptr inbounds float, ptr %79, i64 10
%133 = load float, ptr %132, align 4, !tbaa !331, !alias.scope !370, !noalias !374
%134 = insertelement <16 x float> poison, float %133, i64 0
%135 = shufflevector <16 x float> %134, <16 x float> poison, <16 x i32> zeroinitializer
%136 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %135, <16 x float> %81, <16 x float> %67)
%137 = getelementptr inbounds float, ptr %79, i64 11
%138 = load float, ptr %137, align 4, !tbaa !331, !alias.scope !370, !noalias !374
%139 = insertelement <16 x float> poison, float %138, i64 0
%140 = shufflevector <16 x float> %139, <16 x float> poison, <16 x i32> zeroinitializer
%141 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %140, <16 x float> %81, <16 x float> %66)
%142 = getelementptr inbounds float, ptr %79, i64 12
%143 = load float, ptr %142, align 4, !tbaa !331, !alias.scope !370, !noalias !374
%144 = insertelement <16 x float> poison, float %143, i64 0
%145 = shufflevector <16 x float> %144, <16 x float> poison, <16 x i32> zeroinitializer
%146 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %145, <16 x float> %81, <16 x float> %65)
%147 = getelementptr inbounds float, ptr %79, i64 13
%148 = load float, ptr %147, align 4, !tbaa !331, !alias.scope !370, !noalias !374
%149 = insertelement <16 x float> poison, float %148, i64 0
%150 = shufflevector <16 x float> %149, <16 x float> poison, <16 x i32> zeroinitializer
%151 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %150, <16 x float> %81, <16 x float> %64)
%152 = getelementptr inbounds float, ptr %79, i64 14
%153 = load float, ptr %152, align 4, !tbaa !331, !alias.scope !370, !noalias !374
%154 = insertelement <16 x float> poison, float %153, i64 0
%155 = shufflevector <16 x float> %154, <16 x float> poison, <16 x i32> zeroinitializer
%156 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %155, <16 x float> %81, <16 x float> %63)
%157 = getelementptr inbounds float, ptr %79, i64 15
%158 = load float, ptr %157, align 4, !tbaa !331, !alias.scope !370, !noalias !374
%159 = insertelement <16 x float> poison, float %158, i64 0
%160 = shufflevector <16 x float> %159, <16 x float> poison, <16 x i32> zeroinitializer
%161 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %160, <16 x float> %81, <16 x float> %62)
%162 = getelementptr inbounds float, ptr %80, i64 16
%163 = getelementptr inbounds float, ptr %79, i64 128
tail call void @llvm.prefetch.p0(ptr nonnull %163, i32 0, i32 3, i32 1), !noalias !374
%164 = getelementptr inbounds float, ptr %79, i64 16
%165 = add nuw nsw i64 %78, 1
%166 = icmp eq i64 %165, %59
br i1 %166, label %167, label %61, !llvm.loop !333
167: ; preds = %61, %41
%168 = phi <16 x float> [ %42, %41 ], [ %161, %61 ]
%169 = phi <16 x float> [ %43, %41 ], [ %156, %61 ]
%170 = phi <16 x float> [ %44, %41 ], [ %151, %61 ]
%171 = phi <16 x float> [ %45, %41 ], [ %146, %61 ]
%172 = phi <16 x float> [ %46, %41 ], [ %141, %61 ]
%173 = phi <16 x float> [ %47, %41 ], [ %136, %61 ]
%174 = phi <16 x float> [ %48, %41 ], [ %131, %61 ]
%175 = phi <16 x float> [ %49, %41 ], [ %126, %61 ]
%176 = phi <16 x float> [ %50, %41 ], [ %121, %61 ]
%177 = phi <16 x float> [ %51, %41 ], [ %116, %61 ]
%178 = phi <16 x float> [ %52, %41 ], [ %111, %61 ]
%179 = phi <16 x float> [ %53, %41 ], [ %106, %61 ]
%180 = phi <16 x float> [ %54, %41 ], [ %101, %61 ]
%181 = phi <16 x float> [ %55, %41 ], [ %96, %61 ]
%182 = phi <16 x float> [ %56, %41 ], [ %91, %61 ]
%183 = phi <16 x float> [ %57, %41 ], [ %86, %61 ]
store <16 x float> %183, ptr %0, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%184 = getelementptr inbounds float, ptr %0, i64 16
store <16 x float> %182, ptr %184, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%185 = getelementptr inbounds float, ptr %0, i64 32
store <16 x float> %181, ptr %185, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%186 = getelementptr inbounds float, ptr %0, i64 48
store <16 x float> %180, ptr %186, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%187 = getelementptr inbounds float, ptr %0, i64 64
store <16 x float> %179, ptr %187, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%188 = getelementptr inbounds float, ptr %0, i64 80
store <16 x float> %178, ptr %188, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%189 = getelementptr inbounds float, ptr %0, i64 96
store <16 x float> %177, ptr %189, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%190 = getelementptr inbounds float, ptr %0, i64 112
store <16 x float> %176, ptr %190, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%191 = getelementptr inbounds float, ptr %0, i64 128
store <16 x float> %175, ptr %191, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%192 = getelementptr inbounds float, ptr %0, i64 144
store <16 x float> %174, ptr %192, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%193 = getelementptr inbounds float, ptr %0, i64 160
store <16 x float> %173, ptr %193, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%194 = getelementptr inbounds float, ptr %0, i64 176
store <16 x float> %172, ptr %194, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%195 = getelementptr inbounds float, ptr %0, i64 192
store <16 x float> %171, ptr %195, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%196 = getelementptr inbounds float, ptr %0, i64 208
store <16 x float> %170, ptr %196, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%197 = getelementptr inbounds float, ptr %0, i64 224
store <16 x float> %169, ptr %197, align 1, !tbaa !17, !alias.scope !367, !noalias !377
%198 = getelementptr inbounds float, ptr %0, i64 240
store <16 x float> %168, ptr %198, align 1, !tbaa !17, !alias.scope !367, !noalias !377
ret void
}
Intermediate file: ...optimized.bc
, disassembled to ...optimized.ll
link
; Function Attrs: nofree norecurse nosync nounwind
define internal noundef i32 @matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32(ptr noalias nocapture nonnull readonly align 16 %0, ptr noalias nocapture nonnull readonly align 16 %1, ptr noalias nocapture nonnull readonly align 16 %2) #1 !dbg !90 {
%.elt7 = getelementptr inbounds %iree_hal_executable_dispatch_state_v0_t.19, ptr %1, i64 0, i32 4, !dbg !91
%.unpack8 = load i32, ptr %.elt7, align 4, !dbg !91
%.elt9 = getelementptr inbounds %iree_hal_executable_dispatch_state_v0_t.19, ptr %1, i64 0, i32 5, !dbg !91
%.unpack10 = load i32, ptr %.elt9, align 16, !dbg !91
%.elt17 = getelementptr inbounds %iree_hal_executable_dispatch_state_v0_t.19, ptr %1, i64 0, i32 9, !dbg !91
%.unpack18 = load ptr, ptr %.elt17, align 8, !dbg !91
%.elt19 = getelementptr inbounds %iree_hal_executable_dispatch_state_v0_t.19, ptr %1, i64 0, i32 10, !dbg !91
%.unpack20 = load ptr, ptr %.elt19, align 16, !dbg !91
%4 = getelementptr i32, ptr %.unpack18, i64 4, !dbg !91
%5 = load i64, ptr %4, align 4, !dbg !91
%6 = getelementptr i32, ptr %.unpack18, i64 6, !dbg !91
%7 = load i32, ptr %6, align 4, !dbg !91
%8 = getelementptr i32, ptr %.unpack18, i64 7, !dbg !91
%9 = load i32, ptr %8, align 4, !dbg !91
%10 = getelementptr i32, ptr %.unpack18, i64 8, !dbg !91
%11 = load i64, ptr %10, align 4, !dbg !91
%12 = getelementptr i32, ptr %.unpack18, i64 10, !dbg !91
%13 = load i64, ptr %12, align 4, !dbg !91
%14 = shl i64 %13, 4, !dbg !91
%15 = getelementptr i32, ptr %.unpack18, i64 14, !dbg !91
%16 = load i64, ptr %15, align 4, !dbg !91
%17 = shl i64 %16, 8, !dbg !91
%18 = zext i32 %7 to i64, !dbg !91
%19 = zext i32 %9 to i64, !dbg !91
%20 = shl nuw i64 %19, 32, !dbg !91
%21 = or disjoint i64 %20, %18, !dbg !91
%22 = load ptr, ptr %.unpack20, align 8, !dbg !91
%23 = getelementptr ptr, ptr %.unpack20, i64 1, !dbg !91
%24 = load ptr, ptr %23, align 8, !dbg !91
%25 = load %iree_hal_executable_workgroup_state_v0_t.20, ptr %2, align 16, !dbg !91
%26 = extractvalue %iree_hal_executable_workgroup_state_v0_t.20 %25, 0, !dbg !91
%27 = zext i32 %26 to i64, !dbg !91
%28 = zext i32 %.unpack8 to i64, !dbg !91
%29 = extractvalue %iree_hal_executable_workgroup_state_v0_t.20 %25, 1, !dbg !91
%30 = zext i32 %29 to i64, !dbg !91
%31 = zext i32 %.unpack10 to i64, !dbg !91
%32 = icmp sgt i64 %5, %30, !dbg !91
br i1 %32, label %.preheader.lr.ph, label %._crit_edge58, !dbg !91
.preheader.lr.ph: ; preds = %3
%33 = getelementptr i32, ptr %.unpack18, i64 3, !dbg !91
%34 = load i32, ptr %33, align 4, !dbg !91
%35 = zext i32 %34 to i64, !dbg !91
%36 = shl nuw i64 %35, 32, !dbg !91
%37 = getelementptr i32, ptr %.unpack18, i64 2, !dbg !91
%38 = load i32, ptr %37, align 4, !dbg !91
%39 = zext i32 %38 to i64, !dbg !91
%40 = or disjoint i64 %36, %39, !dbg !91
%41 = getelementptr i32, ptr %.unpack18, i64 1, !dbg !91
%42 = load i32, ptr %41, align 4, !dbg !91
%43 = zext i32 %42 to i64, !dbg !91
%44 = shl nuw i64 %43, 32, !dbg !91
%45 = load i32, ptr %.unpack18, align 4, !dbg !91
%46 = zext i32 %45 to i64, !dbg !91
%47 = or disjoint i64 %44, %46, !dbg !91
%48 = icmp sgt i64 %11, %27
%.lobit = ashr i64 %44, 63
%49 = xor i64 %47, %.lobit
%50 = sdiv i64 %49, 4
%51 = xor i64 %50, %.lobit
%.lobit24 = ashr i64 %36, 63
%52 = xor i64 %40, %.lobit24
%53 = sdiv i64 %52, 4
%54 = xor i64 %53, %.lobit24
%55 = icmp eq i64 %21, 0
%56 = shl i64 %21, 9
%57 = icmp sgt i64 %21, 0
br label %.preheader, !dbg !91
.preheader: ; preds = %._crit_edge, %.preheader.lr.ph
%58 = phi i64 [ %30, %.preheader.lr.ph ], [ %228, %._crit_edge ]
br i1 %48, label %.lr.ph, label %._crit_edge, !dbg !91
.lr.ph: ; preds = %.preheader
%59 = mul i64 %17, %58
%60 = add i64 %59, %54
%61 = mul i64 %56, %58
%62 = ashr exact i64 %61, 3
%63 = getelementptr inbounds i8, ptr %22, i64 %62
%64 = shl i64 %60, 2
%invariant.gep = getelementptr i8, ptr %24, i64 %64, !dbg !91
br label %65, !dbg !91
65: ; preds = %iree_uk_mmt4d.exit, %.lr.ph
%66 = phi i64 [ %27, %.lr.ph ], [ %226, %iree_uk_mmt4d.exit ]
br i1 %55, label %iree_uk_mmt4d.exit, label %67, !dbg !91
67: ; preds = %65
%68 = mul i64 %14, %66, !dbg !91
%69 = add i64 %68, %51, !dbg !91
%70 = shl i64 %69, 5, !dbg !91
%71 = ashr exact i64 %70, 3, !dbg !91
%72 = getelementptr inbounds i8, ptr %22, i64 %71, !dbg !91
%73 = shl i64 %66, 10, !dbg !91
%gep = getelementptr i8, ptr %invariant.gep, i64 %73, !dbg !91
tail call void @llvm.prefetch.p0(ptr %gep, i32 1, i32 1, i32 1), !dbg !91
tail call void @llvm.prefetch.p0(ptr %63, i32 0, i32 3, i32 1), !dbg !91
tail call void @llvm.prefetch.p0(ptr %72, i32 0, i32 3, i32 1), !dbg !91
tail call void @llvm.experimental.noalias.scope.decl(metadata !92), !dbg !91
tail call void @llvm.experimental.noalias.scope.decl(metadata !95), !dbg !91
tail call void @llvm.experimental.noalias.scope.decl(metadata !97), !dbg !91
tail call void @llvm.experimental.noalias.scope.decl(metadata !99), !dbg !91
tail call void @llvm.experimental.noalias.scope.decl(metadata !102), !dbg !91
tail call void @llvm.experimental.noalias.scope.decl(metadata !104), !dbg !91
tail call void @llvm.prefetch.p0(ptr %63, i32 0, i32 3, i32 1), !dbg !91, !noalias !106
tail call void @llvm.prefetch.p0(ptr %72, i32 0, i32 3, i32 1), !dbg !91, !noalias !107
%74 = load <16 x float>, ptr %gep, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
%75 = getelementptr inbounds float, ptr %gep, i64 16, !dbg !91
%76 = load <16 x float>, ptr %75, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
%77 = getelementptr inbounds float, ptr %gep, i64 32, !dbg !91
%78 = load <16 x float>, ptr %77, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
%79 = getelementptr inbounds float, ptr %gep, i64 48, !dbg !91
%80 = load <16 x float>, ptr %79, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
%81 = getelementptr inbounds float, ptr %gep, i64 64, !dbg !91
%82 = load <16 x float>, ptr %81, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
%83 = getelementptr inbounds float, ptr %gep, i64 80, !dbg !91
%84 = load <16 x float>, ptr %83, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
%85 = getelementptr inbounds float, ptr %gep, i64 96, !dbg !91
%86 = load <16 x float>, ptr %85, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
%87 = getelementptr inbounds float, ptr %gep, i64 112, !dbg !91
%88 = load <16 x float>, ptr %87, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
%89 = getelementptr inbounds float, ptr %gep, i64 128, !dbg !91
%90 = load <16 x float>, ptr %89, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
%91 = getelementptr inbounds float, ptr %gep, i64 144, !dbg !91
%92 = load <16 x float>, ptr %91, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
%93 = getelementptr inbounds float, ptr %gep, i64 160, !dbg !91
%94 = load <16 x float>, ptr %93, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
%95 = getelementptr inbounds float, ptr %gep, i64 176, !dbg !91
%96 = load <16 x float>, ptr %95, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
%97 = getelementptr inbounds float, ptr %gep, i64 192, !dbg !91
%98 = load <16 x float>, ptr %97, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
%99 = getelementptr inbounds float, ptr %gep, i64 208, !dbg !91
%100 = load <16 x float>, ptr %99, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
%101 = getelementptr inbounds float, ptr %gep, i64 224, !dbg !91
%102 = load <16 x float>, ptr %101, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
%103 = getelementptr inbounds float, ptr %gep, i64 240, !dbg !91
%104 = load <16 x float>, ptr %103, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
br i1 %57, label %.preheader.i, label %iree_uk_mmt4d_tile_f32f32f32_16x16x1_x86_64_avx512_base.exit, !dbg !91
.preheader.i: ; preds = %.preheader.i, %67
%105 = phi <16 x float> [ %204, %.preheader.i ], [ %104, %67 ], !dbg !91
%106 = phi <16 x float> [ %199, %.preheader.i ], [ %102, %67 ], !dbg !91
%107 = phi <16 x float> [ %194, %.preheader.i ], [ %100, %67 ], !dbg !91
%108 = phi <16 x float> [ %189, %.preheader.i ], [ %98, %67 ], !dbg !91
%109 = phi <16 x float> [ %184, %.preheader.i ], [ %96, %67 ], !dbg !91
%110 = phi <16 x float> [ %179, %.preheader.i ], [ %94, %67 ], !dbg !91
%111 = phi <16 x float> [ %174, %.preheader.i ], [ %92, %67 ], !dbg !91
%112 = phi <16 x float> [ %169, %.preheader.i ], [ %90, %67 ], !dbg !91
%113 = phi <16 x float> [ %164, %.preheader.i ], [ %88, %67 ], !dbg !91
%114 = phi <16 x float> [ %159, %.preheader.i ], [ %86, %67 ], !dbg !91
%115 = phi <16 x float> [ %154, %.preheader.i ], [ %84, %67 ], !dbg !91
%116 = phi <16 x float> [ %149, %.preheader.i ], [ %82, %67 ], !dbg !91
%117 = phi <16 x float> [ %144, %.preheader.i ], [ %80, %67 ], !dbg !91
%118 = phi <16 x float> [ %139, %.preheader.i ], [ %78, %67 ], !dbg !91
%119 = phi <16 x float> [ %134, %.preheader.i ], [ %76, %67 ], !dbg !91
%120 = phi <16 x float> [ %129, %.preheader.i ], [ %74, %67 ], !dbg !91
%121 = phi i64 [ %208, %.preheader.i ], [ 0, %67 ], !dbg !91
%122 = phi ptr [ %207, %.preheader.i ], [ %63, %67 ], !dbg !91
%123 = phi ptr [ %205, %.preheader.i ], [ %72, %67 ], !dbg !91
%124 = load <16 x float>, ptr %123, align 1, !dbg !91, !tbaa !108, !alias.scope !113, !noalias !107
%125 = getelementptr inbounds float, ptr %123, i64 128, !dbg !91
tail call void @llvm.prefetch.p0(ptr nonnull %125, i32 0, i32 3, i32 1), !dbg !91, !noalias !107
%126 = load float, ptr %122, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106
%127 = insertelement <16 x float> poison, float %126, i64 0, !dbg !91
%128 = shufflevector <16 x float> %127, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91
%129 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %128, <16 x float> %124, <16 x float> %120), !dbg !91
%130 = getelementptr inbounds float, ptr %122, i64 1, !dbg !91
%131 = load float, ptr %130, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106
%132 = insertelement <16 x float> poison, float %131, i64 0, !dbg !91
%133 = shufflevector <16 x float> %132, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91
%134 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %133, <16 x float> %124, <16 x float> %119), !dbg !91
%135 = getelementptr inbounds float, ptr %122, i64 2, !dbg !91
%136 = load float, ptr %135, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106
%137 = insertelement <16 x float> poison, float %136, i64 0, !dbg !91
%138 = shufflevector <16 x float> %137, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91
%139 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %138, <16 x float> %124, <16 x float> %118), !dbg !91
%140 = getelementptr inbounds float, ptr %122, i64 3, !dbg !91
%141 = load float, ptr %140, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106
%142 = insertelement <16 x float> poison, float %141, i64 0, !dbg !91
%143 = shufflevector <16 x float> %142, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91
%144 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %143, <16 x float> %124, <16 x float> %117), !dbg !91
%145 = getelementptr inbounds float, ptr %122, i64 4, !dbg !91
%146 = load float, ptr %145, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106
%147 = insertelement <16 x float> poison, float %146, i64 0, !dbg !91
%148 = shufflevector <16 x float> %147, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91
%149 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %148, <16 x float> %124, <16 x float> %116), !dbg !91
%150 = getelementptr inbounds float, ptr %122, i64 5, !dbg !91
%151 = load float, ptr %150, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106
%152 = insertelement <16 x float> poison, float %151, i64 0, !dbg !91
%153 = shufflevector <16 x float> %152, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91
%154 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %153, <16 x float> %124, <16 x float> %115), !dbg !91
%155 = getelementptr inbounds float, ptr %122, i64 6, !dbg !91
%156 = load float, ptr %155, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106
%157 = insertelement <16 x float> poison, float %156, i64 0, !dbg !91
%158 = shufflevector <16 x float> %157, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91
%159 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %158, <16 x float> %124, <16 x float> %114), !dbg !91
%160 = getelementptr inbounds float, ptr %122, i64 7, !dbg !91
%161 = load float, ptr %160, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106
%162 = insertelement <16 x float> poison, float %161, i64 0, !dbg !91
%163 = shufflevector <16 x float> %162, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91
%164 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %163, <16 x float> %124, <16 x float> %113), !dbg !91
%165 = getelementptr inbounds float, ptr %122, i64 8, !dbg !91
%166 = load float, ptr %165, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106
%167 = insertelement <16 x float> poison, float %166, i64 0, !dbg !91
%168 = shufflevector <16 x float> %167, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91
%169 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %168, <16 x float> %124, <16 x float> %112), !dbg !91
%170 = getelementptr inbounds float, ptr %122, i64 9, !dbg !91
%171 = load float, ptr %170, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106
%172 = insertelement <16 x float> poison, float %171, i64 0, !dbg !91
%173 = shufflevector <16 x float> %172, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91
%174 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %173, <16 x float> %124, <16 x float> %111), !dbg !91
%175 = getelementptr inbounds float, ptr %122, i64 10, !dbg !91
%176 = load float, ptr %175, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106
%177 = insertelement <16 x float> poison, float %176, i64 0, !dbg !91
%178 = shufflevector <16 x float> %177, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91
%179 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %178, <16 x float> %124, <16 x float> %110), !dbg !91
%180 = getelementptr inbounds float, ptr %122, i64 11, !dbg !91
%181 = load float, ptr %180, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106
%182 = insertelement <16 x float> poison, float %181, i64 0, !dbg !91
%183 = shufflevector <16 x float> %182, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91
%184 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %183, <16 x float> %124, <16 x float> %109), !dbg !91
%185 = getelementptr inbounds float, ptr %122, i64 12, !dbg !91
%186 = load float, ptr %185, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106
%187 = insertelement <16 x float> poison, float %186, i64 0, !dbg !91
%188 = shufflevector <16 x float> %187, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91
%189 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %188, <16 x float> %124, <16 x float> %108), !dbg !91
%190 = getelementptr inbounds float, ptr %122, i64 13, !dbg !91
%191 = load float, ptr %190, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106
%192 = insertelement <16 x float> poison, float %191, i64 0, !dbg !91
%193 = shufflevector <16 x float> %192, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91
%194 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %193, <16 x float> %124, <16 x float> %107), !dbg !91
%195 = getelementptr inbounds float, ptr %122, i64 14, !dbg !91
%196 = load float, ptr %195, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106
%197 = insertelement <16 x float> poison, float %196, i64 0, !dbg !91
%198 = shufflevector <16 x float> %197, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91
%199 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %198, <16 x float> %124, <16 x float> %106), !dbg !91
%200 = getelementptr inbounds float, ptr %122, i64 15, !dbg !91
%201 = load float, ptr %200, align 4, !dbg !91, !tbaa !114, !alias.scope !116, !noalias !106
%202 = insertelement <16 x float> poison, float %201, i64 0, !dbg !91
%203 = shufflevector <16 x float> %202, <16 x float> poison, <16 x i32> zeroinitializer, !dbg !91
%204 = tail call <16 x float> @llvm.fma.v16f32(<16 x float> %203, <16 x float> %124, <16 x float> %105), !dbg !91
%205 = getelementptr inbounds float, ptr %123, i64 16, !dbg !91
%206 = getelementptr inbounds float, ptr %122, i64 128, !dbg !91
tail call void @llvm.prefetch.p0(ptr nonnull %206, i32 0, i32 3, i32 1), !dbg !91, !noalias !106
%207 = getelementptr inbounds float, ptr %122, i64 16, !dbg !91
%208 = add nuw nsw i64 %121, 1, !dbg !91
%209 = icmp eq i64 %208, %21, !dbg !91
br i1 %209, label %iree_uk_mmt4d_tile_f32f32f32_16x16x1_x86_64_avx512_base.exit, label %.preheader.i, !dbg !91, !llvm.loop !117
iree_uk_mmt4d_tile_f32f32f32_16x16x1_x86_64_avx512_base.exit: ; preds = %.preheader.i, %67
%210 = phi <16 x float> [ %104, %67 ], [ %204, %.preheader.i ], !dbg !91
%211 = phi <16 x float> [ %102, %67 ], [ %199, %.preheader.i ], !dbg !91
%212 = phi <16 x float> [ %100, %67 ], [ %194, %.preheader.i ], !dbg !91
%213 = phi <16 x float> [ %98, %67 ], [ %189, %.preheader.i ], !dbg !91
%214 = phi <16 x float> [ %96, %67 ], [ %184, %.preheader.i ], !dbg !91
%215 = phi <16 x float> [ %94, %67 ], [ %179, %.preheader.i ], !dbg !91
%216 = phi <16 x float> [ %92, %67 ], [ %174, %.preheader.i ], !dbg !91
%217 = phi <16 x float> [ %90, %67 ], [ %169, %.preheader.i ], !dbg !91
%218 = phi <16 x float> [ %88, %67 ], [ %164, %.preheader.i ], !dbg !91
%219 = phi <16 x float> [ %86, %67 ], [ %159, %.preheader.i ], !dbg !91
%220 = phi <16 x float> [ %84, %67 ], [ %154, %.preheader.i ], !dbg !91
%221 = phi <16 x float> [ %82, %67 ], [ %149, %.preheader.i ], !dbg !91
%222 = phi <16 x float> [ %80, %67 ], [ %144, %.preheader.i ], !dbg !91
%223 = phi <16 x float> [ %78, %67 ], [ %139, %.preheader.i ], !dbg !91
%224 = phi <16 x float> [ %76, %67 ], [ %134, %.preheader.i ], !dbg !91
%225 = phi <16 x float> [ %74, %67 ], [ %129, %.preheader.i ], !dbg !91
store <16 x float> %225, ptr %gep, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
store <16 x float> %224, ptr %75, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
store <16 x float> %223, ptr %77, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
store <16 x float> %222, ptr %79, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
store <16 x float> %221, ptr %81, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
store <16 x float> %220, ptr %83, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
store <16 x float> %219, ptr %85, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
store <16 x float> %218, ptr %87, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
store <16 x float> %217, ptr %89, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
store <16 x float> %216, ptr %91, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
store <16 x float> %215, ptr %93, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
store <16 x float> %214, ptr %95, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
store <16 x float> %213, ptr %97, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
store <16 x float> %212, ptr %99, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
store <16 x float> %211, ptr %101, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
store <16 x float> %210, ptr %103, align 1, !dbg !91, !tbaa !108, !alias.scope !111, !noalias !112
br label %iree_uk_mmt4d.exit, !dbg !91
iree_uk_mmt4d.exit: ; preds = %iree_uk_mmt4d_tile_f32f32f32_16x16x1_x86_64_avx512_base.exit, %65
%226 = add i64 %66, %28, !dbg !91
%227 = icmp slt i64 %226, %11, !dbg !91
br i1 %227, label %65, label %._crit_edge, !dbg !91
._crit_edge: ; preds = %iree_uk_mmt4d.exit, %.preheader
%228 = add i64 %58, %31, !dbg !91
%229 = icmp slt i64 %228, %5, !dbg !91
br i1 %229, label %.preheader, label %._crit_edge58, !dbg !91
._crit_edge58: ; preds = %._crit_edge, %3
ret i32 0, !dbg !91
}
x86 assemblylink
.section .text.matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32,"ax",@progbits
.p2align 4, 0x90
.type matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32,@function
matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32:
.Lfunc_begin3:
.loc 1 1 0 is_stmt 1
.cfi_startproc
push rbp
.cfi_def_cfa_offset 16
.cfi_offset rbp, -16
mov rbp, rsp
.cfi_def_cfa_register rbp
.Ltmp6:
push r15
push r14
push r13
push r12
push rbx
.cfi_offset rbx, -56
.cfi_offset r12, -48
.cfi_offset r13, -40
.cfi_offset r14, -32
.cfi_offset r15, -24
.loc 1 1 1 prologue_end
mov rcx, qword ptr [rsi + 24]
mov edi, dword ptr [rdx + 4]
mov rax, qword ptr [rcx + 16]
mov qword ptr [rbp - 48], rdi
mov qword ptr [rbp - 112], rax
cmp rax, rdi
jle .LBB3_11
mov eax, dword ptr [rsi + 16]
mov edi, dword ptr [rsi + 12]
mov r12, qword ptr [rsi + 32]
mov rsi, qword ptr [rcx + 40]
mov r9, qword ptr [rcx + 56]
mov ebx, dword ptr [rcx + 4]
mov r10d, dword ptr [rcx]
mov r11, qword ptr [rcx + 24]
mov r14, qword ptr [rcx + 32]
mov r8, rsi
shl r8, 4
mov qword ptr [rbp - 104], rax
shl r9, 8
mov rax, qword ptr [r12 + 8]
shl rbx, 32
mov qword ptr [rbp - 128], r8
mov r8d, dword ptr [rcx + 12]
mov qword ptr [rbp - 96], r9
mov r9d, dword ptr [rcx + 8]
or r10, rbx
sar rbx, 63
xor r10, rbx
lea r15, [r10 + 3]
mov qword ptr [rbp - 80], rax
mov eax, dword ptr [rdx]
shl r8, 32
or r9, r8
test r10, r10
cmovns r15, r10
sar r8, 63
sar r15, 2
xor r9, r8
xor r15, rbx
lea rcx, [r9 + 3]
test r9, r9
mov qword ptr [rbp - 56], rax
cmovns rcx, r9
imul rax, rsi
mov r9, qword ptr [r12]
imul rsi, rdi
mov qword ptr [rbp - 120], r15
sar rcx, 2
xor rcx, r8
shl rax, 6
mov qword ptr [rbp - 88], rcx
mov rcx, r11
shl rcx, 9
shl rsi, 6
lea rax, [rax + 4*r15]
mov qword ptr [rbp - 72], rcx
mov qword ptr [rbp - 64], rax
jmp .LBB3_2
.p2align 4, 0x90
.LBB3_10:
.loc 1 0 1 is_stmt 0
mov rax, qword ptr [rbp - 48]
.loc 1 1 1
add rax, qword ptr [rbp - 104]
mov qword ptr [rbp - 48], rax
cmp rax, qword ptr [rbp - 112]
jge .LBB3_11
.LBB3_2:
.loc 1 0 1
cmp r14, qword ptr [rbp - 56]
.loc 1 1 1
jle .LBB3_10
.loc 1 0 1
mov rax, qword ptr [rbp - 96]
mov rcx, qword ptr [rbp - 48]
mov r10, qword ptr [rbp - 72]
mov rdx, qword ptr [rbp - 80]
mov r8, qword ptr [rbp - 64]
imul rax, rcx
add rax, qword ptr [rbp - 88]
imul r10, rcx
sar r10, 3
lea r13, [r9 + r10]
.loc 1 1 1
lea r15, [rdx + 4*rax]
mov rax, qword ptr [rbp - 56]
jmp .LBB3_4
.p2align 4, 0x90
.LBB3_8:
add rdx, r15
vmovups zmmword ptr [rdx], zmm15
vmovups zmmword ptr [rdx + 64], zmm14
vmovups zmmword ptr [rdx + 128], zmm13
vmovups zmmword ptr [rdx + 192], zmm12
vmovups zmmword ptr [rdx + 256], zmm11
vmovups zmmword ptr [rdx + 320], zmm10
vmovups zmmword ptr [rdx + 384], zmm9
vmovups zmmword ptr [rdx + 448], zmm8
vmovups zmmword ptr [rdx + 512], zmm7
vmovups zmmword ptr [rdx + 576], zmm6
vmovups zmmword ptr [rdx + 640], zmm5
vmovups zmmword ptr [rdx + 704], zmm4
vmovups zmmword ptr [rdx + 768], zmm3
vmovups zmmword ptr [rdx + 832], zmm2
vmovups zmmword ptr [rdx + 896], zmm1
vmovups zmmword ptr [rdx + 960], zmm0
.LBB3_9:
add rax, rdi
add r8, rsi
cmp rax, r14
jge .LBB3_10
.LBB3_4:
.loc 1 0 1
test r11, r11
.loc 1 1 1
je .LBB3_9
.loc 1 0 1
mov rcx, qword ptr [rbp - 128]
.loc 1 1 1
mov rdx, rax
shl rdx, 10
prefetchw byte ptr [r15 + rdx]
prefetcht0 byte ptr [r13]
imul rcx, rax
add rcx, qword ptr [rbp - 120]
shl rcx, 5
sar rcx, 3
prefetcht0 byte ptr [r9 + rcx]
prefetcht0 byte ptr [r13]
prefetcht0 byte ptr [r9 + rcx]
vmovups zmm15, zmmword ptr [r15 + rdx]
vmovups zmm14, zmmword ptr [r15 + rdx + 64]
vmovups zmm13, zmmword ptr [r15 + rdx + 128]
vmovups zmm12, zmmword ptr [r15 + rdx + 192]
vmovups zmm11, zmmword ptr [r15 + rdx + 256]
vmovups zmm10, zmmword ptr [r15 + rdx + 320]
vmovups zmm9, zmmword ptr [r15 + rdx + 384]
vmovups zmm8, zmmword ptr [r15 + rdx + 448]
vmovups zmm7, zmmword ptr [r15 + rdx + 512]
vmovups zmm6, zmmword ptr [r15 + rdx + 576]
vmovups zmm5, zmmword ptr [r15 + rdx + 640]
vmovups zmm4, zmmword ptr [r15 + rdx + 704]
vmovups zmm3, zmmword ptr [r15 + rdx + 768]
vmovups zmm2, zmmword ptr [r15 + rdx + 832]
vmovups zmm1, zmmword ptr [r15 + rdx + 896]
vmovups zmm0, zmmword ptr [r15 + rdx + 960]
test r11, r11
jle .LBB3_8
.loc 1 0 1
lea rcx, [8*r8]
mov r12, r9
mov rbx, r11
sar rcx, 3
add rcx, 512
.p2align 4, 0x90
.LBB3_7:
.loc 1 1 1
vmovups zmm16, zmmword ptr [r12 + rcx - 512]
prefetcht0 byte ptr [r12 + rcx]
vfmadd231ps zmm15, zmm16, dword ptr [r12 + r10]{1to16}
vfmadd231ps zmm14, zmm16, dword ptr [r12 + r10 + 4]{1to16}
vfmadd231ps zmm13, zmm16, dword ptr [r12 + r10 + 8]{1to16}
vfmadd231ps zmm12, zmm16, dword ptr [r12 + r10 + 12]{1to16}
vfmadd231ps zmm11, zmm16, dword ptr [r12 + r10 + 16]{1to16}
vfmadd231ps zmm10, zmm16, dword ptr [r12 + r10 + 20]{1to16}
vfmadd231ps zmm9, zmm16, dword ptr [r12 + r10 + 24]{1to16}
vfmadd231ps zmm8, zmm16, dword ptr [r12 + r10 + 28]{1to16}
vfmadd231ps zmm7, zmm16, dword ptr [r12 + r10 + 32]{1to16}
vfmadd231ps zmm6, zmm16, dword ptr [r12 + r10 + 36]{1to16}
vfmadd231ps zmm5, zmm16, dword ptr [r12 + r10 + 40]{1to16}
vfmadd231ps zmm4, zmm16, dword ptr [r12 + r10 + 44]{1to16}
vfmadd231ps zmm3, zmm16, dword ptr [r12 + r10 + 48]{1to16}
vfmadd231ps zmm2, zmm16, dword ptr [r12 + r10 + 52]{1to16}
vfmadd231ps zmm1, zmm16, dword ptr [r12 + r10 + 56]{1to16}
vfmadd231ps zmm0, zmm16, dword ptr [r12 + r10 + 60]{1to16}
prefetcht0 byte ptr [r12 + r10 + 512]
add r12, 64
dec rbx
jne .LBB3_7
jmp .LBB3_8
.LBB3_11:
xor eax, eax
.loc 1 1 1 epilogue_begin
pop rbx
pop r12
pop r13
pop r14
pop r15
pop rbp
.cfi_def_cfa rsp, 8
vzeroupper
ret
.Ltmp7:
.Lfunc_end3:
.size matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32, .Lfunc_end3-matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x1_f32
.cfi_endproc