Common/GPU

`-iree-codegen-expand-gpu-ops`link

Expands high-level GPU ops, such as clustered gpu.subgroup_reduce.

`-iree-codegen-gpu-apply-tiling-level`link

Pass to tile tensor ops based on tiling configs

Optionslink

-tiling-level      : Tiling level to tile. Supported levels are 'reduction' and 'thread'
-allow-zero-slices : Allow pad fusion to generate zero size slices

`-iree-codegen-gpu-bubble-resource-casts`link

Bubbles iree_gpu.buffer_resource_cast ops upwards.

`-iree-codegen-gpu-check-resource-usage`link

Checks GPU specific resource usage constraints like shared memory limits

`-iree-codegen-gpu-combine-value-barriers`link

Combines iree_gpu.value_barrier ops

`-iree-codegen-gpu-create-fast-slow-path`link

Create separate fast and slow paths to handle padding

`-iree-codegen-gpu-decompose-horizontally-fused-gemms`link

Decomposes a horizontally fused GEMM back into its constituent GEMMs

`-iree-codegen-gpu-distribute`link

Pass to distribute scf.forall ops using upstream patterns.

`-iree-codegen-gpu-distribute-copy-using-forall`link

Pass to distribute copies to threads.

`-iree-codegen-gpu-distribute-forall`link

Pass to distribute scf.forall ops.

`-iree-codegen-gpu-distribute-scf-for`link

Distribute tiled loop nests to invocations

Optionslink

-use-block-dims : Use gpu.block_dim ops to query distribution sizes.

`-iree-codegen-gpu-distribute-shared-memory-copy`link

Pass to distribute shared memory copies to threads.

`-iree-codegen-gpu-fuse-and-hoist-parallel-loops`link

Greedily fuses and hoists parallel loops.

`-iree-codegen-gpu-generalize-named-ops`link

Convert named Linalg ops to linalg.generic ops

`-iree-codegen-gpu-greedily-distribute-to-threads`link

Greedily distributes all remaining tilable ops to threads

`-iree-codegen-gpu-infer-memory-space`link

Pass to infer and set the memory space for all alloc_tensor ops.

`-iree-codegen-gpu-lower-to-ukernels`link

Lower suitable ops to previously-selected microkernels

`-iree-codegen-gpu-multi-buffering`link

Pass to do multi buffering.

Optionslink

-num-buffers : Number of buffers to use.

`-iree-codegen-gpu-pack-to-intrinsics`link

Packs matmul like operations and converts to iree_gpu.multi_mma

`-iree-codegen-gpu-pad-operands`link

Pass to pad operands of ops with padding configuration provided.

`-iree-codegen-gpu-pipelining`link

Pass to do software pipelining.

Optionslink

-epilogue-peeling    : Try to use un-peeling epilogue when false, peeled epilouge o.w.
-pipeline-depth      : Number of stages 
-schedule-index      : Allows picking different schedule for the pipelining transformation.
-transform-file-name : Optional filename containing a transform dialect specification to apply. If left empty, the IR is assumed to contain one top-level transform dialect operation somewhere in the module.

`-iree-codegen-gpu-promote-matmul-operands`link

Pass to insert copies with a different thread configuration on matmul operands

`-iree-codegen-gpu-reduce-bank-conflicts`link

Pass to try to reduce the number of bank conflicts by padding memref.alloc ops.

Optionslink

-padding-bits : Padding size (in bits) to introduce between rows.

`-iree-codegen-gpu-reuse-shared-memory-allocs`link

Pass to reuse shared memory allocations with no overlapping liveness.

`-iree-codegen-gpu-tensor-alloc`link

Pass to create allocations for some tensor values to useGPU shared memory

`-iree-codegen-gpu-tensor-tile`link

Pass to tile tensor (linalg) ops within a GPU workgroup

Optionslink

-distribute-to-subgroup : Distribute the workloads to subgroup if true, otherwise distribute to threads.

`-iree-codegen-gpu-tensor-tile-to-serial-loops`link

Pass to tile reduction dimensions for certain GPU ops

Optionslink

-coalesce-loops : Collapse the loops that are generated to a single loops

`-iree-codegen-gpu-tile`link

Tile Linalg ops with tensor semantics to invocations

`-iree-codegen-gpu-tile-reduction`link

Pass to tile linalg reduction dimensions.

`-iree-codegen-gpu-vector-alloc`link

Pass to create allocations for contraction inputs to copy to GPU shared memory

`-iree-codegen-gpu-verify-distribution`link

Pass to verify writes before resolving distributed contexts.

`-iree-codegen-reorder-workgroups`link

Reorder workgroup ids for better cache reuse

Optionslink

-strategy : Workgroup reordering strategy, one of: '' (none),  'transpose'

`-iree-codegen-vector-reduction-to-gpu`link

Convert vector reduction to GPU ops.

Common/GPU

-iree-codegen-expand-gpu-opslink

-iree-codegen-gpu-apply-tiling-levellink

Optionslink

-iree-codegen-gpu-bubble-resource-castslink

-iree-codegen-gpu-check-resource-usagelink

-iree-codegen-gpu-combine-value-barrierslink

-iree-codegen-gpu-create-fast-slow-pathlink

-iree-codegen-gpu-decompose-horizontally-fused-gemmslink

-iree-codegen-gpu-distributelink

-iree-codegen-gpu-distribute-copy-using-foralllink

-iree-codegen-gpu-distribute-foralllink

-iree-codegen-gpu-distribute-scf-forlink

Optionslink

-iree-codegen-gpu-distribute-shared-memory-copylink

-iree-codegen-gpu-fuse-and-hoist-parallel-loopslink

-iree-codegen-gpu-generalize-named-opslink

-iree-codegen-gpu-greedily-distribute-to-threadslink

-iree-codegen-gpu-infer-memory-spacelink

-iree-codegen-gpu-lower-to-ukernelslink

-iree-codegen-gpu-multi-bufferinglink

Optionslink

-iree-codegen-gpu-pack-to-intrinsicslink

-iree-codegen-gpu-pad-operandslink

-iree-codegen-gpu-pipelininglink

Optionslink

-iree-codegen-gpu-promote-matmul-operandslink

-iree-codegen-gpu-reduce-bank-conflictslink

Optionslink

-iree-codegen-gpu-reuse-shared-memory-allocslink

-iree-codegen-gpu-tensor-alloclink

-iree-codegen-gpu-tensor-tilelink

Optionslink

-iree-codegen-gpu-tensor-tile-to-serial-loopslink

Optionslink

-iree-codegen-gpu-tilelink

-iree-codegen-gpu-tile-reductionlink

-iree-codegen-gpu-vector-alloclink

-iree-codegen-gpu-verify-distributionlink

-iree-codegen-reorder-workgroupslink

Optionslink

-iree-codegen-vector-reduction-to-gpulink

`-iree-codegen-expand-gpu-ops`link

`-iree-codegen-gpu-apply-tiling-level`link

`-iree-codegen-gpu-bubble-resource-casts`link

`-iree-codegen-gpu-check-resource-usage`link

`-iree-codegen-gpu-combine-value-barriers`link

`-iree-codegen-gpu-create-fast-slow-path`link

`-iree-codegen-gpu-decompose-horizontally-fused-gemms`link

`-iree-codegen-gpu-distribute`link

`-iree-codegen-gpu-distribute-copy-using-forall`link

`-iree-codegen-gpu-distribute-forall`link

`-iree-codegen-gpu-distribute-scf-for`link

`-iree-codegen-gpu-distribute-shared-memory-copy`link

`-iree-codegen-gpu-fuse-and-hoist-parallel-loops`link

`-iree-codegen-gpu-generalize-named-ops`link

`-iree-codegen-gpu-greedily-distribute-to-threads`link

`-iree-codegen-gpu-infer-memory-space`link

`-iree-codegen-gpu-lower-to-ukernels`link

`-iree-codegen-gpu-multi-buffering`link

`-iree-codegen-gpu-pack-to-intrinsics`link

`-iree-codegen-gpu-pad-operands`link

`-iree-codegen-gpu-pipelining`link

`-iree-codegen-gpu-promote-matmul-operands`link

`-iree-codegen-gpu-reduce-bank-conflicts`link

`-iree-codegen-gpu-reuse-shared-memory-allocs`link

`-iree-codegen-gpu-tensor-alloc`link

`-iree-codegen-gpu-tensor-tile`link

`-iree-codegen-gpu-tensor-tile-to-serial-loops`link

`-iree-codegen-gpu-tile`link

`-iree-codegen-gpu-tile-reduction`link

`-iree-codegen-gpu-vector-alloc`link

`-iree-codegen-gpu-verify-distribution`link

`-iree-codegen-reorder-workgroups`link

`-iree-codegen-vector-reduction-to-gpu`link