Tuninglink
This page documents support for IREE dispatch tuning. The compiler supports both default and user-provided tuning specs (specifications) that override compiler heuristics that guide dispatch code generation. In our experience, tuning specs can provide meaningful speedup of model execution. For example, we achieved a ~10% improvement on the Stable Diffusion XL (SDXL) model with the MI300X GPU.
Tuning specslink
The default specs are shipped with the IREE compiler and are target-specific. We aim to provide default tuning specs that cover the most in-demand hardware and dispatches from most popular ML models, although we do not guarantee completeness.
User-provided tuning specs are a mechanism that allows for users to get the best performance on custom models and hardware targets without having to modify the compiler source code or needlessly special-case compiler heuristics.
Currently, the dispatch tuner that generates tuning specs is still experimental and hosted in an external repo. This document describes how to work with tuning specs generated by the SHARK Tuner or produced manually, but it does not go into detail on how to generate these specs.
Flagslink
The use of tuning specs in iree-compile is controlled with the following
flags:
--iree-codegen-enable-default-tuning-specs-- enables or disables the default tuning specs shipped with the compiler.--iree-codegen-tuning-spec-path-- loads a user-specified tuning spec.--iree-codegen-dump-tuning-specs-to-- dumps final tuning specs to a directory or standard output.
Note that both default and user-provided specs can be enabled at the same time. The compiler will link them together and invoke the user-provided spec before attempting the default one.
Anatomy of a tuning speclink
Examplelink
module @my_spec attributes { transform.with_named_sequence, iree_codegen.tuning_spec_with_default_entrypoint } {
transform.named_sequence @apply_op_config(%op: !transform.any_op {transform.readonly},
                                          %config: !transform.any_param {transform.readonly}) {
  transform.annotate %op "compilation_info" = %config : !transform.any_op, !transform.any_param
  transform.yield
}
transform.named_sequence
@match_mmt_f16_f16_f32(%root: !transform.any_op {transform.readonly}) -> !transform.any_op {
  transform.match.operation_name %root ["linalg.generic"] : !transform.any_op
  %ins, %outs = transform.iree.match.cast_compatible_dag_from_root %root {
    ^bb0(%lhs: tensor<?x?xf16>, %rhs: tensor<?x?xf16>, %out: tensor<?x?xf32>):
    %7 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>,
                                          affine_map<(d0, d1, d2) -> (d1, d2)>,
                                          affine_map<(d0, d1, d2) -> (d0, d1)>],
                          iterator_types = ["parallel", "parallel", "reduction"]}
        ins(%lhs, %rhs : tensor<?x?xf16>, tensor<?x?xf16>) outs(%out : tensor<?x?xf32>) {
      ^bb0(%in: f16, %in_0: f16, %acc: f32):
        %8 = arith.extf %in : f16 to f32
        %9 = arith.extf %in_0 : f16 to f32
        %10 = arith.mulf %8, %9 : f32
        %11 = arith.addf %acc, %10 : f32
        linalg.yield %11 : f32
      } -> tensor<?x?xf32>
  } : (!transform.any_op) -> (!transform.any_value, !transform.any_value)
  transform.yield %root : !transform.any_op
}
transform.named_sequence
@match_mmt_2048x1280x5120_f16_f16_f32(%matmul: !transform.any_op {transform.readonly})
  -> (!transform.any_op, !transform.any_param) {
  %mmt = transform.include @match_mmt_f16_f16_f32 failures(propagate) (%matmul)
    : (!transform.any_op) -> !transform.any_op
  %lhs = transform.get_operand %matmul[0] : (!transform.any_op) -> !transform.any_value
  %rhs = transform.get_operand %matmul[1] : (!transform.any_op) -> !transform.any_value
  transform.iree.match.cast_compatible_type %lhs = tensor<2048x5120xf16> : !transform.any_value
  transform.iree.match.cast_compatible_type %rhs = tensor<1280x5120xf16> : !transform.any_value
  %config = transform.param.constant #iree_codegen.compilation_info<
    lowering_config = #iree_gpu.lowering_config<{promote_operands = [0, 1],
                                                 mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>,
                                                 subgroup_m_count = 2, subgroup_n_count = 2,
                                                 reduction = [0, 0, 64],
                                                 workgroup = [64, 128, 0]}>,
    translation_info = #iree_codegen.translation_info<pipeline = LLVMGPUVectorDistribute
      workgroup_size = [256, 1, 1] subgroup_size = 64,
      {gpu_pipeline_options = #iree_gpu.pipeline_options<prefetch_shared_memory = true>}>
  > -> !transform.any_param
  transform.yield %matmul, %config : !transform.any_op, !transform.any_param
}
transform.named_sequence
@__kernel_config(%variant_op: !transform.any_op {transform.consumed}) -> !transform.any_op
  attributes { iree_codegen.tuning_spec_entrypoint } {
  %res = transform.foreach_match in %variant_op
    @match_mmt_2048x1280x5120_f16_f16_f32 -> @apply_op_config
    : (!transform.any_op) -> !transform.any_op
  transform.yield %res : !transform.any_op
}
}
Explanationlink
Tuning specs are transform dialect libraries that conform to the following format:
- All tuning spec entry points (named sequence ops) are marked with the
  
iree_codegen.tuning_spec_entrypointattribute. They have a single argument of type!transform.any_opand return a single value of type!transform.any_op. - All entry points in the final tuning specs must either read
  (
transform.readonly) or consume (transform.consumed) the argument. - The 
iree_codegen.tuning_spec_with_default_entrypointattribute ensures that the tuning spec includes a named sequence op with name__kernel_config, which must contain exactly oneforeach_matchop. Thatforeach_matchop must have exactly one argument and one result of type any_op. 
The tuning spec above attempts to match linalg.generic ops that correspond to the
matmul operation with the RHS operand transposed (a.k.a. mmt) of shape
2048x1280x5120 and f16 operand element types and f32 result element type.
If the match succeeds, the tuning spec applies the compilation_info attribute
that will drive the code generation. This attribute is considered a compiler
implementation detail; in general, each codegen pipeline has its own
requirements as to what is considered a valid compilation info and how to
interpret it.
Tuning specs get executed by the 'Materialize User Configs` pass.