Tuninglink

This page documents support for IREE dispatch tuning. The compiler supports both default and user-provided tuning specs (specifications) that override compiler heuristics that guide dispatch code generation. In our experience, tuning specs can provide meaningful speedup of model execution. For example, we achieved a ~10% improvement on the Stable Diffusion XL (SDXL) model with the MI300X GPU.

Tuning specslink

The default specs are shipped with the IREE compiler and are target-specific. We aim to provide default tuning specs that cover the most in-demand hardware and dispatches from most popular ML models, although we do not guarantee completeness.

User-provided tuning specs are a mechanism that allows for users to get the best performance on custom models and hardware targets without having to modify the compiler source code or needlessly special-case compiler heuristics.

Currently, the dispatch tuner that generates tuning specs is still experimental and hosted in an external repo. This document describes how to work with tuning specs generated by the SHARK Tuner or produced manually, but it does not go into detail on how to generate these specs.

Flagslink

The use of tuning specs in iree-compile is controlled with the following flags:

--iree-codegen-enable-default-tuning-specs -- enables or disables the default tuning specs shipped with the compiler.
--iree-codegen-tuning-spec-path -- loads a user-specified tuning spec.
--iree-codegen-dump-tuning-specs-to -- dumps final tuning specs to a directory or standard output.

Note that both default and user-provided specs can be enabled at the same time. The compiler will link them together and invoke the user-provided spec before attempting the default one.

Anatomy of a tuning speclink

Examplelink

module @my_spec attributes { transform.with_named_sequence, iree_codegen.tuning_spec_with_default_entrypoint } {
transform.named_sequence @apply_op_config(%op: !transform.any_op {transform.readonly},
                                          %config: !transform.any_param {transform.readonly}) {
  transform.annotate %op "compilation_info" = %config : !transform.any_op, !transform.any_param
  transform.yield
}

transform.named_sequence
@match_mmt_f16_f16_f32(%root: !transform.any_op {transform.readonly}) -> !transform.any_op {
  transform.match.operation_name %root ["linalg.generic"] : !transform.any_op
  %ins, %outs = transform.iree.match.cast_compatible_dag_from_root %root {
    ^bb0(%lhs: tensor<?x?xf16>, %rhs: tensor<?x?xf16>, %out: tensor<?x?xf32>):
    %7 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>,
                                          affine_map<(d0, d1, d2) -> (d1, d2)>,
                                          affine_map<(d0, d1, d2) -> (d0, d1)>],
                          iterator_types = ["parallel", "parallel", "reduction"]}
        ins(%lhs, %rhs : tensor<?x?xf16>, tensor<?x?xf16>) outs(%out : tensor<?x?xf32>) {
      ^bb0(%in: f16, %in_0: f16, %acc: f32):
        %8 = arith.extf %in : f16 to f32
        %9 = arith.extf %in_0 : f16 to f32
        %10 = arith.mulf %8, %9 : f32
        %11 = arith.addf %acc, %10 : f32
        linalg.yield %11 : f32
      } -> tensor<?x?xf32>
  } : (!transform.any_op) -> (!transform.any_value, !transform.any_value)
  transform.yield %root : !transform.any_op
}

transform.named_sequence
@match_mmt_2048x1280x5120_f16_f16_f32(%matmul: !transform.any_op {transform.readonly})
  -> (!transform.any_op, !transform.any_param) {
  %mmt = transform.include @match_mmt_f16_f16_f32 failures(propagate) (%matmul)
    : (!transform.any_op) -> !transform.any_op
  %lhs = transform.get_operand %matmul[0] : (!transform.any_op) -> !transform.any_value
  %rhs = transform.get_operand %matmul[1] : (!transform.any_op) -> !transform.any_value
  transform.iree.match.cast_compatible_type %lhs = tensor<2048x5120xf16> : !transform.any_value
  transform.iree.match.cast_compatible_type %rhs = tensor<1280x5120xf16> : !transform.any_value
  %config = transform.param.constant #iree_codegen.compilation_info<
    lowering_config = #iree_gpu.lowering_config<{promote_operands = [0, 1],
                                                 mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>,
                                                 subgroup_m_count = 2, subgroup_n_count = 2,
                                                 reduction = [0, 0, 64],
                                                 workgroup = [64, 128, 0]}>,
    translation_info = #iree_codegen.translation_info<pipeline = LLVMGPUVectorDistribute
      workgroup_size = [256, 1, 1] subgroup_size = 64,
      {gpu_pipeline_options = #iree_gpu.pipeline_options<prefetch_shared_memory = true>}>
  > -> !transform.any_param
  transform.yield %matmul, %config : !transform.any_op, !transform.any_param
}

transform.named_sequence
@__kernel_config(%variant_op: !transform.any_op {transform.consumed}) -> !transform.any_op
  attributes { iree_codegen.tuning_spec_entrypoint } {
  %res = transform.foreach_match in %variant_op
    @match_mmt_2048x1280x5120_f16_f16_f32 -> @apply_op_config
    : (!transform.any_op) -> !transform.any_op
  transform.yield %res : !transform.any_op
}
}

Explanationlink

Tuning specs are transform dialect libraries that conform to the following format:

All tuning spec entry points (named sequence ops) are marked with the iree_codegen.tuning_spec_entrypoint attribute. They have a single argument of type !transform.any_op and return a single value of type !transform.any_op.
All entry points in the final tuning specs must either read (transform.readonly) or consume (transform.consumed) the argument.
The iree_codegen.tuning_spec_with_default_entrypoint attribute ensures that the tuning spec includes a named sequence op with name __kernel_config, which must contain exactly one foreach_match op. That foreach_match op must have exactly one argument and one result of type any_op.

The tuning spec above attempts to match linalg.generic ops that correspond to the matmul operation with the RHS operand transposed (a.k.a. mmt) of shape 2048x1280x5120 and f16 operand element types and f32 result element type.

If the match succeeds, the tuning spec applies the compilation_info attribute that will drive the code generation. This attribute is considered a compiler implementation detail; in general, each codegen pipeline has its own requirements as to what is considered a valid compilation info and how to interpret it.

Tuning specs get executed by the 'Materialize User Configs` pass.