Stream

`-iree-stream-annotate-affinities`link

Annotates affinities on all ops for debugging.

`-iree-stream-annotate-dispatch-arguments`link

Annotates dispatch arguments with potential values derived from dispatch sites.

Uses data flow analysis to identify potential value sets and alignments (or divisibility) for dispatch operands and bindings. Upon successful analysis the dispatch executables are annotated such that further lowering in codegen has the analysis results locally without needing to inspect the entire program.

Operands are annotated with stream.values and/or stream.alignment attributes indicating all known constant values at all dispatch sites and/or their divisibility. stream.values is only added when only statically-known values are passed and stream.alignment is added in cases where some minimum divisibility is identified even if the values are dynamic (such as all values passed in going through util.align or arith.muli prior).

Bindings are annotated with stream.alignment attributes indicating their base alignment prior to the offset specified on the binding op itself. Note that just because the base alignment is some value does not mean the offset is always known to be aligned in the same way.

`-iree-stream-annotate-dispatch-assumptions`link

Adds util.assume.* op to executables from all dispatch sites.

Uses dataflow analysis to determine integer range and divisibility, propagating that as util.assume.int ops within the executable with an assumption row for each dispatch site. This effectively transports the per-dispatch level analyses to the executable so that the backend can act on it as it sees fit.

Note that this pass largely replaces the AnnotateDispatchArgumentsPass above and can eventually subsume it entirely. However, as the mechanism is new and needs to be phased in, both exist in parallel for the moment.

`-iree-stream-clone-to-consumers`link

Clones operations that opt-in to consumer affinities.

Performs whole-program analysis to identify operations that are used on multiple affinities that can be cloned per-affinity. The StreamableOp interface's preferCloneToConsumers query is used and any ops implementing the interface may opt-in to the cloning.

`-iree-stream-conversion`link

Converts from flow and other input dialects into the stream dialect.

Converts supported input dialects (flow, tensor, util, and various upstream dialects like cf/scf) into the stream dialect and adds additional metadata. After conversion all supported operations will act on !stream.resource<*> types and track resource storage sizes symbolically.

Though the conversion requires that the program be in an implicitly synchronized form (SSA use-def chains on immutable tensor-like objects) limited support is available for a subset of the hal dialect ops that are used on the program ABI boundary for interoperating with external buffers and fences. These ops, such as hal.tensor.import and hal.tensor.barrier, will be converted to their stream dialect form and preserve the implicit synchronization guaranteeds required for proper analysis.

Dispatched executables are allowed to be in one of the supported input dialects (like flow.executable), already be lowered into stream.executable ops, or be the final hal.executable ops. The amount of analysis and optimization that can be performed on hal.executable ops is limited and no retargetability is available when directly providing them.

`-iree-stream-dump-statistics`link

Dumps stream dialect usage information to a file.

Optionslink

-output-format : Specifies the output format to produce.
-output-file   : File path to write to; or `` for stderr or `-` for stdout.

`-iree-stream-elide-async-copies`link

Elides copies when they are not performing meaningful work.

Performs whole-program analysis to identify copies that are not required for program correctness or enabling concurrency, such as clones of the last use of a value. This eliminates copies both from input programs and those materialized by the iree-stream-materialize-copy-on-write pass.

`-iree-stream-elide-async-transfers`link

Elides transfers when they are not performing meaningful work.

Performs whole-program analysis to identify transfers that are not required for program correctness (transfers to/from the same device, etc).

`-iree-stream-elide-timepoints`link

Elides timepoints that are known to be covered by dependent timepoints.

Elides waits on timepoints that are known to be reached by a dependent timepoint. Errs on the side of preserving timepoints if analysis can't guarantee that a particular wait is covered.

Example:

  %timepoint0 = ...
  %timepoint1 = ... await(%timepoint0)
  %timepoint2 = stream.timepoint.join max(%timepoint0, %timepoint1)
->
  %timepoint0 = ...
  %timepoint1 = ... await(%timepoint0)
  %timepoint2 = stream.timepoint.join max(%timepoint1)
-> (canonicalization) ->
  %timepoint0 = ...
  %timepoint1 = ... await(%timepoint0)
  %timepoint2 = %timepoint1

`-iree-stream-emplace-allocations`link

Emplaces transient tensor allocations to remove copies.

Identifies opportunities for placing operation results directly into existing resources when analysis determines it is safe to do so. This is intended to run after copy-on-write materialization when such analysis can be performed local to the operations. The common case this helps with is insertions of produced results into larger resources such as performed by tensor concatenation.

`-iree-stream-encode-device-tensors`link

Encodes tensors into binary formats based on affinity and target support.

Encodes stream.binding.* ops on tensor-like objects while handling packing and encoding as with the iree-stream-encode-host-tensors pass but within executables.

`-iree-stream-encode-host-tensors`link

Encodes tensors into storage formats based on affinity and target support.

Encodes stream.tensor.* ops on tensor-like objects into encoding-erased asynchronous stream.async.* ops and resolves (if possible) symbolic encoding ops such as stream.tensor.sizeof into their final values.

Dense tensors are trivially lowerable but other encodings may require additional transfer and dispatch operations. For example, computing the minimal fixed storage size of an unblocked sparse tensor may require the pass to insert a dispatch that traverses the index tables to discover how many elements are present while a blocked sparse tensor may be able to resolve to a simpler calculation based solely on the number of fixed-size blocks.

Sub-byte tensor types or those with non-trivial packing/encoding are also resolved here such as by calculating that a tensor<Nxi4> requires N*4/8 bytes of storage. Some operations like slicing subranges of elements without known alignment may also require additional transfer and dispatch operations to preserve behavior while lowering into the type-erased forms.

`-iree-stream-fold-uniform-operands`link

Folds redundant and uniformly constant dispatch operands.

Performs whole-program analysis to find all dispatch sites to each dispatch and fold or inline operands that are uniformly passed. For example if multiple dispatch sites pass the same SSA value for two operands (even if dynamically computed) they will be folded into a single value, and if multiple dispatch sites pass the same constant value for the same operand the constant value will be inlined and the operand removed.

`-iree-stream-fuse-dispatch-bindings`link

Fuses bindings to the same underlying storage to reduce binding count.

Erases dispatch binding subranges and attempts to fuse bindings that originate from the same resources across all dispatch sites.

`-iree-stream-layout-slices`link

Lays out packed slices and produces arithmetic required for all offsets.

Performs target-aware layout of packed slices in stream.resource.pack ops. Alignment, padding, and static/dynamic offset calculation of the slices within larger allocated resources happens with awareness of both the resource slices being packed and where they will be consumed.

`-iree-stream-materialize-builtins`link

Materialize dispatches to builtin executables where required.

Materializes dispatches to builtin executables when operations are not supported by lower layers of the stack. For example, an stream.async.fill op with an i64 pattern will be converted to a stream.async.dispatch of __builtin_fill_i64 and the stream.executable will be merged into the module.

Though in many cases this kind of emulation happens more naturally during the global optimization phase of the compiler and is more efficient as there is opportunity for fusion into existing dispatches sometimes it's not possible to statically know at the time such phases operate whether the operations are required and this catches those cases.

Since it's often less efficient to materialize a builtin dispatch instead of having fused it with others or to have been able to make use of a pure transfer operation the materialization is seen as a pessimization that should be avoided. Generally builtins are only added to ensure correct execution and are not used to try to optimize the program.

`-iree-stream-materialize-copy-on-write`link

Materializes copy-on-write (🐄) behavior as explicit ops.

Materializes copy-on-write behavior in the program by analyzing usage of !stream.resource<*> types by stream ops. Prior to this pass resources are implicitly immutable and follow SSA semantics while after the pass any cases where such implicit behavior is assumed has been expanded into appropriate clones of the resources or rematerialization of source values.

As an example attempting to update the same immutable tensor will result in the original tensor being cloned such that each update sees a unique copy:

%init = stream.async.splat %c0
%fill0 = stream.async.fill %c123, %init[...] -> %init
%fill1 = stream.async.fill %c456, %init[...] -> %init
->
%init = stream.async.splat %c0
%clone0 = stream.async.clone %init
%fill0 = stream.async.fill %c123, %clone0[...] -> %clone0
%clone1 = stream.async.clone %init
%fill1 = stream.async.fill %c456, %clone1[...] -> %clone1

A subsequently run iree-stream-elide-async-copies pass can often elide or simplify some of the copies such as above where splatting and then cloning the splat twice is not required. The passes are split to allow for simple local analysis here and for the elision pass to catch input that may already have contained unneeded copies.

`-iree-stream-materialize-encodings`link

Materialize stream.tensor.encode ops to dispatches and executables.

Materializes uniqued executables for stream.tensor.encode ops and replaces them with dispatches to those executables.

`-iree-stream-pack-constants`link

Packs and allocates backing storage for fused constant resources.

Packs slices of stream.resource.constants ops and materializes operations to initialize them based on their contents. Embedded constants are turned into inline host buffers with operations that try to map them into device memory or perform device-accelerated file I/O asynchronously with other initialization code. Parameters are expanded based on the the device memory model to be loads (which may allow mapping memory on devices with unified memory) or gathers (that require allocation and staging on devices with discrete memory).

`-iree-stream-pack-dispatch-operands`link

Packs stream dispatch operands into i32 push constants.

Packs dispatch operands (such as i2, i64, complex<f32>, etc) into the required i32 values on the dispatch ABI. May optimize multiple wider bit-width operands with known ranges or alignments into or across fewer operands to reduce the total operand count.

`-iree-stream-propagate-timepoints`link

Materializes timepoints and sinks them to consumers throughout the whole program.

Propagates !stream.timepoint values across the whole program in order to avoid host-device and device-device waits where possible without changing correct execution ordering. For example a host wait on a timepoint via a stream.timepoint.await op guarding a resource passed to a function call will be changed to pass the timepoint to the callee and have the wait occur in there thus allowing it to be chained with subsequent device operations that may consume the resource. Such propagation happens across global stores and loads, function calls, and control flow.

`-iree-stream-refine-usage`link

Refines resource usage bits and inserts transfers where appropriate.

Performs whole-program analysis to assign lifetime and usage attributes to !stream.resource<*> types that have not yet been fixed. Resources are tracked across global loads/stores, function calls, control flow, and operations acting on them to determine how they are used (transfers, host staging, constants, etc). Upon completion all resources have a fixed lifetime and any new resources introduced into the program with an unspecified lifetime (!stream.resource<*>) will require the pass to be run again prior to continued lowering.

`-iree-stream-schedule-allocation`link

Allocates resources and converts to explicit stream commands.

Schedules allocation of resources and converts the program from the implicit resource management scheme of the stream.async.* ops into the explicit resource management scheme of the stream.cmd.* ops. After conversion the program cannot be raised as aliasing is introduced and local liveness ranges are erased.

Allocations are performed by asynchronous operations like stream.resource.alloca (and the matching stream.resource.dealloca) and sequenced in the device timeline by !stream.timepoint values.

`-iree-stream-schedule-concurrency`link

Identifies and groups asynchronous operations within executable regions that can run concurrently and groups them into streams.

Partitions operations that can execute concurrently within stream.async.execute regions into a tree with stream.async.concurrent ops indicating two or more operations that are allowed to execute concurrently even if resources may alias.

`-iree-stream-schedule-execution`link

Identifies and groups asynchronous operations into executable regions within function-like regions.

Partitions stream.async.* operations into execution regions that are executed atomically on a single device. The partitioning algorithm uses the operations being performed and the affinity assigned to them (if any) to determine which are allowed to execute together and is allowed to produce any number of partitions to cover the workload. Original executing ordering is preserved by the resulting stream.async.execute operations using !stream.timepoint to maintain explicit SSA use-def-based wait-on and signal-to behavior. Scheduling may insert host waits on device work that can be later avoided by timepoint propagation and elision.

`-iree-stream-specialize-dispatches`link

Specializes executables by inlining/fusing operands based on dispatch sites.

Reduces the number of operands passed to dispatches by identifying common patterns at dispatch sites across the program that can be compressed into unique dispatch site identifiers. For example, if a dispatch takes several operands that are [0, 1, ...] at one dispatch site and [10, 11, ...] at another the dispatch will be changed to take a single value indicating which set of operands to use and the operands themselves will be placed into a lookup table within the dispatch.

`-iree-stream-specialize-encodings`link

Specializes serializable encodings based on layout analysis.

Attaches layouts to encodings and duplicates executables based on the encoding layout analysis.

Some executables can be launched by different devices. It can produce wrong codegen artifacts when bindings types are encoded (i.e., the tensor type has an encoding attribute). Because they can result in different layouts, especially when multi-device is involved. E.g., say that device_a and device_b interpret a tensor type with encodings in different layouts, and there is an executable that can be launched with resources from either device_a or device_b. It is confusing what the input layouts for the executable because there are two possibilities. In this case, we have to duplicate the executable with updated encoding, and modify the dispatch to launch proper executable based on device analysis.

The pass resolves the layouts based on Stream affinity analysis. It updates the encodings of all the Stream tensor ops with resolved layouts, duplicates executables based on the set of incoming layouts and result layouts, and updates bindings with resolved layouts.

Requirements: - At least one of the dialect implements AffinityAnalysisDialectInterface dialect interface, because Stream does not need to know any dialect other than itself. - The binding types have to implement IREE::Encoding::EncodingTypeInterface, so it can updates the types without accessing any other dialects. - All the encodings attached on the types have to implement SerializableEncodingAttrInterface. Because the pass updates the encodings using interfaces.

`-iree-stream-sync-initializers`link

Makes all initializer-produced timepoints synchronously wait before proceeding.

Gathers all global timepoint stores within each initializer and converts them to a single synchronous host wait.

NOTE: this does not currently find timepoints in called functions. To handle that we would need to analyze the call graph to find functions called only from initializers and duplicate any function that is called from both initializers and non-initializer roots. At the point in the pipeline where this pass runs most internal function calls return timepoints and the initializer is the place where they are stored into globals so it happens to work out.

`-iree-stream-verify-affinities`link

Verifies that all operations have affinities assigned (directly or indirectly).

`-iree-stream-verify-async-access-ranges`link

Verifies that stream.async.* access ranges are in bounds where possible.

`-iree-stream-verify-input`link

Verifies that input dialects are supported by the streams dialect.

`-iree-stream-verify-lowering-to-async`link

Verifies that all stream.tensor. ops and types are fully lowered to stream.async. ops and all resources have an assigned lifetime.

`-iree-stream-verify-lowering-to-async-resources`link

Verifies that all stream.tensor. ops and types are fully lowered to stream.async. resource ops.

`-iree-stream-verify-lowering-to-cmd`link

Verifies that all stream.async. ops and types are fully lowered to stream.cmd. ops.

`-iree-stream-verify-lowering-to-tensors`link

Verifies that input dialects are converted to stream.tensor.* ops.

Stream

-iree-stream-annotate-affinitieslink

-iree-stream-annotate-dispatch-argumentslink

-iree-stream-annotate-dispatch-assumptionslink

-iree-stream-clone-to-consumerslink

-iree-stream-conversionlink

-iree-stream-dump-statisticslink

Optionslink

-iree-stream-elide-async-copieslink

-iree-stream-elide-async-transferslink

-iree-stream-elide-timepointslink

-iree-stream-emplace-allocationslink

-iree-stream-encode-device-tensorslink

-iree-stream-encode-host-tensorslink

-iree-stream-fold-uniform-operandslink

-iree-stream-fuse-dispatch-bindingslink

-iree-stream-layout-sliceslink

-iree-stream-materialize-builtinslink

-iree-stream-materialize-copy-on-writelink

-iree-stream-materialize-encodingslink

-iree-stream-pack-constantslink

-iree-stream-pack-dispatch-operandslink

-iree-stream-propagate-timepointslink

-iree-stream-refine-usagelink

-iree-stream-schedule-allocationlink

-iree-stream-schedule-concurrencylink

-iree-stream-schedule-executionlink

-iree-stream-specialize-dispatcheslink

-iree-stream-specialize-encodingslink

-iree-stream-sync-initializerslink

-iree-stream-verify-affinitieslink

-iree-stream-verify-async-access-rangeslink

-iree-stream-verify-inputlink

-iree-stream-verify-lowering-to-asynclink

-iree-stream-verify-lowering-to-async-resourceslink

-iree-stream-verify-lowering-to-cmdlink

-iree-stream-verify-lowering-to-tensorslink