Metal HAL driverlink

This document lists technical details regarding the Metal implemenation of IREE's Hardware Abstraction Layer, called a Metal HAL driver.

IREE provides a Hardware Abstraction Layer (HAL) as a common interface to different compute accelerators. IREE HAL's design draws inspiration from modern GPU architecture and APIs; so implementing a HAL driver using modern GPU APIs is generally straightforward. This applies to the Metal HAL driver.

Overall Design Choiceslink

Metal Versionslink

Currently the Metal HAL driver expects Metal 3 capabilities. Metal 3 was released late 2022 and are supported since macOS Ventura and iOS 16. It covers recent Apple silicon GPUs including A13+ and M1+ chips and others.

In the future, we expect to increase the support to cover Metal 2 capabilities. Metal 2 introduces useful features like argument buffer and others that are necessary for performance and make IREE HAL implementation simpler. Metal 2 was released late 2017 and are supported since macOS High Sierra and iOS 11. It is already dominant (macOS, iOS).

Programming Languages and Librarieslink

The Metal framework only exposes Objective-C or Swift programming language APIs. Metal HAL driver needs to inherit from common HAL abstraction definitions, which are in C. To minimize dependency and binary size and increase performance, we use Metal's Objective-C API for implementing the Metal HAL driver. Header (.h) and implementation (.m) files are put adjacent to each other.

Object Lifetime Managementlink

Objective-C uses refcount for tracking object lifetime and managing memory. This is traditionally done manually by sending retain and release messages to Objective-C objects. Modern Objective-C allows developers to opt in to use Automatic Reference Counting to let the compiler to automatically deduce and insert retain/release where possible to simplify the burdern of manual management.

We don't use ARC in the Metal HAL driver given that IREE has its own object refcount and lifetime management mechanism. Metal HAL GPU objects are tracked with that to be consistent with others. Each Metal HAL GPU object retains the underlying Metal id<MTL*> object on construction and releases on destruction.

GPU Objectslink

Metal is one of the main modern GPU APIs that provide more explicit control over the hardware. The mapping between IREE HAL classes and Metal protocols are relatively straightforward:

IREE HAL Class	Metal Protocol
`iree_hal_driver_t`	N/A
`iree_hal_device_t`	`MTLDevice`
`iree_hal_command_buffer_t`	`MTLCommandBuffer`
`iree_hal_semaphore_t`	`MTLSharedEvent`
`iree_hal_allocator_t`	N/A
`iree_hal_buffer_t`	`MTLBuffer`
`iree_hal_executable_t`	`MTLLibrary`
`iree_hal_executable_cache_t`	N/A

In the following subsections, we go over each pair to provide more details.

Driverlink

There is no native driver abstraction in Metal. IREE's Metal HAL driver still provides a iree_hal_metal_driver_t struct to implement the common iree_hal_driver_t struct. iree_hal_metal_driver_t just retains all available Metal devices in the system during its lifetime, to guarantee that we have the same id<MTLDevice> for device querying and creation.

Devicelink

iree_hal_metal_device_t implements iree_hal_device_t to provide the interface to Metal GPU device by wrapping a id<MTLDevice>. Upon construction, iree_hal_metal_device_t creates and retains one queue for both dispatch and transfer during its lifetime. In the future we expect to spport multiple queues for better concurrency.

Command buffer submissionlink

In IREE HAL, command buffers are directly created from the iree_hal_device_t. It's also directly submitted there via iree_hal_device_queue_execute(). Each execution takes a batch of command buffers, together with a list of waiting iree_hal_semaphore_ts and a list signaling iree_hal_semaphore_ts. There is no direct mapping of such structure in Metal; so we performs the submission in three steps:

Create a new MTLCommandBuffer to encodeWaitForEvent:value for all waiting iree_hal_semaphore_ts and commit this command buffer.
Commit all command buffers in the submmision batch.
Create a new MTLCommandBuffer to encodeSignalEvent:value for all signaling iree_hal_semaphore_ts and commit this command buffer.

Such submission enables asynchronous execution of the workload on the GPU.

Queue-ordered allocationlink

Queue-ordered asynchronous allocations via iree_hal_device_queue_alloc is not fully supported yet; it just translates to blocking wait and allocation.

Collectiveslink

Collectives suppport is not yet implemented.

Profilinglink

The Metal HAL driver supports profiling via MTLCaptureManager. We can either capture to a trace file or XCode.

To perform profiling in the command line, attach --device_profiling_mode=queue --device_profiling_file=/path/to/metal.gputrace to IREE binaries.

Command bufferlink

Command buffers are where IREE HAL and Metal API have a major difference.

IREE HAL command buffers follow the flat Vulkan recording model, where all memory or dispatch commands are recorded into a command buffer directly. Unlike Vulkan, Metal adopts a multi-level command recording model--memory/dispatch commands are not directly recorded into a command buffer; rather, they must go through the additional level of blit/compute encoders. Implementing IREE's HAL using Metal would require switching encoders for interleaved memory and dispatch commands. Additionally, certain IREE HAL API features do not have direct mapping in Metal APIs, e.g., various forms of IREE HAL execution/memory barriers. Translating them would require looking at both previous and next commands to decide the proper mapping.

Due to these reasons, it's beneficial to have a complete view of the full command buffer and extra flexibility during recording, in order to fixup past commands, or inspect future commands.

Therefore, to implement IREE HAL command buffers using Metal, we perform two steps using a linked list of command segments: First we create segments to keep track of all IREE HAL commands and the associated data. And then, when finalizing the command buffer, we iterate through all the segments and record their contents into a proper MTLCommandBuffer. A linked list gives us the flexibility to organize command sequence in low overhead; and a deferred recording gives us the complete picture of the command buffer when really started recording.

The Metal HAL driver right now only support one-shot command buffers, by mapping to MTLCommandBuffers.

Fill/copy/update bufferlink

Metal APIs for fill and copy buffers have alignment restrictions on the offset and length. iree_hal_command_buffer_{fill|copy|update}_buffer() is more flexible regarding that. So for cases aren't directly supported by Metal APIs, we use polyfill compute kernels to perform the memory operation using GPU threads.

Semaphorelink

iree_hal_semaphore_t allows host->device, device->host, host->host, and device->device synchronization. It maps to Vulkan timeline semaphore. In Metal world, the counterpart would be MTLSharedEvent. Most of the iree_hal_semaphore_t APIs are simple to implement in MetalSharedEvent, with iree_hal_semaphore_wait() as an exception. A listener is registered on the MTLSharedEvent with notifyListener:atValue:block: to singal a semaphore to wake the current thread, which is put into sleep by waiting on the semaphore.

Allocatorlink

At the moment the Metal HAL driver just has a very simple iree_hal_allocator_t implementation. It just wraps a MTLDevice and redirects all allocation requests to the MTLDevice. No page/pool/slab or whatever. This is meant to be used together with common allocator layers like the caching allocator.

Bufferlink

IREE iree_hal_buffer_t maps Metal MTLBuffer. See Object Lifetime Management for more details.

Executablelink

IREE iree_hal_executable_t represents a GPU program archive with a driver-defined format. It maps naturally to Metal MTLLibrary. An entry point in a MTLLibrary is a MTLFunction. We define iree_hal_metal_executable_t to wrap around a MTLLibrary, its MTLFunctions, and also MTLComputePipelineState objects constructed from MTLFunctions.

Executable cachelink

IREE iree_hal_executable_cache_t is modeling a cache of preprared GPU executables for a particular device. At the moment the Metal HAL driver does not peforming any caching on GPU programs; it simply reads the program from the FlatBuffer and hands it over to Metal driver.

Descriptor set / pipeline layoutlink

See Resource descriptors for more details.

Compute Pipelinelink

Shader/kernel compilationlink

Metal has Metal Shading Language (MSL) for authoring graphics shaders and compute kernels. MSL source code can be directly consumed by the Metal framework at run-time; it can also be compiled first into an opaque library using command-line tools at build-time.

IREE uses compilers to compile ML models expressed with high-level op semantics down to GPU native source format. This is also the case for the Metal HAL driver. Metal does not provide an open intermediate language; we reuse the SPIR-V code generation pipeline and then cross compile the generated SPIR-V into MSL source with SPIRV-Cross. This is actually a fair common practice for targeting multiple GPU APIs in graphics programming world. For example, the Vulkan implmenation in macOS/iOS, MoltenVK, is also doing the same for shaders/kernels. The path is quite robust, as demonstrated by various games on top of MoltenVK.

Therefore, in IREE, we have a MetalSPIRVTargetBackend, which pulls in the common SPIR-V passes to form the compilation pipeline. The difference would be to provide a suitable SPIR-V target environment to drive the compilation, which one can derive from the Metal GPU families to target. The serialization step differs from VulkanSPIRVTargetBackend too: following the normal SPIR-V serialization step, we additionally need to invoke SPRIV-Cross to cross compile the generated SPIR-V into MSL, and then compile and/or serialize the MSL source/library.

IREE uses FlatBuffer to encode the whole workload module, including both GPU shader/kernel (called executable in IREE terminology) and CPU scheduling logic. The GPU executables are embedded as part of the module's FlatBuffer, which are mmapped when IREE runs.

For the Metal HAL driver, it means we need to embed the MSL kernels inside the module FlatBuffer. Right now we can either encode the MSL source strings and compile them at Metal run-time, or directly encoding the library instead.

Workgroup/threadgroup sizelink

When dispatching a compute kernel in Metal, we need to specify the number of thread groups in grid and the number of threads in thread group. Both are 3-D vectors. IREE HAL, which follows Vulkan, calls them workgroup count and workgroup size, respectively.

In Vulkan programming model, workgroup count and workgroup size are specified at different places: the former is given when invoking vkCmdDispatch(), while the later is encoded in the dispatched SPIR-V code. This split does not match the Metal model, where we specify both in the API with dispatchThreads:threadsPerThreadgroup:.

As said in shader/kernel compilation, MSL kernels are cross compiled from SPIR-V code and then embeded in the module FlatBuffer. The module FlatBuffer provides us a way to convey the threadgroup/workgroup size information extracted from the SPIR-V code. We encode an additional 3-D vector for each entry point and use it as the threadgroup size when later dispatching the MTLFunction corresponding to the entry point.

Command buffer dispatchlink

Metal HAL driver command buffer dispatch recording performs the following steps with the current active MTLComputeCommandEncoder:

Bind the MTLComputePipelineState for the current entry function.
Encode the push constants using setBytes:length:atIndex.
For each bound descriptor set at set #S:
Create a MTLArgumentEncoder for encoding an associated argument MTLBuffer.
For each bound resource buffer at binding #B in this descriptor set, encode it to the argument buffer index #B with setBuffer::offset::atIndex: and inform the MTLComputeCommandEncoder that the dispatch will use this resource with useResource:usage:.
Set the argument MTLBuffer to buffer index #S.
Dispatch with dispatchThreadgroups:threadsPerThreadgroup:.