Metal framework on Apple devices provides a way of using GPU for running complex computing tasks much faster than on CPU. In this blog post I am giving a quick overview how to set up a Metal compute pipeline and processing data on the GPU.
Metal compute kernel
As a first step we need to create a metal compute kernel which will run on the GPU. In this example project it is going to be very simple and just multiplies input data with a factor of 2.
Compute kernel is written in Metal shading language. In the current example we first need to give name to the function and then specify arguments where the first argument is a constant float vector and the second argument is output float vector what we are going to mutate. Third argument is a thread’s position in the input vector. When running compute kernel there are multiple threads processing the input data and this tells us which element we should be modifying.
Setting up Metal compute pipeline
For running the created compute kernel on the GPU we need to create a compute pipeline what uses it.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
init() | |
{ | |
guard let device = MTLCreateSystemDefaultDevice() else { fatalError("Metal device is not available.") } | |
self.device = device | |
guard let commandQueue = device.makeCommandQueue() else { fatalError("Failed creating Metal command queue.") } | |
self.commandQueue = commandQueue | |
guard let library = device.makeDefaultLibrary() else { fatalError("Failed creating Metal library.") } | |
guard let function = library.makeFunction(name: "processData") else { fatalError("Failed creating Metal function.") } | |
do | |
{ | |
computePipelineState = try device.makeComputePipelineState(function: function) | |
} | |
catch | |
{ | |
fatalError("Failed preparing compute pipeline.") | |
} | |
} |
Note that Apple recommends to create and reuse Metal objects where possible. With that in mind, we first create MTLDevice what represents a single GPU. Followed by MTLCommandQueue what is a serial queue handling command buffers GPU executes (more about that later). The third step is to allocate MTLLibrary, find the MTLFunction representing the created compute kernel and then finally initializing MTLComputePipelineState with that function. Now we have everything set up for using the created compute kernel.
Running Metal compute pipeline
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
func process(data: ContiguousArray<Float>) -> ContiguousArray<Float> | |
{ | |
let dataBuffer = data.withUnsafeBytes { (bufferPointer) -> MTLBuffer? in | |
guard let baseAddress = bufferPointer.baseAddress else { return nil } | |
return device.makeBuffer(bytes: baseAddress, length: bufferPointer.count, options: .storageModeShared) | |
} | |
guard let inputBuffer = dataBuffer else { return [] } | |
guard let outputBuffer = device.makeBuffer(length: inputBuffer.length, options: .storageModeShared) else { return [] } | |
guard let commandBuffer = commandQueue.makeCommandBuffer() else { return [] } | |
guard let commandEncoder = commandBuffer.makeComputeCommandEncoder() else { return [] } | |
commandEncoder.setComputePipelineState(computePipelineState) | |
commandEncoder.setBuffer(inputBuffer, offset: 0, index: 0) | |
commandEncoder.setBuffer(outputBuffer, offset: 0, index: 1) | |
let threadsPerThreadgroup = MTLSize(width: 10, height: 1, depth: 1) | |
let threadgroupsPerGrid = MTLSize(width: data.count / threadsPerThreadgroup.width, height: threadsPerThreadgroup.height, depth: threadsPerThreadgroup.depth) | |
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup) | |
commandEncoder.endEncoding() | |
commandBuffer.commit() | |
commandBuffer.waitUntilCompleted() | |
let outputPointer = outputBuffer.contents().assumingMemoryBound(to: Float.self) | |
let outputDataBufferPointer = UnsafeBufferPointer<Float>(start: outputPointer, count: data.count) | |
return ContiguousArray<Float>(outputDataBufferPointer) | |
} |
Let’s now take a look on the process data function what takes in a contiguous float array and returns an array with values processed by a GPU. Exposing data to GPU is managed by MTLBuffer. ContiguousArray stores its elements in a contiguous region of memory, therefore we can access the contents of memory directly and create an instance of MTLBuffer containing a copy the float array.
Another instance of MTLBuffer is needed for storing the output values.
MTLCommandBuffer is a buffer for containing encoded commands what in turn are executed by the GPU. So in finally we can create a MTLComputeCommandEncoder object referencing input and output data buffer and the compute kernel. This is the object actually defining the work we want to run on the GPU. For that we first set the compute pipeline state what stores the information about our compute kernel, followed by setting data buffers. Note that index 0 is the first buffer in the kernel’s implementation
const device float *inVector [[ buffer(0) ]]
what defines the input and index 1 is the second buffer device float *outVector [[ buffer(1) ]]
for output.Calculating threadgroup and grid sizes contains a detailed information how to manage the amount of threads processing the data. When this is set, we mark command encoder ready, commit the buffer for GPU to execute and then waiting it to finish. When command buffer has finished we can access the data in the output buffer.
For more detailed information please go to Apple’s documentation for Metal.
Check out the whole sample application written in Swift 4 here: MetalCompute at GitHub. Make sure running it on iOS device, not in simulator.