Categories
iOS macOS Swift

Building a memory cache for a file reader in Swift

In the previous blog post Reading data from a file with DispatchIO I built a small FileReader which enabled reading data from a file for random byte ranges. Signal Path uses a similar file reader but in addition it also caches read data in some interactions where there are a lot of read requests active, for example, while recording. Therefore, in this blog post we’ll build a small file data cache which stores data chunks for byte ranges. The cache needs to be performant, otherwise it would be faster to just read the data from the file if it is slow.

Defining an interface for the cache

The interface of the FileMemoryCache needs to provide a way for retrieving data, storing data, and limiting the overall memory usage.

final class FileMemoryCache {
init(byteCountLimit: Int)
/// Returns data for the byte range
func data(for byteRange: CountableRange<Int>) -> DispatchData?
/// Caches data for the byte range and evicts older data when exceeding byte count limit
func set(_ data: DispatchData, byteRange: CountableRange<Int>)
/// Removes everything from the cache
func removeAll()
}

Storing and retrieving cached data

The strategy of caching is to use a sorted array where the array is sorted by the first index of a byte range. Items in the array are structs containing a byte range and data. Having a sorted array makes it easier to merge data chunks when retrieving data for a longer byte range than the individual array items provide. As the aim is to keep it as fast as possible then data is merged only when cache is accessed which reduces the need to copy memory multiple times. The downside is that the cache can store overlapping chunks of data which reduces the efficiency. But on the other hand it is fast. Sorted array also enables to use binary search if linear search happens to be too slow with large amount of cached data chunks.

final class FileMemoryCache {
private var sortedItems = [StorageItem]()
struct StorageItem {
let byteRange: CountableRange<Int>
let data: DispatchData
}
}

Let’s start with the store method which takes a data and a byte range it corresponds to. For keeping the array sorted, we’ll use the first index of the byte range as the attribute for sorting. We’ll also make sure that the current item at the insertion index does not already contain the byte range we are trying to cache. If it does contain, we can safely ignore the current store request as the data is already cached. Additionally, we’ll check if the previous item in the array as it might also contain the byte range.

func set(_ data: DispatchData, byteRange: CountableRange<Int>) {
if let insertionIndex = sortedItems.firstIndex(where: { $0.byteRange.startIndex >= byteRange.startIndex }) {
let canDiscard: Bool = {
guard !sortedItems[insertionIndex].byteRange.contains(byteRange) else { return true }
guard insertionIndex > sortedItems.startIndex else { return false }
return sortedItems[insertionIndex.advanced(by: -1)].byteRange.contains(byteRange)
}()
if !canDiscard {
sortedItems.insert(StorageItem(byteRange: byteRange, data: data), at: insertionIndex)
}
}
else {
sortedItems.append(StorageItem(byteRange: byteRange, data: data))
}
}

When it comes to retrieving data then we’ll need to find intersecting cached data chunks and merge them if needed. Therefore, the first step is to collect cached data chunks which intersect with the requested range. For keeping it as simple as possible, will just loop over the array without using binary search and collect the items which are intersecting. The next step after that is to make sure the whole requested byte range is already present in the cache. This can be checked quite easily with IndexSet which contains indexes for every requested bytes. By removing byte indexes in intersecting ranges, the index set becomes empty. It means that every byte index is cached.

func data(for byteRange: CountableRange<Int>) -> DispatchData? {
// Find items which intersect with the search range
guard let firstIntersectingItemIndex = sortedItems.firstIndex(where: { byteRange.intersects($0.byteRange) }) else { return nil }
let intersectingItems = sortedItems[firstIntersectingItemIndex…].prefix(while: { byteRange.intersects($0.byteRange) })
// Check if the whole range is covered
var byteIndexesInCache = IndexSet(integersIn: byteRange)
intersectingItems.forEach({ byteIndexesInCache.remove(integersIn: $0.byteRange) })
guard byteIndexesInCache.isEmpty else { return nil }
// …
}

We’ll need to copy intersecting byte indexes into a final contiguous data. It could be that the whole cached data chunk can be copied over or just a part of it. After figuring out which part of the cached data chunk to copy, the next step is to calculate the byte index in the resulting memory region which starts with the zero byte index. The implementation can be seen below which includes byte range index conversions.

func data(for byteRange: CountableRange<Int>) -> DispatchData? {
// …
guard let byteRangePointer = malloc(byteRange.count) else { return nil }
for storageItem in intersectingItems {
let rangeToCopyInData: CountableRange<Int> = {
switch (byteRange.contains(storageItem.byteRange.startIndex), byteRange.contains(storageItem.byteRange.endIndex – 1)) {
case (true, true):
// Everything: |+++++++++++++|
return 0..<storageItem.byteRange.count
case (false, true):
// End: |—++++++++++|
return byteRange.startIndex – storageItem.byteRange.startIndex..<storageItem.byteRange.count
case (true, false):
// Start: |++++++++++—|
return 0..<byteRange.endIndex – storageItem.byteRange.startIndex
case (false, false):
// Middle: |—+++++++—|
return byteRange.startIndex – storageItem.byteRange.startIndex..<byteRange.endIndex – storageItem.byteRange.startIndex
}
}()
// Find the byte range in the allocated memory range where to copy the cached data section
let destinationByteIndex = storageItem.byteRange.startIndex – byteRange.startIndex + rangeToCopyInData.startIndex
let destinationPointer = byteRangePointer.advanced(by: destinationByteIndex)
let destinationBufferPointer = UnsafeMutableRawBufferPointer(start: destinationPointer, count: rangeToCopyInData.count)
storageItem.data.copyBytes(to: destinationBufferPointer, from: rangeToCopyInData)
// Ignore remaining intersections because everything is copied
if destinationByteIndex + rangeToCopyInData.count >= byteRange.endIndex {
break
}
}
let byteRangeBufferPointer = UnsafeRawBufferPointer(start: byteRangePointer, count: byteRange.count)
return DispatchData(bytesNoCopy: byteRangeBufferPointer, deallocator: .free)
}

Summary

We built a simple cache for storing data chunks for byte ranges. We used a sorted array for storing individual data chunks which made it easier to merge them into one when the cache is accessed. Make sure to check the full implementation on GitHub which also includes evicting cached data.

If this was helpful, please let me know on Mastodon@toomasvahter or Twitter @toomasvahter. Feel free to subscribe to RSS feed. Thank you for reading.

Example Project

FileReaderPlayground (Xcode 12.4)

Categories
Swift

Reading data from a file with DispatchIO

Signal Path is an app which works with large files, even in several gigabytes. The app reads ranges from the file and visualizes the data. Signal Path heavily relies on DispatchIO for having efficient access to data in a file. The aim of the blog post is to build a FileReader class which wraps DispatchIO and provides a similar functionality.

DispatchIO manages a file descriptor and coordinates file accesses. DispatchIO object can be created by providing a path to the file and specifying the stream type: stream or random access. In the context of this post we are interested in random access as we would like to read random ranges of bytes from a file. Therefore, let’s start with defining an interface for the FileReader.

final class FileReader {
init(fileURL: URL)
/// Opens the I/O channel with random access semantics
func open()
/// Closes the opened I/O channel
func close()
/// Reads data at byte range.
func read(byteRange: CountableRange<Int>, queue: DispatchQueue = .main, completionHandler: @escaping (DispatchData?) -> Void)
}
The interface for DispatchIO wrapping class.

The interface is pretty straight-forward with functions to open and close the file and with an asynchronous read function. DispatchIO returns read data as DispatchData which is a contiguous block of memory and what also can be cast into Data if needed.

We can use an init method on DispatchIO which has path argument when dealing with file paths. Note that the queue parameter on DispatchIO just specifies which queue is used for the cleanup handler, most of the cases .main will suffice. After creating the channel we can additionally control if data is returned once or partially with multiple handler callbacks. In our case, we would like to get a single callback, and therefore we need to set the low limit to Int.max. Then the read data is returned in with a single callback.

func open() -> Bool {
guard channel == nil else { return true }
guard let path = (fileURL.path as NSString).utf8String else { return false }
channel = DispatchIO(type: .random, path: path, oflag: 0, mode: 0, queue: .main, cleanupHandler: { error in
print("Closed a channel with status: \(error)")
})
// Load the whole requested byte range at once
channel?.setLimit(lowWater: .max)
guard self.channel != nil else { return false }
print("Opened a channel at \(fileURL)")
return true
}
func close() {
channel?.close()
channel = nil
}

Reading from a file requires having an opened channel and defining a byte range.

func read(byteRange: CountableRange<Int>, queue: DispatchQueue = .main, completionHandler: @escaping (DispatchData?) -> Void) {
if let channel = channel {
channel.read(offset: off_t(byteRange.startIndex), length: byteRange.count, queue: queue, ioHandler: { done, data, error in
print(done, data?.count ?? -1, error)
completionHandler(data)
})
}
else {
print("Channel is closed")
completionHandler(nil)
}
}

And the file reader can be used like this:

let fileURL = Bundle.main.url(forResource: "DataFile", withExtension: nil)!
let reader = FileReader(fileURL: fileURL)
if reader.open() {
reader.read(byteRange: 0..<20) { data in
if let data = data {
print("Read bytes: \(data.map({ UInt8($0) }))")
}
else {
print("Failed to read data")
}
}
}
else {
print("Failed to open")
}

Summary

DispatchIO provides an efficient way for accessing raw bytes in a file. Wrapping it into a FileReader class gives us a compact interface for working with file data. Please checkout the playground which contains the full implementation and the example: FileReaderPlayground.

If this was helpful, please let me know on Mastodon@toomasvahter or Twitter @toomasvahter. Feel free to subscribe to RSS feed. Thank you for reading.

Example Project

FileReaderPlayground (Xcode 12.4)