In the previous blog post Reading data from a file with DispatchIO I built a small FileReader which enabled reading data from a file for random byte ranges. Signal Path uses a similar file reader but in addition it also caches read data in some interactions where there are a lot of read requests active, for example, while recording. Therefore, in this blog post we’ll build a small file data cache which stores data chunks for byte ranges. The cache needs to be performant, otherwise it would be faster to just read the data from the file if it is slow.
Defining an interface for the cache
The interface of the FileMemoryCache needs to provide a way for retrieving data, storing data, and limiting the overall memory usage.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The strategy of caching is to use a sorted array where the array is sorted by the first index of a byte range. Items in the array are structs containing a byte range and data. Having a sorted array makes it easier to merge data chunks when retrieving data for a longer byte range than the individual array items provide. As the aim is to keep it as fast as possible then data is merged only when cache is accessed which reduces the need to copy memory multiple times. The downside is that the cache can store overlapping chunks of data which reduces the efficiency. But on the other hand it is fast. Sorted array also enables to use binary search if linear search happens to be too slow with large amount of cached data chunks.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Let’s start with the store method which takes a data and a byte range it corresponds to. For keeping the array sorted, we’ll use the first index of the byte range as the attribute for sorting. We’ll also make sure that the current item at the insertion index does not already contain the byte range we are trying to cache. If it does contain, we can safely ignore the current store request as the data is already cached. Additionally, we’ll check if the previous item in the array as it might also contain the byte range.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
When it comes to retrieving data then we’ll need to find intersecting cached data chunks and merge them if needed. Therefore, the first step is to collect cached data chunks which intersect with the requested range. For keeping it as simple as possible, will just loop over the array without using binary search and collect the items which are intersecting. The next step after that is to make sure the whole requested byte range is already present in the cache. This can be checked quite easily with IndexSet which contains indexes for every requested bytes. By removing byte indexes in intersecting ranges, the index set becomes empty. It means that every byte index is cached.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We’ll need to copy intersecting byte indexes into a final contiguous data. It could be that the whole cached data chunk can be copied over or just a part of it. After figuring out which part of the cached data chunk to copy, the next step is to calculate the byte index in the resulting memory region which starts with the zero byte index. The implementation can be seen below which includes byte range index conversions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We built a simple cache for storing data chunks for byte ranges. We used a sorted array for storing individual data chunks which made it easier to merge them into one when the cache is accessed. Make sure to check the full implementation on GitHub which also includes evicting cached data.
Signal Path is an app which works with large files, even in several gigabytes. The app reads ranges from the file and visualizes the data. Signal Path heavily relies on DispatchIO for having efficient access to data in a file. The aim of the blog post is to build a FileReader class which wraps DispatchIO and provides a similar functionality.
DispatchIO manages a file descriptor and coordinates file accesses. DispatchIO object can be created by providing a path to the file and specifying the stream type: stream or random access. In the context of this post we are interested in random access as we would like to read random ranges of bytes from a file. Therefore, let’s start with defining an interface for the FileReader.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The interface is pretty straight-forward with functions to open and close the file and with an asynchronous read function. DispatchIO returns read data as DispatchData which is a contiguous block of memory and what also can be cast into Data if needed.
We can use an init method on DispatchIO which has path argument when dealing with file paths. Note that the queue parameter on DispatchIO just specifies which queue is used for the cleanup handler, most of the cases .main will suffice. After creating the channel we can additionally control if data is returned once or partially with multiple handler callbacks. In our case, we would like to get a single callback, and therefore we need to set the low limit to Int.max. Then the read data is returned in with a single callback.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Reading from a file requires having an opened channel and defining a byte range.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
DispatchIO provides an efficient way for accessing raw bytes in a file. Wrapping it into a FileReader class gives us a compact interface for working with file data. Please checkout the playground which contains the full implementation and the example: FileReaderPlayground.