Storage and Retrieval #
Data Structures that power your DB #
log -> In DB context, refers to “an append only sequence of records”.
Many DBs use a log to store data (but need additional processing e.g. concurrency control, reclaiming disk space, handling errors etc)
Writing data in a log is fast since it only appends to the end of the file but reading can be slow since it requires scanning the entire file, and hence DBs need an index.
An index slows down writes because it needs to be updated every time the data changes, so not everything is indexed by default and needs intuition by the developer to choose the right indexes.
Hash Indexes #
Simplest strategy:
- Keep an in-memory hash map where every key is mapped to a byte offset in the data file where the value can be found
- When a new key-value pair is appended to the file, add or update the hash map with the new offset
- To look up a value, get the offset from the hash map, seek to that position in the data file, and read the value
This is used by Bitcask, the default storage engine for Riak. This is suited for situations where values for existing keys are updated frequently since there are a lot of writes but not many distinct keys, so it’s feasible to keep all keys in memory.
Compaction - A strategy to reclaim disk space:
- Break the log into segments of a certain size and save them in separate files
- Perform compaction on each file, i.e., throw away duplicate keys and keep only most recent update for each key
- Since segment files become smaller with compaction, they can be merged together to form a new segment file (Old segment files can be then discarded)
- This can be done in the background as an offline batch processing task so there’s no perf hit to live workloads