Distributed Systems — Let’s start from basics
Here I am trying to explain the concepts that you are going to encounter while you are developing data intensive applications. This blog is a combination of my thoughts and notes from the book “Designing Data Intensive Applications”
In this topic, we will try to explore how distributed storage systems can be implemented. You will see one or more overlapping concepts implemented in underlying subsystems. So it’s a good idea to get familiar with them.
Fundamentally there are only two interrelated problems to solve.
- How to store the data?
- How to retrieve the data?
There are two philosophies that often gets debated, NoSql (log structured storage systems) and Sql (Page Oriented Storage Systems)
Let’s focus on Log-Structured Storage Systems first.
We can start with a simple bash script of Put() and Get() by appending the key and value to a file and Get() would return the last entry of the file.
What are the problems?
Problem1: Read Complexity O(n)
There are other problems as well but let’s try to address problem one by one. We will get to them later. So the problem to focus on is how do we decrease the read complexity?
Okay, Index is the word or concept any developer will fallback into.
What is the basic knowledge we should have when selecting Indexes?
- Writes should not be slower because of creating multiple indexes.
- Indexes should be maintained in memory in order for reads to be fast. — Otherwise, there is no point in having an index! Okay, don’t take this statement too seriously :). But keep this concept in your mind.
Implementation Notes.
- When building indexes one detail to take care of is how fast you can rebuild the index?.
Hash Indexes:
So let’s start with hash indexes by maintaining <key, offset in the file of value>
Example : put(42, {jsonblob}).
- We append JSON blob to the file(let’s say offset with 100)
- Create an in-memory index <42, 100> which is like, while retrieving I could ‘random seek’ to the byte 100 to retrieve the value for key 42.
The constraint that this approach gives you is all the keys must be stored in memory. It works well if you know the number of keys is limited. So this is definitely a solution worth considering for the kind of payload which matches the pattern as highlighted in bold.
Now continuing with the next problem.
Problem 2: What if file size outruns disk space?
Approach
- Distribute the data into multiple files. Each file is present in a different machine.
- Delete/Compact the data that is no longer needed
- Since data is not in a single file we need to maintain a hashtable to know which file to search for while serving the read-request for a particular key.
- Handling Deletes: Using BloomFilters. will elaborate later on this.
- Crash Recovery: Save the index into the disk to recover fast.
- Partial Writes: A write happened successfully to disk but not to the hashtable. we have to discard such records eventually.
Next Problem: Stay tuned :)