Foundations of Data Systems #

3 Important Characteristics of Data Systems #

Reliability: System should continue to work correctly even in the face of adversity and maintain same perf levels
Scalability: As the system grows, there should be reasonable ways to deal with the growth
Maintainability: Different people should be able to work on it productively across time

Reliability #

This can include:

Performing expected functions correctly
Tolerate user mistakes or unexpected inputs
Maintain good enough performance under load
Prevent unauthorized access and abuse

Types of faults:

Hardware -> Typically mitigated by redundant hardware
Software -> Typically mitigated by thinking upfront about design and careful testing
Human error -> Typically mitigated by careful design and testing

Scalability #

This is system’s ability to cope with increased load.

Load -> Can be described with the help of numbers we can call as load parameters. The parameters would depend on the system architecture. E.g.:

Web server -> Number of requests per second
Database -> Number of read/write operations per second or read to write ratio
Chat Room -> Simultaneously active users
Cache -> Hit/miss rate

Twitter Example #

Tweet post rate -> 4.6K RPS average, 12K RPS peak Home TL View -> 300k RPS average

Option 1:

Insert each posted tweet to a global collection. When a user views their TL, fetch all their followee’s tweets and merge them. Simple but slow due to all the joins.

Option 2:

Maintain a cache for each user’s TL. On a tweet post, update TL cache for all followers of tweet poster. This approach is fast on read but can become very expensive for people with large following. E.g. a user with 30 mn followers would mean 30mn writes per tweet.

Final Solution used at twitter: Hybrid approach of above 2. For most users, do 2nd option. For users with high number of followers (e.g. celebs), do option 1 (i.e. their tweets get merged into the caches at view time).

Performance #

Batch processing systems (e.g. Hadoop) care more about throughput while online systems care more about response time.

Latency -> duration where request is waiting to be handled

Response time -> Total time spent in serving a request that a client sees (thus, including latency)

Response time is not a single number but a distribution since you’d get slightly different response times for every request. A good metric to use while reporting perf data like response time is percentiles, not averages. This is because averages can hide outliers and also don’t tell about how many users experienced a given response time.

Median (P50), P95, P99 etc are good metrics. High percentiles of response times are also called tail latencies.

Note: Amazon describes response time requirements for internal services in terms of P99.9, even though it affects only 1 in 1000 requests. Because often those are the users with most data because they’ve made most purchases.

Amazon observes that a 100ms increase in response time for a request reduces sales by 1%. Others report that a 1 second slow down reduces customer satisfaction by 16%.

Note that trying to satisfy too high tail latencies is also not useful so there should be a balance of ROI.

Queueing delays are also known as head of line blocking.

Load Test -> Should keep generating and sending requests without waiting for previous response to finish, to simulate the behavior of a real system where there could be queueing delays.

Tail latency amplification -> When a user request results in multiple backend calls, even 1 slow call slows down the whole response & there’s a high chance of several users experiencing this.

Approaches for coping with load:

Vertical scaling -> Increase the resources of a single machine
Horizontal scaling -> Distribute load across multiple smaller machines. Also called shared-nothing architecture.

Good architectures employ a pragmatic mix of both. Elasticity, or automatic scaling, is useful for unpredictable loads.

Maintainability #

3 Design Principles:

Operability -> Making life easy for operations
Simplicity -> Managing complexity
Evolvability -> Making change easy

Accidental Complexity -> Complexity is accidental if it is not inherent in the problem that the software solves, but arises only from the implementation. Abstractions can help reduce accidental complexity. High level programming languages are abstractions too.