System Design: Reliable, Scalable, and Maintainable Applications

July 16, 2025

System Design: Reliable, Scalable, and Maintainable Applications

Image source: Google Gemini

Modern applications have made a transition from compute intensive to data intensive, spearheaded by the AI revolution. Typically, most applications tend to:

Store data so that they, or another application, can find it again later (databases)
Remember the result of an expensive operation, to speed up reads (caches)
Allow users to search data by keyword or filter it in various ways (search indexes)
Send a message to another process, to be handled asynchronously (stream processing)
Periodically crunch a large amount of accumulated data (batch processing)

This implies that building modern software often requires a combination of readily available tools that abstract a lot of the technical details i.e., no one typically builds a search engine from scratch, but uses what is already available.

As such, when building data intensive applications, one has to consider factors like reliability, scalability and maintainability since it basically comes down to choosing the most appropriate tools for the job.

Reliability

To better understand the role of reliability in software systems, faults and failures need a spotlight. Faults roughly implies one component of the system deviating from its spec, whereas a failure typically means the whole system stops providing the required service to the user. It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures.

Faults can be as a result of hardware failure(eg: disk crashes, power outages), software failure(eg: bugs in software that hogs resources and causes system to become unresponsive) and human error(design choices that spell doom later on).

Scalability

Scalability simply implies the system stays reliable, as the load on it changes. The load may be requests per second to a web server, the ratio of reads to writes in a database, the number of simultaneously active users in a chat room, the hit rate on a cache, or something else. This load can be described using percentiles. eg. Amazon boasts of optimizing the 99.99th percentile, meaning only 1 in 1000 requests would be slower than 200ms.

In scaling, the system is often scaled vertically(adding resources to the system to make a more powerful server) or horizontally(adding more smaller servers to create a distributed system). Often, scaling choices depends primarily on the system, costs and technical complexity, Google may afford a hybrid vertical and horizontally scaled system but a new startup may not.

Maintainability

The majority of the cost of software is not in its initial development, but in its ongoing maintenance; fixing bugs, keeping its systems operational, investigating failures, adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding new features. Three principles that help design software in such a way that it will hopefully minimize pain during maintenance are:

Operability: Make it easy to keep the system running smoothly.
Simplicity: Make it easy to understand the system.
Evolvability: Make it easy to make changes to the system in the future, adapting it for unanticipated use cases as requirements change.

Summary

In building modern data intensive applications, there is unfortunately no easy fix for making applications reliable, scalable, or maintainable. However, there are certain patterns and techniques that keep reappearing in different kinds of applications. These patterns and techniques can help build data intensive applications that are reliable, scalable and maintainable.

Reference

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Book by Martin Kleppmann

This blog post is a summary of my personal notes and understanding from reading "Designing Data-Intensive Applications" by Martin Kleppmann. All credit for the original ideas belongs to the author.

Search This Blog

Gilbs' Tech Hub

System Design: Reliable, Scalable, and Maintainable Applications

Reliability

Scalability

Maintainability

Summary

Reference

Comments

Post a Comment

Popular Posts

Webstack monitoring

Postmortem