Review: Designing Data Intensive Applications

kuniga.me > NP-Incompleteness > Review: Designing Data Intensive Applications

Review: Designing Data Intensive Applications

03 May 2022

Designing Data Intensive Applications book cover

In this post we will review the book Designing Data Intensive Applications by Martin Kleppmann.

If I had to summarize this book in a sentence I’d say it discusses several topics in database and distributed systems with a industry applications in mind. This is opposed to textbook versions of these topics which tend to be a lot more theoretical and academic.

This book covers a broad range of topics which makes it challenging to summarize in enough detail. Instead I’ll go over the major themes and comment on specific things I learned and found interesting.

Organization of the book

The book is divided into 3 parts of a total of 12 chapters. In part I, the author covers some database topics, in part II he discusses distributed systems and in part III batch and stream processing in the context of data pipelines.

Selected Notes

Here I’m going to over each chapter providing a brief summary and then a list of bullet points with my random notes.

Chapter 1 - Reliable, Scalable and Maintainable Applications

This chapter discusses three of the most important non-functional aspects of data applications: reliability, scalability and maintainability.

Notes:

Chapter 2 - Data Models and Query Language

This chapter describes a few data models like the relational, document (key-value, NoSQL) and also graph models. It also describes different ways to access the data: imperative language, SQL and also DSLs, Domain Specific Languages, for the graph models.

Notes:

Chapter 3 - Storage and Retrieval

This chapter describes the data structures used to store indexes of DBs, such as Hash Indexes, B-Trees and LSM Trees. It presents the differences between OLTP (Online Transaction Processing) and OLAP (Online Analytic Processing) databases.

Notes:

Chapter 4 - Encoding and Evolution

This chapter describes ways to encode data in a way that supports evolution (i.e. schema can change in back-compatible ways). Examples include Protocol Buffers and Thrift. It also touches upon inter-node communication concepts such as REST and RPC.

Chapter 5 - Replication

Replication means duplicating data in multiple machines for increased throughput and fault-tolerance. This chapter discusses single-leader replication (master-slave), multi-leader replication and leaderless replication. It also delves into the issue of replication lag.

Notes:

Chapter 6 - Partitioning

Partitioning means segmenting the data into multiple machines for scalability and potentially increased throughput. This chapter discusses different partition strategies, re-balancing of partitions and routing of requests to the right partitions.

Notes:

Chapter 7 - Transactions

Transactions allow multiple operations to be performed atomically, that is, either all operations succeed or none of them does. This chapter discusses transactions in a single machine and considers the different guarantees and tradeoffs that we can achieve. It delves into weak isolation and serializability.

Notes:

Chapter 8 - The Trouble with Distributed Systems

In this chapter the author discusses transactions in the context of distributed systems. There are a lot more things to go wrong: partial failures, unreliable networks and unreliable clocks. In addition, distributed algorithms and protocols are based off models, which makes assumptions about what kind of failures can and cannot happen, which might not correspond 100% to reality.

Notes:

Chapter 9 - Consistency and Consensus

This chapter describes different consistency models: eventual consistency and strong consistency (linearizability). It also talks about ordering guarantees, distributed transactions and consensus. The author then makes a connection between ordering, linearizability and consensus.

Notes:

Although CAP has been historically influential, it has little practical value for designing systems.

Chapter 10 - Batch Processing

In this chapter the author discusses map-reduce and makes an analogy with the unix philosophy: do one thing and do it well, and compose simple functions via pipes.

Notes:

Chapter 11 - Stream Processing

Opposed to batch processing is stream processing where the input is considered unbounded, and it arrives gradually over time. We can reduce stream processing to batch by accumulating the data for an amount of time (say a day) but this adds a delay.

Notes:

Chapter 12 - Future of Data Systems

In this chapter the author shares a more personal take. From my understanding he advocates for a big architectural shift, on the lines of storing unstructured raw logs and having all other structured data be derived from it, via batch and stream processing. This would avoid many complicated issues around ordering (the logs define a total order) and consistency (one source of truth).

The author also touches on data privacy and the ethics of dealing with real-world data.

Conclusion

Overall I learned a lot from this book. The author makes difficult topics digestible via examples and diagrams. I found the Chapters 7, 8 and 9 the most difficult.

I recently started working with stream processing and noticed Chapter 11 is not very comprehensive, though there’s only so much detail one can cover in a general distributed systems book. I already have Streaming Systems by Akidau, Chernyak and Lax lined up.

Many of the posts I wrote under distributed systems has been mentioned in the book: