10 things I learned from Designing Data Intensive Applications by Martin Kleppmann

Preview

1. To build in fault tolerance you need to design in failures

You should be able to randomly shut parts of the system down and have it recover

2. Focus on fixing slow requests: they impact user experience

This is in a discussion where he mentions Twitter and their infrastructure approach. What his point is is that if you have metrics on response times, you should not use the average, or the mean. You should use the metric like P 95 which means 95% of the requests are faster than this number and 5% are slower and then focus on the 5%. His point is is that if you make a parallel request to the server of 20 requests your response will come back at the pace of the one slowest request.

3. Never heard graph database concepts referred to geometrically: talking about vertex and edges

4. It is hard to write tech books for a general audience

This author is doing a commendable job, but in the same chapter, he describes how browsers render CSS and then gets into an overview of how graph databases work. If you have any idea how graph databases work, or even know that graph databases exist, writing CSS is ridiculously simple. If you don't know how CSS works yet and you are reading this book, you should put it down and start with a "web for dummies book" first.

5. Different databases can be used for different use cases

OK this is a fairly obvious statement. However, I have worked with relational databases for so long, and they seem to be such a perfect analogy to how the real world is structured that I have not come across a problem that can't be solved by a relational database, but I just don't have those kind of problems. It would be interesting to learn more about graph databases because the concepts seem interesting.

6. NoSQL and schema less databases have uses

Obviously people use these right now. I have worked with some APIs that use them and they seem to be kind of confusing, but maybe there are some good use cases for them.

7. For extremely large data projects the disk system needs to be taken into consideration for indexing

For example, binary tree indexing is very popular, but at least some disk space gets left blank on use. Logarithmic indexing can do a better job of disk space usage but there is a trade as the data needs to be processed before it is stored. There is also a difference between SSD storage and magnetic, spinning hard drive storage in terms of the ability to use data and store data. This is the kind of thing that Google and Twitter need to worry about.

8. I wonder whether the voice over actor had any idea what he was talking about

He was very easy to understand and sounded confident.

9. Some data lakes used column based queries

In the SQL world I am used to row based queries.

10. Hyper cube data storage

This is used in some data lakes where you want to store information on multiple axes at a time. For example, you want to know which product sold on which date in which store. This would be a three dimensional, hyper cube with each cell and hyper cube having the information of that store and that product on that day pre-calculated. It is a way to de-normalize data and still have accessible at high rated of speed.

Chris407x