12 more things I learned from Designing Data Intensive Applications by Martin Kleppmann

This was a mind blowing ride to how the "big guys" need to think about and manage data at the scale of Twitter, Amazon and Google. In the near future, more people will need to contend with this amount of data.

Preview

1. Lots of ways to do database replication in distributed systems

This is good to keep in mind even in the early stages of a project that might get very large. The author points out the advantages of robust replication instead of statement based application for basis. When you duplicated the rows, you're simply copying the data over. If you were writing, a statement and the statement uses a function based on time or date functions , a stored procedure, or any other method that could be dependent on the machine, you'll get different results.

2. Database system using leader, follower and write ahead logs

The write ahead logs mean any changes such as insert, update, or delete get written to a log before they go to any database. Then they are executed/tried on the leaders and later tried on the followers (can be milliseconds later).

3. Technique for working with leader and follower databases

In the case of something like Twitter, where you have users contributing values through a website, you would use the leader to reflect the values back to the user they just change them. And you would use the followers to show those changes to everybody else. This does a few things: it reduces the reload on the leaders database significantly, and it gives immediate feedback to the person that enter the data. It then allows the follower databases to pick up later in time, which could be seconds, minutes, or even longer in the case of system failures, but the databases should eventually agree.

4. Monotonic reads from distributed databases for data consistency

This is a way to make sure that a user will always see consistent data. The most straightforward way to achieve this is to make sure that every user session is using the same database or data center. The example given was if somebody connects to one data center and sees a post that exists there and then refresh the page and connect to a different data center it doesn't have the same post yet. They will see data inconsistencies. So unless there is a failure, you want to lock a user to the same data system.

5. Consistent prefix reads

Make sure that things are always seen in the same order, like a tweet or timeline. If you are using a shared a database where records stored in order might not be on the same physical machine these problems can happen. You will need to sort your results so that they are always given in the same chronological order.

6. Write conflicts are challenging with large databases

Also group document editing has the same replication issues with each browser acting as a data center

7. Stream analytics

This is kind of the opposite relationship of a traditional database. In a traditional database the data is stored, and the queries are ephemeral in stream analytics the queries are stored, and they are analyzing a stream of data. With something matches the query it is then picked off and stored depending on the application.

8. Large scale projects have auditability built in

These are ways to check and verify data integrity. For example, constantly running check sums against databases to be sure that the data has not been corrupted. This could be by either software or hardware errors, but this example is focus. Mostly on hardware is like this failures.

9. Event based design instead of transactions

ACID and other technologies are based on assumptions that the database will always store what you tell it to. Events are easier to keep track of than individual insert statements and they can be more resilient to failure.

10. Predictive decision making is a risk to society

He calls this "algorithmic jail" where people are hurt by a biased system. There is bias built into society and any algorithmic way to make predictions amplifies these issues. This is similar to what ANT is doing and using AI to guarantee loans for people. It will be biased against certain geographic areas, etc.

11. Thought experiment: replace uses of the word “data “ with surveillance in marketing material for your own company or a company that you are interested in.

"We use customer data to improve our products " to " we use customer surveillance to improve our products "

12. He compared data storage to the beginnings of the industrial economy: we are now in the beginning stages of the data economy

We need to monitor and build our data stewardship with a mindset that we are entering a new era. We are going to be storing all data forever from here forward and will need ways to scale it.

Chris407x