Building Secure and Reliable Systems

A recent book was published this year by Google about site reliability and security engineering, I would like to provide you a brief overview of it and incorporate my own analysis and thoughts about this subject while saving you some time from reading, at least part of it.

Take a few of your customers and ask them, what are the top 5 features on my product that you like. The answer that you are likely to get is, I really like how polished the UI is, or the daily report I get by mail is just fantastic, or since I started using your product I was able to save one hour a day my productivity got up and the share /chat button on document that you added recently is doing a great job.

Your customers are very unlikely to answer the question of what top 5 features of my product do you like with I really like its security or I really like that we lost no chat messages since I started using it. No real customer will even think of it, moreover, assuming you did a very good job, they won't notice that these features even exist. This should have us learn a big lesson, there are some features that customers are unaware of but could be more important to that customer than others but having a good implementation of them means the customer won't even notice them.

While this of course assumes that the customer is the real end customer and not let's say the on call it's still a common sense used in many organizations that when you think of your product then your product teams sets the feature list and the engineers are to implement them. The way to resolve this is to extend actually the list of customers, you should not only interview the end user, you should also interview the on call as a customer, he is the best to tell you what's important to him as a customer then you should sum up a feature list that include solutions to all customers. This cannot be easily done, it requires a whole set of culture and standard's out of an organization, but taking this route is crucial to the success of your business.

The secret is this therefore you need to start with the list of customers you have, and they are not only the end user, to have a healthy organization with a healthy product the list of customers include not only the end customer but also the internal maintainers of your product, those are developers support marketing and of course on call.

As the Google SRE book claims security and reliability should be top priorities for your organization. And this opens a complexity of its own because while security and reliability share many features they also have distinct features and design considerations. The door or the login to your servers can be very much secured but for an engineer to get in and fix something broken a door or login which is overly secured would block that developer from fixing that problem. In buildings we have an escape pathways in case of fire, but this also introduces another door inside the building for attackers hence we see again that security and reliability design considerations often collide. Another example from security perspective you want the least amount of people being able to access the logs of your servers but from reliability perspective you want more the more people can look at the problem the greater the chances you can solve issues and get more knowledge about how this system works.

Incorporating the end customer could be a solution to the effort required, a customer who is acknowledged we actually make it visible the highlights of the security and the reliability of the system would be much happier, even if he did not ask for it. If you use zoom during the covid19 and you are not a technical user, you would be much happier to see in the front page that the organization invests in the security and reliability of your video chats, therefore this provides us with a kind of solution we incorporate the customer in our effort we explicitly tell the customer you did not loose voice calls to the outside world all is not stored in our servers because we proactively took these measures when user logs in to our website we remind him that his mobile is a backup and so forth we keep the user informed and educate him about the importance of security and reliability of our system.

Now imagine that you have an application let's say a stock trading website and you have an issue that you involve your developers and oncall to fix, what do you prefer simple code or complex code simple system with clear boundaries or complex system with unwell defined boundaries, of course the simple one. This does not say of course that simple is always better but if there are a few choices, we should strive to keep in our mind how simplicity is important. At the time of a crisis or an issue you want the system to be clear and simple you want to be able to run tests quickly, you want to be able to navigate the code quickly and to understand where the issue is. This is a choice we can explicitly make when designing and when investing the time on the system.

The market moves fast and features should be delivered this can yield architects and developers investing less on the reduction of technical debt, but this of it this way, this is a real debt take it only if you explicitly want to take this debt, also be reminded that this complexity can accumulate inadvertently so explicit thought should be taken into this, an example of such is taking a whole team member each sprint and dedicating it to cut technical debt in this way there is no question, you always dedicate time for this continuously this is only one method of doing so.

A small change on 2018 caused YouTube to be down for one hour, this was the removal of 2 innocent looking lines of code, this only stresses out that often times technical debt accumulates and only when you reach the tipping point you see how it exempts itself in this example as a downtime, this makes it much harder to allocate time because you don't see the actual effect until you reach the tipping point in many cases.

Defensive approaches such as limiting the blast radius when designing and coding features can help in this defensive approach you ask yourself if a total failure happens here how can I ensure that it's not taking the whole system down with it, or even better how can I ensure the blast radius is as small as possible while keeping the system still simple and easy to maintain. Some of this can be done by limiting access to different projects, so one person cannot access projects he is not supposed to do this reduces the blast radius of that developer this is a distinctive failure domain.

Usually we tend to think of security attackers as attackers from the outside of the organization, but they could be also from inside either intentionally or unintentionally, in fact its best not to distinct them between the two. This is also called the principle of the least privilege.

Multi party authorization is another tool that enforces sensitive operations to be performed only when multiple parties authorize the operation, just like multiple people needing to hit the red button for an atom bomb to blast, this both protects against inside malicious users and also against inadvertent operations and mistakes.

Before deploying a system a process should make sure the system behaves well both before and after deployment almost any reliability and security problem happens after a change in the system, even a solid design when translated to code needs the appropriate tunings for it to actually be in reality resilient, it's not enough for the architecture to be such. Slow roll outs helps in checking the feature on a small set of users and less risky consumers while propagating the feature on and on. We it's usually said that we deploy 100 times a day it's not wise to deploy to everyone at once, a slowed down deployment it's also useful so that feature can go through testing in stages.

With all our effort errors would happen, and we cannot be totally resilient or secure, when this error happen we want to have logs, and monitoring, specifically for logs we want there more, and we want to be able to search them fast and effectively, however at a certain scale logs become too expensive and you cannot and don't really want to log everything, it's not scalable. Not only it's not scalable it can impose security risks if you log sensitive information, logging itself can cause problem in reliability of systems example if logging too much and system cannot keep the pace of logs, you can either choose to lose logs or just make the machine slower, or even stop working so a balance is best here, and the response should be proper logging the amount you would better log each request and each response, be able to take that request and run it through your IntelliJ in debug and put much of the Burdon on monitoring and graphs and tables.

In 2014 an attacher put the service code spaces out of business in hours by taking over the administrator tools and deleting all its data including the backup. This emphasizes the amount of time we have to answer issues, team must be able to work fast, developers must be able to work found in case of crisis, and this will not happen unless we prepare for it. It's best to have a premade plan, best to think beforehand and think less afterhand. When there is such a crisis you are operating under stress so you don't want to do all the complex work before as much of it and leave the simple steps to when crisis happens. Checklists are the root of problem solving under such crisis. There is actually a standard for it called ICS - a standardized approach to command control and coordination of emergency response.

Google has an internal tool called dirt (disaster recovery testing program) which regularly simulates various internal system failures and forces teams to cope with these types of scenarios. Hard in practice easy in battle.

When patching or fixing code in systems is required you want to ensure you have a fast lane and fast process of that, while it could be the case that your continuous deployment requires days to get to production you want to ensure that in these cases you have a fast lane and close the vulnerabilities quickly. However to be able to do that you need to have great visibility into the state of the system so you know that everything is ok once you applied the patch otherwise it could well be the case that it's worser, there are approcahes for that that the books discusses - mainly with proper visibility.

Summary

While security and reliability have much incommon they also have qualities and requirements which collide, how to address those colliding best practice for visiblity for deploying patches for investing the time beforehand on security and reliability and continously investing time in reducing technical debt is key aspect, though consumers are not always aware of it, they should get aware of it and they will then appreciate security and reliability as part of the features they require.

[1]

Code Code Code Blog

Search This Blog

Building Secure and Reliable Systems

Comments

Post a Comment

Popular posts from this blog

Functional Programming in Scala for Working Class OOP Java Programmers - Part 1

Alternatives to Using UUIDs

Bellman Ford Graph Algorithm