Database reliability is a goal of many teams. They want to make sure things never go down and databases do not cause trouble. However, reliability is often misunderstood and teams miss multiple things when making sure their databases work well. They focus on improving stability metrics without building the proper processes around them. While this leads to reliability, the team needs to spend more time on maintenance and efficient procedures. Let’s see 3 hot trends in database reliability that every team should incorporate into their organization.
Observability Instead of Monitoring
The first thing is changing the way how we monitor our ecosystems. To make sure our database is reliable, we need to prevent any bad changes from reaching production. We need to find the issues before they get through pipelines and fix them as early as possible.
Monitoring is not enough to do that. We can’t just capture metrics from various software layers and sources in the production, and call it a day. This way we’ll observe the end effect of the deployment which may be wrong and it may be far too late to prevent the issues. We need to make sure that we cover our whole software development life cycle and capture data points from all the stages and elements that may affect the production database.
Just to give you some understanding of what we may be missing today, here comes what we should include:
- SQL execution plans from the developers’ environments
- Queries from unit tests executed in CI/CD
- Index usage in the production database
- Extension settings in the database
- Routine maintenance tasks around infrastructure
- Cache utilization
- Requests data distribution
This list is much longer and depends on what you use exactly and what domain you work in.
We need to move from merely monitoring the production to seeing everything in our ecosystem. This way we can identify slow queries before they hit the database, invalid schema migrations that may take the database down, and wrong scripts that may lead to data loss. Bear in mind this is not just unit testing. Tests focus on the correctness of what we read and write, but they disregard how we extract the data and whether SQL queries are fast enough.
Understanding Instead of Seeing
Once we observe everything, we need to move from just observing to understanding how pieces interact with each other.
Our current monitoring solutions are great at showing us where something happens. They can easily show that there is a fire somewhere, present metrics, and even draw timelines. However, they don’t explain what happened. They can’t connect the dots between various layers to explain that the metric increases because of a code bug, different data distribution, operating system updates, or some new task running unexpectedly.
Seeing is not enough. Just seeing that the metrics increased is not enough when we’re deploying many times a day and we deal with many heterogeneous applications. Services are very chatty these days, we have constant network traffic and many moving pieces affecting each other.
What we need instead is understanding. We need our monitoring solutions to explain what happens, why metrics change, what exactly contributes to the issues we observe, and how things are evolving. This is easier when we have the observability described in the previous section. We can correlate code changes with routine tasks and updates flowing throughout the system. Instead of getting the alarms that metrics spiked, we should get a consistent history explaining what happened and where we are now.
Automated Troubleshooting Instead of Manual Work
Last but not least, computers are great at doing repetitive tasks and should troubleshoot things automatically. Once we get comprehensive explanations of what changed and why the system doesn’t behave as expected, we should expect the tooling to troubleshoot the issues automatically.
Think about your typical steps when you see a slow query. You start with capturing the execution plan (which we already have with observability described above), you then examine why the query is slow (which we should have provided to us based on what we said in the previous section), and then you try different versions of the query with different indexes or configurations. All of that can be automated and tested automatically. The system can add hypothetical indexes and check the execution plans, verify the cost of different queries, and suggest improvements.
We can take that even further. Since we already have the whole SDLC in place, we can submit pull requests to fix the code and configurations automatically. The only thing we would need to do as users is accept these pull requests and let them flow through the pipeline. One day we wouldn’t even need to accept them explicitly.
How Metis Does All of That
Metis turns monitoring into observability by plugging into all the steps of the SDLC. Metis can analyze your developers’ queries and capture execution plans. It can then provide actionable insights into how to fix these queries and how to change the database to make things faster.
Metis can analyze production databases and find anomalies or issues with live queries. It can capture extensions, schemas, configurations, and live transactions to explain why things are slow. Metis uses an AI engine to suggest improvements and instructions on how to rewrite queries to make them faster.
Monitoring the metrics is not enough. We need to move on to prevent the bad code changes from being deployed to production. We can get all of that automated and forget that our databases exist.