Many things have changed in the last decade. In our quest for greater scalability, resilience, and flexibility within the digital infrastructure of our organization, there has been a strategic pivot away from traditional monolithic application architectures towards embracing modern software engineering practices such as microservices architecture coupled with cloud-native applications. This shift acknowledges that in today's fast-paced technological landscape, building isolated and independently deployable services offers significant advantages over the legacy of intertwined codebases characteristic of monolithic systems.
Moreover, by adopting cloud-native principles tailored for public or hybrid cloud environments, we've further streamlined our application development and delivery process while ensuring optimal resource utilization through container orchestration tools like Kubernetes - which facilitate scalable deployment patterns such as horizontal scaling to match demand fluctuations. This paradigm shift not only allows us more efficient use of cloud resources but also supports the DevOps culture, fostering an environment where continuous integration and delivery become integral components that accelerate time-to-market for new features or enhancements in alignment with our business objectives.
To deal with the fast-changing world, we've shifted our approach to reduce the complexity of deployments; they have become frequent daily tasks rather than rare challenging events due to a move from laborious manual processes to streamlined CI/CD pipelines and the creation of infrastructure deployment tools. This transition has substantially complicated system architectures across various dimensions including but not limited to infrastructure, configuration settings, security protocols, machine learning integrations, etc., where we've gained proficiency in managing these complexities through our deployments.
Nevertheless, the intricate complexity of databases hasn’t been addressed adequately; it has surged dramatically with each application now leveraging multiple database types - ranging from SQL and NoSQL systems to specialized setups for specific tasks like machine learning or advanced vector search operations due to regular frequent deployments. Because these changes are often rolled out asynchronously, alterations in the schema of databases or background jobs can occur at any time without warning which has a cascading effect on performance issues throughout our interconnected systems.
This not only affects business directly but also complicates resolution efforts for developers and DevOps engineers who lack the expertise to troubleshoot these database-centric problems alone, thus necessitating external assistance from operations experts or specialized DBAs (Database Administrators). The absence of automated solutions leaves the process vulnerable due to dependence on manual intervention. In the past, we would put the burden of increased complexity on specialized teams like DBAs or operations. Unfortunately, this is not possible anymore. The complexity of the deployments and applications increased enormously due to the hundreds of databases and services we deploy every day. Nowadays, we face multi-tenant architectures with hundreds of databases, thousands of serverless applications, and millions of changes going through the pipelines each day. Even if we wanted to handle this complexity with specialized teams of DBAs or DevOps engineers, it’s simply impossible.
Thinking that this remains irrelevant to mainstream business applications couldn’t be farther from the truth. Let’s read on to understand why.
Developers Evaluate Your Business
Many companies realized that streamlining developers’ work inevitably brings multiple benefits to the whole company. This happens mostly due to two reasons: performance improvement and new domains.
Automation in development areas can significantly reduce MTTR and improve velocity. All business problems of today’s world need to be addressed by the digital solutions that are ultimately developed and maintained by developers. Keeping developers far from the end of the funnel means higher MTTR, more bugs, and longer troubleshooting. On the other hand, if we reorganize the environment to let developers work faster, they can directly impact all the organizational metrics. Therefore, our goal is to involve developers in all the activities and shift-left as much as possible. By putting more tasks directly on the development teams, we impact not only the technical metrics but also the business KPIs and customer-facing OKRs.
The second reason is the rise of new domains, especially around machine learning. AI solutions significantly reshape our today’s world. With large language models, recommendation systems, image recognition, and smart devices around, we can build better products and solve our customers’ issues faster. However, AI changes so rapidly that only developers can tame this complexity. This requires developers to understand not only the technical side of the AI solutions but also the domain knowledge of the business they work on. Developers need to know how to build and train the recommendation systems but also why these systems recommend specific products and how societies work. This turns developers into experts in sociology, politics, economics, finances, communication, psychology, and any other domain that benefits from AI.
Both these reasons lead to developers playing a crucial role in running our businesses. Days of developers just taking their tasks from Jira board are now long gone. Developers not only lead the business end-to-end but also the performance of the business strongly depends on the developers’ performance. Therefore, we need to shift our solutions to be more developer-centric to lower the MTTR, improve velocity, and enable developers to move faster.
Developers are increasingly advocating for an ecosystem where every component, from configuration changes to deployment processes, is encapsulated within code - a philosophy known as infrastructure as code (IaC). This approach not only streamlines the setup but also ensures consistency across various environments. The shift towards full automation further emphasizes this trend; developers are keen on implementing continuous integration and delivery pipelines that automatically build, test, and deploy software without human intervention whenever possible. They believe in removing manual steps to reduce errors caused by human error or oversight and speed up the overall development cycle. Furthermore, they aim for these automated processes to be as transparent and reversible as needed - allowing developers quick feedback loops when issues arise during testing stages while ensuring that any rollback can happen seamlessly if necessary due to a failed deployment or unexpected behavior in production environments. Ultimately, the goal is an efficient, error-resistant workflow where code not only dictates functionality but also governs infrastructure changes and automation protocols - a vision of development heavily reliant on software for its operational needs rather than traditional manual processes.
Developers critically evaluate each tool under their purview - whether these be platforms for infrastructure management like Puppet or Chef; continuous integration systems such as Jenkins; deployment frameworks including Kubernetes; monitoring solutions, perhaps Prometheus or Grafana; or even AI and machine learning applications. They examine how maintenance-friendly the product is: can it handle frequent updates without downtime? Does its architecture allow for easy upgrades to newer versions with minimal configuration changes required by developers themselves? The level of automation built into these products becomes a central focus - does an update or change trigger tasks automatically, streamlining workflows and reducing the need for manual intervention in routine maintenance activities?
Beyond mere functionality, how well does it integrate within their existing pipelines; are its APIs easily accessible so that developers can extend capabilities with custom scripts if necessary. For instance, integrating monitoring tools into CI/CD processes to automatically alert when a release has failed or rolled back due to critical issues is an essential feature assessed by savvy devs who understand the cascading effects of downtime in today's interconnected digital infrastructure.
Their focus is not just immediate utility but future-proofing; they seek out systems whose design anticipates growth, both in terms of infrastructure complexity and the sheer volume of data handled by monitoring tools or AI applications deployed across their stacks - ensuring that what today might be cutting edge remains viable for years to come. Developers aim not just at building products but also curating ecosystem components tailored towards seamless upkeep with minimal manual input required on everyday tasks while maximizing productivity through intelligent built-in mechanisms that predict, prevent, or swiftly rectify issues.
Developers play an essential role in shaping technology within organizations by cooperating with teams at various levels - management, platforms engineering, and senior leaders - to present their findings, proposed enhancements, or innovative solutions aimed to improve efficiency, security, scalability, user experience, or other critical factors. These collaborations are crucial for ensuring that technological strategies align closely with business objectives while leveraging the developers' expertise in software creation and maintenance. By actively communicating their insights through structured meetings like code reviews, daily stand-ups, retrospectives, or dedicated strategy sessions, they help guide informed decision-making at every level of leadership for a more robust tech ecosystem that drives business success forward. This suggests that systems must keep developers in mind to be successful.
Your System Must Be Developer-First
Companies are increasingly moving to platform solutions to enhance their operational velocity, enabling faster development cycles and quicker time-to-market. By leveraging integrated tools and services, platform solutions streamline workflows, reduce the complexity of managing multiple systems, and foster greater collaboration across teams. This consolidated approach allows companies to accelerate innovation, respond swiftly to market changes, and deliver value to customers more efficiently, ultimately gaining a competitive edge in the fast-paced business environment. However, to enhance the operational velocity, the solutions must be developer-first.
Let's look at some examples of products that have shifted towards prioritizing developers. The first is cloud computing. Manual deployments are a thing of the past. Developers now prefer to manage everything as code, enabling repeatable, automated, and reliable deployments. Cloud platforms have embraced this approach by offering code-centric mechanisms for creating infrastructure, monitoring, wikis, and even documentation. Solutions like AWS CloudFormation and Azure Resource Manager allow developers to represent the system's state as code, which they can easily browse and modify using their preferred tools.
Another example is internal developer platforms (IDPs), which empower developers to build and deploy their services independently. Developers no longer need to coordinate with other teams to create infrastructure and pipelines. Instead, they can automate their tasks through self-service, removing dependencies on others. Tasks that once required manual input from multiple teams are now automated and accessible through self-service, allowing developers to work more efficiently.
Yet another example is artificial intelligence tools. AI is significantly enhancing developer efficiency by seamlessly integrating with their tools and workflows. By automating repetitive tasks, such as code generation, debugging, and testing, AI allows developers to focus more on creative problem-solving and innovation. AI-powered tools can also provide real-time suggestions, detect potential issues before they become problems, and optimize code performance, all within the development environment. This integration not only accelerates the development process but also improves the quality of the code, leading to faster, more reliable deployments and ultimately, a more productive and efficient development cycle. Many tools (especially at Microsoft) are now enabled with AI assistants that streamline the developers’ work.
Observability 2.0 To The Rescue
We saw a couple of solutions that kept developers’ experience in mind. Let’s now see an example domain that lacks this approach - monitoring and databases.
Monitoring systems often prioritize raw and generic metrics because they are readily accessible and applicable across various systems and applications. These metrics typically include data that can be universally measured, such as CPU usage or memory consumption. Regardless of whether an application is CPU-intensive or memory-intensive, these basic metrics are always available. Similarly, metrics like network activity, the number of open files, CPU count, and runtime can be consistently monitored across different environments.
The issue with these metrics is that they are too general and don’t provide much insight. For instance, a spike in CPU usage might be observed, but what does it mean? Or perhaps the application is consuming a lot of memory - does that indicate a problem? Without a deeper understanding of the application, it's challenging to interpret these metrics meaningfully.
Another important consideration is determining how many metrics to collect and how to group them. Simply tracking "CPU usage" isn't sufficient; we need to categorize metrics based on factors like node type, application, country, or other relevant dimensions. However, this approach can introduce challenges. If we aggregate all metrics under a single "CPU" label, we might miss critical issues affecting only a subset of the sources. For example, if you have 100 hosts and only one experiences a CPU spike, this won't be apparent in aggregated data. While metrics like p99 or tm99 can offer more insights than averages, they still fall short. If each host experiences a CPU spike at different times, these metrics might not detect the problem. When we recognize this issue, we might attempt to capture additional dimensions, create more dashboards for various subsets, and set thresholds and alarms for each one individually. However, this approach can quickly lead to an overwhelming number of metrics.
There is a discrepancy between what developers want and what evangelists or architects think the right way is. Architects and C-level executives promote monitoring solutions that developers just can’t stand. Monitoring solutions are just wrong because they swamp the users with raw data instead of presenting curated aggregates and actionable insights. To make things better, the monitoring solutions need to switch gears to observability 2.0 and database guardrails.
First and foremost, developers aim to avoid issues altogether. They seek modern observability solutions that can prevent problems before they occur. This goes beyond merely monitoring metrics; it encompasses the entire software development lifecycle (SDLC) and every stage of development within the organization. Production issues don't begin with a sudden surge in traffic; they originate much earlier when developers first implement their solutions. Issues begin to surface as these solutions are deployed to production and customers start using them. Observability solutions must shift to monitoring all the aspects of SDLC and all the activities that happen throughout the development pipeline. This includes the production code and how it’s running, but also the CI/CD pipeline, development activities, and every single test executed against the database.
Second, developers deal with hundreds of applications each they. They can’t waste their time manually tuning alerting for each application separately. The monitoring solutions must automatically detect anomalies, fix issues before they happen, and tune the alarms based on the real traffic. They shouldn’t raise alarms based on hard limits like 80% of the CPU load. Instead, they should understand if the high CPU is abnormal or maybe it’s inherent to the application domain.
Last but not least, monitoring solutions can’t just monitor. They need to fix the issues as soon as they appear. Many problems around databases can be solved automatically by introducing indexes, updating the statistics, or changing the configuration of the system. These activities can be performed automatically by the monitoring systems. Developers should be called if and only if there are business decisions to be taken. And when that happens, developers should be given a full context of what happens, why, where, and what choice they need to make. They shouldn’t be debugging anything as all the troubleshooting should be done automatically by the tooling.
Stay In The Loop With Developers In Mind
Over the past decade, significant changes have occurred. In our pursuit of enhanced scalability, resilience, and flexibility within our organization’s digital infrastructure, we have strategically moved away from traditional monolithic application architectures. Instead, we have adopted modern software engineering practices like microservices architecture and cloud-native applications. This shift reflects the recognition that in today’s rapidly evolving technological environment, building isolated, independently deployable services provides substantial benefits compared to the tightly coupled codebases typical of monolithic systems.
To make this transition complete, we need to make all our systems developer-centric. This shifts the focus on what we build and how to consider developers and integrate with their environments. Instead of swamping them with data and forcing them to do the hard work, we need to provide solutions and answers. Many products already shifted to this approach. Your product shouldn’t stay behind.