Observability becomes the foundation of cloud security in the AI era, unifying visibility, detection and analysis to better understand and prioritize risks.
Teams rely on observability data to understand why a system behaved a certain way during an outage or security incident. Long associated with performance monitoring alone, this visibility is in reality just as essential for security analyses. To reconstruct the progression of an attack, it is essential to understand how cloud identities, services and resources interacted with each other. Observability data: metrics, events, logs and traces (MELT) provide precisely this level of context.
With the rise of LLM applications and services in increasingly distributed environments, the costs and visibility issues related to the fragmentation of tools and infrastructure only increase. THE Vibe Coding significantly increases the number of application vulnerabilities, while new attacks exploiting AI can cause unpredictable behavior in certain systems. At the same time, AI can be extremely effective in helping security teams, speeding up investigations thanks to its ability to correlate large volumes of signals. In fact, observability data becomes an essential building block of the security system, at the crossroads of application performance and security.
Leverage observability data to detect, analyze and respond to threats
While unified, centralized visibility is essential to understanding cloud incidents, security signals alone will often prove insufficient. Each incident generates a multitude of flows associated with authentications, applications, and infrastructure data which intercept each of the context elements at different moments in time. Without the ability to correlate security and observability contexts, teams are left to piece together the puzzle after
There Cloudflare compromised at the end of 2023 concretely illustrates the benefit of combining operational data and safety signals during an incident. By leveraging compromised credentials and access token via a third-party provider, the attackers successfully penetrated internal systems, including the self-hosted Atlassian environment. Faced with a series of weak signals (unusual authentication flows, reconnaissance activities, attempts to access other systems), no single element made it possible to understand the attack. It is the correlation of observability data (metrics, events, logs, traces) which ultimately allowed the teams to reconstruct the sequence of events and establish a precise chronology.
As this example shows, teams can leverage their existing data to answer key questions in a security investigation: What changed in the system? Who or what is causing this change? What other elements were impacted? Which endpoints were called? For AI-enabled services, an additional question arises: which prompt or tool call triggered the action?
This need for a shared context pushes organizations to bring together or even merge their SRE and security teams. By combining detailed knowledge of architecture and security expertise, they make their systems more resilient to failures. Cloud security is then no longer seen as a separate layer, but as a natural extension of the observability of environments. Security signals are thus more faithful to the actual behavior of systems, and therefore more relevant and usable in the event of an incident.
This approach also allows AI to better prioritize incidents by synthesizing changes, identifying the actors involved and mapping the impacted systems.
Use observability data for threat analysis
When threat analysis leverages existing observability data, teams can reuse the same context as performance monitoring to relate a system’s behavior to an attack path or exploited vulnerability. This data base also allows AI to become more relevant, by facilitating the generation of investigation summaries and remediation recommendations.
We see immediate value in linking observability data to security signals through AI-driven analytics, including generating clear, actionable event maps during incidents. This contribution is particularly visible during the triage and investigation phases, often long and iterative, where teams must check the relevance and priority of the different signals.
For example, consider a SIEM signal indicating the addition of an AdministratorAccess policy to a service account, associated with a source IP address identified as a suspicious residential proxy. In a traditional investigation, the analysis would generally follow several steps:
1. Check if the IP address matches a legitimate administrator and if the session is consistent with their login habits.
2. Reconstruct events that occurred at the same time: policy changes, access key creation, authentication failures, or unusual API calls.
3. Identify the services and resources impacted in order to assess the extent of possible access.
4. Analyze associated network behaviors, including unusual locations and outbound traffic spikes.
AI-assisted analysis, leveraging observability data, can condense this complex process into a single assessment. This allows teams to go from hours of investigation to minutes, and can focus on corrective actions, such as disabling compromised credentials or reinforcing least privilege principles.
Use observability data to prioritize risks
Teams must be able to rely on observability data at every stage of the software development life cycle (SDLC). With the acceleration of code production, driven by AI-assisted development workflows, this data becomes essential to effectively prioritize risks. In reality, only a fraction of critical vulnerabilities merit immediate attention. By linking code analysis results to production behaviors and impacted services, it becomes possible to identify truly significant risks earlier. When leveraging this data, AI-based approaches further reduce false positives and avoid unnecessary remediation efforts.
The integration of this data into code reviews, via an analysis based on LLM applied to pull requests (PR), makes it possible to identify risks before going into production. AI is particularly useful for analyzing large PRs, where malicious code can be hidden among seemingly innocuous changes.
To distinguish malicious code from legitimate modifications, models need context about actual attacks and common development practices. This is why datasets must be continually enriched, both with observed or simulated attacks and with classic PRs, in order to reduce false positives.
Design resilient systems by unifying observability and security
Cloud environments will continue to transform with the adoption of new technologies, while the integration of AI adds an additional level of complexity. With each new layer, new blind spots appear and the attack surface widens. In this context, to accurately interpret, investigate, and remediate, observability data should be considered the foundation of cloud security, especially in AI-enabled environments.