Architecture
Causely is built on a split-architecture model that balances local control with cloud-powered intelligence. This design ensures low overhead, strong data privacy, and seamless integration with existing tools.
This page provides detailed information about Causely's deployment architecture and component structure. For a high-level overview of how Causely's causal reasoning engine works, see How Causely Works.
System Architecture
Deployment Architecture
Causely can be deployed across various environments, including Kubernetes clusters, standalone Docker hosts, Nomad clusters, and more. The deployment architecture consists of several components that work together to provide real-time root cause analysis.
Mediation Layer
The mediation layer is deployed locally in your infrastructure and processes telemetry data to extract only the signals needed for reasoning. It performs:
- Symptom Detection: Converts telemetry from Prometheus, CloudWatch, Datadog, OpenTelemetry (including eBPF), and other sources into a binary stream of active/inactive symptoms.
- Topology Discovery and Ingestion: Leverages integrated telemetry sources to discover entities and dependencies and ingest topology from systems such as OpenTelemetry, cloud provider APIs and other sources.
- Local Processing: Processes telemetry locally to minimize data transfer, control cost, and preserve privacy. Most raw telemetry remains local, with only distilled insights and targeted evidence sent to the cloud.
The mediation layer primarily sends distilled insights to the cloud. After a root cause is identified, a targeted subset of relevant telemetry (metrics, traces, and log-derived errors/events) may be sent as evidence to enhance root cause clarity. For more detail on supported telemetry sources, see Supported Telemetry.
The mediation layer consists of the following components:
Mediator
The Mediator is the core component that runs locally in your environment and serves as the data processing layer:
- Symptom Detection: Converts telemetry from various sources into binary symptom states
- Topology Discovery: Automatically discovers services, infrastructure, and dependencies
- Local Processing: Processes telemetry locally, with most raw telemetry remaining in your datacenter
- OTLP Endpoint: Listens on port 4317 for OpenTelemetry Protocol data
The Mediator handles secure communication with Causely's cloud-based causal reasoning engine, primarily sending distilled insights. After root causes are identified, a targeted subset of relevant telemetry may be sent as evidence to enhance root cause clarity.
The Mediator can also be optionally configured to get metrics from Prometheus or discover and monitor managed cloud services from cloud providers.
Agents
Agents are deployed across your infrastructure to gather node and container level metrics. The deployment method varies depending on your environment:
- Kubernetes: Agents are deployed as a DaemonSet across all nodes in the cluster
- Docker: Agents run as containers on standalone Docker hosts
- Nomad: Agents are deployed as Nomad jobs across the cluster
Agents leverage eBPF technology, which requires privileged access to the host system. This enables automatic instrumentation without code changes.
Agents don't establish any outbound connections to the internet or any other service apart from the Mediator and VictoriaMetrics. The agents periodically forward the topology and manifestation data to the Mediator, which, in turn, sends it to the Causely SaaS backend for analysis.
Agent Architecture
Executor
The Executor is an optional component responsible for executing remediation actions within your infrastructure. The Executor can be enabled as part of the deployment process.
The specific permissions required depend on your deployment environment:
- Kubernetes: The Executor's ServiceAccount is granted the
cluster-adminrole - Docker/Nomad: The Executor requires appropriate permissions to execute remediation actions
VictoriaMetrics
VictoriaMetrics is a timeseries database used by the agents and mediator (on port: 8428) to store additional timeseries data locally in your environment.
Causal Engine
The Causal Engine runs in Causely's secure cloud environment (or can be self-hosted). It receives the stream of symptom states and performs real-time analysis using probabilistic modeling, system graphs, and causal inference. It infers causes, evaluates blast radius, validates constraints, and prioritizes remediation, all without requiring manual correlation.
Telemetry Sources
Causely supports a wide range of telemetry sources, including OpenTelemetry, Prometheus, CloudWatch, Datadog, and more. For a full list of supported telemetry sources, see Supported Telemetry.
Causely Agents are deployed in your infrastructure and are responsible for collecting the telemetry data from those sources.
By default Causely will automatically instrument your applications to receive OpenTelemetry traces. This allows Causely to discover service dependencies, monitor sync and async communication signals. Additionally you can export traces to Causely from your existing OpenTelemetry Collectors. We recommend that you always send OpenTelemetry traces to Causely, as this allows Causely to provide cross-service insights.
Workflow Integration
Causely integrates directly into your tools of choice, delivering causal insights into Slack, Alertmanager, Opsgenie, Grafana, and more.
For details on how to connect Causely to your workflows, see Supported Workflows.
This architecture allows Causely to deliver precise, real-time insights without burdening your data pipelines or violating privacy requirements.
Security Considerations
For detailed information about security, permissions, and data handling, see the Security documentation.