MCP Server Integration
The Causely MCP server gives agents and AI assistants direct access to Causely's causal reasoning engine. 25 tools across 5 categories let your agent move from raw alerts to structured root cause analysis, dependency maps, and reliability reports, without writing custom integrations.
Key Workflows
These are the four workflows agents use most often. Each maps to a specific sequence of MCP tool calls.
Incident Triage
Identify what's broken and how far it has spread.
Try:
- "What's broken in production right now?"
- "Checkout is throwing errors and we have Alertmanager alerts firing. What's the actual root cause?"
- "Three services are alerting at once. Which is the real problem and which are downstream noise?"
get_symptoms(): see all active symptoms across the entire environment (no filters needed)get_root_causes(): identify all active root causes and impacted servicesget_alerts(alert_name_filters=...,): drill into a specific alert's causeget_topology(entity_id=..., mode="dependents"): map which upstream services are affected
Handled automatically by causely-correlated-incidents, or causely-alert-triage if you're starting from a specific alert.
Quick Service Health
Get a complete health picture for a specific service in two calls.
Try:
- "Is checkout healthy?"
- "Give me a full health picture for the payments service: status, open issues, SLOs."
- "Before I page anyone, is there actually a problem with database-service or is this alert noise?"
- "Are there are any concerning errors or warnings in the logs for the frontend service over the last hour?"
get_entities(query="service-name", entity_types=["Service"]): resolve the service name to its entity IDget_service_summary(service="service-name"): full snapshot: status, active symptoms, root causes, SLOs, metrics, recent events, error logsget_logs(entity_id=...,): retrieves live log output for a running service
Handled automatically by the causely-health-reporting skill.
Post-Deploy Validation
Check whether a deployment introduced regressions.
Try:
- "Did the last deploy to payments cause any regressions?"
- "We deployed cart service 30 minutes ago. How does it look compared to before?"
- "Check all services my team owns, did anything degrade after today's deploys?"
reliability_delta(service="service-name"): compare CPU, memory, latency, and error rate before vs after the most recent deploymentfleet_reliability_delta(team="team-name"): batch check across all services for a team, namespace, or explicit list
Handled automatically by the causely-change-impact skill.
Post-Incident Reporting
Generate postmortem documentation and action items from a resolved incident.
Try:
- "The payments outage is resolved. Draft a postmortem."
- "Write up what happened to checkout this morning: timeline, root cause, and what was affected."
- "Payments is back up. Draft the postmortem and create a follow-up ticket for the team."
get_root_causes(root_cause_id=...): retrieve full root cause details, timeline, and blast radiuspostmortem(root_cause_id=...): generate a structured postmortem draftgenerate_ticket(task="..."): create a follow-up engineering ticket for Jira, GitHub Issues, or Linear
Handled automatically by the causely-postmortem skill.
Skills (Recommended)
Skills automate the tool-selection step shown in the workflows above. You describe your situation in natural language; the right specialist activates and runs the correct tool sequence for you. Skills are available for Claude Code, Claude Desktop, and Cursor.
| Situation | Skill | Try |
|---|---|---|
| Incoming alert | causely-alert-triage | "PagerDuty just paged for checkout-latency. What's the actual cause?" |
| Post-deploy validation | causely-change-impact | "Did the last deploy to payments cause any regressions?" |
| Multi-service outage | causely-correlated-incidents | "Three services are alerting at once. What's the real problem?" |
| Health summary / morning standup | causely-health-reporting | "Give me a morning health report for production." |
| Kubernetes investigation | causely-k8s-investigation | "The orders pod keeps OOMKilling. Why?" |
| Postmortem / ticket | causely-postmortem | "Draft a postmortem for the checkout outage that resolved an hour ago." |
See the Skills page for install instructions, full skill detail, and override options.
Choose Your Client
Select your tool for a copy-paste config snippet, config file location, and restart instructions.
| Client | Transport | Config format |
|---|---|---|
| Claude Code | HTTP | .mcp.json (mcpServers) |
| Claude Desktop | stdio via mcp-remote | claude_desktop_config.json (mcpServers) |
| Codex | HTTP | config.toml (mcp_servers) |
| Cursor | HTTP | .cursor/mcp.json (mcpServers) |
| VS Code (GitHub Copilot) | HTTP | .vscode/mcp.json (servers) |
Verify your connection by asking: ”Causely: What defects are currently active?”
Other MCP-compatible Clients
The clients above have dedicated setup pages. The following tools also support the Causely MCP server, point them at https://api.causely.app/mcp using your tool's HTTP MCP config. See Advanced Authentication for credential options.
IDEs and Editors: JetBrains IDEs (IntelliJ IDEA, PyCharm, WebStorm, GoLand, and others), Windsurf, Zed
CLIs: Kiro CLI, Amp, Atlassian Rovo DEV CLI, and other MCP-compatible CLI tools
Agent Frameworks: HolmesGPT
Authentication
The MCP server validates Frontegg-issued Bearer tokens. For most clients, browser-based OAuth runs automatically, no manual setup needed. For non-interactive setups (automation, CI) or clients that only support stdio, including the stdio/mcp-remote fallback, see Advanced Authentication.
Using the Tool Reference
The reference below is for teams building custom agents that need explicit tool control. If you're using Claude, Cursor, Codex, or any conversational agent, you can skim it for capability awareness. In most cases, you can describe what you want and the agent picks the right tools. One thing worth knowing if you do go programmatic: most structured tools require an entity ID, so get_entities() is usually the right first call.
Tool Selection: Ask Causely vs Structured Tools
The MCP server exposes two interaction styles. Choose based on what your agent needs to do with the result.
| Use case | Recommended tool |
|---|---|
| Narrative health summary (“Is checkout healthy?”) | get_service_summary |
| Historical questions (“What happened last night?”) | ask_causely |
| Incident standup summary (“What happened to checkout yesterday?”) | ask_causely |
| SLO overview, error budget, and burn rate (“Are any SLOs at risk?” / “Is the payments SLO burning?”) | get_slo |
| Programmatic root cause output (“What is the root cause of latency on payments?”) | get_root_causes |
| Time-series metric data (“What is the p95 latency for the last hour on payments?”) | get_metrics |
| Entity ID resolution (“Resolve the entity ID for the payments service”) | get_entities |
| Dependency graph (“What services depend on payments?”) | get_topology |
| Post-deploy regression check (“Did the latest payments deploy introduce a regression?”) | reliability_delta |
Ask Causely natural language in. Best for open-ended exploration and synthesis.
Structured tools explicit named inputs. Best when your agent needs to act on the result, apply logic, or chain calls.
What Agents Get vs Raw Telemetry
| Raw telemetry | Causely MCP | |
|---|---|---|
| Root cause identification | Correlation-based, requires analysis | Deterministic causal analysis |
| Dependency awareness | Manual mapping required | Live topology from observed traffic |
| Blast radius | Estimated | Computed from causal graph |
| Structured output | Custom parsing required | Typed tool responses |
| Time to insight | Minutes of analysis | Single tool call |
Full Tool Reference
25 tools across 5 categories. All tools are available to any MCP-compatible agent or assistant.
Entity Resolution
| Tool | When to use |
|---|---|
get_entities | Start here. Resolve a service or database name to its ID; list all entities in a namespace; check current health status |
get_label_values | Enumerate valid label values (team, product, cluster, namespace) before fanning out queries across environments |
list_namespaces | Discover Kubernetes namespace names before resolving entities or scanning a namespace |
list_clusters | Discover cluster names before scoping multi-cluster queries |
Data Retrieval
| Tool | When to use |
|---|---|
get_metrics | Retrieve numeric metric data (p95 latency, error rate, CPU, memory, throughput): the only tool that returns time-series |
get_logs | Inspect live service logs, or retrieve evidence logs captured at root cause detection time |
get_alerts | Start triage from an alert name (PagerDuty, Slack, Datadog); distinguish alerts mapped to causal analysis from noise |
get_events | Correlate symptom onset with deployments, restarts, scaling events, or config changes |
get_config | Investigate configuration drift; verify deployment manifest matches expectations |
get_slow_queries | Identify database queries consuming the most execution time; follow up on database root causes |
Health & Diagnosis
| Tool | When to use |
|---|---|
ask_causely | Open-ended questions and synthesis: historical summaries, standup recaps, anything where narrative output is more useful than structured data |
get_symptoms | Call with no filters to see all active symptoms across the entire environment or filter for specific entity, namespace or cluster |
get_root_causes | Identify all active root causes; filter by impacted service, symptom, or root cause ID |
get_entity_health | Structured health summary for non-Service entities (databases, pods, queues, topics, tables) |
get_environment_health | Structured health summary for the environment, can be scoped to specific namespaces or services |
get_slo | Check SLO state, error budget remaining, and burn rate |
get_topology | Find upstream blast radius (dependents), downstream dependencies, or full data-flow graph |
get_integration_status | Verify monitoring coverage; check scraper health by cluster |
triage | Focused health summary by entity name or root cause ID: no entity ID pre-resolution needed |
team_health | Health summary for all services owned by a team; degraded and critical services listed first |
get_service_summary | Comprehensive health snapshot for a single service: status, symptoms, root causes, SLOs, metrics, events, logs |
Reporting & Postmortems
| Tool | When to use |
|---|---|
postmortem | Generate a deterministic postmortem draft for a resolved incident from Causely data |
generate_ticket | Create a structured engineering ticket suitable for Jira, GitHub Issues, or Linear |
Reliability & Deployment
| Tool | When to use |
|---|---|
reliability_delta | Post-deploy regression check for a single service: compare resource consumption before/after most recent deployment |
fleet_reliability_delta | Batch regression check across a team, namespace, or explicit service list (up to 20 services per call) |