Stream events from a DAG-failure diagnosis run

Returns a Server-Sent Events stream of diagnosis events. Supports reconnection via the lastEventId query parameter.

Authentication

AuthorizationBearer

Bearer authentication of the form Bearer <token>, where token is your auth token.

Path parameters

organizationIdstringRequired
The ID of the Organization to which the Deployment belongs.
deploymentIdstringRequired
The Deployment's ID.
diagnosisRunIdstringRequired
The diagnosis run ID.

Query parameters

lastEventIdstringOptional
The last event ID received, used for reconnection.

Response

A server-sent events stream of diagnosis events. Progress events (`text_delta`) stream as the investigation runs, a `heartbeat` event keeps the connection warm, and the `rca_diagnosis` event carries the structured diagnosis described by this schema. An `end` event marks the end of the stream, and an `error` event reports a failure.
titlestring
Short descriptive title for the finding.
summarystring

Concise incident-style summary of what happened and the root cause.

root_causestring
The root cause best supported by the retrieved evidence.
root_cause_taskstring

The task ID (dag_id.task_id) identified as the root cause. Empty for a Dag-level issue.

root_cause_typestring

Root-cause category code, for example USER_CODE_PYTHON_TYPE_ERROR, NETWORK_READ_TIMEOUT, or OUT_OF_MEMORY_ERROR. Returns INSUFFICIENT_EVIDENCE when no retrieved evidence points to a cause, or OTHER when no listed code fits. Workspace or Deployment guidance can define custom values.

transienceenum

Failure transience: PERMANENT fails every time, TRANSIENT is a one-off or self-healing failure, INTERMITTENT fails sometimes.

severityenum
Severity of the issue.
priorityenum

Incident priority, from P1 (critical outage) to P4 (low).

confidencedouble
Confidence in the diagnosis, from 0.0 to 1.0.
confidence_justificationstring
Explanation of why the confidence score was assigned.
evidence_statusenum

How well-evidenced the diagnosis is: confirmed (grounded in retrieved logs plus code or config), hypothesis (plausible but key evidence missing), or insufficient_evidence (no retrieved signal points to a cause).

evidencelist of strings
Key pieces of evidence supporting the diagnosis.
symptomslist of strings
Observable symptoms of the failure.
contributing_factorslist of strings
Additional factors that contributed to or amplified the failure.
evidence_gapslist of objects
One entry per key field the agent could not populate, with the reason.
log_snippetstring
The most relevant raw log excerpt showing the error.
exception_classstring
The Python exception class name, for example DatabaseError or ConnectionError.
exception_messagestring
The exception message string.
log_classificationstring

Classification of the log error pattern, for example connection_timeout, auth_failure, resource_exhaustion, data_validation, parse_error, import_error, or permission_denied.

log_signalslist of strings
Structured signal tags extracted from logs.
dag_level_checkslist of objects

Dag-wide diagnostic checks performed before per-task analysis.

taskslist of objects

Per-task diagnostic breakdown, one entry per failed or upstream-failed task.

cascade_chainlist of objects
Ordered chain of tasks or Dags in the failure cascade, from root to leaf.
cascade_depthinteger
Number of hops from the root failure to the furthest affected task.
is_cascade_amplifierboolean
Whether a single task failure was amplified into many downstream failures.
amplifier_descriptionstring
Explanation of how the cascade amplification occurred.
blast_radiusobject
The impact scope of the failure.
cofailure_pairslist of objects
Pairs of tasks or Dags that fail together because of a shared resource.
effective_availabilitydouble
Actual availability percentage, accounting for retries and partial failures.
reported_success_ratedouble
The nominal success rate reported by Airflow.
health_gapstring
Explanation of the gap between effective availability and reported success rate.
failure_onset_datestring
ISO 8601 timestamp of when failures first started.
last_success_datestring
ISO 8601 timestamp of the last successful run.
timeline_eventslist of objects
Chronological sequence of events leading to and during the failure.
change_signalslist of objects
Config or deploy changes correlated with the failure onset.
suggested_fixstring
Actionable fix, with code snippets or config changes when applicable.
remediationstring

Step-by-step remediation instructions.

prevention_measureslist of objects

Forward-looking recommendations to prevent recurrence.

vendor_remediationstring

Vendor-specific fix guidance, separate from the Airflow-side fix.

vendor_namestring
External vendor or service involved, for example Snowflake, S3, or BigQuery.
vendor_destinationstring
Specific vendor resource target.
vendor_dashboard_urlstring
URL to a vendor monitoring dashboard for further investigation.
rendered_conn_idslist of strings
Airflow connection IDs involved in the failure.
affected_dagslist of strings
All Dag IDs affected by the failure, including downstream.
child_dag_findingslist of objects
Findings for downstream or child Dags affected by the failure cascade.
match_countinteger
Number of matching failures with the same error signature in the observed window.
coverage_qualitystring

Depth of analysis performed: deep, moderate, shallow, or partial.

duration_percentilesobject
Task duration distribution percentiles, in seconds.
trend_directionstring

Trend direction of the failure rate or duration: improving, stable, degrading, or unknown.

change_point_datestring
ISO 8601 date when a significant change in behavior was detected.
seasonality_signalstring
Description of any seasonal or periodic pattern detected.
fleet_health_scoredouble

Overall Deployment health score, from 0.0 (unhealthy) to 1.0 (healthy).

from_dag_run_idstring
Original Dag run ID when the diagnosis was replayed from the cache. Absent on fresh diagnoses.
session_idstring
Durable ID of the investigation session. Include it when you contact Astronomer support.