Stream events from a DAG-failure diagnosis run

Returns a Server-Sent Events stream of diagnosis events. Supports reconnection via the lastEventId query parameter.

Authentication

AuthorizationBearer

Bearer authentication of the form Bearer <token>, where token is your auth token.

Path parameters

organizationIdstringRequired

The ID of the Organization to which the Deployment belongs.

deploymentIdstringRequired

The Deployment's ID.

diagnosisRunIdstringRequired

The diagnosis run ID.

Query parameters

lastEventIdstringOptional

The last event ID received, used for reconnection.

Response

A server-sent events stream of diagnosis events. Progress events (text_delta) stream as the investigation runs, a heartbeat event keeps the connection warm, and the rca_diagnosis event carries the structured diagnosis described by this schema. An end event marks the end of the stream, and an error event reports a failure.

A server-sent events stream of diagnosis events. Progress events (`text_delta`) stream as the investigation runs, a `heartbeat` event keeps the connection warm, and the `rca_diagnosis` event carries the structured diagnosis described by this schema. An `end` event marks the end of the stream, and an `error` event reports a failure.

titlestring

Short descriptive title for the finding.

summarystring

Concise incident-style summary of what happened and the root cause.

root_causestring

The root cause best supported by the retrieved evidence.

root_cause_taskstring

The task ID (dag_id.task_id) identified as the root cause. Empty for a Dag-level issue.

root_cause_typestring

Root-cause category code, for example USER_CODE_PYTHON_TYPE_ERROR, NETWORK_READ_TIMEOUT, or OUT_OF_MEMORY_ERROR. Returns INSUFFICIENT_EVIDENCE when no retrieved evidence points to a cause, or OTHER when no listed code fits. Workspace or Deployment guidance can define custom values.

transienceenum

Failure transience: PERMANENT fails every time, TRANSIENT is a one-off or self-healing failure, INTERMITTENT fails sometimes.

severityenum

Severity of the issue.

priorityenum

Incident priority, from P1 (critical outage) to P4 (low).

confidencedouble

Confidence in the diagnosis, from 0.0 to 1.0.

confidence_justificationstring

Explanation of why the confidence score was assigned.

evidence_statusenum

How well-evidenced the diagnosis is: confirmed (grounded in retrieved logs plus code or config), hypothesis (plausible but key evidence missing), or insufficient_evidence (no retrieved signal points to a cause).

evidencelist of strings

Key pieces of evidence supporting the diagnosis.

symptomslist of strings

Observable symptoms of the failure.

contributing_factorslist of strings

Additional factors that contributed to or amplified the failure.

evidence_gapslist of objects

One entry per key field the agent could not populate, with the reason.

log_snippetstring

The most relevant raw log excerpt showing the error.

exception_classstring

The Python exception class name, for example DatabaseError or ConnectionError.

exception_messagestring

The exception message string.

log_classificationstring

Classification of the log error pattern, for example connection_timeout, auth_failure, resource_exhaustion, data_validation, parse_error, import_error, or permission_denied.

log_signalslist of strings

Structured signal tags extracted from logs.

dag_level_checkslist of objects

Dag-wide diagnostic checks performed before per-task analysis.

taskslist of objects

Per-task diagnostic breakdown, one entry per failed or upstream-failed task.

cascade_chainlist of objects

Ordered chain of tasks or Dags in the failure cascade, from root to leaf.

cascade_depthinteger

Number of hops from the root failure to the furthest affected task.

is_cascade_amplifierboolean

Whether a single task failure was amplified into many downstream failures.

amplifier_descriptionstring

Explanation of how the cascade amplification occurred.

blast_radiusobject

The impact scope of the failure.

cofailure_pairslist of objects

Pairs of tasks or Dags that fail together because of a shared resource.

effective_availabilitydouble

Actual availability percentage, accounting for retries and partial failures.

reported_success_ratedouble

The nominal success rate reported by Airflow.

health_gapstring

Explanation of the gap between effective availability and reported success rate.

failure_onset_datestring

ISO 8601 timestamp of when failures first started.

last_success_datestring

ISO 8601 timestamp of the last successful run.

timeline_eventslist of objects

Chronological sequence of events leading to and during the failure.

change_signalslist of objects

Config or deploy changes correlated with the failure onset.

suggested_fixstring

Actionable fix, with code snippets or config changes when applicable.

remediationstring

Step-by-step remediation instructions.

prevention_measureslist of objects

Forward-looking recommendations to prevent recurrence.

vendor_remediationstring

Vendor-specific fix guidance, separate from the Airflow-side fix.

vendor_namestring

External vendor or service involved, for example Snowflake, S3, or BigQuery.

vendor_destinationstring

Specific vendor resource target.

vendor_dashboard_urlstring

URL to a vendor monitoring dashboard for further investigation.

rendered_conn_idslist of strings

Airflow connection IDs involved in the failure.

affected_dagslist of strings

All Dag IDs affected by the failure, including downstream.

child_dag_findingslist of objects

Findings for downstream or child Dags affected by the failure cascade.

match_countinteger

Number of matching failures with the same error signature in the observed window.

coverage_qualitystring

Depth of analysis performed: deep, moderate, shallow, or partial.

duration_percentilesobject

Task duration distribution percentiles, in seconds.

trend_directionstring

Trend direction of the failure rate or duration: improving, stable, degrading, or unknown.

change_point_datestring

ISO 8601 date when a significant change in behavior was detected.

seasonality_signalstring

Description of any seasonal or periodic pattern detected.

fleet_health_scoredouble

Overall Deployment health score, from 0.0 (unhealthy) to 1.0 (healthy).

from_dag_run_idstring

Original Dag run ID when the diagnosis was replayed from the cache. Absent on fresh diagnoses.

session_idstring

Durable ID of the investigation session. Include it when you contact Astronomer support.