** AutoOps-Insight turns noisy CI failures into structured incident intelligence and release decisions.
FastAPI · Kafka · PostgreSQL · React/Vite **
{
"incident_type": "dns_failure",
"recurrence": 3,
"release_decision": "hold_release"
}Full incident record:
{
"predicted_issue": "timeout",
"confidence": 0.95,
"failure_family": "timeout",
"severity": "high",
"signature": "timeout:733da8a4e20740af",
"likely_cause": "operation exceeded threshold or dependency responded too slowly",
"first_remediation_step": "inspect the exact timed-out operation and compare recent latency trends",
"probable_owner": "platform-networking",
"release_blocking": true,
"recurrence": {
"total_count": 3,
"is_recurring": true
}
}When a CI pipeline fails, an on-call engineer opens a wall of logs and starts guessing.
The raw log tells you what happened last. It does not tell you whether this failure has appeared before, whether something changed near the incident window, whether rollback is worth trying, or who owns the problem. AutoOps encodes those answers.
## Release Risk Summary
- Release risk: HIGH
- Total analyses: 3
- Release-blocking incidents: 3
Top recurring signature:
timeout:733da8a4e20740af | family=timeout | severity=high | count=3
Recommendation:
Repeated failure signatures present. Investigate before promoting build.Fleet Health and Root-Cause Report — Noisy-service ranking, top recurring signatures, root-cause distribution:
Audit Log Traceability — Rule update with actor, timestamp, and before/after diff:
Incident Replay and Test Validation — Replayed stored incident with recurrence metadata and passing test run:
Audit Diff and Rollback Preview UI — Field-level diff inspection for a rule update:
ingest logs → classify incident → fingerprint → correlate changes → surface recurrence → release decision
- What kind of incident is this?
- Is this part of a repeated failure pattern?
- Did something change near this incident window?
- Is rollback worth considering?
- Who should own escalation?
- What should be checked first?
From live ingestion runs:
- 3 persisted incident records
- 2 distinct failure families classified:
timeout,dns_failure - 1 recurring signature detected:
dns_failure:818a0911c2c842c0appearing twice
The recurring signature means AutoOps identified that two separate CI failures were the same underlying infrastructure issue — not two independent problems.
| Family | Severity | Release blocking |
|---|---|---|
timeout |
high | yes |
oom |
critical | yes |
connection_refused |
high | yes |
dns_failure |
high | yes |
tls_handshake |
high | yes |
retry_exhausted |
medium | yes |
crash_loop |
critical | yes |
dependency_unavailable |
high | yes |
flaky_test_signature |
medium | context-dependent |
intermittent_network_flap |
medium | context-dependent |
Classification is driven by config/rules.yaml — no backend code changes required to add or tune patterns.
{
"incident_id": 1,
"window_minutes": 60,
"correlated_incidents": [
{ "id": 2, "signature": "timeout:733da8a4e20740af", "minutes_from_anchor": 12 },
{ "id": 3, "signature": "timeout:733da8a4e20740af", "minutes_from_anchor": 24 }
],
"nearby_audit_events": [{ "event_type": "rule_update", "actor": "kriti", "minutes_from_anchor": 8 }],
"correlation_summary": {
"burst_detected": true,
"single_family_concentration": true,
"release_blocking_count": 3,
"nearby_change_detected": true,
"rollback_review_suggested": true
}
}Dry-run rule changes against stored incidents before applying:
{
"rule_id": "timeout_rule",
"incidents_evaluated": 3,
"incidents_impacted": 3,
"probable_owner_changed": 3,
"sample_impacted_incidents": [{
"id": 3,
"changed_fields": ["probable_owner"],
"original": { "probable_owner": "service-owner" },
"simulated": { "probable_owner": "platform-networking" }
}]
}{
"audit_event_id": 1,
"rule_id": "timeout_rule",
"rollback_updates": { "probable_owner": "service-owner" },
"impact_preview": { "incidents_evaluated": 3, "incidents_impacted": 3 }
}{
"failure_family": "dns_failure",
"first_checks": [
"verify DNS resolver reachability from affected hosts",
"check whether one hostname or zone is disproportionately impacted",
"compare resolution success rate before and after the incident window"
],
"likely_cause": "resolver misconfiguration, zone propagation delay, or service-discovery change near incident window",
"rollback_guidance": "roll back only if a recent DNS or service-discovery change correlates strongly with incident window",
"escalation_route": "service-owner → platform-networking → dns/platform team",
"mitigation_sequence": [
"retry resolution from multiple hosts or regions",
"shift to a known-good endpoint if available",
"roll back recent DNS/service-discovery change if correlation is strong",
"escalate with affected hostnames, regions, and timestamps"
]
}Rule-based layer — deterministic pattern matching for: timeout, dns_failure, connection_refused, tls_handshake, retry_exhausted, oom, flaky_test_signature, dependency_unavailable, crash_loop, latency_spike
ML fallback — TF-IDF vectorization and Logistic Regression trained on labeled log data (ml_model/log_train.csv). Each analysis record indicates which detection path was used.
| Before | After AutoOps |
|---|---|
| Read raw logs manually | Classify into concrete failure family |
| Guess likely owner from error strings | Surface probable owner and escalation route |
| Check dashboards separately for timing | Correlate nearby incidents and changes in bounded window |
| Search for nearby deploys by hand | Automated timeline correlation |
| Decide rollback with incomplete context | Fleet-level recurrence and blast-radius signals |
Primary: PostgreSQL with Alembic-managed schema migrations
Fallback: SQLite for local development
docker run -e POSTGRES_PASSWORD=pass -p 5432:5432 postgres:15
alembic upgrade head
uvicorn main:app --reload| Method | Endpoint | Description |
|---|---|---|
POST |
/analyze |
Analyze a log and persist the result |
GET |
/history/recurring |
Top recurring signatures |
GET |
/reports/summary |
Structured release-risk summary |
GET |
/incident/runbook/{family} |
Operator runbook for a failure family |
GET |
/incident/correlate |
Correlate incident against nearby changes |
GET |
/fleet/health |
Fleet-level health and recurrence view |
POST |
/reporting/export-powerbi |
Export Power BI-ready CSV artifacts |
GET |
/metrics |
Prometheus counters |
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Train model
cd ml_model && python train_model.py && cd ..
# Start API
uvicorn main:app --reload
# Analyze a log
python3 cli.py analyze sample.log
# Generate release risk report
python3 cli.py report
# View fleet health
python3 cli.py fleet-health
# Simulate a rule change
python3 cli.py simulate-rule timeout_rule probable_owner platform-networking
# Start dashboard
cd autoops-ui && npm install && npm run devGitHub Actions workflow automatically: runs CLI health check, analyzes sample logs, generates markdown and JSON report artifacts, uploads artifacts and SQLite DB for inspection.
On-call triage is a time and context problem. Engineers who have been with a system for two years can glance at a log and know if a failure is new or recurring, which team owns it, and whether rollback is worth trying. That knowledge doesn't transfer and doesn't scale. AutoOps structures it: stable fingerprints replace pattern memory, correlation windows replace manual dashboard-hopping, runbook generation replaces tribal knowledge. The result is faster triage and better release judgment regardless of who is on call.
- Log-based analysis, not real-time metric stream ingestion
- ML model trained on labeled sample data; performance on novel log formats requires retraining
- Correlation is time-window based, not causal trace analysis
- SQLite fallback; PostgreSQL recommended for production use
SRE · Production Engineering · Release Engineering · Internal Developer Tooling · Platform / Infrastructure
Python · FastAPI · React/Vite · PostgreSQL · SQLite · Alembic · scikit-learn · Docker · GitHub Actions
- KubePulse — Kubernetes resilience validation and deployment safety
- Faultline — exactly-once execution under distributed failure
- DetTrace — deterministic replay for concurrency failures
- Postmortem Atlas — historical production outage analysis



