Autonomous database self-healing.For the 2am page that shouldn't have woken you.
AegisDB watches your data, diagnoses what broke, tests the fix in an ephemeral sandbox, asks a human to approve, then writes the whole story to your catalog.
Live · How AegisDB heals
1. Detect
Spot anomalies the moment they surface
2. Diagnose
Trace root cause through signals & context
3. Sandbox
Replay & validate fixes in isolation
4. Propose
Generate a safe, ranked remediation plan
5. Apply
Execute the fix with human sign-off
6. Document
Log every decision to your audit trail
The problem everyone ignores
until it's too late.
Data quality failures do not announce themselves. They accumulate quietly, and by the time someone notices, the damage is already downstream.
Corruption is silent until something breaks.
A NULL creeps into a non-nullable column. No alert fires. No pipeline fails. A downstream report returns wrong numbers for three days before anyone notices. By then, tracing the root cause is guesswork.
The same bug hits again. Nobody remembered the fix.
The orders.discount column goes out of range for the fourth time this year. The engineer who patched it first is gone. The fix lived in a Slack thread, now buried under months of scroll. The team debugs from scratch, again.
Manual patches leave no trail and no guarantee.
Someone runs an UPDATE directly on production. It works this time. There is no record of what changed, no assertion that it held, no documentation of why the value was wrong. The next incident starts with the same confusion.
Six stages. Fully automated.
One human decision.
Every stage is an independent worker consuming from a Redis Stream. No agent calls another directly — failures are isolated, retries are scoped, and the audit trail is complete.
Detect
A data quality test fails in OpenMetadata. NULL violation, range breach, uniqueness error, referential integrity break, format mismatch. The webhook hits FastAPI instantly. No polling. No cron. The pipeline starts the moment the failure is confirmed.
Diagnose
A FastAPI worker pulls the failure from the Redis stream and calls LLaMA 3.3 70B via Groq. The prompt includes the failure payload plus the top 3 similar past fixes retrieved from ChromaDB using vector similarity. The model returns candidate SQL and a confidence score. Below the threshold, it escalates. Above it, the pipeline moves forward.
Sandbox
The fix never touches production first. An ephemeral Postgres container spins up via testcontainers, clones the schema, seeds up to 500 real rows, and runs the SQL. Three retries with adjusted prompts on failure. The container is destroyed after the test. Nothing persists.
Propose
A Proposal record is written to Postgres and surfaced in two places at once: the Next.js operator console and a live Slack card. Operators can approve, reject with a reason, or let it expire. Dry-run mode is togglable at runtime without a redeploy. No fix ever reaches production without an explicit human decision.
Apply
On approval, the fix runs inside an explicit transaction on production. Post-apply assertions execute immediately after. If any assertion fails, the transaction rolls back automatically and the incident escalates. Every apply, rollback, and assertion result is written to an append-only audit log via Redis Streams in real time.
Document
After a successful apply, a FixReport is written to Postgres. The affected column gets annotated in OpenMetadata and tagged AegisDB.healed. The Slack card updates with deep links to the proposal and audit entry. The fix embedding is stored back in ChromaDB so future diagnoses can learn from this outcome.
Three surfaces.
One source of truth.
See every incident, in flight and resolved.
Live pipeline, queued approvals, agent traces. Built for the engineers who actually fix things.
Talk to it like a teammate.
Not another notification firehose. Ask why, push back, request a different fix.
Workspace
Every fix ever applied. Searchable. Linkable.
Not just automation.
A system that learns.
The pipeline gets better with every fix applied and every proposal rejected. Context accumulates. Mistakes don't repeat.
RAG Knowledge Base
Every diagnosis is grounded in what actually worked before.
Each query embeds the anomaly context with all-MiniLM-L6-v2 and retrieves the top-3 most similar past fixes from ChromaDB via cosine similarity. The LLM never generates cold — it always has precedent.
Sandbox Safety
No fix touches production before it survives a realistic replica.
An ephemeral Postgres container spins up via testcontainers, clones the target schema, seeds up to 500 rows from production, and runs the fix SQL. Three retries with adjusted prompts on failure. The container is destroyed immediately after — nothing leaks.
Recurrence Detection
The system knows when a failure isn't new — and escalates accordingly.
Every anomaly is keyed by the compound (column_fqn × anomaly_type). Each new occurrence increments a counter stored in Postgres. Recurrence count is surfaced on the proposal card and Slack message so operators can see if they're looking at a first-time glitch or a structural problem.
Learning from Rejections
A rejected fix isn't wasted — it makes the next diagnosis sharper.
When an operator rejects a proposal, the rejection reason is embedded and written back into ChromaDB alongside the original fix candidate. Future RAG retrievals will surface this pair, steering the LLM away from approaches that a human already ruled out.
The console, in motion.
Watching
↳ Stream · events.aegis
Purpose-built from
the right primitives.
Every technology chosen because it was the right tool — not because it was familiar. Event-driven end to end. No polling. No shared mutable state between agents.
· Hover any pill for usage context
