v0.1 · Early Access · Open Source

Autonomous database self-healing.For the 2am page that shouldn't have woken you.

AegisDB watches your data, diagnoses what broke, tests the fix in an ephemeral sandbox, asks a human to approve, then writes the whole story to your catalog.

How it works
Every fix documented
Every decision auditable
Built for trust & control
Safe by default·Human in the loop·Audit trail·Open source

Live · How AegisDB heals

1. Detect

Spot anomalies the moment they surface

2. Diagnose

Trace root cause through signals & context

3. Sandbox

Replay & validate fixes in isolation

4. Propose

Generate a safe, ranked remediation plan

5. Apply

Execute the fix with human sign-off

6. Document

Log every decision to your audit trail

>
The problem

The problem everyone ignores
until it's too late.

Data quality failures do not announce themselves. They accumulate quietly, and by the time someone notices, the damage is already downstream.

Detection gap

Corruption is silent until something breaks.

A NULL creeps into a non-nullable column. No alert fires. No pipeline fails. A downstream report returns wrong numbers for three days before anyone notices. By then, tracing the root cause is guesswork.

Institutional amnesia

The same bug hits again. Nobody remembered the fix.

The orders.discount column goes out of range for the fourth time this year. The engineer who patched it first is gone. The fix lived in a Slack thread, now buried under months of scroll. The team debugs from scratch, again.

Audit void

Manual patches leave no trail and no guarantee.

Someone runs an UPDATE directly on production. It works this time. There is no record of what changed, no assertion that it held, no documentation of why the value was wrong. The next incident starts with the same confusion.

How it works

Six stages. Fully automated.
One human decision.

Every stage is an independent worker consuming from a Redis Stream. No agent calls another directly — failures are isolated, retries are scoped, and the audit trail is complete.

OpenMetadata
FastAPI
Redis
LLM · Groq
Sandbox
Human Gate
Production
Stage 1 · Detect
1

Detect

A data quality test fails in OpenMetadata. NULL violation, range breach, uniqueness error, referential integrity break, format mismatch. The webhook hits FastAPI instantly. No polling. No cron. The pipeline starts the moment the failure is confirmed.

FastAPIOpenMetadataRedis StreamsPydantic
2

Diagnose

A FastAPI worker pulls the failure from the Redis stream and calls LLaMA 3.3 70B via Groq. The prompt includes the failure payload plus the top 3 similar past fixes retrieved from ChromaDB using vector similarity. The model returns candidate SQL and a confidence score. Below the threshold, it escalates. Above it, the pipeline moves forward.

GroqLLaMA 3.3 70BChromaDBsentence-transformers
3

Sandbox

The fix never touches production first. An ephemeral Postgres container spins up via testcontainers, clones the schema, seeds up to 500 real rows, and runs the SQL. Three retries with adjusted prompts on failure. The container is destroyed after the test. Nothing persists.

testcontainersasyncpgDockerSQLAlchemy async
4

Propose

A Proposal record is written to Postgres and surfaced in two places at once: the Next.js operator console and a live Slack card. Operators can approve, reject with a reason, or let it expire. Dry-run mode is togglable at runtime without a redeploy. No fix ever reaches production without an explicit human decision.

Next.jsSlack SDKSQLAlchemyZustand
5

Apply

On approval, the fix runs inside an explicit transaction on production. Post-apply assertions execute immediately after. If any assertion fails, the transaction rolls back automatically and the incident escalates. Every apply, rollback, and assertion result is written to an append-only audit log via Redis Streams in real time.

asyncpgSQLAlchemy asyncRedis Streams
6

Document

After a successful apply, a FixReport is written to Postgres. The affected column gets annotated in OpenMetadata and tagged AegisDB.healed. The Slack card updates with deep links to the proposal and audit entry. The fix embedding is stored back in ChromaDB so future diagnoses can learn from this outcome.

OpenMetadataSlack SDKChromaDBasyncpg
Where you live with it

Three surfaces.
One source of truth.

Dashboard

See every incident, in flight and resolved.

Live pipeline, queued approvals, agent traces. Built for the engineers who actually fix things.

aegisdb.app/incidents
Incidents
Pipeline
Approvals
Agents
Catalog
Active incidents
MTTR
4m 12s
AUTO-FIXED
73%
RECURRING
2.1%
INC-4821null spike · users.email
approved
2m
INC-4820schema drift · orders.amount type
in review
11m
INC-4819freshness lag · etl.daily_revenue
resolved
44m
Slack · Conversational

Talk to it like a teammate.

Not another notification firehose. Ask why, push back, request a different fix.

aegisdb.slack.com / #aegis-ops

Workspace

# general
# aegis-ops
# data-eng
# on-call
aegis-ops· data quality alerts
watching · aegisdb:slack stream
Incident Timeline · With Recurrence
412 fixes · 9 recurring patterns

Every fix ever applied. Searchable. Linkable.

TODAY · 14:02users.email null cascade · auto-patch · approved by @furyfist1st seen
YESTERDAY · 18:42form_submissions schema drift · column type fixrecurrence ×3
2D AGO · 09:11etl.daily_revenue freshness · upstream backfill1st seen
5D AGO · 23:50sessions_idx duplicate keys · index rebuildrecurrence ×2
The intelligence

Not just automation.
A system that learns.

The pipeline gets better with every fix applied and every proposal rejected. Context accumulates. Mistakes don't repeat.

RAG Knowledge Base

Every diagnosis is grounded in what actually worked before.

Each query embeds the anomaly context with all-MiniLM-L6-v2 and retrieves the top-3 most similar past fixes from ChromaDB via cosine similarity. The LLM never generates cold — it always has precedent.

Sandbox Safety

No fix touches production before it survives a realistic replica.

An ephemeral Postgres container spins up via testcontainers, clones the target schema, seeds up to 500 rows from production, and runs the fix SQL. Three retries with adjusted prompts on failure. The container is destroyed immediately after — nothing leaks.

Recurrence Detection

The system knows when a failure isn't new — and escalates accordingly.

Every anomaly is keyed by the compound (column_fqn × anomaly_type). Each new occurrence increments a counter stored in Postgres. Recurrence count is surfaced on the proposal card and Slack message so operators can see if they're looking at a first-time glitch or a structural problem.

Learning from Rejections

A rejected fix isn't wasted — it makes the next diagnosis sharper.

When an operator rejects a proposal, the rejection reason is embedded and written back into ChromaDB alongside the original fix candidate. Future RAG retrievals will surface this pair, steering the LLM away from approaches that a human already ruled out.

Live Preview

The console, in motion.

aegisdb.app/console

Watching

employees.region
ok
customers.region
ok
orders.amount
alert
customers.email
ok

↳ Stream · events.aegis

Tech stack

Purpose-built from
the right primitives.

Every technology chosen because it was the right tool — not because it was familiar. Event-driven end to end. No polling. No shared mutable state between agents.

Backend
FastAPI
asyncpg
SQLAlchemy async
Pydantic
Docker Compose
AI / ML
Groq LLaMA 3.3 70B
ChromaDB
sentence-transformers
all-MiniLM-L6-v2
Infrastructure
Redis Streams
testcontainers
OpenMetadata
Slack SDK
Frontend
Next.js 16
TypeScript
Tailwind CSS
React Query
Zustand

· Hover any pill for usage context