v0.1 · Early Access · Open Source

Autonomous database self-healing.For the 2am page that shouldn't have woken you.

AegisDB watches your data, diagnoses what broke, tests the fix in an ephemeral sandbox, asks a human to approve, then writes the whole story to your catalog.

How it works

Every fix documented

Every decision auditable

Built for trust & control

Safe by default·Human in the loop·Audit trail·Open source

Live · How AegisDB heals

1. Detect

Spot anomalies the moment they surface

2. Diagnose

Trace root cause through signals & context

3. Sandbox

Replay & validate fixes in isolation

4. Propose

Generate a safe, ranked remediation plan

5. Apply

Execute the fix with human sign-off

6. Document

Log every decision to your audit trail

The problem

The problem everyone ignores
until it's too late.

Data quality failures do not announce themselves. They accumulate quietly, and by the time someone notices, the damage is already downstream.

Detection gap

Corruption is silent until something breaks.

A NULL creeps into a non-nullable column. No alert fires. No pipeline fails. A downstream report returns wrong numbers for three days before anyone notices. By then, tracing the root cause is guesswork.

Institutional amnesia

The same bug hits again. Nobody remembered the fix.

The orders.discount column goes out of range for the fourth time this year. The engineer who patched it first is gone. The fix lived in a Slack thread, now buried under months of scroll. The team debugs from scratch, again.

Audit void

Manual patches leave no trail and no guarantee.

Someone runs an UPDATE directly on production. It works this time. There is no record of what changed, no assertion that it held, no documentation of why the value was wrong. The next incident starts with the same confusion.

How it works

Six stages. Fully automated.
One human decision.

Every stage is an independent worker consuming from a Redis Stream. No agent calls another directly — failures are isolated, retries are scoped, and the audit trail is complete.

OpenMetadata

FastAPI

Redis

LLM · Groq

Sandbox

Human Gate

Production

Stage 1 · Detect

Detect

A data quality test fails in OpenMetadata. NULL violation, range breach, uniqueness error, referential integrity break, format mismatch. The webhook hits FastAPI instantly. No polling. No cron. The pipeline starts the moment the failure is confirmed.

FastAPIOpenMetadataRedis StreamsPydantic

Diagnose

A FastAPI worker pulls the failure from the Redis stream and calls LLaMA 3.3 70B via Groq. The prompt includes the failure payload plus the top 3 similar past fixes retrieved from ChromaDB using vector similarity. The model returns candidate SQL and a confidence score. Below the threshold, it escalates. Above it, the pipeline moves forward.

GroqLLaMA 3.3 70BChromaDBsentence-transformers

Sandbox

The fix never touches production first. An ephemeral Postgres container spins up via testcontainers, clones the schema, seeds up to 500 real rows, and runs the SQL. Three retries with adjusted prompts on failure. The container is destroyed after the test. Nothing persists.

testcontainersasyncpgDockerSQLAlchemy async

Propose

A Proposal record is written to Postgres and surfaced in two places at once: the Next.js operator console and a live Slack card. Operators can approve, reject with a reason, or let it expire. Dry-run mode is togglable at runtime without a redeploy. No fix ever reaches production without an explicit human decision.

Next.jsSlack SDKSQLAlchemyZustand

Apply

On approval, the fix runs inside an explicit transaction on production. Post-apply assertions execute immediately after. If any assertion fails, the transaction rolls back automatically and the incident escalates. Every apply, rollback, and assertion result is written to an append-only audit log via Redis Streams in real time.

asyncpgSQLAlchemy asyncRedis Streams

Document

After a successful apply, a FixReport is written to Postgres. The affected column gets annotated in OpenMetadata and tagged AegisDB.healed. The Slack card updates with deep links to the proposal and audit entry. The fix embedding is stored back in ChromaDB so future diagnoses can learn from this outcome.

OpenMetadataSlack SDKChromaDBasyncpg

Where you live with it

Three surfaces.
One source of truth.

Dashboard

See every incident, in flight and resolved.

Live pipeline, queued approvals, agent traces. Built for the engineers who actually fix things.

aegisdb.app/incidents

Incidents

Pipeline

Approvals

Agents

Catalog

Active incidents

MTTR

4m 12s

AUTO-FIXED

73%

RECURRING

2.1%

INC-4821null spike · users.email

approved

INC-4820schema drift · orders.amount type

in review

11m

INC-4819freshness lag · etl.daily_revenue

resolved

44m

Slack · Conversational

Talk to it like a teammate.

Not another notification firehose. Ask why, push back, request a different fix.

aegisdb.slack.com / #aegis-ops

Workspace

# general

# aegis-ops

# data-eng

# on-call

aegis-ops· data quality alerts

watching · aegisdb:slack stream

Incident Timeline · With Recurrence

412 fixes · 9 recurring patterns

Every fix ever applied. Searchable. Linkable.

TODAY · 14:02users.email null cascade · auto-patch · approved by @furyfist1st seen

YESTERDAY · 18:42form_submissions schema drift · column type fixrecurrence ×3

2D AGO · 09:11etl.daily_revenue freshness · upstream backfill1st seen

5D AGO · 23:50sessions_idx duplicate keys · index rebuildrecurrence ×2

The intelligence

Not just automation.
A system that learns.

The pipeline gets better with every fix applied and every proposal rejected. Context accumulates. Mistakes don't repeat.

RAG Knowledge Base

Every diagnosis is grounded in what actually worked before.

Each query embeds the anomaly context with all-MiniLM-L6-v2 and retrieves the top-3 most similar past fixes from ChromaDB via cosine similarity. The LLM never generates cold — it always has precedent.

Sandbox Safety

No fix touches production before it survives a realistic replica.

An ephemeral Postgres container spins up via testcontainers, clones the target schema, seeds up to 500 rows from production, and runs the fix SQL. Three retries with adjusted prompts on failure. The container is destroyed immediately after — nothing leaks.

Recurrence Detection

The system knows when a failure isn't new — and escalates accordingly.

Every anomaly is keyed by the compound (column_fqn × anomaly_type). Each new occurrence increments a counter stored in Postgres. Recurrence count is surfaced on the proposal card and Slack message so operators can see if they're looking at a first-time glitch or a structural problem.

Learning from Rejections

A rejected fix isn't wasted — it makes the next diagnosis sharper.

When an operator rejects a proposal, the rejection reason is embedded and written back into ChromaDB alongside the original fix candidate. Future RAG retrievals will surface this pair, steering the LLM away from approaches that a human already ruled out.

Live Preview

The console, in motion.

streaming · redis://events

aegisdb.app/console

Watching

employees.region

customers.region

orders.amount

alert

customers.email

↳ Stream · events.aegis

Tech stack

Purpose-built from
the right primitives.

Every technology chosen because it was the right tool — not because it was familiar. Event-driven end to end. No polling. No shared mutable state between agents.

Backend

FastAPI

asyncpg

SQLAlchemy async

Pydantic

Docker Compose

AI / ML

Groq LLaMA 3.3 70B

ChromaDB

sentence-transformers

all-MiniLM-L6-v2

Infrastructure

Redis Streams

testcontainers

OpenMetadata

Slack SDK

Frontend

Next.js 16

TypeScript

Tailwind CSS

React Query

Zustand

· Hover any pill for usage context

Autonomous database self-healing.For the 2am page that shouldn't have woken you.

The problem everyone ignoresuntil it's too late.

Corruption is silent until something breaks.

The same bug hits again. Nobody remembered the fix.

Manual patches leave no trail and no guarantee.

Six stages. Fully automated.One human decision.

Detect

Diagnose

Sandbox

Propose

Apply

Document

Three surfaces.One source of truth.

See every incident, in flight and resolved.

Talk to it like a teammate.

Every fix ever applied. Searchable. Linkable.

Not just automation.A system that learns.

RAG Knowledge Base

Sandbox Safety

Recurrence Detection

Learning from Rejections

The console, in motion.

Purpose-built fromthe right primitives.

The problem everyone ignores
until it's too late.

Six stages. Fully automated.
One human decision.

Three surfaces.
One source of truth.

Not just automation.
A system that learns.

Purpose-built from
the right primitives.