Data
Connectors, transforms, and quality checks from a single config. DuckDB and Spark backends built in.
Production-ready Data + ML + AI engineering, unified. Works standalone or alongside Airflow, MLflow, and LangChain.
pip install dataenginex
— or: uv add dataenginex
data:
source: s3://my-bucket/raw/
format: parquet
quality:
null_threshold: 0.05
ml:
backend: mlflow
training:
model: xgboost
target: revenue
ai:
provider: openai
retrieval: hybrid
agents:
- name: analyst
tools: [sql, search]
observability:
metrics: prometheus
tracing: otel
Airflow for orchestration. MLflow for tracking. LangChain for agents. FastAPI wired together by hand. Prometheus bolted on. Each tool: its own config format, auth system, failure mode, oncall rotation. Stop building glue. Start shipping products.
Six domains. One framework. No assembly required.
Connectors, transforms, and quality checks from a single config. DuckDB and Spark backends built in.
Experiment tracking, training, serving, and drift detection built in. MLflow, W&B, or the built-in backend — your call.
LLM providers, hybrid BM25+dense retrieval, and LangGraph agent runtime — swappable, not locked in.
Self-hosted web UI — pipelines, warehouse, ML experiments, AI agents, and SQL console. FastAPI/Jinja2, port 7860.
structlog structured logging, Prometheus metrics, and OpenTelemetry tracing — wired up from config, not code.
K3s, Helm, and Terraform via infradex. From dev to production Kubernetes cluster without writing manifests by hand.
dex.yaml is the single source of truth for your entire platform.
Sources, transforms, quality rules, model config, agent definitions,
API settings, and observability — all in one place.
No more hunting across twelve repos to find why a pipeline broke. No more "it works in dev" because dev and prod share the same config schema.
dex validate dex.yaml
# DataEngineX — full stack config
data:
source: s3://my-bucket/raw/
format: parquet
backend: duckdb # or spark
quality:
null_threshold: 0.05
schema_enforcement: strict
audit_table: quality.audit
ml:
backend: mlflow
tracking_uri: http://mlflow:5000
training:
model: xgboost
target: revenue
features: [clicks, sessions, region]
serving:
endpoint: /api/v1/predict
drift_detection: true
ai:
provider: openai
model: gpt-4o-mini
retrieval: hybrid # BM25 + dense
agents:
- name: analyst
tools: [sql, search, python]
observability:
metrics: prometheus
tracing: otel
log_level: info
Each component is independently useful. Together they cover data, ML, AI, and careers.
pip install dataenginex
Core framework — config system, backend registry, CLI, ML lifecycle, AI agents, DuckDB lakehouse. Pure Python library, no server bundled.
View on GitHub →Port 7860
B2B web UI — FastAPI + Jinja2 + HTMX. Monitor pipelines, browse data, inspect ML experiments, chat with AI agents, query via SQL console.
View on GitHub →Port 7870
B2C career AI — job matching, resume analysis, ATS scanning, interview prep, and application tracking. Powered by the same dex core.
View on GitHub →Terraform + Helm
K3s cluster config, Helm charts for Authentik, Langfuse, Qdrant, Prometheus, Grafana, ArgoCD. From blank VPS to production Kubernetes — no manual YAML.
View on GitHub →Install the base package or pick the extras you need.
pip install dataenginex
# or
uv add dataenginex
pip install "dataenginex[cloud]" # S3 · GCS · BigQuery
pip install "dataenginex[observability]" # Langfuse LLM tracing
pip install 'litellm>=1.83.3' --no-deps # 100+ LLM providers