v0.1 · accepting private beta

Synthetic training data that makes your model measurably better.

Execution-verified coding datasets, benchmark-targeted generation, and a before/after performance guarantee. Built for the teams fine-tuning LLMs — not the ones making slide decks about them.

Start free trial→Read the docs

100k free tokens·no credit card·cancel any time

da1a ~ shell · job_01HY5PX3W2N4C7JQK8VAHF9D2E

verified

$ da1a jobs create ./job.yaml
▸ job_id       job_01HY5PX3W2N4C7JQK8VAHF9D2E
▸ domain       coding
▸ format       instruction-response
▸ languages    [python, typescript, rust]
▸ volume       2,500 examples
[queued]            accepted · 2,500 pending
[generating]        2,500 / 2,500 drafted
[sandbox.execute]   running in 48 workers ·  t+41s
[filter.dedup]      MinHash · removed 108
[filter.toxicity]   detoxify · removed 3
[verify]            execution-verified · 2,254 passed
→ 200 OK  · job.complete
# response
{
  "job_id":        "job_01HY5PX3W2N4C7JQK8VAHF9D2E",
  "status":        "complete",
  "verified":      true,
  "pass_rate":     0.942,
  "total":         2500,
  "passed":        2254,
  "filtered":      111,
  "dedup_removed": 108,
  "output": {
    "format":  "jsonl",
    "size":    "11.8 MB",
    "url":     "r2://da1a/datasets/job_01H…/data.jsonl"
  },
  "manifest": {
    "version_hash": "sha256:9f2c…a81d",
    "dp_epsilon":   1.2,
    "signed":       true
  }
}
$

pipeline

  ┌────────────┐     ┌────────────┐     ┌────────────┐     ┌────────────┐
  │  generate  │ ──▶ │  sandbox   │ ──▶ │   filter   │ ──▶ │  manifest  │
  │    llm     │     │  execute   │     │ dedup+pii  │     │   signed   │
  └────────────┘     └────────────┘     └────────────┘     └────────────┘
       draft              run                 keep              deliver

[scale]

examples executed & verified

0.0%

average execution pass rate

+0.0

avg benchmark pts improvement

* rolling 30-day figures from production generation pipeline.

[how it works]

Three steps. One guarantee.

We don't sell you tokens and disappear. Every dataset is generated, filtered, and verified against the same pipeline we use internally — with the receipts to prove it.

step

Define your task

Submit a job spec via the dashboard or REST API. Pick a domain, output format, languages, difficulty mix, and volume. Optionally provide seed examples — we strip PII on ingest with Presidio.

POST /v1/jobs
{
  "domain":     "coding",
  "format":     "instruction-response",
  "languages":  ["python", "ts", "rust"],
  "difficulty": { "beginner": 0.2,
                  "intermediate": 0.5,
                  "advanced": 0.3 },
  "volume":     2500
}

step

We generate & verify

Every code example is executed in a sandboxed, network-isolated container with a kill switch. No syntactic judging, no LLM-as-critic. If it doesn't run, it doesn't ship.

[sandbox.execute]
  workers        48
  timeout        30s · SIGKILL on overrun
  network        deny-all
  fs             read-only, tmpfs scratch
  verdict        2254 / 2500 passed (94.2%)
  retries        disabled — failures are signal

step

Your model improves

Receive a curated dataset with a full quality report: pass rate, language & difficulty breakdown, dedup stats, and a signed manifest so you can prove to auditors what your model was trained on.

# quality_report.json
{
  "pass_rate":    0.942,
  "dedup":        108,
  "pii_stripped": 0,
  "manifest":     "sha256:9f2c…a81d",
  "dp_epsilon":   1.2,
  "signed":       true
}

[why da1a]

A model performance partner — not a data marketplace.

Public datasets are stale. Generic synthetic data is incoherent. We built da1a for teams that care about what their model does at eval time, not what was shipped to HuggingFace six months ago.

[verify]

Execution-verified

Every code example runs in a sandboxed Docker container with a hard timeout. Pass/fail is a fact, not an LLM opinion.

[graph]

Causally coherent

A graph-first generator keeps fields consistent. An 'easy' Python problem won't get an expert-level solution. No data-model drift.

[bench]

Benchmark-targeted

Pick HumanEval, MBPP, or a BIG-Bench subset. We run gap analysis and generate data aimed at closing your specific failure modes.

[dp]

DP-certified

Every job ships with a differential-privacy epsilon value and a signed, immutable manifest. EU AI Act ready out of the box.

[in production]case studies shipping Q1

testimonial · 01

“Our internal coding eval jumped 6.1 points after one week of da1a data. Our previous synthetic vendor shipped us 200K duplicate examples.”

Staff ML Engineer

Series B AI startup · 28 engineers

testimonial · 02

“The signed manifest alone paid for the subscription. Our compliance team stopped asking us where our training data came from.”

Head of Platform

Regulated fintech · EU

testimonial · 03

“We replaced four internal data pipelines with one da1a job spec. Execution verification caught failure modes our LLM judge was greenlighting.”

Founding Engineer

Dev-tools stealth · YC W26

[pricing]

Pay for improvement, not for tokens.

All plans include execution verification, signed lineage, and PII-stripping on seed uploads. No hidden overage charges — you'll hit a soft cap with 72h advance notice.

Compare all plans →

Starter

$199/ mo

500K tokens · ~25K examples

For solo engineers and small teams shipping their first fine-tune.

Execution verification
JSONL + JSON export
Signed data lineage
Community support

Start free trial →

Start with 100,000 free tokens.
No credit card required.

Request early access — we'll get you onboarded this week. Tell us what you're fine-tuning and we'll help you scope a first job.

SOC2 in progress·GDPR & EU AI Act ready·on-prem available at scale

Synthetic training data that makes your model measurably better.

Three steps. One guarantee.

Define your task

We generate & verify

Your model improves

A model performance partner — not a data marketplace.

Pay for improvement, not for tokens.

Start with 100,000 free tokens.No credit card required.

Start with 100,000 free tokens.
No credit card required.