v0.1 · accepting private beta

Synthetic training data that makes your model measurably better.

Execution-verified coding datasets, benchmark-targeted generation, and a before/after performance guarantee. Built for the teams fine-tuning LLMs — not the ones making slide decks about them.

100k free tokensno credit cardcancel any time
da1a ~ shell · job_01HY5PX3W2N4C7JQK8VAHF9D2E
verified
$ da1a jobs create ./job.yaml
▸ job_id job_01HY5PX3W2N4C7JQK8VAHF9D2E
▸ domain coding
▸ format instruction-response
▸ languages [python, typescript, rust]
▸ volume 2,500 examples
[queued] accepted · 2,500 pending
[generating] 2,500 / 2,500 drafted
[sandbox.execute] running in 48 workers · t+41s
[filter.dedup] MinHash · removed 108
[filter.toxicity] detoxify · removed 3
[verify] execution-verified · 2,254 passed
→ 200 OK · job.complete
# response
{
"job_id": "job_01HY5PX3W2N4C7JQK8VAHF9D2E",
"status": "complete",
"verified": true,
"pass_rate": 0.942,
"total": 2500,
"passed": 2254,
"filtered": 111,
"dedup_removed": 108,
"output": {
"format": "jsonl",
"size": "11.8 MB",
"url": "r2://da1a/datasets/job_01H…/data.jsonl"
},
"manifest": {
"version_hash": "sha256:9f2c…a81d",
"dp_epsilon": 1.2,
"signed": true
}
}
$
pipeline
[scale]
01
0
examples executed & verified
02
0.0%
average execution pass rate
03
+0.0
avg benchmark pts improvement

* rolling 30-day figures from production generation pipeline.

[how it works]

Three steps. One guarantee.

We don't sell you tokens and disappear. Every dataset is generated, filtered, and verified against the same pipeline we use internally — with the receipts to prove it.

01
step

Define your task

Submit a job spec via the dashboard or REST API. Pick a domain, output format, languages, difficulty mix, and volume. Optionally provide seed examples — we strip PII on ingest with Presidio.

POST /v1/jobs
{
  "domain":     "coding",
  "format":     "instruction-response",
  "languages":  ["python", "ts", "rust"],
  "difficulty": { "beginner": 0.2,
                  "intermediate": 0.5,
                  "advanced": 0.3 },
  "volume":     2500
}
02
step

We generate & verify

Every code example is executed in a sandboxed, network-isolated container with a kill switch. No syntactic judging, no LLM-as-critic. If it doesn't run, it doesn't ship.

[sandbox.execute]
  workers        48
  timeout        30s · SIGKILL on overrun
  network        deny-all
  fs             read-only, tmpfs scratch
  verdict        2254 / 2500 passed (94.2%)
  retries        disabled — failures are signal
03
step

Your model improves

Receive a curated dataset with a full quality report: pass rate, language & difficulty breakdown, dedup stats, and a signed manifest so you can prove to auditors what your model was trained on.

# quality_report.json
{
  "pass_rate":    0.942,
  "dedup":        108,
  "pii_stripped": 0,
  "manifest":     "sha256:9f2c…a81d",
  "dp_epsilon":   1.2,
  "signed":       true
}
[why da1a]

A model performance partner — not a data marketplace.

Public datasets are stale. Generic synthetic data is incoherent. We built da1a for teams that care about what their model does at eval time, not what was shipped to HuggingFace six months ago.

[verify]
Execution-verified

Every code example runs in a sandboxed Docker container with a hard timeout. Pass/fail is a fact, not an LLM opinion.

[graph]
Causally coherent

A graph-first generator keeps fields consistent. An 'easy' Python problem won't get an expert-level solution. No data-model drift.

[bench]
Benchmark-targeted

Pick HumanEval, MBPP, or a BIG-Bench subset. We run gap analysis and generate data aimed at closing your specific failure modes.

[dp]
DP-certified

Every job ships with a differential-privacy epsilon value and a signed, immutable manifest. EU AI Act ready out of the box.

[in production]case studies shipping Q1
testimonial · 01
Our internal coding eval jumped 6.1 points after one week of da1a data. Our previous synthetic vendor shipped us 200K duplicate examples.
Staff ML Engineer
Series B AI startup · 28 engineers
testimonial · 02
The signed manifest alone paid for the subscription. Our compliance team stopped asking us where our training data came from.
Head of Platform
Regulated fintech · EU
testimonial · 03
We replaced four internal data pipelines with one da1a job spec. Execution verification caught failure modes our LLM judge was greenlighting.
Founding Engineer
Dev-tools stealth · YC W26
[pricing]

Pay for improvement, not for tokens.

All plans include execution verification, signed lineage, and PII-stripping on seed uploads. No hidden overage charges — you'll hit a soft cap with 72h advance notice.

Compare all plans →
Starter
$199/ mo
500K tokens · ~25K examples

For solo engineers and small teams shipping their first fine-tune.

  • Execution verification
  • JSONL + JSON export
  • Signed data lineage
  • Community support
Start free trial
most popular
Builder
$799/ mo
3M tokens · ~150K examples

For ML teams fine-tuning weekly and running against benchmarks.

  • Everything in Starter
  • Graph-first causal engine
  • Benchmark gap reports
  • Priority generation queue
  • Slack support
Start free trial
Scale
$2,500/ mo
15M tokens · ~750K examples

For teams treating data quality as a first-class engineering lever.

  • Everything in Builder
  • Model-in-the-loop failure diagnosis
  • Before/after delta reports
  • Red-team adversarial data
  • Dedicated solutions engineer
Start free trial
enterprise
Need dedicated infra, VPC delivery, or >50M tokens?
Talk to us →
[/start]

Start with 100,000 free tokens.
No credit card required.

Request early access — we'll get you onboarded this week. Tell us what you're fine-tuning and we'll help you scope a first job.

SOC2 in progress·GDPR & EU AI Act ready·on-prem available at scale