Prepare Annotations

A dedicated toolkit for downloading, processing, and preparing genomic annotation datasets (Ensembl, ClinVar, dbSNP, gnomAD) using Dagster for robust, parallel, and observable pipelines.

🔷 Why Dagster?

Genomic data preparation is complex, involving multi-GB downloads and multi-step transformations. We use Dagster to provide:

Software-Defined Assets (SDA): Instead of just running "tasks", we define Assets (like a Parquet file). Dagster understands the dependencies between assets and only runs what is necessary.
Lineage & Observability: You can visualize exactly which source VCF produced which output Parquet file. If a file looks wrong, you can trace it back to its source.
Dynamic Partitioning: We discover files on remote servers (like Ensembl FTP) and create a "partition" for each. This allows fine-grained progress tracking and the ability to retry only failed files.
Parallelism & Concurrency: Safe parallel execution with configurable limits to avoid overloading source servers or local system resources.
Self-Documenting: The Dagster UI provides a live, interactive map of your data pipeline and its current state.

Installation

This project uses uv for dependency management.

git clone https://github.com/dna-seq/prepare-annotations.git
cd prepare-annotations
uv sync

Usage

Running Pipelines

The primary entry points are dagster-ensembl for running jobs and dagster-ui for the web interface.

# Run the full Ensembl pipeline (download → convert → upload)
uv run dagster-ensembl

# Run the LongevityMap pipeline (convert → join with Ensembl → upload)
uv run prepare longevitymap

# Start the Dagster UI for monitoring and lineage visualization
uv run dagster-ui

# Run for a specific species
uv run dagster-ensembl --species mus_musculus

Advanced Operations

Use the prepare command for more granular control:

# List all available assets and jobs
uv run prepare assets
uv run prepare jobs

# Materialize specific assets
uv run prepare materialize ensembl_vcf_urls
uv run prepare materialize ensembl_vcf_file --partition homo_sapiens.vcf.gz

OakVar Module Management

The modules command manages OakVar modules from the dna-seq GitHub organization.

# Download data files from a module
uv run modules data --repo dna-seq/just_longevitymap

# Convert module data to unified schema
uv run prepare longevitymap --convert-only

Available modules: just_longevitymap, just_coronary, just_vo2max, just_lipidmetabolism, just_superhuman, just_drugs, just_pathogenic, just_cancer, just_prs

Unified Annotation Schema

Module conversion produces three standardized parquet files:

File	Schema
annotations.parquet	`rsid, module, gene, phenotype, category`
studies.parquet	`rsid, module, pmid, population, p_value, conclusion, study_design`
weights.parquet	`rsid, genotype, module, weight, state, priority, conclusion, curator, method`

State: protective, risk, or neutral
Genotype: List of 2 alleles, alphabetically sorted

Converted datasets are uploaded to the just-dna-seq organization on HuggingFace Hub. See the Hugging Face Module Consumption Guide for details on how to use these modules in your own pipelines.

Package Structure

The package follows Dagster best practices with utilities organized in subpackages:

src/prepare_annotations/
├── definitions.py          # Main Dagster definitions (assets, jobs, resources)
├── pipelines.py            # Standalone API for ClinVar, dbSNP, gnomAD (non-Dagster)
├── cli.py                  # Typer CLI entrypoint
│
├── core/                   # Core utilities
│   ├── io.py               # VCF/Parquet I/O
│   ├── models.py           # Pydantic models
│   ├── paths.py            # Path helpers
│   └── runtime.py          # Profiling, environment
│
├── assets/                 # Dagster assets
│   ├── ensembl.py          # Ensembl VCF pipeline
│   └── modules.py          # OakVar module conversion
│
├── downloaders/            # Download utilities
│   ├── vcf.py              # VCF download
│   └── genome.py           # Genome FASTA download
│
├── huggingface/            # HuggingFace Hub integration
│   ├── uploader.py         # Upload utilities
│   └── dataset_cards.py    # Dataset card templates
│
└── converters/             # OakVar module converters

Testing

# Run all tests (excluding large downloads)
uv run pytest

# Run specific module tests
uv run pytest tests/test_longevitymap_module.py -v

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
data		data
docs		docs
images		images
logs		logs
notebooks		notebooks
scripts		scripts
src/prepare_annotations		src/prepare_annotations
tests		tests
.env.template		.env.template
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
dagster.yaml		dagster.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prepare Annotations

🔷 Why Dagster?

Installation

Usage

Running Pipelines

Advanced Operations

OakVar Module Management

Unified Annotation Schema

Package Structure

Testing

License

About

Uh oh!

Releases

Packages

Languages

License

dna-seq/prepare-annotations

Folders and files

Latest commit

History

Repository files navigation

Prepare Annotations

🔷 Why Dagster?

Installation

Usage

Running Pipelines

Advanced Operations

OakVar Module Management

Unified Annotation Schema

Package Structure

Testing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages