A dedicated toolkit for downloading, processing, and preparing genomic annotation datasets (Ensembl, ClinVar, dbSNP, gnomAD) using Dagster for robust, parallel, and observable pipelines.
Genomic data preparation is complex, involving multi-GB downloads and multi-step transformations. We use Dagster to provide:
- Software-Defined Assets (SDA): Instead of just running "tasks", we define Assets (like a Parquet file). Dagster understands the dependencies between assets and only runs what is necessary.
- Lineage & Observability: You can visualize exactly which source VCF produced which output Parquet file. If a file looks wrong, you can trace it back to its source.
- Dynamic Partitioning: We discover files on remote servers (like Ensembl FTP) and create a "partition" for each. This allows fine-grained progress tracking and the ability to retry only failed files.
- Parallelism & Concurrency: Safe parallel execution with configurable limits to avoid overloading source servers or local system resources.
- Self-Documenting: The Dagster UI provides a live, interactive map of your data pipeline and its current state.
This project uses uv for dependency management.
git clone https://github.com/dna-seq/prepare-annotations.git
cd prepare-annotations
uv syncThe primary entry points are dagster-ensembl for running jobs and dagster-ui for the web interface.
# Run the full Ensembl pipeline (download → convert → upload)
uv run dagster-ensembl
# Run the LongevityMap pipeline (convert → join with Ensembl → upload)
uv run prepare longevitymap
# Start the Dagster UI for monitoring and lineage visualization
uv run dagster-ui
# Run for a specific species
uv run dagster-ensembl --species mus_musculusUse the prepare command for more granular control:
# List all available assets and jobs
uv run prepare assets
uv run prepare jobs
# Materialize specific assets
uv run prepare materialize ensembl_vcf_urls
uv run prepare materialize ensembl_vcf_file --partition homo_sapiens.vcf.gzThe modules command manages OakVar modules from the dna-seq GitHub organization.
# Download data files from a module
uv run modules data --repo dna-seq/just_longevitymap
# Convert module data to unified schema
uv run prepare longevitymap --convert-onlyAvailable modules: just_longevitymap, just_coronary, just_vo2max, just_lipidmetabolism, just_superhuman, just_drugs, just_pathogenic, just_cancer, just_prs
Module conversion produces three standardized parquet files:
| File | Schema |
|---|---|
| annotations.parquet | rsid, module, gene, phenotype, category |
| studies.parquet | rsid, module, pmid, population, p_value, conclusion, study_design |
| weights.parquet | rsid, genotype, module, weight, state, priority, conclusion, curator, method |
- State:
protective,risk, orneutral - Genotype: List of 2 alleles, alphabetically sorted
Converted datasets are uploaded to the just-dna-seq organization on HuggingFace Hub. See the Hugging Face Module Consumption Guide for details on how to use these modules in your own pipelines.
The package follows Dagster best practices with utilities organized in subpackages:
src/prepare_annotations/
├── definitions.py # Main Dagster definitions (assets, jobs, resources)
├── pipelines.py # Standalone API for ClinVar, dbSNP, gnomAD (non-Dagster)
├── cli.py # Typer CLI entrypoint
│
├── core/ # Core utilities
│ ├── io.py # VCF/Parquet I/O
│ ├── models.py # Pydantic models
│ ├── paths.py # Path helpers
│ └── runtime.py # Profiling, environment
│
├── assets/ # Dagster assets
│ ├── ensembl.py # Ensembl VCF pipeline
│ └── modules.py # OakVar module conversion
│
├── downloaders/ # Download utilities
│ ├── vcf.py # VCF download
│ └── genome.py # Genome FASTA download
│
├── huggingface/ # HuggingFace Hub integration
│ ├── uploader.py # Upload utilities
│ └── dataset_cards.py # Dataset card templates
│
└── converters/ # OakVar module converters
# Run all tests (excluding large downloads)
uv run pytest
# Run specific module tests
uv run pytest tests/test_longevitymap_module.py -vApache 2.0
