Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 13 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,18 @@ In a RAP project, the README is essential for:
- Documenting setup steps and usage instructions
- Outlining folder structure and key files
- Explaining how to run the pipeline, tests, and automation tools
- Sharing best practices for reproducibility, automation, and transparency
- Any other information to help users and contributors understand and work with the project

A well-written README makes your RAP project accessible and easy for others to use, review, or contribute to. Update it as your project evolves.
The README file is the first file users and contributors will interact with in a RAP.
A well-written README makes the RAP project accessible and easy for others to use, review, or contribute to.
Update it as your project evolves.
-->
# Work in Progress - RAP demonstration repository for Python

Welcome to the RAP (Reproducible Analytical Pipeline) demonstration repository! This repository is designed for beginner to intermediate coders to practice RAP principles, experiment with code, and learn best practices for Reproducible Analytical Pipelines in Python.

See the [Reproducible Analytical Pipelines]([PROVISIONAL_LINK]) materials on the Analysis for Action platform for more information about RAPs and their importance.

**This repository is still in development**

## Getting Started
Expand Down Expand Up @@ -99,6 +103,13 @@ Run tests with:
pytest tests
```

## Troubleshooting
If you encounter issues:
- Ensure your virtual environment is activated. The terminal prompt should show `(.venv)` at the start.
- Check that all dependencies are installed by running `pip install -r requirements.txt`.
- Verify you are in the project root directory when running commands. The terminal should show the path ending with `python_rap_demo`.
- For exercise notebooks, clean outputs and restart the kernel if you face issues.

## AI declaration

AI has been used in the production of this content.
Expand Down
24 changes: 15 additions & 9 deletions exercises/01_introduction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,7 @@
"metadata": {},
"source": [
"# Introduction to RAP pipeline exercises\n",
"\n",
"Welcome to the RAP (Reproducible Analytical Pipeline) exercises for this project. These exercises are designed to help you learn and apply best practices for reproducible data analysis in Python.\n",
"Welcome to the RAP (Reproducible Analytical Pipeline) exercises. These exercises are designed to help you develop understanding of best practices for a RAP and walk through how to apply those practices in Python.\n",
"\n",
"## Contents of the exercises\n",
"- **01_introduction.ipynb**: Overview and guidance for the RAP exercises\n",
Expand All @@ -17,7 +16,7 @@
"- **05_continuous_integration.ipynb**: Implement and test continuous integration\n",
"\n",
"## Aim of the exercises\n",
"The aim is to guide to build on your understanding of reproducible analytical pipelines (RAP) with practical experience in a demonstration repository. \n",
"The aim is to guide to build on your understanding of RAP with practical experience in a demonstration repository. \n",
" \n",
"These exercises are intended as a starting point that you can build upon. Please use this repository to practice elements of RAP you would like to improve that may not be covered by these exercises.\n",
"\n",
Expand All @@ -34,18 +33,25 @@
"\n",
"## Adding code to the src folder\n",
"- Place reusable functions and classes in the appropriate module in `src/python_rap_demo/` (e.g., `cleaning.py`, `io.py`, `report.py`).\n",
"- Use clear function names, type hints, and docstrings following PEP8 standards.\n",
"- Use clear function names, type hints, and docstrings following [PEP8](https://peps.python.org/pep-0008/) standards.\n",
"\n",
"## How to check if your solutions have worked\n",
"- **Compare outputs**: Check your results against the solutions notebooks and output files.\n",
"- **Check outputs**: Review generated files in `data/outputs/` or `reports/` for expected results.\n",
"- **Run tests**: For unit test exercises you can run all tests with:\n",
"View the solutions for each exercise in the `exercises/solutions/` folder. The solutions will walk through the expected outputs and how to run them. This includes:\n",
"- **Comparing outputs**: Check your results against the solutions notebooks and output files.\n",
"- **Checking outputs**: Review generated files in `data/outputs/` or `reports/` for expected results.\n",
"- **Running tests**: For unit test exercises you can run all tests with:\n",
" ```cmd\n",
" pytest tests\n",
" ```\n",
"- **Use pre-commit hooks and CI**: Ensure your code passes formatting, linting, and CI checks.\n",
"- **Using pre-commit hooks and CI**: Ensure your code passes formatting, linting, and CI checks.\n",
"\n",
"By following these exercises, you will build a reproducible, automated, and transparent analytical pipeline suitable for real-world data analysis."
"## Troubleshooting\n",
"If you encounter issues:\n",
"- Ensure your virtual environment is activated. The terminal prompt should show `(.venv)` at the start.\n",
"- Check your notebook is using the correct Python kernel associated with the virtual environment. You can change the kernel in the Jupyter notebook interface.\n",
"- Clean outputs and restart the kernel if you face issues.\n",
"- Check that all dependencies are installed by running `pip install -r requirements.txt`.\n",
"- Verify you are in the project root directory when running commands. The terminal should show the path ending with `python_rap_demo`.\n"
]
}
],
Expand Down
8 changes: 4 additions & 4 deletions exercises/02_modules.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@
"id": "2",
"metadata": {},
"source": [
"## Step 1: Review the Monolithic Script\n",
"## Exercise 1: Review the Monolithic Script\n",
"\n",
"Below is a single script that identifies missing values and performs imputation. Your task is to refactor this into modular code.\n",
"\n",
Expand Down Expand Up @@ -106,7 +106,7 @@
"id": "6",
"metadata": {},
"source": [
"## Step 2: Add your functions to the pipeline\n",
"## Exercise 2: Add your functions to the pipeline\n",
"\n",
"**Tasks:**\n",
"- Add your functions to an appropriate module in `src/python_rap_demo`\n",
Expand All @@ -122,7 +122,7 @@
"id": "7",
"metadata": {},
"source": [
"## Step 3: Challenge & Reflection\n",
"## Exercise 3: Challenge & Reflection\n",
"\n",
"**Tasks:**\n",
"- Compare the cleaned_data output with and without your changes\n",
Expand All @@ -138,7 +138,7 @@
"id": "8",
"metadata": {},
"source": [
"## Step 4: Refactor Visualisation Code\n",
"## Exercise 4: Refactor Visualisation Code\n",
"\n",
"As an additional challenge, Add code to create visualisations. Place all visualisation functions in a separate module (e.g., `report.py`).\n",
"\n",
Expand Down
8 changes: 4 additions & 4 deletions exercises/03_config_files.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
"id": "1",
"metadata": {},
"source": [
"## Step 1: Add a new parameter to the config file\n",
"## Exercise 1: Add a new parameter to the config file\n",
"\n",
"Open `config/user_config.yaml` and add a new parameter. For example, add a parameter to control the minimum height allowed in your analysis:\n",
"\n",
Expand All @@ -42,7 +42,7 @@
"id": "2",
"metadata": {},
"source": [
"## Step 2: Update your script to use the new parameter\n",
"## Exercise 2: Update your script to use the new parameter\n",
"\n",
"Update your pipeline code (e.g., in `main.py` or `cleaning.py`) to read the new parameter from the config file and use it to filter the data.\n",
"\n",
Expand All @@ -59,7 +59,7 @@
"id": "3",
"metadata": {},
"source": [
"## Step 3: Test and reflect\n",
"## Exercise 3: Test and reflect\n",
"\n",
"**Tasks:**\n",
"- Run your pipeline and check that the new parameter is applied\n",
Expand All @@ -75,7 +75,7 @@
"id": "4",
"metadata": {},
"source": [
"## Step 4: Challenge\n",
"## Exercise 4: Challenge\n",
"\n",
"Add another parameter to your config file, such as `output_format` (e.g., `csv` or `xlsx`). Update your pipeline to use this parameter when saving output files.\n",
"\n",
Expand Down
8 changes: 4 additions & 4 deletions exercises/04_unit_tests.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@
"## Why are unit tests important?\n",
"\n",
"Unit tests check that individual functions behave as expected. They:\n",
"- Help catch bugs early\n",
"- Make code easier to maintain\n",
"- Support reproducibility and automation\n",
"- Give confidence when refactoring or adding new features\n",
"- Help catch bugs, by testing small pieces of code in isolation\n",
"- Make code easier to maintain by documenting expected behavior\n",
"- Ensure code changes don't break existing functionality by running tests after modifications\n",
"- Give confidence when refactoring or adding new features by verifying existing tests still pass\n",
"\n",
"Read more in the [pytest documentation](https://docs.pytest.org/en/stable/)."
]
Expand Down
44 changes: 16 additions & 28 deletions exercises/solutions/02_modules_solutions.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
"import sys\n",
"\n",
"import pandas as pd\n",
"import plotly.express as px\n",
"\n",
"sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), \"..\", \"..\", \"src\")))\n",
"\n",
Expand Down Expand Up @@ -51,9 +52,9 @@
"id": "2",
"metadata": {},
"source": [
"## Step 1 solution: Imputation function\n",
"## Exercise 1 solution: Imputation function\n",
"\n",
"Below is a function that performs imputation for missing height and weight values using group means (e.g., by sex or diagnosis), and flags which values were imputed. This approach is more robust and transparent than using overall means."
"Below is a function that performs imputation for missing height and weight values using group means (e.g., by sex or diagnosis), and flags which values were imputed."
]
},
{
Expand Down Expand Up @@ -130,7 +131,7 @@
"id": "4",
"metadata": {},
"source": [
"## Step 2 solution: Add functions to pipeline\n",
"## Exercise 2 solution: Add functions to pipeline\n",
"Where you place these functions depends on the context and what makes most sense for your pipeline. However, for this exercise one appropriate solution is outlined below:\n",
"\n",
"1. Place the `flag_missing`, `impute_by_group` and `impute_height_weight` function into `src/python_rap_demo/cleaning.py`. You could also have added `flag_missing` to utils as it could be used across other modules."
Expand All @@ -147,8 +148,6 @@
"cleaning.py: Data cleaning functions\n",
"\"\"\"\n",
"\n",
"import pandas as pd\n",
"\n",
"\n",
"def clean_health_data(df: pd.DataFrame) -> pd.DataFrame:\n",
" \"\"\"\n",
Expand Down Expand Up @@ -298,11 +297,11 @@
"id": "9",
"metadata": {},
"source": [
"## Step 3 Solution: Reflection\n",
"## Exercise 3 solution: Reflection\n",
"\n",
"- Group-based imputation preserves important differences in the data and improves reproducibility.\n",
"- Flagging imputed values helps with transparency and downstream analysis.\n",
"- Handling edge cases (e.g., all values missing in a group) ensures robustness.\n",
"- Modular code improves reproducibility and maintainability by placing small functions that can be used across the pipeline into clearly named scripts.\n",
"- The advantage of separating code into modules is that it allows users to find and reuse functions easily.\n",
"- This pipeline could be extended further by adding more functions to existing modules such as extra cleaning or analysis functions. New modules could also be created for specific tasks such as creating consistent spreadsheet outputs.\n",
"\n",
"**Extension:**\n",
"The above could be taken further to automatically detect categorical and numerical columns, applying imputation to each. \n",
Expand Down Expand Up @@ -374,17 +373,7 @@
"# for cleaning.py\n",
"\n",
"\n",
"from typing import List\n",
"\n",
"# import utils functions above using\n",
"# from python_rap_demo.utils import (\n",
"# get_column_types,\n",
"# is_categorical_column,\n",
"# is_numeric_column,\n",
"# )\n",
"\n",
"\n",
"def flag_missing(df: pd.DataFrame, columns: List[str]) -> pd.DataFrame:\n",
"def flag_missing(df: pd.DataFrame, columns: list[str]) -> pd.DataFrame:\n",
" \"\"\"\n",
" Add boolean columns to flag missing values for specified columns.\n",
"\n",
Expand Down Expand Up @@ -485,7 +474,7 @@
"id": "13",
"metadata": {},
"source": [
"## Step 4 Solution: Refactor Visualisation Code\n",
"## Exercise 4 solution: Refactor Visualisation Code\n",
"\n",
"Below are example functions for visualising missing values and disease prevalence using Plotly. These would go in `src/python_rap_demo/report.py`."
]
Expand All @@ -505,7 +494,9 @@
"metadata": {},
"outputs": [],
"source": [
"import plotly.express as px\n",
"\"\"\"\n",
"report.py: Markdown report generation for RAP pipeline\n",
"\"\"\"\n",
"\n",
"\n",
"def plot_missing_values(df: pd.DataFrame, output_path: str) -> None:\n",
Expand Down Expand Up @@ -569,11 +560,6 @@
"report.py: Markdown report generation for RAP pipeline\n",
"\"\"\"\n",
"\n",
"import pandas as pd\n",
"\n",
"# Import plotly\n",
"import plotly.express as px\n",
"\n",
"\n",
"def format_month_section(month: str, month_df: pd.DataFrame) -> str:\n",
" \"\"\"\n",
Expand Down Expand Up @@ -677,7 +663,9 @@
"metadata": {},
"outputs": [],
"source": [
"# report.py\n",
"\"\"\"\n",
"report.py: Markdown report generation for RAP pipeline\n",
"\"\"\"\n",
"\n",
"\n",
"def generate_markdown_report(\n",
Expand Down
Loading