diff --git a/README.md b/README.md index 50284da..e09bcc0 100644 --- a/README.md +++ b/README.md @@ -8,14 +8,18 @@ In a RAP project, the README is essential for: - Documenting setup steps and usage instructions - Outlining folder structure and key files - Explaining how to run the pipeline, tests, and automation tools -- Sharing best practices for reproducibility, automation, and transparency +- Any other information to help users and contributors understand and work with the project -A well-written README makes your RAP project accessible and easy for others to use, review, or contribute to. Update it as your project evolves. +The README file is the first file users and contributors will interact with in a RAP. +A well-written README makes the RAP project accessible and easy for others to use, review, or contribute to. +Update it as your project evolves. --> # Work in Progress - RAP demonstration repository for Python Welcome to the RAP (Reproducible Analytical Pipeline) demonstration repository! This repository is designed for beginner to intermediate coders to practice RAP principles, experiment with code, and learn best practices for Reproducible Analytical Pipelines in Python. +See the [Reproducible Analytical Pipelines]([PROVISIONAL_LINK]) materials on the Analysis for Action platform for more information about RAPs and their importance. + **This repository is still in development** ## Getting Started @@ -99,6 +103,13 @@ Run tests with: pytest tests ``` +## Troubleshooting +If you encounter issues: +- Ensure your virtual environment is activated. The terminal prompt should show `(.venv)` at the start. +- Check that all dependencies are installed by running `pip install -r requirements.txt`. +- Verify you are in the project root directory when running commands. The terminal should show the path ending with `python_rap_demo`. +- For exercise notebooks, clean outputs and restart the kernel if you face issues. + ## AI declaration AI has been used in the production of this content. diff --git a/exercises/01_introduction.ipynb b/exercises/01_introduction.ipynb index 0ff207b..d3ca429 100644 --- a/exercises/01_introduction.ipynb +++ b/exercises/01_introduction.ipynb @@ -6,8 +6,7 @@ "metadata": {}, "source": [ "# Introduction to RAP pipeline exercises\n", - "\n", - "Welcome to the RAP (Reproducible Analytical Pipeline) exercises for this project. These exercises are designed to help you learn and apply best practices for reproducible data analysis in Python.\n", + "Welcome to the RAP (Reproducible Analytical Pipeline) exercises. These exercises are designed to help you develop understanding of best practices for a RAP and walk through how to apply those practices in Python.\n", "\n", "## Contents of the exercises\n", "- **01_introduction.ipynb**: Overview and guidance for the RAP exercises\n", @@ -17,7 +16,7 @@ "- **05_continuous_integration.ipynb**: Implement and test continuous integration\n", "\n", "## Aim of the exercises\n", - "The aim is to guide to build on your understanding of reproducible analytical pipelines (RAP) with practical experience in a demonstration repository. \n", + "The aim is to guide to build on your understanding of RAP with practical experience in a demonstration repository. \n", " \n", "These exercises are intended as a starting point that you can build upon. Please use this repository to practice elements of RAP you would like to improve that may not be covered by these exercises.\n", "\n", @@ -34,18 +33,25 @@ "\n", "## Adding code to the src folder\n", "- Place reusable functions and classes in the appropriate module in `src/python_rap_demo/` (e.g., `cleaning.py`, `io.py`, `report.py`).\n", - "- Use clear function names, type hints, and docstrings following PEP8 standards.\n", + "- Use clear function names, type hints, and docstrings following [PEP8](https://peps.python.org/pep-0008/) standards.\n", "\n", "## How to check if your solutions have worked\n", - "- **Compare outputs**: Check your results against the solutions notebooks and output files.\n", - "- **Check outputs**: Review generated files in `data/outputs/` or `reports/` for expected results.\n", - "- **Run tests**: For unit test exercises you can run all tests with:\n", + "View the solutions for each exercise in the `exercises/solutions/` folder. The solutions will walk through the expected outputs and how to run them. This includes:\n", + "- **Comparing outputs**: Check your results against the solutions notebooks and output files.\n", + "- **Checking outputs**: Review generated files in `data/outputs/` or `reports/` for expected results.\n", + "- **Running tests**: For unit test exercises you can run all tests with:\n", " ```cmd\n", " pytest tests\n", " ```\n", - "- **Use pre-commit hooks and CI**: Ensure your code passes formatting, linting, and CI checks.\n", + "- **Using pre-commit hooks and CI**: Ensure your code passes formatting, linting, and CI checks.\n", "\n", - "By following these exercises, you will build a reproducible, automated, and transparent analytical pipeline suitable for real-world data analysis." + "## Troubleshooting\n", + "If you encounter issues:\n", + "- Ensure your virtual environment is activated. The terminal prompt should show `(.venv)` at the start.\n", + "- Check your notebook is using the correct Python kernel associated with the virtual environment. You can change the kernel in the Jupyter notebook interface.\n", + "- Clean outputs and restart the kernel if you face issues.\n", + "- Check that all dependencies are installed by running `pip install -r requirements.txt`.\n", + "- Verify you are in the project root directory when running commands. The terminal should show the path ending with `python_rap_demo`.\n" ] } ], diff --git a/exercises/02_modules.ipynb b/exercises/02_modules.ipynb index 34ac560..8aad4ff 100644 --- a/exercises/02_modules.ipynb +++ b/exercises/02_modules.ipynb @@ -43,7 +43,7 @@ "id": "2", "metadata": {}, "source": [ - "## Step 1: Review the Monolithic Script\n", + "## Exercise 1: Review the Monolithic Script\n", "\n", "Below is a single script that identifies missing values and performs imputation. Your task is to refactor this into modular code.\n", "\n", @@ -106,7 +106,7 @@ "id": "6", "metadata": {}, "source": [ - "## Step 2: Add your functions to the pipeline\n", + "## Exercise 2: Add your functions to the pipeline\n", "\n", "**Tasks:**\n", "- Add your functions to an appropriate module in `src/python_rap_demo`\n", @@ -122,7 +122,7 @@ "id": "7", "metadata": {}, "source": [ - "## Step 3: Challenge & Reflection\n", + "## Exercise 3: Challenge & Reflection\n", "\n", "**Tasks:**\n", "- Compare the cleaned_data output with and without your changes\n", @@ -138,7 +138,7 @@ "id": "8", "metadata": {}, "source": [ - "## Step 4: Refactor Visualisation Code\n", + "## Exercise 4: Refactor Visualisation Code\n", "\n", "As an additional challenge, Add code to create visualisations. Place all visualisation functions in a separate module (e.g., `report.py`).\n", "\n", diff --git a/exercises/03_config_files.ipynb b/exercises/03_config_files.ipynb index 413e1fc..6e7c7b6 100644 --- a/exercises/03_config_files.ipynb +++ b/exercises/03_config_files.ipynb @@ -21,7 +21,7 @@ "id": "1", "metadata": {}, "source": [ - "## Step 1: Add a new parameter to the config file\n", + "## Exercise 1: Add a new parameter to the config file\n", "\n", "Open `config/user_config.yaml` and add a new parameter. For example, add a parameter to control the minimum height allowed in your analysis:\n", "\n", @@ -42,7 +42,7 @@ "id": "2", "metadata": {}, "source": [ - "## Step 2: Update your script to use the new parameter\n", + "## Exercise 2: Update your script to use the new parameter\n", "\n", "Update your pipeline code (e.g., in `main.py` or `cleaning.py`) to read the new parameter from the config file and use it to filter the data.\n", "\n", @@ -59,7 +59,7 @@ "id": "3", "metadata": {}, "source": [ - "## Step 3: Test and reflect\n", + "## Exercise 3: Test and reflect\n", "\n", "**Tasks:**\n", "- Run your pipeline and check that the new parameter is applied\n", @@ -75,7 +75,7 @@ "id": "4", "metadata": {}, "source": [ - "## Step 4: Challenge\n", + "## Exercise 4: Challenge\n", "\n", "Add another parameter to your config file, such as `output_format` (e.g., `csv` or `xlsx`). Update your pipeline to use this parameter when saving output files.\n", "\n", diff --git a/exercises/04_unit_tests.ipynb b/exercises/04_unit_tests.ipynb index 417d58c..00b6c9d 100644 --- a/exercises/04_unit_tests.ipynb +++ b/exercises/04_unit_tests.ipynb @@ -18,10 +18,10 @@ "## Why are unit tests important?\n", "\n", "Unit tests check that individual functions behave as expected. They:\n", - "- Help catch bugs early\n", - "- Make code easier to maintain\n", - "- Support reproducibility and automation\n", - "- Give confidence when refactoring or adding new features\n", + "- Help catch bugs, by testing small pieces of code in isolation\n", + "- Make code easier to maintain by documenting expected behavior\n", + "- Ensure code changes don't break existing functionality by running tests after modifications\n", + "- Give confidence when refactoring or adding new features by verifying existing tests still pass\n", "\n", "Read more in the [pytest documentation](https://docs.pytest.org/en/stable/)." ] diff --git a/exercises/solutions/02_modules_solutions.ipynb b/exercises/solutions/02_modules_solutions.ipynb index 7e32ecb..b1e434b 100644 --- a/exercises/solutions/02_modules_solutions.ipynb +++ b/exercises/solutions/02_modules_solutions.ipynb @@ -23,6 +23,7 @@ "import sys\n", "\n", "import pandas as pd\n", + "import plotly.express as px\n", "\n", "sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), \"..\", \"..\", \"src\")))\n", "\n", @@ -51,9 +52,9 @@ "id": "2", "metadata": {}, "source": [ - "## Step 1 solution: Imputation function\n", + "## Exercise 1 solution: Imputation function\n", "\n", - "Below is a function that performs imputation for missing height and weight values using group means (e.g., by sex or diagnosis), and flags which values were imputed. This approach is more robust and transparent than using overall means." + "Below is a function that performs imputation for missing height and weight values using group means (e.g., by sex or diagnosis), and flags which values were imputed." ] }, { @@ -130,7 +131,7 @@ "id": "4", "metadata": {}, "source": [ - "## Step 2 solution: Add functions to pipeline\n", + "## Exercise 2 solution: Add functions to pipeline\n", "Where you place these functions depends on the context and what makes most sense for your pipeline. However, for this exercise one appropriate solution is outlined below:\n", "\n", "1. Place the `flag_missing`, `impute_by_group` and `impute_height_weight` function into `src/python_rap_demo/cleaning.py`. You could also have added `flag_missing` to utils as it could be used across other modules." @@ -147,8 +148,6 @@ "cleaning.py: Data cleaning functions\n", "\"\"\"\n", "\n", - "import pandas as pd\n", - "\n", "\n", "def clean_health_data(df: pd.DataFrame) -> pd.DataFrame:\n", " \"\"\"\n", @@ -298,11 +297,11 @@ "id": "9", "metadata": {}, "source": [ - "## Step 3 Solution: Reflection\n", + "## Exercise 3 solution: Reflection\n", "\n", - "- Group-based imputation preserves important differences in the data and improves reproducibility.\n", - "- Flagging imputed values helps with transparency and downstream analysis.\n", - "- Handling edge cases (e.g., all values missing in a group) ensures robustness.\n", + "- Modular code improves reproducibility and maintainability by placing small functions that can be used across the pipeline into clearly named scripts.\n", + "- The advantage of separating code into modules is that it allows users to find and reuse functions easily.\n", + "- This pipeline could be extended further by adding more functions to existing modules such as extra cleaning or analysis functions. New modules could also be created for specific tasks such as creating consistent spreadsheet outputs.\n", "\n", "**Extension:**\n", "The above could be taken further to automatically detect categorical and numerical columns, applying imputation to each. \n", @@ -374,17 +373,7 @@ "# for cleaning.py\n", "\n", "\n", - "from typing import List\n", - "\n", - "# import utils functions above using\n", - "# from python_rap_demo.utils import (\n", - "# get_column_types,\n", - "# is_categorical_column,\n", - "# is_numeric_column,\n", - "# )\n", - "\n", - "\n", - "def flag_missing(df: pd.DataFrame, columns: List[str]) -> pd.DataFrame:\n", + "def flag_missing(df: pd.DataFrame, columns: list[str]) -> pd.DataFrame:\n", " \"\"\"\n", " Add boolean columns to flag missing values for specified columns.\n", "\n", @@ -485,7 +474,7 @@ "id": "13", "metadata": {}, "source": [ - "## Step 4 Solution: Refactor Visualisation Code\n", + "## Exercise 4 solution: Refactor Visualisation Code\n", "\n", "Below are example functions for visualising missing values and disease prevalence using Plotly. These would go in `src/python_rap_demo/report.py`." ] @@ -505,7 +494,9 @@ "metadata": {}, "outputs": [], "source": [ - "import plotly.express as px\n", + "\"\"\"\n", + "report.py: Markdown report generation for RAP pipeline\n", + "\"\"\"\n", "\n", "\n", "def plot_missing_values(df: pd.DataFrame, output_path: str) -> None:\n", @@ -569,11 +560,6 @@ "report.py: Markdown report generation for RAP pipeline\n", "\"\"\"\n", "\n", - "import pandas as pd\n", - "\n", - "# Import plotly\n", - "import plotly.express as px\n", - "\n", "\n", "def format_month_section(month: str, month_df: pd.DataFrame) -> str:\n", " \"\"\"\n", @@ -677,7 +663,9 @@ "metadata": {}, "outputs": [], "source": [ - "# report.py\n", + "\"\"\"\n", + "report.py: Markdown report generation for RAP pipeline\n", + "\"\"\"\n", "\n", "\n", "def generate_markdown_report(\n", diff --git a/exercises/solutions/03_config_files_solutions.ipynb b/exercises/solutions/03_config_files_solutions.ipynb index dce649f..2882418 100644 --- a/exercises/solutions/03_config_files_solutions.ipynb +++ b/exercises/solutions/03_config_files_solutions.ipynb @@ -38,12 +38,12 @@ "id": "2", "metadata": {}, "source": [ - "## Step 1 Solution: Add a new parameter to the config file\n", + "## Exercise 1 solution: Add a new parameter to the config file\n", "\n", "Open `config/user_config.yaml` and add the following parameter:\n", "\n", "```yaml\n", - "min_height_cm: 120 # Minimum height (cm) to include in analysis\n", + "min_height_cm: 155 # Minimum height (cm) to include in analysis\n", "```\n", "\n", "This parameter allows you to control which rows are included based on height. Save the file after editing." @@ -54,7 +54,7 @@ "id": "3", "metadata": {}, "source": [ - "## Step 2 Solution: Read config parameters in a modular way\n", + "## Exercise 2 solution: Read config parameters in a modular way\n", "\n", "Create a function to read config values in a new module, e.g., `src/python_rap_demo/io.py`:" ] @@ -66,9 +66,12 @@ "metadata": {}, "outputs": [], "source": [ + "# This block will error if the parameter has not been added to the config file\n", + "# Run the set up code block after adding the parameter to check it works\n", + "\n", "# src/main.py\n", "\n", - "min_height = config[\"min_height\"]\n" + "min_height = config[\"min_height\"]" ] }, { @@ -86,7 +89,7 @@ "id": "6", "metadata": {}, "source": [ - "## Step 3 Solution: Use config parameters in your pipeline\n", + "## Exercise 3 solution: Use config parameters in your pipeline\n", "\n", "Update your pipeline (e.g., in `main.py` or a cleaning module) to use the config-driven parameter for filtering:" ] @@ -99,7 +102,7 @@ "outputs": [], "source": [ "# src/python_rap_demo/cleaning.py\n", - "import pandas as pd\n", + "\n", "\n", "def filter_by_min_height(df: pd.DataFrame, min_height_cm: int) -> pd.DataFrame:\n", " \"\"\"\n", @@ -133,6 +136,9 @@ "metadata": {}, "outputs": [], "source": [ + "# src/main.py\n", + "\n", + "\n", "min_height_cm = config[\"min_height\"]\n", "\n", "filtered_df = filter_by_min_height(health_df, min_height_cm)" @@ -143,7 +149,7 @@ "id": "10", "metadata": {}, "source": [ - "## Step 4 Solution: Add and use another config parameter (output format)\n", + "## Exercise 4 solution: Add and use another config parameter (output format)\n", "\n", "Add another parameter to your config file:\n", "\n", @@ -162,7 +168,7 @@ "outputs": [], "source": [ "# src/python_rap_demo/io.py\n", - "import pandas as pd\n", + "\n", "\n", "def save_output(df: pd.DataFrame, output_path: str, output_format: str) -> None:\n", " \"\"\"\n", @@ -181,15 +187,26 @@ " raise ValueError(\"Unsupported output format.\")" ] }, + { + "cell_type": "markdown", + "id": "12", + "metadata": {}, + "source": [ + "Example usage in `main.py`\n", + "```python\n", + "from python_rap_demo.io import save_output\n", + "```" + ] + }, { "cell_type": "code", "execution_count": null, - "id": "12", + "id": "13", "metadata": {}, "outputs": [], "source": [ - "# Example usage in main.py\n", - "from python_rap_demo.io import save_output\n", + "# src/main.py\n", + "\n", "\n", "output_format = config[\"output_format\"]\n", "save_output(filtered_df, output_path, output_format)" diff --git a/exercises/solutions/04_unit_tests_solutions.ipynb b/exercises/solutions/04_unit_tests_solutions.ipynb index 025498b..d25de2f 100644 --- a/exercises/solutions/04_unit_tests_solutions.ipynb +++ b/exercises/solutions/04_unit_tests_solutions.ipynb @@ -7,7 +7,7 @@ "source": [ "# Solutions: Unit Testing Exercises\n", "\n", - "This notebook provides step-by-step solutions for writing and running unit tests in your RAP pipeline using pytest. Each solution matches the corresponding exercise notebook and is designed for beginners." + "This notebook provides step-by-step solutions for writing and running unit tests in your RAP pipeline using pytest. Each solution matches the corresponding exercise notebook." ] }, { @@ -15,7 +15,7 @@ "id": "1", "metadata": {}, "source": [ - "## Exercise 1 Solution: Review and adapt an existing unit test\n", + "## Exercise 1 solution: Review and adapt an existing unit test\n", "\n", "Open `tests/test_cleaning.py` and run the test using the following command in the terminal:\n", "```cmd\n", @@ -180,7 +180,7 @@ "id": "10", "metadata": {}, "source": [ - "## Exercise 3 Solution: Run your unit tests\n", + "## Exercise 3 solution: Run your unit tests\n", "\n", "Run the following command in your terminal:\n", "```cmd\n", @@ -195,7 +195,7 @@ "id": "11", "metadata": {}, "source": [ - "## Exercise 4 Solution: Stretch - Check test coverage\n", + "## Exercise 4 solution: Stretch - Check test coverage\n", "\n", "Run the following commands:\n", "```cmd\n", @@ -212,7 +212,7 @@ "id": "12", "metadata": {}, "source": [ - "## Exercise 5 Solution: Stretch - Try parameterisation in pytest\n", + "## Exercise 5 solution: Stretch - Try parameterisation in pytest\n", "\n", "Here are examples using `@pytest.mark.parametrize` for `flag_missing` and `impute_by_group`. Parameterisation lets you run the same test with different inputs, making your tests more robust and easier to maintain." ] diff --git a/exercises/solutions/05_continuous_integration_solutions.ipynb b/exercises/solutions/05_continuous_integration_solutions.ipynb index 2e14d80..8f75490 100644 --- a/exercises/solutions/05_continuous_integration_solutions.ipynb +++ b/exercises/solutions/05_continuous_integration_solutions.ipynb @@ -15,7 +15,7 @@ "id": "1", "metadata": {}, "source": [ - "## Solution 1: Explore the CI configuration files\n", + "## Exercise 1 solution: Explore the CI configuration files\n", "\n", "- Open `.pre-commit-config.yaml`, `pyproject.toml`, and `requirements.txt` in your project root.\n", "- Review which hooks and tools are enabled for Python scripts and notebooks.\n", @@ -27,7 +27,7 @@ "id": "2", "metadata": {}, "source": [ - "## Solution 2: Add a new pre-commit hook and test with a deliberate error\n", + "## Exercise 2 solution: Add a new pre-commit hook and test with a deliberate error\n", "\n", "- Add the following to `.pre-commit-config.yaml` under the `pre-commit-hooks` repo:\n", " ```yaml\n", @@ -63,7 +63,7 @@ "id": "3", "metadata": {}, "source": [ - "## Solution 3: Detect secrets in your code\n", + "## Exercise 3 solution: Detect secrets in your code\n", "\n", "- Add a fake secret to any Python file. For example: \n", "\n", @@ -84,7 +84,7 @@ "id": "4", "metadata": {}, "source": [ - "## Solution 4: Add a new workflow that activates on push to any branch\n", + "## Exercise 4 solution: Add a new workflow that activates on push to any branch\n", "\n", "- Create a new workflow file in `.github/workflows/ci_lint_and_format_any_branch.yml` with the following content:\n", " ```yaml\n", diff --git a/src/main.py b/src/main.py index 3653e70..0673f65 100644 --- a/src/main.py +++ b/src/main.py @@ -1,5 +1,6 @@ """ -main.py: Entry point for the RAP (Reproducible Analytical Pipeline) demo project. +main.py: Entry point for the RAP (Reproducible Analytical Pipeline) demo +project. This script coordinates the full analysis pipeline, including: - Loading configuration settings @@ -39,6 +40,7 @@ def main(): df_clean = clean_health_data(df) df_clean = add_bmi_column(df_clean) write_dataframe(df_clean, cleaned_path) + # Processing prevalence_df = calculate_disease_prevalence(df_clean) print(f"Clean data outputted: {cleaned_path}") @@ -49,4 +51,5 @@ def main(): if __name__ == "__main__": + # Run the pipeline main()