diff --git a/INSTRUCTIONS.md b/INSTRUCTIONS.md new file mode 100644 index 0000000..3c24f53 --- /dev/null +++ b/INSTRUCTIONS.md @@ -0,0 +1,8 @@ +Offline Python RAP demo instructions + +Getting started + +1. Download and unzip the file +2. Open the folder in your chosen IDE + +Once these steps are complete, refer to the README.md file in the unzipped folder and follow the instructions from step 2 onwards diff --git a/README.md b/README.md index 39b086d..50284da 100644 --- a/README.md +++ b/README.md @@ -47,6 +47,7 @@ Welcome to the RAP (Reproducible Analytical Pipeline) demonstration repository! - `src/` — Main pipeline code and modules - `data/` — Example health data for analysis - `config/` — Configuration files (YAML) +- `reports/` — Graphs and reports - `tests/` — Unit tests for pipeline modules - `exercises/` — **Practice exercises** (see below) - `docs/` — Documentation diff --git a/config/user_config.yaml b/config/user_config.yaml index 2e1ddb7..6a47056 100644 --- a/config/user_config.yaml +++ b/config/user_config.yaml @@ -9,4 +9,4 @@ # - Any other settings you want to change without editing code input_path: data/input/health_data.csv cleaned_path: data/outputs/cleaned/health_data_cleaned.csv -report_dir: data/outputs/reports/ +report_dir: reports/ diff --git a/exercises/01_introduction.ipynb b/exercises/01_introduction.ipynb index 78c7e7b..0ff207b 100644 --- a/exercises/01_introduction.ipynb +++ b/exercises/01_introduction.ipynb @@ -24,13 +24,13 @@ "## Where to find exercises, solutions, and outputs\n", "- **Exercises**: Located in the `exercises/` folder (e.g., `exercises/02_modules.ipynb`)\n", "- **Solutions**: Located in `exercises/solutions/` (e.g., `exercises/solutions/02_modules_solutions.ipynb`)\n", - "- **Outputs**: Saved in `exercises/outputs/` (e.g., cleaned data, reports, charts)\n", + "- **Outputs**: Saved in `data/outputs/` or `reports/` (e.g., cleaned data, reports, charts)\n", "\n", "## How to use the exercises\n", "1. **Read each exercise notebook** and follow the instructions step by step.\n", "2. **Write your code** in the notebook cells or in the `src/python_rap_demo/` modules as instructed.\n", "4. **Check your solutions** by comparing your results with the solutions notebooks in `exercises/solutions/`.\n", - "5. **View outputs** in the `data/outputs/` folder (e.g., cleaned data, markdown reports, charts).\n", + "5. **View outputs** in the `data/outputs/` and `reports/` folders (e.g., cleaned data, markdown reports, charts).\n", "\n", "## Adding code to the src folder\n", "- Place reusable functions and classes in the appropriate module in `src/python_rap_demo/` (e.g., `cleaning.py`, `io.py`, `report.py`).\n", @@ -38,7 +38,7 @@ "\n", "## How to check if your solutions have worked\n", "- **Compare outputs**: Check your results against the solutions notebooks and output files.\n", - "- **Check outputs**: Review generated files in `data/outputs/` for expected results.\n", + "- **Check outputs**: Review generated files in `data/outputs/` or `reports/` for expected results.\n", "- **Run tests**: For unit test exercises you can run all tests with:\n", " ```cmd\n", " pytest tests\n", diff --git a/exercises/02_modules.ipynb b/exercises/02_modules.ipynb index 9f3aa7d..34ac560 100644 --- a/exercises/02_modules.ipynb +++ b/exercises/02_modules.ipynb @@ -146,7 +146,10 @@ "- Write a function to plot missing values per column before the data is cleaned\n", "- Write a function to plot disease prevalence for each disease category over time after the data is cleaned\n", "- Add the visualisations to the output report\n", - "- Save the charts to the outputs folder\n" + "\n", + "**Bonus:**\n", + "- Save the charts to the reports folder\n", + "- Customise chart formatting and colours\n" ] } ], @@ -161,7 +164,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.5" + "version": "3.12.3" } }, "nbformat": 4, diff --git a/exercises/03_config_files.ipynb b/exercises/03_config_files.ipynb index 5b71948..413e1fc 100644 --- a/exercises/03_config_files.ipynb +++ b/exercises/03_config_files.ipynb @@ -67,7 +67,7 @@ "\n", "**Reflect:**\n", "- How does using config files improve reproducibility and flexibility?\n", - "- What other parameters could you add to make your pipeline more configurable?" + "- What other parameters could you add to make your pipeline more configurable?\n" ] }, { diff --git a/exercises/04_unit_tests.ipynb b/exercises/04_unit_tests.ipynb index a0a5bb8..417d58c 100644 --- a/exercises/04_unit_tests.ipynb +++ b/exercises/04_unit_tests.ipynb @@ -31,7 +31,55 @@ "id": "2", "metadata": {}, "source": [ - "## Exercise 1: Write a simple unit test for a new function\n", + "## Exercise 1: Review and Adapt an existing unit test\n", + "\n", + "The function `clean_health_data` in `src/python_rap_demo/cleaning.py` has an existing unit test in\n", + "`tests/test_cleaning.py`.\n", + "\n", + "**Task:** \n", + "1. Run the existing unit test for `clean_health_data` to understand how it works. \n", + "2. Modify the function `clean_health_data` and re-run the tests to understand what causes the tests to pass or fail, then modify the unit test so that the tests passes again.\n", + "\n", + "There is an example of a modified function below however it does not have to be used, feel free to modify the function and experiment with the tests without following the set exercise." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3", + "metadata": {}, + "outputs": [], + "source": [ + "def clean_health_data(df: pd.DataFrame) -> pd.DataFrame:\n", + " \"\"\"\n", + " Clean health data by dropping rows with missing values in key columns.\n", + "\n", + " Args:\n", + " df (pd.DataFrame): Raw health data.\n", + "\n", + " Returns:\n", + " pd.DataFrame: Cleaned health data with no missing values in critical columns.\n", + " \"\"\"\n", + " df = df.copy()\n", + "\n", + " # Drop rows with missing values in height_cm, weight_kg, or diagnosis columns\n", + " df = df.dropna(subset=[\"height_cm\", \"weight_kg\", \"diagnosis\"])\n", + "\n", + " # Fill missing smoker values with 'Yes'\n", + " df[\"smoker\"] = df[\"smoker\"].fillna(\"Yes\")\n", + "\n", + " # Ensure gender is uppercase\n", + " df[\"gender\"] = df[\"gender\"].str.upper()\n", + "\n", + " return df\n" + ] + }, + { + "cell_type": "markdown", + "id": "4", + "metadata": {}, + "source": [ + "## Exercise 2: Write a simple unit test for a new function\n", "\n", "Suppose you have created a function called `impute_by_group` in `src/python_rap_demo/cleaning.py`.\n", "\n", @@ -46,10 +94,10 @@ }, { "cell_type": "markdown", - "id": "3", + "id": "5", "metadata": {}, "source": [ - "## Exercise 1a: Walkthrough - Write a unit test for flag_missing\n", + "## Exercise 2a: Walkthrough - Write a unit test for flag_missing\n", "\n", "Let's start with a simple function called `flag_missing`. This function adds a new column to your DataFrame to flag missing values in specified columns.\n", "\n", @@ -59,7 +107,7 @@ { "cell_type": "code", "execution_count": null, - "id": "4", + "id": "6", "metadata": {}, "outputs": [], "source": [ @@ -93,7 +141,7 @@ }, { "cell_type": "markdown", - "id": "5", + "id": "7", "metadata": {}, "source": [ "### How to write a unit test for `flag_missing`\n", @@ -129,7 +177,7 @@ }, { "cell_type": "markdown", - "id": "6", + "id": "8", "metadata": {}, "source": [ "#### Understanding the test_flag_missing function\n", @@ -147,10 +195,10 @@ }, { "cell_type": "markdown", - "id": "7", + "id": "9", "metadata": {}, "source": [ - "## Exercise 1b: Write a unit test for impute_by_group\n", + "## Exercise 2b: Write a unit test for impute_by_group\n", "\n", "Now try writing your own tests for the following function. Use the walkthrough above as a guide.\n", "\n", @@ -165,7 +213,7 @@ { "cell_type": "code", "execution_count": null, - "id": "8", + "id": "10", "metadata": {}, "outputs": [], "source": [ @@ -190,7 +238,7 @@ }, { "cell_type": "markdown", - "id": "9", + "id": "11", "metadata": {}, "source": [ "**Task:**\n", @@ -205,10 +253,10 @@ }, { "cell_type": "markdown", - "id": "10", + "id": "12", "metadata": {}, "source": [ - "## Exercise 2: Run your unit tests\n", + "## Exercise 3: Run your unit tests\n", "\n", "**Task:**\n", "- Run all tests in the `tests/` folder using the command below:\n", @@ -223,10 +271,10 @@ }, { "cell_type": "markdown", - "id": "11", + "id": "13", "metadata": {}, "source": [ - "## Exercise 3: Stretch - Check test coverage\n", + "## Exercise 4: Stretch - Check test coverage\n", "\n", "Test coverage shows how much of your code is tested by unit tests.\n", "\n", @@ -245,10 +293,10 @@ }, { "cell_type": "markdown", - "id": "12", + "id": "14", "metadata": {}, "source": [ - "## Exercise 4: Stretch - Try parameterisation in pytest\n", + "## Exercise 5: Stretch - Try parameterisation in pytest\n", "\n", "Parameterisation lets you run the same test with different inputs.\n", "\n", @@ -260,7 +308,8 @@ ], "metadata": { "language_info": { - "name": "python" + "name": "python", + "version": "3.12.3" } }, "nbformat": 4, diff --git a/exercises/solutions/02_modules_solutions.ipynb b/exercises/solutions/02_modules_solutions.ipynb index 0aecdd8..7e32ecb 100644 --- a/exercises/solutions/02_modules_solutions.ipynb +++ b/exercises/solutions/02_modules_solutions.ipynb @@ -22,6 +22,8 @@ "import os\n", "import sys\n", "\n", + "import pandas as pd\n", + "\n", "sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), \"..\", \"..\", \"src\")))\n", "\n", "from python_rap_demo.cleaning import clean_health_data\n", @@ -30,8 +32,8 @@ "from python_rap_demo.utils import add_bmi_column\n", "\n", "input_path = \"../../data/input/health_data.csv\"\n", - "cleaned_path = \"../outputs/02_modules/cleaned_data.csv\"\n", - "report_path = \"../outputs/02_modules/\"\n", + "cleaned_path = \"../../data/outputs/cleaned/health_data_cleaned.csv\"\n", + "report_path = \"../../reports/\"\n", "\n", "# I/O: Read data\n", "df = read_health_data(input_path)\n", @@ -62,7 +64,6 @@ "outputs": [], "source": [ "# Modular solution using subfunctions\n", - "import pandas as pd\n", "\n", "\n", "def flag_missing(df: pd.DataFrame, columns: list[str]) -> pd.DataFrame:\n", @@ -736,7 +737,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.5" + "version": "3.12.3" } }, "nbformat": 4, diff --git a/exercises/solutions/04_unit_tests_solutions.ipynb b/exercises/solutions/04_unit_tests_solutions.ipynb index daf6883..025498b 100644 --- a/exercises/solutions/04_unit_tests_solutions.ipynb +++ b/exercises/solutions/04_unit_tests_solutions.ipynb @@ -15,9 +15,13 @@ "id": "1", "metadata": {}, "source": [ - "## Solution 1: Write a simple unit test for a new function\n", + "## Exercise 1 Solution: Review and adapt an existing unit test\n", "\n", - "Here are example unit tests for the `flag_missing` and `impute_by_group` functions in `src/python_rap_demo/cleaning.py`:" + "Open `tests/test_cleaning.py` and run the test using the following command in the terminal:\n", + "```cmd\n", + "pytest tests/test_cleaning.py\n", + "```\n", + "After the test passes, change the function `clean_health_data` in `src/python_rap_demo/cleaning.py ` to the following function, save the file and run the test again:" ] }, { @@ -26,10 +30,95 @@ "id": "2", "metadata": {}, "outputs": [], + "source": [ + "def clean_health_data(df: pd.DataFrame) -> pd.DataFrame:\n", + " \"\"\"\n", + " Clean health data by dropping rows with missing values in key columns.\n", + "\n", + " Args:\n", + " df (pd.DataFrame): Raw health data.\n", + "\n", + " Returns:\n", + " pd.DataFrame: Cleaned health data with no missing values in critical columns.\n", + " \"\"\"\n", + " df = df.copy()\n", + "\n", + " # Drop rows with missing values in height_cm, weight_kg, or diagnosis columns\n", + " df = df.dropna(subset=[\"height_cm\", \"weight_kg\", \"diagnosis\"])\n", + "\n", + " # Fill missing smoker values with 'Yes'\n", + " df[\"smoker\"] = df[\"smoker\"].fillna(\"Yes\")\n", + "\n", + " # Ensure gender is uppercase\n", + " df[\"gender\"] = df[\"gender\"].str.upper()\n", + "\n", + " return df" + ] + }, + { + "cell_type": "markdown", + "id": "3", + "metadata": {}, + "source": [ + "Notice how the check for the missing smoker value fails. The test checks the first column for a \"smoker\" value of \"No\", however the modified `clean_health_data` function fills missing smoker values with 'Yes', changing the smoker value in the first column to 'Yes', which causes the test to fail.\n", + "\n", + "The test failing highlights the function to developers who can then check if the change was correct or not. If it was not, the developer can fix the error in the function. If it was, the unit test can be adapted to incorporate the change. In this case assume the change was correct. In order for the test to pass change the expected \"smoker\" value to \"Yes\" instead of \"No\".\n", + "The original assert statement looks like this:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4", + "metadata": {}, + "outputs": [], + "source": [ + "assert cleaned[\"smoker\"].iloc[0] == \"No\"" + ] + }, + { + "cell_type": "markdown", + "id": "5", + "metadata": {}, + "source": [ + "The changed assert statement should look like this:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6", + "metadata": {}, + "outputs": [], + "source": [ + "assert cleaned[\"smoker\"].iloc[0] == \"Yes\"" + ] + }, + { + "cell_type": "markdown", + "id": "7", + "metadata": {}, + "source": [ + "## Exercise 2 Solution: Write a simple unit test for a new function\n", + "\n", + "Here are example unit tests for the `flag_missing` and `impute_by_group` functions in `src/python_rap_demo/cleaning.py`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8", + "metadata": {}, + "outputs": [], "source": [ "# Walkthrough: Unit test for flag_missing\n", + "import os\n", + "import sys\n", + "\n", "import pandas as pd\n", "\n", + "sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), \"..\", \"..\", \"src\")))\n", + "\n", "from python_rap_demo.cleaning import flag_missing\n", "\n", "\n", @@ -41,19 +130,33 @@ " flagged = flag_missing(df, [\"height_cm\", \"weight_kg\"])\n", " # Check that the _imputed columns are correct\n", " assert flagged[\"height_cm_imputed\"].tolist() == [False, True]\n", - " assert flagged[\"weight_kg_imputed\"].tolist() == [False, True]" + " print(\"height_cm_imputed test passed.\")\n", + " assert flagged[\"weight_kg_imputed\"].tolist() == [False, True]\n", + " print(\"weight_kg_imputed test passed.\")\n", + "\n", + "\n", + "test_flag_missing()" ] }, { "cell_type": "code", "execution_count": null, - "id": "3", + "id": "9", "metadata": {}, "outputs": [], "source": [ "# Example: Unit test for impute_by_group\n", + "\n", + "# Note: This test will not run unless impute_by_group has been entered into cleaning.py\n", + "\n", + "\n", + "import os\n", + "import sys\n", + "\n", "import pandas as pd\n", "\n", + "sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), \"..\", \"..\", \"src\")))\n", + "\n", "from python_rap_demo.cleaning import impute_by_group\n", "\n", "\n", @@ -65,15 +168,19 @@ " imputed = impute_by_group(df, \"height_cm\", \"sex\")\n", " # Check that missing value is imputed with group mean\n", " expected = [170, 160, 160]\n", - " assert imputed.tolist() == expected" + " assert imputed.tolist() == expected\n", + " print(\"impute_by_group test passed.\")\n", + "\n", + "\n", + "test_impute_by_group()" ] }, { "cell_type": "markdown", - "id": "4", + "id": "10", "metadata": {}, "source": [ - "## Solution 2: Run your unit tests\n", + "## Exercise 3 Solution: Run your unit tests\n", "\n", "Run the following command in your terminal:\n", "```cmd\n", @@ -85,10 +192,10 @@ }, { "cell_type": "markdown", - "id": "5", + "id": "11", "metadata": {}, "source": [ - "## Solution 3: Stretch - Check test coverage\n", + "## Exercise 4 Solution: Stretch - Check test coverage\n", "\n", "Run the following commands:\n", "```cmd\n", @@ -102,10 +209,10 @@ }, { "cell_type": "markdown", - "id": "6", + "id": "12", "metadata": {}, "source": [ - "## Solution 4: Stretch - Try parameterisation in pytest\n", + "## Exercise 5 Solution: Stretch - Try parameterisation in pytest\n", "\n", "Here are examples using `@pytest.mark.parametrize` for `flag_missing` and `impute_by_group`. Parameterisation lets you run the same test with different inputs, making your tests more robust and easier to maintain." ] @@ -113,13 +220,20 @@ { "cell_type": "code", "execution_count": null, - "id": "7", + "id": "13", "metadata": {}, "outputs": [], "source": [ + "# Note: These tests will not run unless impute_by_group and flag_missing have been entered into cleaning.py\n", + "\n", + "import os\n", + "import sys\n", + "\n", "import pandas as pd\n", "import pytest\n", "\n", + "sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), \"..\", \"..\", \"src\")))\n", + "\n", "from python_rap_demo.cleaning import flag_missing, impute_by_group\n", "\n", "# Parameterised test for flag_missing\n", @@ -188,7 +302,16 @@ ], "metadata": { "language_info": { - "name": "python" + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.3" } }, "nbformat": 4, diff --git a/data/outputs/reports/.gitkeep b/reports/.gitkeep similarity index 100% rename from data/outputs/reports/.gitkeep rename to reports/.gitkeep diff --git a/exercises/outputs/.gitkeep b/src/__init__.py similarity index 100% rename from exercises/outputs/.gitkeep rename to src/__init__.py diff --git a/exercises/outputs/02_modules/.gitkeep b/src/python_rap_demo/__init__.py similarity index 100% rename from exercises/outputs/02_modules/.gitkeep rename to src/python_rap_demo/__init__.py