{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Simpson's Paradox\n", "\n", "### Authors: \n", "- Christian Michelsen (Niels Bohr Institute)\n", "\n", "### Date: \n", "- 05-11-2018 (latest update)\n", "\n", "***\n", "\n", "The aim of this notebook is to illustrate Simpson's paradox through a simple, quick example.\n", "\n", "For more information on the Simpson's Paradox, see: __[Wikipedia](https://en.wikipedia.org/wiki/Simpson%27s_paradox)__\n", "\n", "This small example is meant to show how aggregating different samples can change the overall statistics completely. In this specific case we will look at how the amount of exercise can change the risk of catching a highly infectioness disease. \n", "\n", "***\n", "\n", "First, we import the modules we want to use:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import seaborn as sns\n", "\n", "sns.set(color_codes=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then load the data located in the csv file `Simpsons_paradox.csv` into a Pandas DataFrame and see what it contains:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ExerciseAgeProbability
09.72015920-4020.988422
12.49893710-2022.323348
25.52074210-205.875953
38.44135060-8074.790105
410.81393740-6050.120903
58.36277240-6050.933656
65.23243120-4036.104305
711.19211860-8064.977362
813.17465060-8076.653905
98.46382910-204.628246
\n", "
" ], "text/plain": [ " Exercise Age Probability\n", "0 9.720159 20-40 20.988422\n", "1 2.498937 10-20 22.323348\n", "2 5.520742 10-20 5.875953\n", "3 8.441350 60-80 74.790105\n", "4 10.813937 40-60 50.120903\n", "5 8.362772 40-60 50.933656\n", "6 5.232431 20-40 36.104305\n", "7 11.192118 60-80 64.977362\n", "8 13.174650 60-80 76.653905\n", "9 8.463829 10-20 4.628246" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"Simpsons_paradox.csv\", index_col=0) #index_col=0 to make sure that the first columns is used as the index\n", "df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a very quick, but illustrating, example of how to work with Pandas DataFrames, see the following __[link](https://jalammar.github.io/gentle-visual-intro-to-data-analysis-python-pandas)__.\n", "\n", "***\n", "\n", "To further see any relationships in the data, we plot the hours pr. week of exercise vs. the probability of catching the highly infectious disease. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig1, ax1 = plt.subplots(ncols=1, figsize=(14, 8))\n", "ax1 = sns.scatterplot(x=\"Exercise\", y=\"Probability\", data=df, label=\"Age: All\", ax=ax1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here the plot is done using Seaborn and its `scatterplot` function. This works closely together with Pandas DataFrames, where we simply write the strings of the dataframe columns and provide the dataframe via the `data` keyword. \n", "\n", "## Questions: \n", "\n", "- What can you conclude from the following plot? \n", "\n", "***\n", "\n", "- In this example we have more data that we haven't used yet; the age of the patient. How would you include this information in the plot and does this alter your above conclusion? " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "# insert code here\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "- ___Hint___: _look up the `hue` keyword in the `scatterplot` function_\n", "\n", "***\n", "\n", "### Optional question:\n", "\n", "When finished, you can take at the generating function located in `Simpsons_paradox_generate_data.py` and see if it makes sense.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }