{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Simpson's Paradox\n", "\n", "### Authors: \n", "- Christian Michelsen (Niels Bohr Institute)\n", "\n", "### Date: \n", "- 05-11-2018 (latest update)\n", "\n", "***\n", "\n", "The aim of this notebook is to illustrate Simpson's paradox through a simple, quick example.\n", "\n", "For more information on the Simpson's Paradox, see: __[Wikipedia](https://en.wikipedia.org/wiki/Simpson%27s_paradox)__\n", "\n", "This small example is meant to show how aggregating different samples can change the overall statistics completely. In this specific case we will look at how the amount of exercise can change the risk of catching a highly infectioness disease. \n", "\n", "***\n", "\n", "First, we import the modules we want to use:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import seaborn as sns\n", "\n", "sns.set(color_codes=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then load the data located in the csv file `Simpsons_paradox.csv` into a Pandas DataFrame and see what it contains:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Exercise | \n", "Age | \n", "Probability | \n", "
---|---|---|---|
0 | \n", "9.720159 | \n", "20-40 | \n", "20.988422 | \n", "
1 | \n", "2.498937 | \n", "10-20 | \n", "22.323348 | \n", "
2 | \n", "5.520742 | \n", "10-20 | \n", "5.875953 | \n", "
3 | \n", "8.441350 | \n", "60-80 | \n", "74.790105 | \n", "
4 | \n", "10.813937 | \n", "40-60 | \n", "50.120903 | \n", "
5 | \n", "8.362772 | \n", "40-60 | \n", "50.933656 | \n", "
6 | \n", "5.232431 | \n", "20-40 | \n", "36.104305 | \n", "
7 | \n", "11.192118 | \n", "60-80 | \n", "64.977362 | \n", "
8 | \n", "13.174650 | \n", "60-80 | \n", "76.653905 | \n", "
9 | \n", "8.463829 | \n", "10-20 | \n", "4.628246 | \n", "