{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Predicting Character Deaths in Game of Thrones\n",
    "\n",
    "Authors: Thomas Jian, Ian MacCarthy, Arturo Rey, Sifan Zhang"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "\n",
    "import pandas as pd\n",
    "from myst_nb import glue"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "\n",
    "short_results_df = pd.read_csv('random_search_scores.csv')\n",
    "classification_results = pd.read_csv('class_report.csv')\n",
    "missing_percentage = pd.read_csv('calculate_missing_percentage.csv').rename(columns={'Unnamed: 0':'Feature','0':'Percent NaN'})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "We build a prediction tool that predicts whether a given character from the Game of Thrones books will survive to the end of the series. To do this, we implement a logistic regression model on a data set containing character information. The model is not able to achieve prediction accuracy any better than 65%. This is likely due to an absence of strong patterns in the plot and cast of characters that would allow us to easily answer such a question."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "The popular novel series *Game of Thrones* by George R. R. Martin takes place in a fictional world inspired by medeival Europe. Warfare and violence are central to the plot, leading to frequent casualties among the expansive cast of characters. Exactly which characters are fated to die before the end of the series has become a topic of heated speculation.\n",
    "\n",
    "In this analysis we try to predict whether a character will come to grief before the end of the series based on that character's attributes. We consider a data set created by Data Society available through data.world {cite}`data_society_2016`, which comprises a comprehensive list of characters appearing in the series. For each character there is a number of attributes, such as whether the character is noble, whether they appeared in a given book in the series, whether a relation of theirs is already dead, and finally whether or not the character is still alive at the end. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Methods and Results\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A concern of ours having seen the data is that there are a number of columns which are filled almost completely with missing values. These represent information that is only known for a very small minority of characters, like whether their parents are alive and who their heir is. We decided to drop columns that were almost entirely missing values, like **isAliveFather**. In other cases like **House**, where the missing values represent characters who are not noble and thus do not belong to a house, we filled the missing values with a new category like 'No house'. Also these columns contained a great number of unique categories; for simplicity's sake we decided to keep only the most common categories and filled all the other values with 'other'. Finally for age, which is unknown for about half of the characters, we filled missing values with the median age."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Feature</th>\n",
       "      <th>Percent NaN</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>S.No</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>plod</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>name</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>title</td>\n",
       "      <td>0.517986</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>male</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>culture</td>\n",
       "      <td>0.652107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>dateOfBirth</td>\n",
       "      <td>0.777492</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>DateoFdeath</td>\n",
       "      <td>0.771840</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>mother</td>\n",
       "      <td>0.989209</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>father</td>\n",
       "      <td>0.986639</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>heir</td>\n",
       "      <td>0.988181</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>house</td>\n",
       "      <td>0.219424</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>spouse</td>\n",
       "      <td>0.858171</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>book1</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>book2</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>book3</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>book4</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>book5</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>isAliveMother</td>\n",
       "      <td>0.989209</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>isAliveFather</td>\n",
       "      <td>0.986639</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>isAliveHeir</td>\n",
       "      <td>0.988181</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>isAliveSpouse</td>\n",
       "      <td>0.858171</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>isMarried</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>isNoble</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>age</td>\n",
       "      <td>0.777492</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>numDeadRelations</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>boolDeadRelations</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>isPopular</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>popularity</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>isAlive</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Feature  Percent NaN\n",
       "0                S.No     0.000000\n",
       "1                plod     0.000000\n",
       "2                name     0.000000\n",
       "3               title     0.517986\n",
       "4                male     0.000000\n",
       "5             culture     0.652107\n",
       "6         dateOfBirth     0.777492\n",
       "7         DateoFdeath     0.771840\n",
       "8              mother     0.989209\n",
       "9              father     0.986639\n",
       "10               heir     0.988181\n",
       "11              house     0.219424\n",
       "12             spouse     0.858171\n",
       "13              book1     0.000000\n",
       "14              book2     0.000000\n",
       "15              book3     0.000000\n",
       "16              book4     0.000000\n",
       "17              book5     0.000000\n",
       "18      isAliveMother     0.989209\n",
       "19      isAliveFather     0.986639\n",
       "20        isAliveHeir     0.988181\n",
       "21      isAliveSpouse     0.858171\n",
       "22          isMarried     0.000000\n",
       "23            isNoble     0.000000\n",
       "24                age     0.777492\n",
       "25   numDeadRelations     0.000000\n",
       "26  boolDeadRelations     0.000000\n",
       "27          isPopular     0.000000\n",
       "28         popularity     0.000000\n",
       "29            isAlive     0.000000"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "missing_percentage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "As a rule of thumb, we usually drop features whose percentage of missing values are over 15/20/25%. This is done because imputing these values, or replacing them with the mean/median/mode, would create bias in our data. Another way to deal with this case, would be to drop all the columns where we have null values in. This is a viable approach when the number of null values are not a big part of the dataset.\n",
    "Considering that a couple of features have a great percentage of missing values (> 25%), we have dropped these features from our training and test set in order to not influence our model by adding bias by imputing values.\n",
    "\n",
    "Among those features we picked for further analysis, some are easily understood by the feature names while some others might be hard to read and understand, so descriptions of these features are provided below.\n",
    "\n",
    "Culture: Social group to which a character belongs \\\n",
    "House: House to which a character belongs \\\n",
    "book1,book2,book3,book4,book5: Whether or not the character is mentioned in the book \\\n",
    "numDeadRelations: Number of dead characters to whom a character is related \\\n",
    "Popularity: popularity score calculated from the number of internal incoming and outgoing\n",
    "links to the characters page in the http://awoiaf.westeros.org wiki \\\n",
    "Target:whether or not the character is alive "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Based on the selected features, we created a heatmap {cite}`ostblom_2023_dsci531` to illustrate the correlation of each feature with **isAlive**. None of the correlations, in {numref}`Figure {number} <correlation_map>`, are particularly strong, but they seem to make sense. There is a notable positive correlation with whether the character is alive in book 4, which makes sense given that a character has no chance of being alive at the end of the series if they aren't still alive at book 4. There are also notable negative correlations with the character's popularity and the number of dead realtions the character has, suggesting that popular characters whose families are already depleted tend to survive in the story.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{figure} correlation_heatmap.png\n",
    "---\n",
    "height: 500px\n",
    "name: correlation_map\n",
    "---\n",
    "Correlation heatmap for all features in processed data\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using this cleaned data set, we then constructed a column transformer to prepare the data for machine learning purposes. We tried a selection of different machine learning models: dummy, k-NN, SVM, and logistic regression {cite}`kolhatkar_2023_dsci571`. We found that the logistic regression model performed the best based on f1 score, with **test_f1** of 0.427, and so undertook randomized hyper-parameter optimization {cite}`ostblom_2023_dsci573` to fine-tune that model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results of a randomized grid search with cross validation is shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>param_logisticregression__C</th>\n",
       "      <th>param_logisticregression__max_iter</th>\n",
       "      <th>param_logisticregression__class_weight</th>\n",
       "      <th>mean_test_score</th>\n",
       "      <th>mean_fit_time</th>\n",
       "      <th>mean_train_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.000</td>\n",
       "      <td>500</td>\n",
       "      <td>balanced</td>\n",
       "      <td>0.516</td>\n",
       "      <td>0.020</td>\n",
       "      <td>0.518</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.000</td>\n",
       "      <td>2000</td>\n",
       "      <td>balanced</td>\n",
       "      <td>0.516</td>\n",
       "      <td>0.020</td>\n",
       "      <td>0.518</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.048</td>\n",
       "      <td>100</td>\n",
       "      <td>balanced</td>\n",
       "      <td>0.502</td>\n",
       "      <td>0.038</td>\n",
       "      <td>0.516</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.886</td>\n",
       "      <td>2000</td>\n",
       "      <td>balanced</td>\n",
       "      <td>0.501</td>\n",
       "      <td>0.041</td>\n",
       "      <td>0.519</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>69.519</td>\n",
       "      <td>2000</td>\n",
       "      <td>balanced</td>\n",
       "      <td>0.498</td>\n",
       "      <td>0.042</td>\n",
       "      <td>0.525</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   param_logisticregression__C  param_logisticregression__max_iter  \\\n",
       "0                        0.000                                 500   \n",
       "1                        0.000                                2000   \n",
       "2                        0.048                                 100   \n",
       "3                        0.886                                2000   \n",
       "4                       69.519                                2000   \n",
       "\n",
       "  param_logisticregression__class_weight  mean_test_score  mean_fit_time  \\\n",
       "0                               balanced            0.516          0.020   \n",
       "1                               balanced            0.516          0.020   \n",
       "2                               balanced            0.502          0.038   \n",
       "3                               balanced            0.501          0.041   \n",
       "4                               balanced            0.498          0.042   \n",
       "\n",
       "   mean_train_score  \n",
       "0             0.518  \n",
       "1             0.518  \n",
       "2             0.516  \n",
       "3             0.519  \n",
       "4             0.525  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\n",
    "short_results_df.round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Having found this best performing model, we tried it out by deploying it on the test set. The classification report is shown below, along with a precision recall curve and a reciever operating characteristic curve"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>precision</th>\n",
       "      <th>recall</th>\n",
       "      <th>f1-score</th>\n",
       "      <th>support</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.876984</td>\n",
       "      <td>0.609235</td>\n",
       "      <td>0.718991</td>\n",
       "      <td>1451.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.395522</td>\n",
       "      <td>0.749495</td>\n",
       "      <td>0.517795</td>\n",
       "      <td>495.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.644913</td>\n",
       "      <td>0.644913</td>\n",
       "      <td>0.644913</td>\n",
       "      <td>0.644913</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.636253</td>\n",
       "      <td>0.679365</td>\n",
       "      <td>0.618393</td>\n",
       "      <td>1946.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.754516</td>\n",
       "      <td>0.644913</td>\n",
       "      <td>0.667813</td>\n",
       "      <td>1946.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   precision    recall  f1-score      support\n",
       "0   0.876984  0.609235  0.718991  1451.000000\n",
       "1   0.395522  0.749495  0.517795   495.000000\n",
       "2   0.644913  0.644913  0.644913     0.644913\n",
       "3   0.636253  0.679365  0.618393  1946.000000\n",
       "4   0.754516  0.644913  0.667813  1946.000000"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "classification_results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{figure} pr_curve.png\n",
    "---\n",
    "height: 500px\n",
    "name: pr_curve\n",
    "---\n",
    "Precision Recall Curve for the best model\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{figure} roc_curve.png\n",
    "---\n",
    "height: 500px\n",
    "name: roc_curve\n",
    "---\n",
    "RoC Curve for the best model\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For both classes (0 = character survives, 1 = character dies), the model's recall is comparable, show in {numref}`Figure {number} <pr_curve>` and {numref}`Figure {number} <roc_curve>`. However for class 1 the model's precision is much lower, indicating that it is a bit too quick to predict the demise of a character. This could perhaps be improved in future by altering the class weights."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Discussion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Overall the model's accuracy is fairly unimpressive, correctly predicting the fate of a character in only about half of all cases. While this might seem a bit disappointing, it did not particularly surprise or discourage us: George R. R. Martin is a celebrated author and master story teller, and the fact that we can't easily predict a whether a character will survive based on their attributes is a testament to the quality of his writing rather than the inadequacy of our model. Of course, this prediction tool is largely humerous and unlikely to make much of an impact beyond this project, it is nevertheless interesting to have a quanititative assessment of a plotline's predictability!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{bibliography}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}