{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Predicting Character Deaths in Game of Thrones\n", "\n", "Authors: Thomas Jian, Ian MacCarthy, Arturo Rey, Sifan Zhang" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "\n", "import pandas as pd\n", "from myst_nb import glue" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "\n", "short_results_df = pd.read_csv('random_search_scores.csv')\n", "classification_results = pd.read_csv('class_report.csv')\n", "missing_percentage = pd.read_csv('calculate_missing_percentage.csv').rename(columns={'Unnamed: 0':'Feature','0':'Percent NaN'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "We build a prediction tool that predicts whether a given character from the Game of Thrones books will survive to the end of the series. To do this, we implement a logistic regression model on a data set containing character information. The model is not able to achieve prediction accuracy any better than 65%. This is likely due to an absence of strong patterns in the plot and cast of characters that would allow us to easily answer such a question." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "The popular novel series *Game of Thrones* by George R. R. Martin takes place in a fictional world inspired by medeival Europe. Warfare and violence are central to the plot, leading to frequent casualties among the expansive cast of characters. Exactly which characters are fated to die before the end of the series has become a topic of heated speculation.\n", "\n", "In this analysis we try to predict whether a character will come to grief before the end of the series based on that character's attributes. We consider a data set created by Data Society available through data.world {cite}`data_society_2016`, which comprises a comprehensive list of characters appearing in the series. For each character there is a number of attributes, such as whether the character is noble, whether they appeared in a given book in the series, whether a relation of theirs is already dead, and finally whether or not the character is still alive at the end. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Methods and Results\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A concern of ours having seen the data is that there are a number of columns which are filled almost completely with missing values. These represent information that is only known for a very small minority of characters, like whether their parents are alive and who their heir is. We decided to drop columns that were almost entirely missing values, like **isAliveFather**. In other cases like **House**, where the missing values represent characters who are not noble and thus do not belong to a house, we filled the missing values with a new category like 'No house'. Also these columns contained a great number of unique categories; for simplicity's sake we decided to keep only the most common categories and filled all the other values with 'other'. Finally for age, which is unknown for about half of the characters, we filled missing values with the median age." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "tags": [ "remove-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FeaturePercent NaN
0S.No0.000000
1plod0.000000
2name0.000000
3title0.517986
4male0.000000
5culture0.652107
6dateOfBirth0.777492
7DateoFdeath0.771840
8mother0.989209
9father0.986639
10heir0.988181
11house0.219424
12spouse0.858171
13book10.000000
14book20.000000
15book30.000000
16book40.000000
17book50.000000
18isAliveMother0.989209
19isAliveFather0.986639
20isAliveHeir0.988181
21isAliveSpouse0.858171
22isMarried0.000000
23isNoble0.000000
24age0.777492
25numDeadRelations0.000000
26boolDeadRelations0.000000
27isPopular0.000000
28popularity0.000000
29isAlive0.000000
\n", "
" ], "text/plain": [ " Feature Percent NaN\n", "0 S.No 0.000000\n", "1 plod 0.000000\n", "2 name 0.000000\n", "3 title 0.517986\n", "4 male 0.000000\n", "5 culture 0.652107\n", "6 dateOfBirth 0.777492\n", "7 DateoFdeath 0.771840\n", "8 mother 0.989209\n", "9 father 0.986639\n", "10 heir 0.988181\n", "11 house 0.219424\n", "12 spouse 0.858171\n", "13 book1 0.000000\n", "14 book2 0.000000\n", "15 book3 0.000000\n", "16 book4 0.000000\n", "17 book5 0.000000\n", "18 isAliveMother 0.989209\n", "19 isAliveFather 0.986639\n", "20 isAliveHeir 0.988181\n", "21 isAliveSpouse 0.858171\n", "22 isMarried 0.000000\n", "23 isNoble 0.000000\n", "24 age 0.777492\n", "25 numDeadRelations 0.000000\n", "26 boolDeadRelations 0.000000\n", "27 isPopular 0.000000\n", "28 popularity 0.000000\n", "29 isAlive 0.000000" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "missing_percentage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "As a rule of thumb, we usually drop features whose percentage of missing values are over 15/20/25%. This is done because imputing these values, or replacing them with the mean/median/mode, would create bias in our data. Another way to deal with this case, would be to drop all the columns where we have null values in. This is a viable approach when the number of null values are not a big part of the dataset.\n", "Considering that a couple of features have a great percentage of missing values (> 25%), we have dropped these features from our training and test set in order to not influence our model by adding bias by imputing values.\n", "\n", "Among those features we picked for further analysis, some are easily understood by the feature names while some others might be hard to read and understand, so descriptions of these features are provided below.\n", "\n", "Culture: Social group to which a character belongs \\\n", "House: House to which a character belongs \\\n", "book1,book2,book3,book4,book5: Whether or not the character is mentioned in the book \\\n", "numDeadRelations: Number of dead characters to whom a character is related \\\n", "Popularity: popularity score calculated from the number of internal incoming and outgoing\n", "links to the characters page in the http://awoiaf.westeros.org wiki \\\n", "Target:whether or not the character is alive " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Based on the selected features, we created a heatmap {cite}`ostblom_2023_dsci531` to illustrate the correlation of each feature with **isAlive**. None of the correlations, in {numref}`Figure {number} `, are particularly strong, but they seem to make sense. There is a notable positive correlation with whether the character is alive in book 4, which makes sense given that a character has no chance of being alive at the end of the series if they aren't still alive at book 4. There are also notable negative correlations with the character's popularity and the number of dead realtions the character has, suggesting that popular characters whose families are already depleted tend to survive in the story.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{figure} correlation_heatmap.png\n", "---\n", "height: 500px\n", "name: correlation_map\n", "---\n", "Correlation heatmap for all features in processed data\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using this cleaned data set, we then constructed a column transformer to prepare the data for machine learning purposes. We tried a selection of different machine learning models: dummy, k-NN, SVM, and logistic regression {cite}`kolhatkar_2023_dsci571`. We found that the logistic regression model performed the best based on f1 score, with **test_f1** of 0.427, and so undertook randomized hyper-parameter optimization {cite}`ostblom_2023_dsci573` to fine-tune that model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results of a randomized grid search with cross validation is shown below." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "tags": [ "remove-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
param_logisticregression__Cparam_logisticregression__max_iterparam_logisticregression__class_weightmean_test_scoremean_fit_timemean_train_score
00.000500balanced0.5160.0200.518
10.0002000balanced0.5160.0200.518
20.048100balanced0.5020.0380.516
30.8862000balanced0.5010.0410.519
469.5192000balanced0.4980.0420.525
\n", "
" ], "text/plain": [ " param_logisticregression__C param_logisticregression__max_iter \\\n", "0 0.000 500 \n", "1 0.000 2000 \n", "2 0.048 100 \n", "3 0.886 2000 \n", "4 69.519 2000 \n", "\n", " param_logisticregression__class_weight mean_test_score mean_fit_time \\\n", "0 balanced 0.516 0.020 \n", "1 balanced 0.516 0.020 \n", "2 balanced 0.502 0.038 \n", "3 balanced 0.501 0.041 \n", "4 balanced 0.498 0.042 \n", "\n", " mean_train_score \n", "0 0.518 \n", "1 0.518 \n", "2 0.516 \n", "3 0.519 \n", "4 0.525 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "short_results_df.round(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Having found this best performing model, we tried it out by deploying it on the test set. The classification report is shown below, along with a precision recall curve and a reciever operating characteristic curve" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "tags": [ "remove-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
precisionrecallf1-scoresupport
00.8769840.6092350.7189911451.000000
10.3955220.7494950.517795495.000000
20.6449130.6449130.6449130.644913
30.6362530.6793650.6183931946.000000
40.7545160.6449130.6678131946.000000
\n", "
" ], "text/plain": [ " precision recall f1-score support\n", "0 0.876984 0.609235 0.718991 1451.000000\n", "1 0.395522 0.749495 0.517795 495.000000\n", "2 0.644913 0.644913 0.644913 0.644913\n", "3 0.636253 0.679365 0.618393 1946.000000\n", "4 0.754516 0.644913 0.667813 1946.000000" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "classification_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{figure} pr_curve.png\n", "---\n", "height: 500px\n", "name: pr_curve\n", "---\n", "Precision Recall Curve for the best model\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{figure} roc_curve.png\n", "---\n", "height: 500px\n", "name: roc_curve\n", "---\n", "RoC Curve for the best model\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For both classes (0 = character survives, 1 = character dies), the model's recall is comparable, show in {numref}`Figure {number} ` and {numref}`Figure {number} `. However for class 1 the model's precision is much lower, indicating that it is a bit too quick to predict the demise of a character. This could perhaps be improved in future by altering the class weights." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Discussion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overall the model's accuracy is fairly unimpressive, correctly predicting the fate of a character in only about half of all cases. While this might seem a bit disappointing, it did not particularly surprise or discourage us: George R. R. Martin is a celebrated author and master story teller, and the fact that we can't easily predict a whether a character will survive based on their attributes is a testament to the quality of his writing rather than the inadequacy of our model. Of course, this prediction tool is largely humerous and unlikely to make much of an impact beyond this project, it is nevertheless interesting to have a quanititative assessment of a plotline's predictability!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{bibliography}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 4 }