Predicting Character Deaths in Game of Thrones#

Authors: Thomas Jian, Ian MacCarthy, Arturo Rey, Sifan Zhang

Summary#

We build a prediction tool that predicts whether a given character from the Game of Thrones books will survive to the end of the series. To do this, we implement a logistic regression model on a data set containing character information. The model is not able to achieve prediction accuracy any better than 65%. This is likely due to an absence of strong patterns in the plot and cast of characters that would allow us to easily answer such a question.

Introduction#

The popular novel series Game of Thrones by George R. R. Martin takes place in a fictional world inspired by medeival Europe. Warfare and violence are central to the plot, leading to frequent casualties among the expansive cast of characters. Exactly which characters are fated to die before the end of the series has become a topic of heated speculation.

In this analysis we try to predict whether a character will come to grief before the end of the series based on that character’s attributes. We consider a data set created by Data Society available through data.world [Data Society, 2016], which comprises a comprehensive list of characters appearing in the series. For each character there is a number of attributes, such as whether the character is noble, whether they appeared in a given book in the series, whether a relation of theirs is already dead, and finally whether or not the character is still alive at the end.

Methods and Results#

A concern of ours having seen the data is that there are a number of columns which are filled almost completely with missing values. These represent information that is only known for a very small minority of characters, like whether their parents are alive and who their heir is. We decided to drop columns that were almost entirely missing values, like isAliveFather. In other cases like House, where the missing values represent characters who are not noble and thus do not belong to a house, we filled the missing values with a new category like ‘No house’. Also these columns contained a great number of unique categories; for simplicity’s sake we decided to keep only the most common categories and filled all the other values with ‘other’. Finally for age, which is unknown for about half of the characters, we filled missing values with the median age.

Feature Percent NaN
0 S.No 0.000000
1 plod 0.000000
2 name 0.000000
3 title 0.517986
4 male 0.000000
5 culture 0.652107
6 dateOfBirth 0.777492
7 DateoFdeath 0.771840
8 mother 0.989209
9 father 0.986639
10 heir 0.988181
11 house 0.219424
12 spouse 0.858171
13 book1 0.000000
14 book2 0.000000
15 book3 0.000000
16 book4 0.000000
17 book5 0.000000
18 isAliveMother 0.989209
19 isAliveFather 0.986639
20 isAliveHeir 0.988181
21 isAliveSpouse 0.858171
22 isMarried 0.000000
23 isNoble 0.000000
24 age 0.777492
25 numDeadRelations 0.000000
26 boolDeadRelations 0.000000
27 isPopular 0.000000
28 popularity 0.000000
29 isAlive 0.000000

As a rule of thumb, we usually drop features whose percentage of missing values are over 15/20/25%. This is done because imputing these values, or replacing them with the mean/median/mode, would create bias in our data. Another way to deal with this case, would be to drop all the columns where we have null values in. This is a viable approach when the number of null values are not a big part of the dataset. Considering that a couple of features have a great percentage of missing values (> 25%), we have dropped these features from our training and test set in order to not influence our model by adding bias by imputing values.

Among those features we picked for further analysis, some are easily understood by the feature names while some others might be hard to read and understand, so descriptions of these features are provided below.

Culture: Social group to which a character belongs
House: House to which a character belongs
book1,book2,book3,book4,book5: Whether or not the character is mentioned in the book
numDeadRelations: Number of dead characters to whom a character is related
Popularity: popularity score calculated from the number of internal incoming and outgoing links to the characters page in the http://awoiaf.westeros.org wiki
Target:whether or not the character is alive

Based on the selected features, we created a heatmap [Östblom, 2023] to illustrate the correlation of each feature with isAlive. None of the correlations, in Figure 1, are particularly strong, but they seem to make sense. There is a notable positive correlation with whether the character is alive in book 4, which makes sense given that a character has no chance of being alive at the end of the series if they aren’t still alive at book 4. There are also notable negative correlations with the character’s popularity and the number of dead realtions the character has, suggesting that popular characters whose families are already depleted tend to survive in the story.

_images/correlation_heatmap.png

Fig. 1 Correlation heatmap for all features in processed data#

Using this cleaned data set, we then constructed a column transformer to prepare the data for machine learning purposes. We tried a selection of different machine learning models: dummy, k-NN, SVM, and logistic regression [Kolhatkar, 2023]. We found that the logistic regression model performed the best based on f1 score, with test_f1 of 0.427, and so undertook randomized hyper-parameter optimization [Östblom, 2023] to fine-tune that model.

The results of a randomized grid search with cross validation is shown below.

param_logisticregression__C param_logisticregression__max_iter param_logisticregression__class_weight mean_test_score mean_fit_time mean_train_score
0 0.000 500 balanced 0.516 0.020 0.518
1 0.000 2000 balanced 0.516 0.020 0.518
2 0.048 100 balanced 0.502 0.038 0.516
3 0.886 2000 balanced 0.501 0.041 0.519
4 69.519 2000 balanced 0.498 0.042 0.525

Having found this best performing model, we tried it out by deploying it on the test set. The classification report is shown below, along with a precision recall curve and a reciever operating characteristic curve

precision recall f1-score support
0 0.876984 0.609235 0.718991 1451.000000
1 0.395522 0.749495 0.517795 495.000000
2 0.644913 0.644913 0.644913 0.644913
3 0.636253 0.679365 0.618393 1946.000000
4 0.754516 0.644913 0.667813 1946.000000
_images/pr_curve.png

Fig. 2 Precision Recall Curve for the best model#

_images/roc_curve.png

Fig. 3 RoC Curve for the best model#

For both classes (0 = character survives, 1 = character dies), the model’s recall is comparable, show in Figure 2 and Figure 3. However for class 1 the model’s precision is much lower, indicating that it is a bit too quick to predict the demise of a character. This could perhaps be improved in future by altering the class weights.

Discussion#

Overall the model’s accuracy is fairly unimpressive, correctly predicting the fate of a character in only about half of all cases. While this might seem a bit disappointing, it did not particularly surprise or discourage us: George R. R. Martin is a celebrated author and master story teller, and the fact that we can’t easily predict a whether a character will survive based on their attributes is a testament to the quality of his writing rather than the inadequacy of our model. Of course, this prediction tool is largely humerous and unlikely to make much of an impact beyond this project, it is nevertheless interesting to have a quanititative assessment of a plotline’s predictability!

References#