Analysis of Hosted Game Data

In [1]:
# importing libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
In [2]:
data = pd.read_csv('data.csv', sep='\t')
In [3]:
data.head()
Out[3]:
HG Story Title Author Genre Subgenre (if applicable) Word Count # of Omnibus Ratings Omnibus Rating # of Steam Reviews % of Positive Steam Reviews # of Google Reviews # of GPS Installs GPS Score Release Date Price Free?
0 Popcorn, Soda�Murder? Pauzle Mystery NaN 23000 65 4.1 NaN NaN 1229.0 100k+ 3.6 3/4/2010 $1.99 Yup
1 The Nightmare Maze Alex Livingston Horror Puzzle 5200 29 4.0 NaN NaN 1069.0 50k+ 3.5 6/17/2010 $0.99 Yup
2 What Happened Last Night? Kie Brooks Mystery Humor 4900 134 3.9 NaN NaN 753.0 50k+ 3.1 7/19/2010 $0.99 Yup
3 Paranoia Kie Brooks Puzzle NaN 5800 180 3.9 NaN NaN 1139.0 50k+ 3.4 4/16/2011 $0.99 Yup
4 Imprisoned Myth Thrazz Fantasy Puzzle 40000 358 4.2 NaN NaN 2211.0 100k+ 3.6 4/21/2011 $0.99 Yup

1. How many games have been released each year?

In [4]:
import datetime
In [5]:
# just converting dates to a standard format
release_dates = pd.to_datetime(data['Release Date'])
In [6]:
data['Release Date'] = release_dates
In [7]:
date_group = data.groupby([data['Release Date'].dt.year]).count()
In [8]:
date_group
Out[8]:
HG Story Title Author Genre Subgenre (if applicable) Word Count # of Omnibus Ratings Omnibus Rating # of Steam Reviews % of Positive Steam Reviews # of Google Reviews # of GPS Installs GPS Score Release Date Price Free?
Release Date
2010 3 3 3 2 3 3 3 0 0 3 3 3 3 3 3
2011 10 10 10 8 10 10 10 1 1 10 10 10 10 10 10
2012 3 3 3 2 3 3 3 0 0 3 3 3 3 3 3
2013 5 5 5 3 5 5 5 1 1 5 5 5 5 5 5
2014 13 13 13 7 13 13 13 3 3 11 11 11 13 13 13
2015 17 17 17 11 17 17 17 5 5 17 17 17 17 17 17
2016 22 22 22 15 22 22 22 10 10 22 22 22 22 22 22
2017 16 16 16 12 16 16 16 7 6 16 16 16 16 16 16
2018 25 25 25 15 25 25 25 15 14 25 25 25 25 25 25
2019 21 21 21 10 21 21 21 10 10 21 21 21 21 21 21
2020 21 21 21 17 21 21 21 9 9 20 21 20 21 21 21
2021 5 5 5 4 5 5 5 2 2 5 5 5 5 5 5
In [9]:
plt.plot(date_group['Release Date'].index, date_group['HG Story Title'], marker='x')
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.title('Stories released per year')
plt.grid(True)
plt.xticks(date_group['Release Date'].index, rotation=45)
plt.show()

Overall, the trend is increasing until a peak in 2018. Data for 2021 is incomplete.

2. How many games have been released in each genre?

In [10]:
genre_group = data[['Genre', 'HG Story Title']].groupby('Genre').count()
In [11]:
genre_group
Out[11]:
HG Story Title
Genre
Adventure 1
Crime 6
Fantasy 46
Historical 7
Horror 6
Humor 3
Mystery 11
Post-apocalyptic 6
Puzzle 6
Romance 5
School 1
Sci-Fi 16
Slice of Life 6
Spy 1
Steampunk 4
Superhero 9
Supernatural 20
War 7
In [12]:
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode()
In [13]:
fig = go.Figure(data=[go.Pie(labels=genre_group.index, values=genre_group['HG Story Title'])])
iplot(fig)
In [14]:
plt.figure(figsize=(10, 8))
plt.barh(genre_group.index, genre_group['HG Story Title'])
plt.xlabel('Count')
plt.ylabel('Genre')
plt.title('Stories released per genre')
plt.grid(True, axis='x')
plt.show()

Fantasy is the most common genre, followed by Supernatural and Sci-Fi.

How do we measure popularity?

Maybe we can define "popularity" as omnibus reviews + Google reviews, since the Google downloads data is very coarse-grained.

As a sanity check, let's plot omnibus ratings vs google reviews, to see if they correlate.

In [15]:
data['Popularity'] = data['# of Omnibus Ratings'] + data['# of Google Reviews']
In [16]:
px.scatter(data, x='# of Omnibus Ratings', y='# of Google Reviews', hover_data=['HG Story Title'], trendline='ols')

This doesn't look so good but let's just calculate a correlation...

In [17]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

It turns out that some of the games don't have Google review data. Which are those games?

In [18]:
data[data['# of Google Reviews'].isna()]
Out[18]:
HG Story Title Author Genre Subgenre (if applicable) Word Count # of Omnibus Ratings Omnibus Rating # of Steam Reviews % of Positive Steam Reviews # of Google Reviews # of GPS Installs GPS Score Release Date Price Free? Popularity
27 Path of Light Ivailo Daskalov Fantasy NaN 100000 14 4.4 NaN NaN NaN NaN NaN 2014-08-15 $0.99 Nope NaN
33 Silent Gear Lee Yuan Sci-Fi War 40000 169 4.2 NaN NaN NaN NaN NaN 2014-12-29 $1.99 Yup NaN
143 Journey into Darkness Jonathan Clark Historical Puzzle 110000 10 3.6 NaN NaN NaN 1k+ NaN 2020-05-21 $3.99 Nope NaN

Alright, let's just remove those games from the data.

In [19]:
data_clean = data[~data['# of Google Reviews'].isna()]
In [20]:
results = sm.OLS(data_clean['# of Omnibus Ratings'], sm.add_constant(data_clean['# of Google Reviews'])).fit()
In [21]:
print(results.summary())
                             OLS Regression Results                             
================================================================================
Dep. Variable:     # of Omnibus Ratings   R-squared:                       0.437
Model:                              OLS   Adj. R-squared:                  0.433
Method:                   Least Squares   F-statistic:                     121.1
Date:                  Tue, 08 Jun 2021   Prob (F-statistic):           3.31e-21
Time:                          00:54:06   Log-Likelihood:                -1400.9
No. Observations:                   158   AIC:                             2806.
Df Residuals:                       156   BIC:                             2812.
Df Model:                             1                                         
Covariance Type:              nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                 531.4866    145.277      3.658      0.000     244.522     818.451
# of Google Reviews     0.5258      0.048     11.004      0.000       0.431       0.620
==============================================================================
Omnibus:                      141.753   Durbin-Watson:                   1.813
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1762.216
Skew:                           3.350   Prob(JB):                         0.00
Kurtosis:                      17.926   Cond. No.                     3.22e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.22e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

That's a lot of numbers! Here I am attempting to discern the relationship between

An important number here is R-squared. Basically, this shows the strength of the relationship between the two variables. It's 0.496, which is pretty decent. So there definitely is a relationship between the two popularity metrics. But I'm guessing the outlier kind of skews the results... What is the outlier, anyway? That seems to be by far the most popular game published by Hosted Games.

In [22]:
data_popular = data.sort_values('Popularity', ascending=False)
In [23]:
data_popular.head(10)
Out[23]:
HG Story Title Author Genre Subgenre (if applicable) Word Count # of Omnibus Ratings Omnibus Rating # of Steam Reviews % of Positive Steam Reviews # of Google Reviews # of GPS Installs GPS Score Release Date Price Free? Popularity
74 The Great Tournament Philip Kempton Fantasy Historical 190000 13831 4.7 21.0 90% 32649.0 500k+ 4.7 2017-03-03 $4.99 Yup 46480.0
113 Life of a Mercenary Philip Kempton Fantasy Historical 340000 13260 4.6 6.0 100% 10060.0 100k+ 4.4 2018-12-18 $3.99 Yup 23320.0
92 Wayhaven Chronicles: Book One Mishka Jenkins Supernatural Romance 440000 11807 4.8 238.0 95% 5909.0 100k+ 4.7 2018-03-08 $4.99 Nope 17716.0
141 Wayhaven Chronicles: Book Two* Mishka Jenkins Supernatural Romance 790000 12458 4.9 140.0 99% 2319.0 10k+ 4.9 2020-05-07 $6.99 Nope 14777.0
127 Hero or Villain: Genesis Adrao Superhero NaN 330000 8984 4.6 25.0 68% 2380.0 100k+ 4.2 2019-09-19 $3.99 Nope 11364.0
71 Zombie Exodus: Safe Haven Jim Dattilo Supernatural Post-apocalyptic 1100000 5187 4.8 129.0 89% 4176.0 100k+ 4.4 2016-10-28 $4.99 Nope 9363.0
8 Zombie Exodus Jim Dattilo Supernatural Post-apocalyptic 700000 1463 4.7 138.0 89% 7191.0 100k+ 4.3 2011-12-04 $1.99 Nope 8654.0
134 The War for the West Lucas Zaper Fantasy NaN 490000 4740 4.8 43.0 81% 2458.0 50k+ 4.5 2019-11-14 $6.99 Nope 7198.0
133 Breach: The Archangel Job Michael Maxwell and S. Ben Luigi Crime NaN 820000 4972 4.9 53.0 94% 1792.0 10k+ 4.6 2019-11-14 $6.99 Nope 6764.0
42 Samurai of Hyuga Devon Connell Fantasy Historical 140000 2344 4.8 41.0 82% 3719.0 100k+ 4.5 2015-07-17 $3.99 Nope 6063.0

The outlier that we saw earlier was The Great Tournament, which is the most popular Hosted Game.

By most metrics, Philip Kempton's The Great Tournament and Life of a Mercenary are the two most popular Hosted Games. Congrats I guess.

What is the distribution of popularity across games?

Okay, now let's plot a histogram of popularity:

In [131]:
px.histogram(data, x='Popularity', nbins=50)

This is a very long-tailed distribution: most games have less than 1000 total ratings, while a handful of games are wildly popular.

4. What factors relate to the popularity of a game?

a) How does word count relate to popularity?

Let's do another linear regression!

In [24]:
results = sm.OLS(data_clean['Popularity'], sm.add_constant(data_clean['Word Count'])).fit()
In [25]:
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Popularity   R-squared:                       0.083
Model:                            OLS   Adj. R-squared:                  0.077
Method:                 Least Squares   F-statistic:                     14.03
Date:                Tue, 08 Jun 2021   Prob (F-statistic):           0.000252
Time:                        00:54:06   Log-Likelihood:                -1553.6
No. Observations:                 158   AIC:                             3111.
Df Residuals:                     156   BIC:                             3117.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        762.4819    497.399      1.533      0.127    -220.025    1744.989
Word Count     0.0065      0.002      3.746      0.000       0.003       0.010
==============================================================================
Omnibus:                      248.175   Durbin-Watson:                   1.913
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            25216.973
Skew:                           6.873   Prob(JB):                         0.00
Kurtosis:                      63.344   Cond. No.                     3.92e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.92e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
In [26]:
px.scatter(data_clean, x='Word Count', y='Popularity', hover_data=['HG Story Title'], trendline='ols')

There's definitely a positive relationship between word count and popularity, but the R^2 between word count and popularity is about 0.08, which is... very weak.

What happens if we remove The Great Tournament?

In [27]:
data_no_outliers = data_clean[data_clean['HG Story Title'] != 'The Great Tournament']
In [28]:
results = sm.OLS(data_no_outliers['Popularity'], sm.add_constant(data_no_outliers['Word Count'])).fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Popularity   R-squared:                       0.193
Model:                            OLS   Adj. R-squared:                  0.188
Method:                 Least Squares   F-statistic:                     37.05
Date:                Tue, 08 Jun 2021   Prob (F-statistic):           8.70e-09
Time:                        00:54:07   Log-Likelihood:                -1468.5
No. Observations:                 157   AIC:                             2941.
Df Residuals:                     155   BIC:                             2947.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        471.8026    308.456      1.530      0.128    -137.518    1081.123
Word Count     0.0066      0.001      6.087      0.000       0.004       0.009
==============================================================================
Omnibus:                      160.082   Durbin-Watson:                   1.945
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3694.022
Skew:                           3.717   Prob(JB):                         0.00
Kurtosis:                      25.571   Cond. No.                     3.92e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.92e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Now, the R^2 improves to 0.193, which is... better?

The regression equation is $Popularity = 0.066*WordCount + 471.8$

This means that, for example, the expected popularity of a 500k-word story would be about 3800.

In [29]:
px.scatter(data_no_outliers, x='Word Count', y='Popularity', hover_data=['HG Story Title'], trendline='ols')
In [30]:
px.scatter(data_no_outliers, x='Word Count', y='Popularity', hover_data=['HG Story Title'], trendline='ols', log_x=True, log_y=True)

How does release date relate to popularity?

We might expect earlier released games to be more popular, just because they've had more time to acquire readers. But we might also expect later released games to be more popular, because of greater visibility for HG in more recent years, and the more recent release of the omnibus app. Or maybe there is no relation at all. Which hypothesis is true?

In [31]:
plt.scatter(data['Release Date'], data['Popularity'])
Out[31]:
<matplotlib.collections.PathCollection at 0x7f8866ebfb38>
In [32]:
# okay that wasn't very helpful... but what is the outlier? (It's The Great Tournament)
px.scatter(data, x='Release Date', y='Popularity', hover_data=['HG Story Title'], trendline='ols')
In [126]:
# what about within the omnibus only?
px.scatter(data_no_outliers, x='Release Date', y='# of Omnibus Ratings', hover_data=['HG Story Title'], trendline='ols')
In [125]:
# what about within GPS only? and we remove The Great Tournament?
px.scatter(data_no_outliers, x='Release Date', y='# of Google Reviews', hover_data=['HG Story Title'], trendline='ols')

It doesn't look like there's much of a relationship between popularity and release date.

How does genre (and subgenre) relate to popularity?

Let's plot the average popularity per genre:

In [33]:
genre_pop_group = data_clean[['Genre', 'Popularity']].groupby('Genre').mean()
In [34]:
px.bar(genre_pop_group, x=genre_pop_group.index, y='Popularity')
In [35]:
#Let's do a box plot with error bars...
px.box(data_clean, x='Genre', y='Popularity', hover_data=['HG Story Title'])
In [36]:
# log popularity by genre
games_plot = px.box(data_clean, x='Genre', y='Popularity', hover_data=['HG Story Title'], log_y=True, points='all')
games_plot.show()
In [37]:
games_plot.write_html('./log_popularity_by_genre.html')
In [38]:
px.strip(data_clean, x='Genre', y='Popularity', hover_data=['HG Story Title'], log_y=True)
In [39]:
#Let's do a box plot with The Great Tournament removed...
px.box(data_no_outliers, x='Genre', y='Popularity', hover_data=['HG Story Title'])

Superatural, Fantasy, and Superhero are the most popular genres on average. They are also the most popular genres in total, so writers' interests and readers' interests seem to overlap for the most part. Interestingly, it seems that Superhero has the highest floor of popularity: the least popular Superhero game has a popularity of 419, which is the highest minimum popularity of any genre!

Can we create a super-model that predicts popularity using Word Count, Genre, and Free?

We'll introduce dummy variables for all of the genres, as well as "Free?". Whatever, let's add Release Year, and author_count as well.

In [151]:
# 1. Create categorical variables to represent genre inclusions
genre_cols = []
genre_cols_quoted = []
for genre in set(data['Genre']):
    data['is_' + genre] = [int(x) for x in (data['Genre'] == genre)]
    genre_cols_quoted.append('"is_' + genre + '"')
    genre_cols.append('is_'+genre)
In [152]:
# 2. identify author release count
known_authors = data.groupby('Author').count()
known_authors_dict = {r[0] : r[1]['HG Story Title'] for r in known_authors.iterrows()}
In [153]:
data['author_count'] = [known_authors_dict[a] - 1 for a in data['Author']]
In [154]:
data['is_free'] = [int(x) for x in (data['Free?'] == 'Yup')]
data['year'] = data['Release Date'].dt.year
data['post_2018'] = [int(x) for x in (data['Release Date'].dt.year >= 2018)]
In [155]:
data_clean = data[~data['# of Google Reviews'].isna()]
data_clean_X = data_clean[['Word Count', 'is_free', 'author_count', 'post_2018'] + genre_cols]

data_no_outliers = data_clean[data_clean['HG Story Title'] != 'The Great Tournament']
data_noo_X = data_no_outliers[['Word Count', 'is_free', 'author_count', 'post_2018'] + genre_cols]
In [158]:
results = sm.OLS(data_clean['Popularity'], sm.add_constant(data_clean_X)).fit()
In [159]:
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Popularity   R-squared:                       0.247
Model:                            OLS   Adj. R-squared:                  0.130
Method:                 Least Squares   F-statistic:                     2.121
Date:                Tue, 08 Jun 2021   Prob (F-statistic):            0.00540
Time:                        02:14:35   Log-Likelihood:                -1538.0
No. Observations:                 158   AIC:                             3120.
Df Residuals:                     136   BIC:                             3187.
Df Model:                          21                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const               -1886.5615    837.457     -2.253      0.026   -3542.683    -230.440
Word Count              0.0080      0.002      4.232      0.000       0.004       0.012
is_free              3861.3661   1007.857      3.831      0.000    1868.268    5854.464
author_count          348.8697    219.929      1.586      0.115     -86.053     783.793
post_2018            1152.2250    835.233      1.380      0.170    -499.498    2803.948
is_Fantasy           1502.1626    932.633      1.611      0.110    -342.176    3346.501
is_Romance           -282.4591   1996.862     -0.141      0.888   -4231.376    3666.458
is_War                382.1840   1668.938      0.229      0.819   -2918.242    3682.610
is_Steampunk          127.9120   2171.633      0.059      0.953   -4166.625    4422.449
is_Humor             -714.4625   2531.328     -0.282      0.778   -5720.317    4291.392
is_Mystery           -990.4184   1399.662     -0.708      0.480   -3758.335    1777.498
is_Horror           -1505.7639   1805.048     -0.834      0.406   -5075.356    2063.828
is_Puzzle           -1254.6319   1830.997     -0.685      0.494   -4875.540    2366.276
is_Adventure        -2224.1754   4264.422     -0.522      0.603   -1.07e+04    6208.978
is_School            1441.3312   4241.130      0.340      0.734   -6945.761    9828.423
is_Sci-Fi           -1428.5891   1224.141     -1.167      0.245   -3849.402     992.224
is_Crime              -98.9062   1800.703     -0.055      0.956   -3659.905    3462.093
is_Post-apocalyptic   417.9971   1835.512      0.228      0.820   -3211.840    4047.834
is_Supernatural      2252.9464   1090.297      2.066      0.041      96.818    4409.075
is_Slice of Life     -637.6427   1799.353     -0.354      0.724   -4195.973    2920.688
is_Historical       -1802.6053   1837.868     -0.981      0.328   -5437.101    1831.891
is_Spy                563.4471   4230.343      0.133      0.894   -7802.313    8929.207
is_Superhero         2365.1127   1514.226      1.562      0.121    -629.360    5359.586
==============================================================================
Omnibus:                      238.803   Durbin-Watson:                   1.963
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            22807.802
Skew:                           6.406   Prob(JB):                         0.00
Kurtosis:                      60.449   Cond. No.                     5.08e+21
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.96e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Now, what the heck does this mean?

  • The 'coef' field for each genre basically represents how much popularity (in terms of Google + omnibus ratings) you gain/lose by using that genre. Note that the error ranges (the last two columns) are very large, due to the small sample size.

If we interpret the linear regression coeffients very literally...

THESE DESCRIPTIONS ARE OUTDATED AS THEY DO NOT INCLUDE THE POST_2018 OR AUTHOR_COUNT TERMS

  • Every additional word gets a game between 0.004 and 0.012 additional ratings.
  • If a game is free, it'll gain between ~1400 and ~5200 ratings.
  • If a game is in the fantasy genre, it'll gain between 580 and 3800 ratings (this is likely inflated by The Great Tournament and Life of a Mercenary, which I did not remove for this regression).
  • If a game is in the supernatural genre, it'll gain between 60 and 4300 ratings (this is barely significant at p<0.05, and would not be "significant" if a FDR correction is applied).
  • If a game is in the superhero genre, it'll gain between -341 and 5643 ratings (this is "almost significant" in p-hacker terminology).

None of the other conclusions are significant. Based on this analysis, we cannot conclusively say that, for example, puzzle or adventure or horror games will have a penalty to popularity, simply because there are too few games in those categories. Similarly, we can't say that school or post-apocalyptic games will have a bonus to popularity.

Now, what if we remove The Great Tournament?

In [160]:
results = sm.OLS(data_no_outliers['Popularity'], sm.add_constant(data_noo_X)).fit()
In [161]:
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Popularity   R-squared:                       0.387
Model:                            OLS   Adj. R-squared:                  0.292
Method:                 Least Squares   F-statistic:                     4.062
Date:                Tue, 08 Jun 2021   Prob (F-statistic):           2.90e-07
Time:                        02:14:45   Log-Likelihood:                -1446.9
No. Observations:                 157   AIC:                             2938.
Df Residuals:                     135   BIC:                             3005.
Df Model:                          21                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const               -1339.3364    499.983     -2.679      0.008   -2328.149    -350.524
Word Count              0.0071      0.001      6.363      0.000       0.005       0.009
is_free              2199.7936    609.451      3.609      0.000     994.487    3405.100
author_count          219.8997    131.242      1.676      0.096     -39.656     479.455
post_2018            1283.5531    497.521      2.580      0.011     299.610    2267.496
is_Fantasy            739.9794    557.562      1.327      0.187    -362.707    1842.666
is_Romance           -723.2791   1189.628     -0.608      0.544   -3075.997    1629.439
is_War                309.0342    994.003      0.311      0.756   -1656.799    2274.867
is_Steampunk         -187.3208   1293.545     -0.145      0.885   -2745.554    2370.913
is_Humor             -805.6486   1507.629     -0.534      0.594   -3787.276    2175.979
is_Mystery           -661.3874    833.878     -0.793      0.429   -2310.541     987.766
is_Horror           -1014.9813   1075.509     -0.944      0.347   -3142.006    1112.044
is_Puzzle            -613.4644   1091.271     -0.562      0.575   -2771.663    1544.734
is_Adventure        -1076.8678   2540.864     -0.424      0.672   -6101.915    3948.180
is_School             950.1384   2526.141      0.376      0.707   -4045.791    5946.068
is_Sci-Fi           -1064.7284    729.444     -1.460      0.147   -2507.345     377.888
is_Crime              -26.1618   1072.480     -0.024      0.981   -2147.196    2094.873
is_Post-apocalyptic    95.7625   1093.393      0.088      0.930   -2066.632    2258.157
is_Supernatural      2305.0228    649.372      3.550      0.001    1020.765    3589.280
is_Slice of Life     -191.1609   1072.040     -0.178      0.859   -2311.327    1929.005
is_Historical       -1761.7643   1094.608     -1.609      0.110   -3926.562     403.033
is_Spy                347.4984   2519.562      0.138      0.891   -4635.419    5330.416
is_Superhero         2039.9928    902.084      2.261      0.025     255.947    3824.038
==============================================================================
Omnibus:                      141.459   Durbin-Watson:                   1.933
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2402.857
Skew:                           3.192   Prob(JB):                         0.00
Kurtosis:                      21.071   Cond. No.                     2.92e+21
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.5e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

These results actually look quite a bit better. Here's a summary:

THESE DESCRIPTIONS ARE OUTDATED AS THEY DO NOT INCLUDE THE POST_2018 OR AUTHOR_COUNT TERMS

  • Every additional word gets a game between 0.005 and 0.009 additional ratings.
  • If a game is free, it'll gain between 521 and 2868 ratings.
  • If a game is in the fantasy genre, it'll gain between 213 and 2217 ratings.
  • If a game is in the supernatural genre, it'll gain between 971 and 3594 ratings.
  • If a game is in the Superhero genre, it'll gain between 513 and 4134 ratings.

These are all the "significant" conclusions here. According to the regression coefficients, puzzle, adventure, sci-fi, horror, historical, and steampunk all have penalties, while war, school, and post-apocalyptic all have slight bonuses. But none of those are "significant", because the sample size is very, very small.

We can look at the residuals, which are the true value minus the predicted value. A positive residual indicates that a game was more popular than predicted, while a negative residual indicates that a game was less popular than predicted.

In [162]:
predictions = results.predict(sm.add_constant(data_noo_X))
In [163]:
px.scatter(data_no_outliers, x='Popularity', y=predictions, hover_data=['HG Story Title'], labels={'y': 'Predicted popularity'})
In [164]:
px.scatter(data_no_outliers, x='Popularity', y=results.resid, hover_data=['HG Story Title'], labels={'y': 'Residual'})
In [165]:
data_no_outliers['resid'] = results.resid
more_popular = data_no_outliers.sort_values('resid', ascending=False)
more_popular.head(10)
Out[165]:
HG Story Title Author Genre Subgenre (if applicable) Word Count # of Omnibus Ratings Omnibus Rating # of Steam Reviews % of Positive Steam Reviews # of Google Reviews ... is_Supernatural is_Slice of Life is_Historical is_Spy is_Superhero is_free year author_count post_2018 resid
113 Life of a Mercenary Philip Kempton Fantasy Historical 340000 13260 4.6 6.0 100% 10060.0 ... 0 0 0 0 0 1 2018 4 1 17130.421251
92 Wayhaven Chronicles: Book One Mishka Jenkins Supernatural Romance 440000 11807 4.8 238.0 95% 5909.0 ... 1 0 0 0 0 0 2018 1 1 12107.344089
141 Wayhaven Chronicles: Book Two* Mishka Jenkins Supernatural Romance 790000 12458 4.9 140.0 99% 2319.0 ... 1 0 0 0 0 0 2020 1 1 6671.001188
127 Hero or Villain: Genesis Adrao Superhero NaN 330000 8984 4.6 25.0 68% 2380.0 ... 0 0 0 0 1 0 2019 3 1 6365.453864
60 Doomsday on Demand Norbert Mohos Post-apocalyptic NaN 120000 417 4.6 7.0 43% 5395.0 ... 0 0 0 0 0 0 2016 1 0 5979.442328
42 Samurai of Hyuga Devon Connell Fantasy Historical 140000 2344 4.8 41.0 82% 3719.0 ... 0 0 0 0 0 0 2015 4 0 4783.821011
73 Evertree Inn Thom Baylay Fantasy Mystery 270000 3563 4.8 28.0 75% 1898.0 ... 0 0 0 0 0 0 2017 2 0 3694.035914
134 The War for the West Lucas Zaper Fantasy NaN 490000 4740 4.8 43.0 81% 2458.0 ... 0 0 0 0 0 0 2019 0 1 3017.523862
128 The Parenting Simulator Matt Simpson Slice of Life Humor 190000 2995 4.7 28.0 93% 1015.0 ... 0 1 0 0 0 0 2019 1 1 2681.344144
8 Zombie Exodus Jim Dattilo Supernatural Post-apocalyptic 700000 1463 4.7 138.0 89% 7191.0 ... 1 0 0 0 0 0 2011 1 0 2473.728144

10 rows × 39 columns

The games that are more popular than expected are, in order (not including The Great Tournament): Life of a Mercenary, Wayhaven 1, Wayhaven 2, Hero or Villain, Doomsday on Demand, Samurai of Hyuga, The Parenting Similator, War for the West, The Aether, and Evertree Inn.

In [166]:
less_popular = data_no_outliers.sort_values('resid', ascending=True)
less_popular.head(10)
Out[166]:
HG Story Title Author Genre Subgenre (if applicable) Word Count # of Omnibus Ratings Omnibus Rating # of Steam Reviews % of Positive Steam Reviews # of Google Reviews ... is_Supernatural is_Slice of Life is_Historical is_Spy is_Superhero is_free year author_count post_2018 resid
22 Tin Star Allen Gies Historical NaN 1400000 847 4.9 263.0 96% 1188.0 ... 0 0 1 0 0 0 2014 3 0 -5512.970040
54 Magikiras Gabriel Cha Sci-Fi Mech 1100000 204 4.2 9.0 44% 660.0 ... 0 0 0 0 0 0 2016 0 0 -4580.727125
59 Gambling With Eternity Ashlee Sierra Supernatural Horror 53000 158 4.0 NaN NaN 165.0 ... 1 0 0 0 0 1 2016 0 0 -3220.649023
70 Twin Flames Ivailo Daskalov Fantasy NaN 46000 60 4.3 NaN NaN 111.0 ... 0 0 0 0 0 1 2016 6 0 -3077.057054
68 Elemental Saga: The Awakening Mandar Deshmukh Supernatural School 100000 273 4.2 NaN NaN 716.0 ... 1 0 0 0 0 1 2016 0 0 -2890.006498
36 Seven Bullets Cloud Buchholz Crime NaN 290000 205 4.4 NaN NaN 85.0 ... 0 0 0 0 0 1 2015 1 0 -2833.422039
20 Burn(t) Vivi Tran Supernatural Post-apocalyptic 71000 378 3.4 NaN NaN 561.0 ... 1 0 0 0 0 1 2013 0 0 -2733.083801
132 The Oldest Dream Onurhan Tunyadin Fantasy NaN 36000 374 4.3 NaN NaN 64.0 ... 0 0 0 0 0 1 2019 0 1 -2702.859238
158 Keeper of the Day and Night Brynn Chernosky Supernatural School 390000 2499 4.8 16.0 100% 265.0 ... 1 0 0 0 0 0 2021 1 1 -2487.892639
149 A Sorcerer’s Story Chris Viola Supernatural Fantasy 55000 126 3.8 NaN NaN 48.0 ... 1 0 0 0 0 0 2020 0 1 -2467.679015

10 rows × 39 columns

The games that are less popular than expected are, in order: Tin Star, Magikiras, Gambling with Eternity, Elemental Saga, Burn(t), Seven Bullets, Twin Flames, Lost in the Pages, The Aegis Saga, and Best of Us.

Do the results improve if we use sub-genres?

What if we added data for secondary genres as well as primary genres?

In [167]:
# 1. Create categorical variables to represent genre inclusions
data_subgenres = data.copy()
genre_cols = []
genre_cols_quoted = []
for genre in set(data['Genre']):
    data_subgenres['is_' + genre] = [int(x) for x in ((data['Genre'] == genre) | (data['Subgenre (if applicable)'] == genre))]
    genre_cols_quoted.append('"is_' + genre + '"')
    genre_cols.append('is_'+genre)
In [176]:
data_sg_clean = data_subgenres[~data_subgenres['# of Google Reviews'].isna()]
data_sg_clean_X = data_sg_clean[['Word Count', 'is_free'] + genre_cols]

data_sg_no_outliers = data_sg_clean[data_clean['HG Story Title'] != 'The Great Tournament']
data_sg_noo_X = data_sg_no_outliers[['Word Count', 'is_free', 'author_count', 'post_2018'] + genre_cols]
In [177]:
results = sm.OLS(data_sg_no_outliers['Popularity'], sm.add_constant(data_sg_noo_X)).fit()
In [178]:
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Popularity   R-squared:                       0.388
Model:                            OLS   Adj. R-squared:                  0.288
Method:                 Least Squares   F-statistic:                     3.867
Date:                Tue, 08 Jun 2021   Prob (F-statistic):           5.49e-07
Time:                        02:19:52   Log-Likelihood:                -1446.7
No. Observations:                 157   AIC:                             2939.
Df Residuals:                     134   BIC:                             3010.
Df Model:                          22                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const               -2123.8722   1011.813     -2.099      0.038   -4125.062    -122.683
Word Count              0.0066      0.001      5.910      0.000       0.004       0.009
is_free              2224.7200    617.008      3.606      0.000    1004.385    3445.055
author_count          276.3651    128.411      2.152      0.033      22.392     530.339
post_2018             691.2199    488.604      1.415      0.159    -275.154    1657.594
is_Fantasy           1001.5810    722.618      1.386      0.168    -427.631    2430.793
is_Romance           3305.5412   1146.493      2.883      0.005    1037.977    5573.105
is_War                472.9144   1125.549      0.420      0.675   -1753.225    2699.053
is_Steampunk          585.2100   1278.911      0.458      0.648   -1944.253    3114.673
is_Humor              250.2483    990.683      0.253      0.801   -1709.151    2209.647
is_Mystery            430.8027    856.996      0.503      0.616   -1264.186    2125.791
is_Horror            -531.8813    994.898     -0.535      0.594   -2499.617    1435.854
is_Puzzle            -207.5382    931.052     -0.223      0.824   -2048.997    1633.920
is_Adventure         -294.2165   2819.537     -0.104      0.917   -5870.770    5282.337
is_School             115.8046    914.547      0.127      0.899   -1693.010    1924.619
is_Sci-Fi            -255.9177    928.121     -0.276      0.783   -2091.580    1579.744
is_Crime              615.3941   1156.570      0.532      0.596   -1672.100    2902.889
is_Post-apocalyptic  1369.0400   1110.181      1.233      0.220    -826.705    3564.785
is_Supernatural      2130.7975    720.365      2.958      0.004     706.040    3555.555
is_Slice of Life      741.6874   1344.121      0.552      0.582   -1916.750    3400.125
is_Historical         948.6244    745.294      1.273      0.205    -525.436    2422.685
is_Spy                639.5689   2762.170      0.232      0.817   -4823.523    6102.661
is_Superhero         3164.3899   1117.977      2.830      0.005     953.226    5375.553
==============================================================================
Omnibus:                      130.654   Durbin-Watson:                   1.933
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2101.129
Skew:                           2.847   Prob(JB):                         0.00
Kurtosis:                      19.994   Cond. No.                     4.18e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.18e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

Well, the $R^2$ is a bit higher here, at 0.363 vs 0.351 for the primary genre-only model. So that's good, indicating that the sub-genre provides some additional information. The coefficients for Word Count and is_free are pretty similar as before.

The interesting thing is that it totally changes the directions of some of the coefficients for the genres:

  • Romance is now the best genres, giving 1080 to 5558 additional reviews (it was negative when only considering primary genre; I'm blaming Wayhaven).
  • The bonus from Fantasy is no longer significant no longer significant (the range is -408 to 2485).
  • Superhero and Supernatural still give significant bonuses.

In order of highest to lowest predicted popularity gain, the genres are:

  1. Romance*
  2. Superhero*
  3. Supernatural*
  4. Post-apocalyptic
  5. Fantasy
  6. Historical
  7. Slice-of-life
  8. Spy
  9. Steampunk
  10. Mystery
  11. Humor

and these genres are predicted to cause popularity loss:

  1. War
  2. School
  3. Sci-Fi
  4. Puzzle
  5. Horror
  6. Adventure

(* means that the effect of that genre is statistically significant at p<0.05.)

Use log-word count instead of word count?

What if we tried to do a regression on log(Word Count) instead?

In [228]:
import numpy as np
data_no_outliers['log_WC'] = np.log2(data_no_outliers['Word Count'])
In [229]:
data_noo_X = data_no_outliers[['log_WC', 'is_free', 'post_2018', 'author_count'] + genre_cols]
In [230]:
results_loglog = sm.OLS(np.log2(data_no_outliers['Popularity']), sm.add_constant(data_noo_X)).fit()
In [231]:
print(results_loglog.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Popularity   R-squared:                       0.463
Model:                            OLS   Adj. R-squared:                  0.380
Method:                 Least Squares   F-statistic:                     5.552
Date:                Tue, 08 Jun 2021   Prob (F-statistic):           1.91e-10
Time:                        02:48:25   Log-Likelihood:                -294.77
No. Observations:                 157   AIC:                             633.5
Df Residuals:                     135   BIC:                             700.8
Df Model:                          21                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                  -2.7596      1.906     -1.448      0.150      -6.528       1.009
log_WC                  0.6476      0.117      5.519      0.000       0.416       0.880
is_free                 1.9323      0.420      4.598      0.000       1.101       2.763
post_2018               0.0324      0.325      0.100      0.921      -0.611       0.676
author_count            0.0900      0.086      1.050      0.296      -0.079       0.259
is_Fantasy              0.9776      0.407      2.399      0.018       0.172       1.783
is_Romance              0.6800      0.783      0.869      0.386      -0.868       2.228
is_War                  0.8983      0.662      1.357      0.177      -0.411       2.208
is_Steampunk           -0.4826      0.856     -0.564      0.574      -2.175       1.210
is_Humor               -1.7536      0.979     -1.791      0.075      -3.690       0.182
is_Mystery             -0.4509      0.540     -0.834      0.406      -1.520       0.618
is_Horror              -1.1725      0.704     -1.665      0.098      -2.565       0.220
is_Puzzle              -1.2666      0.710     -1.783      0.077      -2.671       0.138
is_Adventure           -2.9651      1.654     -1.793      0.075      -6.236       0.306
is_School              -1.0521      1.640     -0.642      0.522      -4.295       2.191
is_Sci-Fi              -0.6080      0.497     -1.223      0.223      -1.591       0.375
is_Crime                0.6075      0.716      0.848      0.398      -0.809       2.024
is_Post-apocalyptic     0.1113      0.719      0.155      0.877      -1.311       1.534
is_Supernatural         1.6888      0.448      3.767      0.000       0.802       2.575
is_Slice of Life        0.6362      0.710      0.896      0.372      -0.768       2.041
is_Historical           0.7948      0.714      1.113      0.268      -0.617       2.207
is_Spy                 -1.6934      1.638     -1.034      0.303      -4.934       1.547
is_Superhero            2.2907      0.605      3.783      0.000       1.093       3.488
==============================================================================
Omnibus:                        0.391   Durbin-Watson:                   1.795
Prob(Omnibus):                  0.823   Jarque-Bera (JB):                0.536
Skew:                           0.091   Prob(JB):                        0.765
Kurtosis:                       2.779   Cond. No.                     1.71e+17
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.58e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
In [232]:
predictions = results_loglog.predict(sm.add_constant(data_noo_X))
In [233]:
predictions = 2**predictions
In [234]:
px.scatter(data_no_outliers, x='Popularity', y=predictions, hover_data=['HG Story Title'], labels={'y': 'Predictions'})

Is the log regression better than the linear regression? Hard to say. The errors are just as large, but at least there aren't any negative predictions.

5. What is the relationship between the ratings on different platforms?

Let's just try plotting the Google ratings vs the omnibus ratings.

In [63]:
px.scatter(data_clean, x='GPS Score', y='Omnibus Rating', trendline='ols', hover_data=['HG Story Title'],
           title = 'Omnibus vs Google ratings')

Overall, the Omnibus rating and GPS rating correlate very well, with an $R^2$ of 0.7 in the above line.

The regression equation is $OmnibusRating = 0.477*GPS + 2.565$.

How do GPS and Omnibus ratings correlate with popularity?

Let's plot GPS Score vs GPS ratings count, and Omnibus Rating vs Omnibus rating count.

In [64]:
px.scatter(data_clean, x='# of Google Reviews', y='GPS Score', trendline='lowess', hover_data=['HG Story Title'])
In [65]:
px.scatter(data_clean, x='# of Omnibus Ratings', y='Omnibus Rating', trendline='lowess', hover_data=['HG Story Title'])

From this, it's pretty clear that more popular games tend to have higher ratings on both platforms. Not going to try to do any regressions here.

What is the distribution of ratings on different platforms?

In [132]:
px.histogram(data_clean, x='Omnibus Rating')
In [179]:
data_clean['Omnibus Rating'].mean()
Out[179]:
4.375949367088608
In [135]:
px.histogram(data_clean, x='GPS Score')
In [189]:
data_clean['GPS Score'].mean()
Out[189]:
3.798101265822784

6. What is the relationship between genre, word count, and rating?

Okay, we're going to do some more regressions. There's probably a better way to do this than OLS because ratings are in a pretty narrow range.

Let's consider the data without sub-genres first:

In [66]:
data_noo_X = data_no_outliers[['Word Count', 'is_free'] + genre_cols]
results = sm.OLS(data_no_outliers['Omnibus Rating'], sm.add_constant(data_noo_X)).fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         Omnibus Rating   R-squared:                       0.434
Model:                            OLS   Adj. R-squared:                  0.355
Method:                 Least Squares   F-statistic:                     5.522
Date:                Tue, 08 Jun 2021   Prob (F-statistic):           7.03e-10
Time:                        00:54:10   Log-Likelihood:                -25.505
No. Observations:                 157   AIC:                             91.01
Df Residuals:                     137   BIC:                             152.1
Df Model:                          19                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                   4.0306      0.047     85.805      0.000       3.938       4.123
Word Count           5.611e-07    1.3e-07      4.327      0.000    3.05e-07    8.18e-07
is_free                -0.2379      0.067     -3.525      0.001      -0.371      -0.104
is_Fantasy              0.4101      0.058      7.118      0.000       0.296       0.524
is_Romance              0.1402      0.135      1.039      0.301      -0.127       0.407
is_War                  0.3616      0.115      3.140      0.002       0.134       0.589
is_Steampunk            0.0799      0.150      0.534      0.594      -0.216       0.376
is_Humor               -0.2408      0.171     -1.408      0.161      -0.579       0.097
is_Mystery              0.1149      0.096      1.200      0.232      -0.074       0.304
is_Horror               0.2575      0.125      2.067      0.041       0.011       0.504
is_Puzzle              -0.0626      0.126     -0.495      0.621      -0.313       0.187
is_Adventure            0.4848      0.295      1.645      0.102      -0.098       1.068
is_School              -0.0687      0.291     -0.236      0.814      -0.645       0.508
is_Sci-Fi               0.2538      0.084      3.035      0.003       0.088       0.419
is_Crime                0.2981      0.124      2.397      0.018       0.052       0.544
is_Post-apocalyptic     0.3047      0.124      2.449      0.016       0.059       0.551
is_Supernatural         0.3247      0.075      4.305      0.000       0.176       0.474
is_Slice of Life        0.3561      0.124      2.861      0.005       0.110       0.602
is_Historical           0.1961      0.126      1.551      0.123      -0.054       0.446
is_Spy                  0.4195      0.291      1.440      0.152      -0.157       0.995
is_Superhero            0.4006      0.104      3.848      0.000       0.195       0.606
==============================================================================
Omnibus:                       25.888   Durbin-Watson:                   2.126
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               35.264
Skew:                          -0.939   Prob(JB):                     2.20e-08
Kurtosis:                       4.364   Cond. No.                     2.92e+21
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.49e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Interpretation:

  • The R^2 is 0.434, which is not bad.
  • The "baseline" rating is about a 4.
  • Higher word counts give higher ratings.
  • Free games have a lowered rating by about 0.24 stars.

Genres sorted by rating bonuses:

Adventure Spy Fantasy Superhero War Supernatural Slice of Life Post-Apocalyptic Crime Horror Sci-Fi* Historical Romance Mystery Steampunk

And here are the predicted "negative" genres: Puzzle School Humor

Again, a * indicates statistical significance at p<0.05.

Interestingly, Adventure and Spy were some of the worst performing genres in terms of popularity. It's interesting that they seem to give ratings bonuses in the omnibus (but these are very small sample sizes).

Now, let's use the secondary genres:

In [67]:
data_sg_noo_X = data_sg_no_outliers[['Word Count', 'is_free'] + genre_cols]
results = sm.OLS(data_sg_no_outliers['Omnibus Rating'], sm.add_constant(data_sg_noo_X)).fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         Omnibus Rating   R-squared:                       0.376
Model:                            OLS   Adj. R-squared:                  0.285
Method:                 Least Squares   F-statistic:                     4.102
Date:                Tue, 08 Jun 2021   Prob (F-statistic):           3.54e-07
Time:                        00:54:10   Log-Likelihood:                -33.084
No. Observations:                 157   AIC:                             108.2
Df Residuals:                     136   BIC:                             172.3
Df Model:                          20                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                   4.2571      0.109     39.146      0.000       4.042       4.472
Word Count           6.616e-07   1.35e-07      4.914      0.000    3.95e-07    9.28e-07
is_free                -0.2504      0.071     -3.516      0.001      -0.391      -0.110
is_Fantasy              0.1271      0.088      1.443      0.151      -0.047       0.301
is_Romance             -0.0131      0.136     -0.096      0.924      -0.283       0.257
is_War                  0.0312      0.135      0.232      0.817      -0.235       0.297
is_Steampunk           -0.0722      0.155     -0.466      0.642      -0.379       0.234
is_Humor               -0.0542      0.120     -0.453      0.651      -0.291       0.183
is_Mystery             -0.0696      0.104     -0.670      0.504      -0.275       0.136
is_Horror               0.1363      0.120      1.132      0.259      -0.102       0.374
is_Puzzle              -0.1788      0.111     -1.610      0.110      -0.399       0.041
is_Adventure            0.2668      0.342      0.781      0.436      -0.409       0.942
is_School               0.1719      0.110      1.557      0.122      -0.046       0.390
is_Sci-Fi              -0.0273      0.112     -0.243      0.808      -0.250       0.195
is_Crime               -0.0708      0.139     -0.510      0.611      -0.345       0.204
is_Post-apocalyptic    -0.0370      0.135     -0.274      0.784      -0.304       0.230
is_Supernatural        -0.0390      0.086     -0.451      0.653      -0.210       0.132
is_Slice of Life        0.1392      0.162      0.858      0.392      -0.182       0.460
is_Historical           0.0254      0.091      0.280      0.780      -0.154       0.205
is_Spy                  0.2536      0.335      0.756      0.451      -0.410       0.917
is_Superhero            0.1045      0.135      0.774      0.440      -0.163       0.371
==============================================================================
Omnibus:                       24.766   Durbin-Watson:                   2.171
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               31.231
Skew:                          -0.979   Prob(JB):                     1.65e-07
Kurtosis:                       3.968   Cond. No.                     4.10e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.1e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

The R^2 here is slightly worse than only using the primary genre.

The trends with regard to genre are... a lot more unclear vs only using the primary genre. None of the results are significant at p<0.05.

In [68]:
genre_ratings_plot = px.box(data_clean, x='Genre', y='Omnibus Rating', hover_data=['HG Story Title'], points='all')
genre_ratings_plot.show()
In [69]:
genre_ratings_plot.write_html('genre_ratings_plot.html')

What are the most underrated games?

How are we defining "underrated" here? Let's just say, the games whose ratings are higher than expected given their popularity. I think we should do log-popularity because it looks kinda more linear.

Also, given low rating counts, we should be using Bayesian averaging... but I don't think we have enough information to actually calculate a Bayesian average.

In [217]:
# compute bayesian average ratings
# let's completely arbitrarily set the prior to 4.
C = 20
m = data_clean['Omnibus Rating'].mean()
ba_omnibus_ratings = (C*m + data_clean['Omnibus Rating']*data_clean['# of Omnibus Ratings'])/(C + data_clean['# of Omnibus Ratings'])
C = 20
m = data_clean['GPS Score'].mean()
ba_google_ratings = (C*m + data_clean['GPS Score']*data_clean['# of Google Reviews'])/(C + data_clean['# of Google Reviews'])
In [218]:
data_clean['ba_omnibus_rating'] = ba_omnibus_ratings
data_clean['ba_google_rating'] = ba_google_ratings
data_no_outliers = data_clean[data_clean['HG Story Title'] != 'The Great Tournament']
In [219]:
px.scatter(data_clean, x='# of Omnibus Ratings', y='ba_omnibus_rating', trendline='lowess', hover_data=['HG Story Title'], log_x=True)
In [220]:
px.scatter(data_clean, x='# of Google Reviews', y='ba_google_rating', trendline='lowess', hover_data=['HG Story Title'], log_x=True)

Let's do a linear regression of the omnibus and GPS ratings as a function of log(review count).

In [221]:
omnibus_results = sm.OLS(data_no_outliers['ba_omnibus_rating'], sm.add_constant(np.log2(data_no_outliers['# of Omnibus Ratings']))).fit()
print(omnibus_results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:      ba_omnibus_rating   R-squared:                       0.309
Model:                            OLS   Adj. R-squared:                  0.304
Method:                 Least Squares   F-statistic:                     69.29
Date:                Tue, 08 Jun 2021   Prob (F-statistic):           4.16e-14
Time:                        02:44:38   Log-Likelihood:                -7.6741
No. Observations:                 157   AIC:                             19.35
Df Residuals:                     155   BIC:                             25.46
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                    3.8763      0.067     57.789      0.000       3.744       4.009
# of Omnibus Ratings     0.0672      0.008      8.324      0.000       0.051       0.083
==============================================================================
Omnibus:                       27.853   Durbin-Watson:                   1.880
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               36.926
Skew:                          -1.045   Prob(JB):                     9.59e-09
Kurtosis:                       4.129   Cond. No.                         27.7
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [222]:
gps_results = sm.OLS(data_no_outliers['ba_google_rating'], sm.add_constant(np.log2(data_no_outliers['# of Google Reviews']))).fit()
print(gps_results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:       ba_google_rating   R-squared:                       0.306
Model:                            OLS   Adj. R-squared:                  0.301
Method:                 Least Squares   F-statistic:                     68.30
Date:                Tue, 08 Jun 2021   Prob (F-statistic):           5.87e-14
Time:                        02:44:39   Log-Likelihood:                -107.94
No. Observations:                 157   AIC:                             219.9
Df Residuals:                     155   BIC:                             226.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                   2.6112      0.153     17.038      0.000       2.308       2.914
# of Google Reviews     0.1508      0.018      8.264      0.000       0.115       0.187
==============================================================================
Omnibus:                       16.205   Durbin-Watson:                   1.876
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               18.111
Skew:                          -0.744   Prob(JB):                     0.000117
Kurtosis:                       3.746   Cond. No.                         33.7
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [223]:
omnibus_predictions = omnibus_results.predict(sm.add_constant(np.log2(data_no_outliers['# of Omnibus Ratings'])))
data_no_outliers['omnibus_rating_predictions'] = omnibus_predictions
data_no_outliers['omnibus_resid'] = omnibus_results.resid
In [224]:
gps_predictions = gps_results.predict(sm.add_constant(np.log2(data_no_outliers['# of Google Reviews'])))
data_no_outliers['gps_rating_predictions'] = gps_predictions
data_no_outliers['gps_resid'] = gps_results.resid
In [225]:
data_no_outliers.sort_values('omnibus_resid', ascending=False).head(10)
Out[225]:
HG Story Title Author Genre Subgenre (if applicable) Word Count # of Omnibus Ratings Omnibus Rating # of Steam Reviews % of Positive Steam Reviews # of Google Reviews ... is_free year author_count post_2018 ba_omnibus_rating ba_google_rating omnibus_rating_predictions omnibus_resid gps_rating_predictions gps_resid
145 Trees Don’t Tell Taylor Zane Mystery Horror 36000 9 4.8 NaN NaN 16.0 ... 0 2020 0 1 4.507551 3.843390 4.089295 0.418257 3.214424 0.628965
56 Guns of Infinity Paul Wang War Fantasy 440000 683 4.9 76.0 94% 647.0 ... 0 2016 1 0 4.885091 4.866960 4.509048 0.376043 4.019385 0.847575
83 The Dryad’s Riddle Avery Moore Puzzle Fantasy 71000 8 4.6 NaN NaN 33.0 ... 0 2017 0 0 4.439964 3.986076 4.077875 0.362089 3.371928 0.614148
85 Lost in the Pages Various Supernatural Puzzle 130000 35 4.7 NaN NaN 115.0 ... 0 2017 1 0 4.582163 3.884904 4.220974 0.361189 3.643549 0.241355
22 Tin Star Allen Gies Historical NaN 1400000 847 4.9 263.0 96% 1188.0 ... 0 2014 3 0 4.887911 4.783412 4.529913 0.357998 4.151598 0.631814
47 A Study in Steampunk: Choice by Gaslight Heather Albano Steampunk Mystery 280000 356 4.8 151.0 96% 772.0 ... 0 2015 0 0 4.777444 4.774700 4.445874 0.331570 4.057816 0.716884
86 Highlands, Deep Waters Fernando Neves and Lucas Zaper Horror Mystery 350000 109 4.7 20.0 80% 458.0 ... 0 2017 0 0 4.649760 4.374816 4.331117 0.318642 3.944219 0.430597
98 A Mummy is Not an Antique Randy Condon Humor Supernatural 54000 3 3.8 NaN NaN 15.0 ... 0 2018 0 1 4.300826 3.456058 3.982776 0.318049 3.200383 0.255675
48 Captive of Fortune Samuel Harrison Young Fantasy NaN 130000 35 4.6 3.0 0% 47.0 ... 0 2015 4 0 4.518527 3.729284 4.220974 0.297553 3.448870 0.280414
82 Diamant Rose Teo Kuusela Spy Mystery 89000 15 4.5 NaN NaN 68.0 ... 0 2017 2 0 4.429114 4.031387 4.138823 0.290291 3.529232 0.502155

10 rows × 44 columns

In [226]:
data_no_outliers.sort_values('gps_resid', ascending=False).head(10)
Out[226]:
HG Story Title Author Genre Subgenre (if applicable) Word Count # of Omnibus Ratings Omnibus Rating # of Steam Reviews % of Positive Steam Reviews # of Google Reviews ... is_free year author_count post_2018 ba_omnibus_rating ba_google_rating omnibus_rating_predictions omnibus_resid gps_rating_predictions gps_resid
56 Guns of Infinity Paul Wang War Fantasy 440000 683 4.9 76.0 94% 647.0 ... 0 2016 1 0 4.885091 4.866960 4.509048 0.376043 4.019385 0.847575
158 Keeper of the Day and Night Brynn Chernosky Supernatural School 390000 2499 4.8 16.0 100% 265.0 ... 0 2021 1 1 4.796633 4.636709 4.634815 0.161818 3.825177 0.811532
47 A Study in Steampunk: Choice by Gaslight Heather Albano Steampunk Mystery 280000 356 4.8 151.0 96% 772.0 ... 0 2015 0 0 4.777444 4.774700 4.445874 0.331570 4.057816 0.716884
119 Sordwin Thom Baylay Fantasy Mystery 440000 3510 4.9 21.0 100% 848.0 ... 0 2019 2 1 4.897031 4.776915 4.667754 0.229277 4.078245 0.698670
111 The Butler Did It Daniel J. Elliot Steampunk Puzzle 300000 125 4.6 NaN NaN 96.0 ... 0 2018 0 1 4.569096 4.296224 4.344397 0.224699 3.604259 0.691965
17 Sabres of Infinity Paul Wang War Fantasy 200000 952 4.8 NaN NaN 924.0 ... 0 2013 1 0 4.791275 4.778773 4.541244 0.250031 4.096919 0.681854
87 Samurai of Hyuga Book 3 Devon Connell Fantasy Historical 230000 1849 4.8 25.0 76% 570.0 ... 0 2017 4 0 4.795462 4.669427 4.605608 0.189855 3.991816 0.677611
19 Way Walkers: University 2 J. Leigh Fantasy School 200000 675 4.7 NaN NaN 693.0 ... 0 2013 1 0 4.690675 4.674701 4.507905 0.182770 4.034328 0.640373
157 Relics of the Lost Age James Shaw Historical Fantasy 400000 603 4.8 9.0 78% 141.0 ... 0 2021 0 1 4.786387 4.325230 4.496969 0.289418 3.687896 0.637334
22 Tin Star Allen Gies Historical NaN 1400000 847 4.9 263.0 96% 1188.0 ... 0 2014 3 0 4.887911 4.783412 4.529913 0.357998 4.151598 0.631814

10 rows × 44 columns

In [ ]: