Analysis of Hosted Game Data¶

# importing libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv('data.csv', sep='\t')

data.head()

1. How many games have been released each year?¶

import datetime

# just converting dates to a standard format
release_dates = pd.to_datetime(data['Release Date'])

data['Release Date'] = release_dates

date_group = data.groupby([data['Release Date'].dt.year]).count()

date_group

plt.plot(date_group['Release Date'].index, date_group['HG Story Title'], marker='x')
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.title('Stories released per year')
plt.grid(True)
plt.xticks(date_group['Release Date'].index, rotation=45)
plt.show()

Overall, the trend is increasing until a peak in 2018. Data for 2021 is incomplete.

2. How many games have been released in each genre?¶

genre_group = data[['Genre', 'HG Story Title']].groupby('Genre').count()

genre_group

import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode()

fig = go.Figure(data=[go.Pie(labels=genre_group.index, values=genre_group['HG Story Title'])])
iplot(fig)

plt.figure(figsize=(10, 8))
plt.barh(genre_group.index, genre_group['HG Story Title'])
plt.xlabel('Count')
plt.ylabel('Genre')
plt.title('Stories released per genre')
plt.grid(True, axis='x')
plt.show()

Fantasy is the most common genre, followed by Supernatural and Sci-Fi.

3. What are the most popular games?¶

How do we measure popularity?¶

Maybe we can define "popularity" as omnibus reviews + Google reviews, since the Google downloads data is very coarse-grained.

As a sanity check, let's plot omnibus ratings vs google reviews, to see if they correlate.

data['Popularity'] = data['# of Omnibus Ratings'] + data['# of Google Reviews']

px.scatter(data, x='# of Omnibus Ratings', y='# of Google Reviews', hover_data=['HG Story Title'], trendline='ols')

This doesn't look so good but let's just calculate a correlation...

import statsmodels.api as sm
import statsmodels.formula.api as smf

It turns out that some of the games don't have Google review data. Which are those games?

data[data['# of Google Reviews'].isna()]

Alright, let's just remove those games from the data.

data_clean = data[~data['# of Google Reviews'].isna()]

results = sm.OLS(data_clean['# of Omnibus Ratings'], sm.add_constant(data_clean['# of Google Reviews'])).fit()

print(results.summary())

                             OLS Regression Results                             
================================================================================
Dep. Variable:     # of Omnibus Ratings   R-squared:                       0.437
Model:                              OLS   Adj. R-squared:                  0.433
Method:                   Least Squares   F-statistic:                     121.1
Date:                  Tue, 08 Jun 2021   Prob (F-statistic):           3.31e-21
Time:                          00:54:06   Log-Likelihood:                -1400.9
No. Observations:                   158   AIC:                             2806.
Df Residuals:                       156   BIC:                             2812.
Df Model:                             1                                         
Covariance Type:              nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                 531.4866    145.277      3.658      0.000     244.522     818.451
# of Google Reviews     0.5258      0.048     11.004      0.000       0.431       0.620
==============================================================================
Omnibus:                      141.753   Durbin-Watson:                   1.813
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1762.216
Skew:                           3.350   Prob(JB):                         0.00
Kurtosis:                      17.926   Cond. No.                     3.22e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.22e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

That's a lot of numbers! Here I am attempting to discern the relationship between

An important number here is R-squared. Basically, this shows the strength of the relationship between the two variables. It's 0.496, which is pretty decent. So there definitely is a relationship between the two popularity metrics. But I'm guessing the outlier kind of skews the results... What is the outlier, anyway? That seems to be by far the most popular game published by Hosted Games.

Okay, but really, what are the most popular games?¶

data_popular = data.sort_values('Popularity', ascending=False)

data_popular.head(10)

The outlier that we saw earlier was The Great Tournament, which is the most popular Hosted Game.

By most metrics, Philip Kempton's The Great Tournament and Life of a Mercenary are the two most popular Hosted Games. Congrats I guess.

What is the distribution of popularity across games?¶

Okay, now let's plot a histogram of popularity:

px.histogram(data, x='Popularity', nbins=50)

This is a very long-tailed distribution: most games have less than 1000 total ratings, while a handful of games are wildly popular.

4. What factors relate to the popularity of a game?¶

a) How does word count relate to popularity?¶

Let's do another linear regression!

results = sm.OLS(data_clean['Popularity'], sm.add_constant(data_clean['Word Count'])).fit()

print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Popularity   R-squared:                       0.083
Model:                            OLS   Adj. R-squared:                  0.077
Method:                 Least Squares   F-statistic:                     14.03
Date:                Tue, 08 Jun 2021   Prob (F-statistic):           0.000252
Time:                        00:54:06   Log-Likelihood:                -1553.6
No. Observations:                 158   AIC:                             3111.
Df Residuals:                     156   BIC:                             3117.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        762.4819    497.399      1.533      0.127    -220.025    1744.989
Word Count     0.0065      0.002      3.746      0.000       0.003       0.010
==============================================================================
Omnibus:                      248.175   Durbin-Watson:                   1.913
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            25216.973
Skew:                           6.873   Prob(JB):                         0.00
Kurtosis:                      63.344   Cond. No.                     3.92e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.92e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

px.scatter(data_clean, x='Word Count', y='Popularity', hover_data=['HG Story Title'], trendline='ols')

There's definitely a positive relationship between word count and popularity, but the R^2 between word count and popularity is about 0.08, which is... very weak.

What happens if we remove The Great Tournament?

data_no_outliers = data_clean[data_clean['HG Story Title'] != 'The Great Tournament']

results = sm.OLS(data_no_outliers['Popularity'], sm.add_constant(data_no_outliers['Word Count'])).fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Popularity   R-squared:                       0.193
Model:                            OLS   Adj. R-squared:                  0.188
Method:                 Least Squares   F-statistic:                     37.05
Date:                Tue, 08 Jun 2021   Prob (F-statistic):           8.70e-09
Time:                        00:54:07   Log-Likelihood:                -1468.5
No. Observations:                 157   AIC:                             2941.
Df Residuals:                     155   BIC:                             2947.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        471.8026    308.456      1.530      0.128    -137.518    1081.123
Word Count     0.0066      0.001      6.087      0.000       0.004       0.009
==============================================================================
Omnibus:                      160.082   Durbin-Watson:                   1.945
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3694.022
Skew:                           3.717   Prob(JB):                         0.00
Kurtosis:                      25.571   Cond. No.                     3.92e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.92e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Now, the R^2 improves to 0.193, which is... better?

The regression equation is $Popularity = 0.066*WordCount + 471.8$

This means that, for example, the expected popularity of a 500k-word story would be about 3800.

px.scatter(data_no_outliers, x='Word Count', y='Popularity', hover_data=['HG Story Title'], trendline='ols')

px.scatter(data_no_outliers, x='Word Count', y='Popularity', hover_data=['HG Story Title'], trendline='ols', log_x=True, log_y=True)

How does release date relate to popularity?¶

We might expect earlier released games to be more popular, just because they've had more time to acquire readers. But we might also expect later released games to be more popular, because of greater visibility for HG in more recent years, and the more recent release of the omnibus app. Or maybe there is no relation at all. Which hypothesis is true?

plt.scatter(data['Release Date'], data['Popularity'])

<matplotlib.collections.PathCollection at 0x7f8866ebfb38>

# okay that wasn't very helpful... but what is the outlier? (It's The Great Tournament)
px.scatter(data, x='Release Date', y='Popularity', hover_data=['HG Story Title'], trendline='ols')

# what about within the omnibus only?
px.scatter(data_no_outliers, x='Release Date', y='# of Omnibus Ratings', hover_data=['HG Story Title'], trendline='ols')

# what about within GPS only? and we remove The Great Tournament?
px.scatter(data_no_outliers, x='Release Date', y='# of Google Reviews', hover_data=['HG Story Title'], trendline='ols')

It doesn't look like there's much of a relationship between popularity and release date.

How does genre (and subgenre) relate to popularity?¶

Let's plot the average popularity per genre:

genre_pop_group = data_clean[['Genre', 'Popularity']].groupby('Genre').mean()

px.bar(genre_pop_group, x=genre_pop_group.index, y='Popularity')

#Let's do a box plot with error bars...
px.box(data_clean, x='Genre', y='Popularity', hover_data=['HG Story Title'])

# log popularity by genre
games_plot = px.box(data_clean, x='Genre', y='Popularity', hover_data=['HG Story Title'], log_y=True, points='all')
games_plot.show()

games_plot.write_html('./log_popularity_by_genre.html')

px.strip(data_clean, x='Genre', y='Popularity', hover_data=['HG Story Title'], log_y=True)

#Let's do a box plot with The Great Tournament removed...
px.box(data_no_outliers, x='Genre', y='Popularity', hover_data=['HG Story Title'])

Superatural, Fantasy, and Superhero are the most popular genres on average. They are also the most popular genres in total, so writers' interests and readers' interests seem to overlap for the most part. Interestingly, it seems that Superhero has the highest floor of popularity: the least popular Superhero game has a popularity of 419, which is the highest minimum popularity of any genre!

Can we create a super-model that predicts popularity using Word Count, Genre, and Free?¶

We'll introduce dummy variables for all of the genres, as well as "Free?". Whatever, let's add Release Year, and author_count as well.

# 1. Create categorical variables to represent genre inclusions
genre_cols = []
genre_cols_quoted = []
for genre in set(data['Genre']):
    data['is_' + genre] = [int(x) for x in (data['Genre'] == genre)]
    genre_cols_quoted.append('"is_' + genre + '"')
    genre_cols.append('is_'+genre)

# 2. identify author release count
known_authors = data.groupby('Author').count()
known_authors_dict = {r[0] : r[1]['HG Story Title'] for r in known_authors.iterrows()}

data['author_count'] = [known_authors_dict[a] - 1 for a in data['Author']]

data['is_free'] = [int(x) for x in (data['Free?'] == 'Yup')]
data['year'] = data['Release Date'].dt.year
data['post_2018'] = [int(x) for x in (data['Release Date'].dt.year >= 2018)]

data_clean = data[~data['# of Google Reviews'].isna()]
data_clean_X = data_clean[['Word Count', 'is_free', 'author_count', 'post_2018'] + genre_cols]

data_no_outliers = data_clean[data_clean['HG Story Title'] != 'The Great Tournament']
data_noo_X = data_no_outliers[['Word Count', 'is_free', 'author_count', 'post_2018'] + genre_cols]

results = sm.OLS(data_clean['Popularity'], sm.add_constant(data_clean_X)).fit()

print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Popularity   R-squared:                       0.247
Model:                            OLS   Adj. R-squared:                  0.130
Method:                 Least Squares   F-statistic:                     2.121
Date:                Tue, 08 Jun 2021   Prob (F-statistic):            0.00540
Time:                        02:14:35   Log-Likelihood:                -1538.0
No. Observations:                 158   AIC:                             3120.
Df Residuals:                     136   BIC:                             3187.
Df Model:                          21                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const               -1886.5615    837.457     -2.253      0.026   -3542.683    -230.440
Word Count              0.0080      0.002      4.232      0.000       0.004       0.012
is_free              3861.3661   1007.857      3.831      0.000    1868.268    5854.464
author_count          348.8697    219.929      1.586      0.115     -86.053     783.793
post_2018            1152.2250    835.233      1.380      0.170    -499.498    2803.948
is_Fantasy           1502.1626    932.633      1.611      0.110    -342.176    3346.501
is_Romance           -282.4591   1996.862     -0.141      0.888   -4231.376    3666.458
is_War                382.1840   1668.938      0.229      0.819   -2918.242    3682.610
is_Steampunk          127.9120   2171.633      0.059      0.953   -4166.625    4422.449
is_Humor             -714.4625   2531.328     -0.282      0.778   -5720.317    4291.392
is_Mystery           -990.4184   1399.662     -0.708      0.480   -3758.335    1777.498
is_Horror           -1505.7639   1805.048     -0.834      0.406   -5075.356    2063.828
is_Puzzle           -1254.6319   1830.997     -0.685      0.494   -4875.540    2366.276
is_Adventure        -2224.1754   4264.422     -0.522      0.603   -1.07e+04    6208.978
is_School            1441.3312   4241.130      0.340      0.734   -6945.761    9828.423
is_Sci-Fi           -1428.5891   1224.141     -1.167      0.245   -3849.402     992.224
is_Crime              -98.9062   1800.703     -0.055      0.956   -3659.905    3462.093
is_Post-apocalyptic   417.9971   1835.512      0.228      0.820   -3211.840    4047.834
is_Supernatural      2252.9464   1090.297      2.066      0.041      96.818    4409.075
is_Slice of Life     -637.6427   1799.353     -0.354      0.724   -4195.973    2920.688
is_Historical       -1802.6053   1837.868     -0.981      0.328   -5437.101    1831.891
is_Spy                563.4471   4230.343      0.133      0.894   -7802.313    8929.207
is_Superhero         2365.1127   1514.226      1.562      0.121    -629.360    5359.586
==============================================================================
Omnibus:                      238.803   Durbin-Watson:                   1.963
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            22807.802
Skew:                           6.406   Prob(JB):                         0.00
Kurtosis:                      60.449   Cond. No.                     5.08e+21
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.96e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Now, what the heck does this mean?

The 'coef' field for each genre basically represents how much popularity (in terms of Google + omnibus ratings) you gain/lose by using that genre. Note that the error ranges (the last two columns) are very large, due to the small sample size.

If we interpret the linear regression coeffients very literally...

THESE DESCRIPTIONS ARE OUTDATED AS THEY DO NOT INCLUDE THE POST_2018 OR AUTHOR_COUNT TERMS

Every additional word gets a game between 0.004 and 0.012 additional ratings.
If a game is free, it'll gain between ~1400 and ~5200 ratings.
If a game is in the fantasy genre, it'll gain between 580 and 3800 ratings (this is likely inflated by The Great Tournament and Life of a Mercenary, which I did not remove for this regression).
If a game is in the supernatural genre, it'll gain between 60 and 4300 ratings (this is barely significant at p<0.05, and would not be "significant" if a FDR correction is applied).
If a game is in the superhero genre, it'll gain between -341 and 5643 ratings (this is "almost significant" in p-hacker terminology).

None of the other conclusions are significant. Based on this analysis, we cannot conclusively say that, for example, puzzle or adventure or horror games will have a penalty to popularity, simply because there are too few games in those categories. Similarly, we can't say that school or post-apocalyptic games will have a bonus to popularity.

Now, what if we remove The Great Tournament?

results = sm.OLS(data_no_outliers['Popularity'], sm.add_constant(data_noo_X)).fit()

print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Popularity   R-squared:                       0.387
Model:                            OLS   Adj. R-squared:                  0.292
Method:                 Least Squares   F-statistic:                     4.062
Date:                Tue, 08 Jun 2021   Prob (F-statistic):           2.90e-07
Time:                        02:14:45   Log-Likelihood:                -1446.9
No. Observations:                 157   AIC:                             2938.
Df Residuals:                     135   BIC:                             3005.
Df Model:                          21                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const               -1339.3364    499.983     -2.679      0.008   -2328.149    -350.524
Word Count              0.0071      0.001      6.363      0.000       0.005       0.009
is_free              2199.7936    609.451      3.609      0.000     994.487    3405.100
author_count          219.8997    131.242      1.676      0.096     -39.656     479.455
post_2018            1283.5531    497.521      2.580      0.011     299.610    2267.496
is_Fantasy            739.9794    557.562      1.327      0.187    -362.707    1842.666
is_Romance           -723.2791   1189.628     -0.608      0.544   -3075.997    1629.439
is_War                309.0342    994.003      0.311      0.756   -1656.799    2274.867
is_Steampunk         -187.3208   1293.545     -0.145      0.885   -2745.554    2370.913
is_Humor             -805.6486   1507.629     -0.534      0.594   -3787.276    2175.979
is_Mystery           -661.3874    833.878     -0.793      0.429   -2310.541     987.766
is_Horror           -1014.9813   1075.509     -0.944      0.347   -3142.006    1112.044
is_Puzzle            -613.4644   1091.271     -0.562      0.575   -2771.663    1544.734
is_Adventure        -1076.8678   2540.864     -0.424      0.672   -6101.915    3948.180
is_School             950.1384   2526.141      0.376      0.707   -4045.791    5946.068
is_Sci-Fi           -1064.7284    729.444     -1.460      0.147   -2507.345     377.888
is_Crime              -26.1618   1072.480     -0.024      0.981   -2147.196    2094.873
is_Post-apocalyptic    95.7625   1093.393      0.088      0.930   -2066.632    2258.157
is_Supernatural      2305.0228    649.372      3.550      0.001    1020.765    3589.280
is_Slice of Life     -191.1609   1072.040     -0.178      0.859   -2311.327    1929.005
is_Historical       -1761.7643   1094.608     -1.609      0.110   -3926.562     403.033
is_Spy                347.4984   2519.562      0.138      0.891   -4635.419    5330.416
is_Superhero         2039.9928    902.084      2.261      0.025     255.947    3824.038
==============================================================================
Omnibus:                      141.459   Durbin-Watson:                   1.933
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2402.857
Skew:                           3.192   Prob(JB):                         0.00
Kurtosis:                      21.071   Cond. No.                     2.92e+21
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.5e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

These results actually look quite a bit better. Here's a summary:

THESE DESCRIPTIONS ARE OUTDATED AS THEY DO NOT INCLUDE THE POST_2018 OR AUTHOR_COUNT TERMS

Every additional word gets a game between 0.005 and 0.009 additional ratings.
If a game is free, it'll gain between 521 and 2868 ratings.
If a game is in the fantasy genre, it'll gain between 213 and 2217 ratings.
If a game is in the supernatural genre, it'll gain between 971 and 3594 ratings.
If a game is in the Superhero genre, it'll gain between 513 and 4134 ratings.

These are all the "significant" conclusions here. According to the regression coefficients, puzzle, adventure, sci-fi, horror, historical, and steampunk all have penalties, while war, school, and post-apocalyptic all have slight bonuses. But none of those are "significant", because the sample size is very, very small.

What are games that are more or less popular than predicted?¶

We can look at the residuals, which are the true value minus the predicted value. A positive residual indicates that a game was more popular than predicted, while a negative residual indicates that a game was less popular than predicted.

predictions = results.predict(sm.add_constant(data_noo_X))

px.scatter(data_no_outliers, x='Popularity', y=predictions, hover_data=['HG Story Title'], labels={'y': 'Predicted popularity'})

px.scatter(data_no_outliers, x='Popularity', y=results.resid, hover_data=['HG Story Title'], labels={'y': 'Residual'})

data_no_outliers['resid'] = results.resid
more_popular = data_no_outliers.sort_values('resid', ascending=False)
more_popular.head(10)

The games that are more popular than expected are, in order (not including The Great Tournament): Life of a Mercenary, Wayhaven 1, Wayhaven 2, Hero or Villain, Doomsday on Demand, Samurai of Hyuga, The Parenting Similator, War for the West, The Aether, and Evertree Inn.

less_popular = data_no_outliers.sort_values('resid', ascending=True)
less_popular.head(10)

The games that are less popular than expected are, in order: Tin Star, Magikiras, Gambling with Eternity, Elemental Saga, Burn(t), Seven Bullets, Twin Flames, Lost in the Pages, The Aegis Saga, and Best of Us.

Do the results improve if we use sub-genres?¶

What if we added data for secondary genres as well as primary genres?

# 1. Create categorical variables to represent genre inclusions
data_subgenres = data.copy()
genre_cols = []
genre_cols_quoted = []
for genre in set(data['Genre']):
    data_subgenres['is_' + genre] = [int(x) for x in ((data['Genre'] == genre) | (data['Subgenre (if applicable)'] == genre))]
    genre_cols_quoted.append('"is_' + genre + '"')
    genre_cols.append('is_'+genre)

data_sg_clean = data_subgenres[~data_subgenres['# of Google Reviews'].isna()]
data_sg_clean_X = data_sg_clean[['Word Count', 'is_free'] + genre_cols]

data_sg_no_outliers = data_sg_clean[data_clean['HG Story Title'] != 'The Great Tournament']
data_sg_noo_X = data_sg_no_outliers[['Word Count', 'is_free', 'author_count', 'post_2018'] + genre_cols]

results = sm.OLS(data_sg_no_outliers['Popularity'], sm.add_constant(data_sg_noo_X)).fit()

print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Popularity   R-squared:                       0.388
Model:                            OLS   Adj. R-squared:                  0.288
Method:                 Least Squares   F-statistic:                     3.867
Date:                Tue, 08 Jun 2021   Prob (F-statistic):           5.49e-07
Time:                        02:19:52   Log-Likelihood:                -1446.7
No. Observations:                 157   AIC:                             2939.
Df Residuals:                     134   BIC:                             3010.
Df Model:                          22                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const               -2123.8722   1011.813     -2.099      0.038   -4125.062    -122.683
Word Count              0.0066      0.001      5.910      0.000       0.004       0.009
is_free              2224.7200    617.008      3.606      0.000    1004.385    3445.055
author_count          276.3651    128.411      2.152      0.033      22.392     530.339
post_2018             691.2199    488.604      1.415      0.159    -275.154    1657.594
is_Fantasy           1001.5810    722.618      1.386      0.168    -427.631    2430.793
is_Romance           3305.5412   1146.493      2.883      0.005    1037.977    5573.105
is_War                472.9144   1125.549      0.420      0.675   -1753.225    2699.053
is_Steampunk          585.2100   1278.911      0.458      0.648   -1944.253    3114.673
is_Humor              250.2483    990.683      0.253      0.801   -1709.151    2209.647
is_Mystery            430.8027    856.996      0.503      0.616   -1264.186    2125.791
is_Horror            -531.8813    994.898     -0.535      0.594   -2499.617    1435.854
is_Puzzle            -207.5382    931.052     -0.223      0.824   -2048.997    1633.920
is_Adventure         -294.2165   2819.537     -0.104      0.917   -5870.770    5282.337
is_School             115.8046    914.547      0.127      0.899   -1693.010    1924.619
is_Sci-Fi            -255.9177    928.121     -0.276      0.783   -2091.580    1579.744
is_Crime              615.3941   1156.570      0.532      0.596   -1672.100    2902.889
is_Post-apocalyptic  1369.0400   1110.181      1.233      0.220    -826.705    3564.785
is_Supernatural      2130.7975    720.365      2.958      0.004     706.040    3555.555
is_Slice of Life      741.6874   1344.121      0.552      0.582   -1916.750    3400.125
is_Historical         948.6244    745.294      1.273      0.205    -525.436    2422.685
is_Spy                639.5689   2762.170      0.232      0.817   -4823.523    6102.661
is_Superhero         3164.3899   1117.977      2.830      0.005     953.226    5375.553
==============================================================================
Omnibus:                      130.654   Durbin-Watson:                   1.933
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2101.129
Skew:                           2.847   Prob(JB):                         0.00
Kurtosis:                      19.994   Cond. No.                     4.18e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.18e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

Well, the $R^2$ is a bit higher here, at 0.363 vs 0.351 for the primary genre-only model. So that's good, indicating that the sub-genre provides some additional information. The coefficients for Word Count and is_free are pretty similar as before.

The interesting thing is that it totally changes the directions of some of the coefficients for the genres:

Romance is now the best genres, giving 1080 to 5558 additional reviews (it was negative when only considering primary genre; I'm blaming Wayhaven).
The bonus from Fantasy is no longer significant no longer significant (the range is -408 to 2485).
Superhero and Supernatural still give significant bonuses.

In order of highest to lowest predicted popularity gain, the genres are:

Romance*
Superhero*
Supernatural*
Post-apocalyptic
Fantasy
Historical
Slice-of-life
Spy
Steampunk
Mystery
Humor

and these genres are predicted to cause popularity loss:

War
School
Sci-Fi
Puzzle
Horror
Adventure

(* means that the effect of that genre is statistically significant at p<0.05.)

Use log-word count instead of word count?¶

What if we tried to do a regression on log(Word Count) instead?

import numpy as np
data_no_outliers['log_WC'] = np.log2(data_no_outliers['Word Count'])

data_noo_X = data_no_outliers[['log_WC', 'is_free', 'post_2018', 'author_count'] + genre_cols]

results_loglog = sm.OLS(np.log2(data_no_outliers['Popularity']), sm.add_constant(data_noo_X)).fit()

print(results_loglog.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Popularity   R-squared:                       0.463
Model:                            OLS   Adj. R-squared:                  0.380
Method:                 Least Squares   F-statistic:                     5.552
Date:                Tue, 08 Jun 2021   Prob (F-statistic):           1.91e-10
Time:                        02:48:25   Log-Likelihood:                -294.77
No. Observations:                 157   AIC:                             633.5
Df Residuals:                     135   BIC:                             700.8
Df Model:                          21                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                  -2.7596      1.906     -1.448      0.150      -6.528       1.009
log_WC                  0.6476      0.117      5.519      0.000       0.416       0.880
is_free                 1.9323      0.420      4.598      0.000       1.101       2.763
post_2018               0.0324      0.325      0.100      0.921      -0.611       0.676
author_count            0.0900      0.086      1.050      0.296      -0.079       0.259
is_Fantasy              0.9776      0.407      2.399      0.018       0.172       1.783
is_Romance              0.6800      0.783      0.869      0.386      -0.868       2.228
is_War                  0.8983      0.662      1.357      0.177      -0.411       2.208
is_Steampunk           -0.4826      0.856     -0.564      0.574      -2.175       1.210
is_Humor               -1.7536      0.979     -1.791      0.075      -3.690       0.182
is_Mystery             -0.4509      0.540     -0.834      0.406      -1.520       0.618
is_Horror              -1.1725      0.704     -1.665      0.098      -2.565       0.220
is_Puzzle              -1.2666      0.710     -1.783      0.077      -2.671       0.138
is_Adventure           -2.9651      1.654     -1.793      0.075      -6.236       0.306
is_School              -1.0521      1.640     -0.642      0.522      -4.295       2.191
is_Sci-Fi              -0.6080      0.497     -1.223      0.223      -1.591       0.375
is_Crime                0.6075      0.716      0.848      0.398      -0.809       2.024
is_Post-apocalyptic     0.1113      0.719      0.155      0.877      -1.311       1.534
is_Supernatural         1.6888      0.448      3.767      0.000       0.802       2.575
is_Slice of Life        0.6362      0.710      0.896      0.372      -0.768       2.041
is_Historical           0.7948      0.714      1.113      0.268      -0.617       2.207
is_Spy                 -1.6934      1.638     -1.034      0.303      -4.934       1.547
is_Superhero            2.2907      0.605      3.783      0.000       1.093       3.488
==============================================================================
Omnibus:                        0.391   Durbin-Watson:                   1.795
Prob(Omnibus):                  0.823   Jarque-Bera (JB):                0.536
Skew:                           0.091   Prob(JB):                        0.765
Kurtosis:                       2.779   Cond. No.                     1.71e+17
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.58e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

predictions = results_loglog.predict(sm.add_constant(data_noo_X))

predictions = 2**predictions

px.scatter(data_no_outliers, x='Popularity', y=predictions, hover_data=['HG Story Title'], labels={'y': 'Predictions'})

Is the log regression better than the linear regression? Hard to say. The errors are just as large, but at least there aren't any negative predictions.

5. What is the relationship between the ratings on different platforms?¶

Let's just try plotting the Google ratings vs the omnibus ratings.

px.scatter(data_clean, x='GPS Score', y='Omnibus Rating', trendline='ols', hover_data=['HG Story Title'],
           title = 'Omnibus vs Google ratings')

Overall, the Omnibus rating and GPS rating correlate very well, with an $R^2$ of 0.7 in the above line.

The regression equation is $OmnibusRating = 0.477*GPS + 2.565$.

How do GPS and Omnibus ratings correlate with popularity?¶

Let's plot GPS Score vs GPS ratings count, and Omnibus Rating vs Omnibus rating count.

px.scatter(data_clean, x='# of Google Reviews', y='GPS Score', trendline='lowess', hover_data=['HG Story Title'])

px.scatter(data_clean, x='# of Omnibus Ratings', y='Omnibus Rating', trendline='lowess', hover_data=['HG Story Title'])

From this, it's pretty clear that more popular games tend to have higher ratings on both platforms. Not going to try to do any regressions here.

What is the distribution of ratings on different platforms?¶

px.histogram(data_clean, x='Omnibus Rating')

data_clean['Omnibus Rating'].mean()

4.375949367088608

px.histogram(data_clean, x='GPS Score')

data_clean['GPS Score'].mean()

3.798101265822784

6. What is the relationship between genre, word count, and rating?¶

Okay, we're going to do some more regressions. There's probably a better way to do this than OLS because ratings are in a pretty narrow range.

Let's consider the data without sub-genres first:

data_noo_X = data_no_outliers[['Word Count', 'is_free'] + genre_cols]
results = sm.OLS(data_no_outliers['Omnibus Rating'], sm.add_constant(data_noo_X)).fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:         Omnibus Rating   R-squared:                       0.434
Model:                            OLS   Adj. R-squared:                  0.355
Method:                 Least Squares   F-statistic:                     5.522
Date:                Tue, 08 Jun 2021   Prob (F-statistic):           7.03e-10
Time:                        00:54:10   Log-Likelihood:                -25.505
No. Observations:                 157   AIC:                             91.01
Df Residuals:                     137   BIC:                             152.1
Df Model:                          19                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                   4.0306      0.047     85.805      0.000       3.938       4.123
Word Count           5.611e-07    1.3e-07      4.327      0.000    3.05e-07    8.18e-07
is_free                -0.2379      0.067     -3.525      0.001      -0.371      -0.104
is_Fantasy              0.4101      0.058      7.118      0.000       0.296       0.524
is_Romance              0.1402      0.135      1.039      0.301      -0.127       0.407
is_War                  0.3616      0.115      3.140      0.002       0.134       0.589
is_Steampunk            0.0799      0.150      0.534      0.594      -0.216       0.376
is_Humor               -0.2408      0.171     -1.408      0.161      -0.579       0.097
is_Mystery              0.1149      0.096      1.200      0.232      -0.074       0.304
is_Horror               0.2575      0.125      2.067      0.041       0.011       0.504
is_Puzzle              -0.0626      0.126     -0.495      0.621      -0.313       0.187
is_Adventure            0.4848      0.295      1.645      0.102      -0.098       1.068
is_School              -0.0687      0.291     -0.236      0.814      -0.645       0.508
is_Sci-Fi               0.2538      0.084      3.035      0.003       0.088       0.419
is_Crime                0.2981      0.124      2.397      0.018       0.052       0.544
is_Post-apocalyptic     0.3047      0.124      2.449      0.016       0.059       0.551
is_Supernatural         0.3247      0.075      4.305      0.000       0.176       0.474
is_Slice of Life        0.3561      0.124      2.861      0.005       0.110       0.602
is_Historical           0.1961      0.126      1.551      0.123      -0.054       0.446
is_Spy                  0.4195      0.291      1.440      0.152      -0.157       0.995
is_Superhero            0.4006      0.104      3.848      0.000       0.195       0.606
==============================================================================
Omnibus:                       25.888   Durbin-Watson:                   2.126
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               35.264
Skew:                          -0.939   Prob(JB):                     2.20e-08
Kurtosis:                       4.364   Cond. No.                     2.92e+21
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.49e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Interpretation:

The R^2 is 0.434, which is not bad.
The "baseline" rating is about a 4.
Higher word counts give higher ratings.
Free games have a lowered rating by about 0.24 stars.

Genres sorted by rating bonuses:

Adventure Spy Fantasy Superhero War Supernatural Slice of Life Post-Apocalyptic Crime Horror Sci-Fi* Historical Romance Mystery Steampunk

And here are the predicted "negative" genres: Puzzle School Humor

Again, a * indicates statistical significance at p<0.05.

Interestingly, Adventure and Spy were some of the worst performing genres in terms of popularity. It's interesting that they seem to give ratings bonuses in the omnibus (but these are very small sample sizes).

Now, let's use the secondary genres:

data_sg_noo_X = data_sg_no_outliers[['Word Count', 'is_free'] + genre_cols]
results = sm.OLS(data_sg_no_outliers['Omnibus Rating'], sm.add_constant(data_sg_noo_X)).fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:         Omnibus Rating   R-squared:                       0.376
Model:                            OLS   Adj. R-squared:                  0.285
Method:                 Least Squares   F-statistic:                     4.102
Date:                Tue, 08 Jun 2021   Prob (F-statistic):           3.54e-07
Time:                        00:54:10   Log-Likelihood:                -33.084
No. Observations:                 157   AIC:                             108.2
Df Residuals:                     136   BIC:                             172.3
Df Model:                          20                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                   4.2571      0.109     39.146      0.000       4.042       4.472
Word Count           6.616e-07   1.35e-07      4.914      0.000    3.95e-07    9.28e-07
is_free                -0.2504      0.071     -3.516      0.001      -0.391      -0.110
is_Fantasy              0.1271      0.088      1.443      0.151      -0.047       0.301
is_Romance             -0.0131      0.136     -0.096      0.924      -0.283       0.257
is_War                  0.0312      0.135      0.232      0.817      -0.235       0.297
is_Steampunk           -0.0722      0.155     -0.466      0.642      -0.379       0.234
is_Humor               -0.0542      0.120     -0.453      0.651      -0.291       0.183
is_Mystery             -0.0696      0.104     -0.670      0.504      -0.275       0.136
is_Horror               0.1363      0.120      1.132      0.259      -0.102       0.374
is_Puzzle              -0.1788      0.111     -1.610      0.110      -0.399       0.041
is_Adventure            0.2668      0.342      0.781      0.436      -0.409       0.942
is_School               0.1719      0.110      1.557      0.122      -0.046       0.390
is_Sci-Fi              -0.0273      0.112     -0.243      0.808      -0.250       0.195
is_Crime               -0.0708      0.139     -0.510      0.611      -0.345       0.204
is_Post-apocalyptic    -0.0370      0.135     -0.274      0.784      -0.304       0.230
is_Supernatural        -0.0390      0.086     -0.451      0.653      -0.210       0.132
is_Slice of Life        0.1392      0.162      0.858      0.392      -0.182       0.460
is_Historical           0.0254      0.091      0.280      0.780      -0.154       0.205
is_Spy                  0.2536      0.335      0.756      0.451      -0.410       0.917
is_Superhero            0.1045      0.135      0.774      0.440      -0.163       0.371
==============================================================================
Omnibus:                       24.766   Durbin-Watson:                   2.171
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               31.231
Skew:                          -0.979   Prob(JB):                     1.65e-07
Kurtosis:                       3.968   Cond. No.                     4.10e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.1e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

The R^2 here is slightly worse than only using the primary genre.

The trends with regard to genre are... a lot more unclear vs only using the primary genre. None of the results are significant at p<0.05.

genre_ratings_plot = px.box(data_clean, x='Genre', y='Omnibus Rating', hover_data=['HG Story Title'], points='all')
genre_ratings_plot.show()

genre_ratings_plot.write_html('genre_ratings_plot.html')

What are the most underrated games?¶

How are we defining "underrated" here? Let's just say, the games whose ratings are higher than expected given their popularity. I think we should do log-popularity because it looks kinda more linear.

Also, given low rating counts, we should be using Bayesian averaging... but I don't think we have enough information to actually calculate a Bayesian average.

# compute bayesian average ratings
# let's completely arbitrarily set the prior to 4.
C = 20
m = data_clean['Omnibus Rating'].mean()
ba_omnibus_ratings = (C*m + data_clean['Omnibus Rating']*data_clean['# of Omnibus Ratings'])/(C + data_clean['# of Omnibus Ratings'])
C = 20
m = data_clean['GPS Score'].mean()
ba_google_ratings = (C*m + data_clean['GPS Score']*data_clean['# of Google Reviews'])/(C + data_clean['# of Google Reviews'])

data_clean['ba_omnibus_rating'] = ba_omnibus_ratings
data_clean['ba_google_rating'] = ba_google_ratings
data_no_outliers = data_clean[data_clean['HG Story Title'] != 'The Great Tournament']

px.scatter(data_clean, x='# of Omnibus Ratings', y='ba_omnibus_rating', trendline='lowess', hover_data=['HG Story Title'], log_x=True)

px.scatter(data_clean, x='# of Google Reviews', y='ba_google_rating', trendline='lowess', hover_data=['HG Story Title'], log_x=True)

Let's do a linear regression of the omnibus and GPS ratings as a function of log(review count).

omnibus_results = sm.OLS(data_no_outliers['ba_omnibus_rating'], sm.add_constant(np.log2(data_no_outliers['# of Omnibus Ratings']))).fit()
print(omnibus_results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:      ba_omnibus_rating   R-squared:                       0.309
Model:                            OLS   Adj. R-squared:                  0.304
Method:                 Least Squares   F-statistic:                     69.29
Date:                Tue, 08 Jun 2021   Prob (F-statistic):           4.16e-14
Time:                        02:44:38   Log-Likelihood:                -7.6741
No. Observations:                 157   AIC:                             19.35
Df Residuals:                     155   BIC:                             25.46
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                    3.8763      0.067     57.789      0.000       3.744       4.009
# of Omnibus Ratings     0.0672      0.008      8.324      0.000       0.051       0.083
==============================================================================
Omnibus:                       27.853   Durbin-Watson:                   1.880
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               36.926
Skew:                          -1.045   Prob(JB):                     9.59e-09
Kurtosis:                       4.129   Cond. No.                         27.7
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

gps_results = sm.OLS(data_no_outliers['ba_google_rating'], sm.add_constant(np.log2(data_no_outliers['# of Google Reviews']))).fit()
print(gps_results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:       ba_google_rating   R-squared:                       0.306
Model:                            OLS   Adj. R-squared:                  0.301
Method:                 Least Squares   F-statistic:                     68.30
Date:                Tue, 08 Jun 2021   Prob (F-statistic):           5.87e-14
Time:                        02:44:39   Log-Likelihood:                -107.94
No. Observations:                 157   AIC:                             219.9
Df Residuals:                     155   BIC:                             226.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                   2.6112      0.153     17.038      0.000       2.308       2.914
# of Google Reviews     0.1508      0.018      8.264      0.000       0.115       0.187
==============================================================================
Omnibus:                       16.205   Durbin-Watson:                   1.876
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               18.111
Skew:                          -0.744   Prob(JB):                     0.000117
Kurtosis:                       3.746   Cond. No.                         33.7
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

omnibus_predictions = omnibus_results.predict(sm.add_constant(np.log2(data_no_outliers['# of Omnibus Ratings'])))
data_no_outliers['omnibus_rating_predictions'] = omnibus_predictions
data_no_outliers['omnibus_resid'] = omnibus_results.resid

gps_predictions = gps_results.predict(sm.add_constant(np.log2(data_no_outliers['# of Google Reviews'])))
data_no_outliers['gps_rating_predictions'] = gps_predictions
data_no_outliers['gps_resid'] = gps_results.resid

data_no_outliers.sort_values('omnibus_resid', ascending=False).head(10)

data_no_outliers.sort_values('gps_resid', ascending=False).head(10)

	HG Story Title	Author	Genre	Subgenre (if applicable)	Word Count	# of Omnibus Ratings	Omnibus Rating	# of Steam Reviews	% of Positive Steam Reviews	# of Google Reviews	# of GPS Installs	GPS Score	Release Date	Price	Free?
0	Popcorn, Soda�Murder?	Pauzle	Mystery	NaN	23000	65	4.1	NaN	NaN	1229.0	100k+	3.6	3/4/2010	$1.99	Yup
1	The Nightmare Maze	Alex Livingston	Horror	Puzzle	5200	29	4.0	NaN	NaN	1069.0	50k+	3.5	6/17/2010	$0.99	Yup
2	What Happened Last Night?	Kie Brooks	Mystery	Humor	4900	134	3.9	NaN	NaN	753.0	50k+	3.1	7/19/2010	$0.99	Yup
3	Paranoia	Kie Brooks	Puzzle	NaN	5800	180	3.9	NaN	NaN	1139.0	50k+	3.4	4/16/2011	$0.99	Yup
4	Imprisoned	Myth Thrazz	Fantasy	Puzzle	40000	358	4.2	NaN	NaN	2211.0	100k+	3.6	4/21/2011	$0.99	Yup

	HG Story Title	Author	Genre	Subgenre (if applicable)	Word Count	# of Omnibus Ratings	Omnibus Rating	# of Steam Reviews	% of Positive Steam Reviews	# of Google Reviews	# of GPS Installs	GPS Score	Release Date	Price	Free?	Popularity
27	Path of Light	Ivailo Daskalov	Fantasy	NaN	100000	14	4.4	NaN	NaN	NaN	NaN	NaN	2014-08-15	$0.99	Nope	NaN
33	Silent Gear	Lee Yuan	Sci-Fi	War	40000	169	4.2	NaN	NaN	NaN	NaN	NaN	2014-12-29	$1.99	Yup	NaN
143	Journey into Darkness	Jonathan Clark	Historical	Puzzle	110000	10	3.6	NaN	NaN	NaN	1k+	NaN	2020-05-21	$3.99	Nope	NaN

	HG Story Title	Author	Genre	Subgenre (if applicable)	Word Count	# of Omnibus Ratings	Omnibus Rating	# of Steam Reviews	% of Positive Steam Reviews	# of Google Reviews	# of GPS Installs	GPS Score	Release Date	Price	Free?	Popularity
74	The Great Tournament	Philip Kempton	Fantasy	Historical	190000	13831	4.7	21.0	90%	32649.0	500k+	4.7	2017-03-03	$4.99	Yup	46480.0
113	Life of a Mercenary	Philip Kempton	Fantasy	Historical	340000	13260	4.6	6.0	100%	10060.0	100k+	4.4	2018-12-18	$3.99	Yup	23320.0
92	Wayhaven Chronicles: Book One	Mishka Jenkins	Supernatural	Romance	440000	11807	4.8	238.0	95%	5909.0	100k+	4.7	2018-03-08	$4.99	Nope	17716.0
141	Wayhaven Chronicles: Book Two*	Mishka Jenkins	Supernatural	Romance	790000	12458	4.9	140.0	99%	2319.0	10k+	4.9	2020-05-07	$6.99	Nope	14777.0
127	Hero or Villain: Genesis	Adrao	Superhero	NaN	330000	8984	4.6	25.0	68%	2380.0	100k+	4.2	2019-09-19	$3.99	Nope	11364.0
71	Zombie Exodus: Safe Haven	Jim Dattilo	Supernatural	Post-apocalyptic	1100000	5187	4.8	129.0	89%	4176.0	100k+	4.4	2016-10-28	$4.99	Nope	9363.0
8	Zombie Exodus	Jim Dattilo	Supernatural	Post-apocalyptic	700000	1463	4.7	138.0	89%	7191.0	100k+	4.3	2011-12-04	$1.99	Nope	8654.0
134	The War for the West	Lucas Zaper	Fantasy	NaN	490000	4740	4.8	43.0	81%	2458.0	50k+	4.5	2019-11-14	$6.99	Nope	7198.0
133	Breach: The Archangel Job	Michael Maxwell and S. Ben Luigi	Crime	NaN	820000	4972	4.9	53.0	94%	1792.0	10k+	4.6	2019-11-14	$6.99	Nope	6764.0
42	Samurai of Hyuga	Devon Connell	Fantasy	Historical	140000	2344	4.8	41.0	82%	3719.0	100k+	4.5	2015-07-17	$3.99	Nope	6063.0

	HG Story Title	Author	Genre	Subgenre (if applicable)	Word Count	# of Omnibus Ratings	Omnibus Rating	# of Steam Reviews	% of Positive Steam Reviews	# of Google Reviews	...	is_Supernatural	is_Slice of Life	is_Superhero	is_free	year	author_count	post_2018	resid
113	Life of a Mercenary	Philip Kempton	Fantasy	Historical	340000	13260	4.6	6.0	100%	10060.0	...	0	0	0	1	2018	4	1	17130.421251
92	Wayhaven Chronicles: Book One	Mishka Jenkins	Supernatural	Romance	440000	11807	4.8	238.0	95%	5909.0	...	1	0	0	0	2018	1	1	12107.344089
141	Wayhaven Chronicles: Book Two*	Mishka Jenkins	Supernatural	Romance	790000	12458	4.9	140.0	99%	2319.0	...	1	0	0	0	2020	1	1	6671.001188
127	Hero or Villain: Genesis	Adrao	Superhero	NaN	330000	8984	4.6	25.0	68%	2380.0	...	0	0	1	0	2019	3	1	6365.453864
60	Doomsday on Demand	Norbert Mohos	Post-apocalyptic	NaN	120000	417	4.6	7.0	43%	5395.0	...	0	0	0	0	2016	1	0	5979.442328
42	Samurai of Hyuga	Devon Connell	Fantasy	Historical	140000	2344	4.8	41.0	82%	3719.0	...	0	0	0	0	2015	4	0	4783.821011
73	Evertree Inn	Thom Baylay	Fantasy	Mystery	270000	3563	4.8	28.0	75%	1898.0	...	0	0	0	0	2017	2	0	3694.035914
134	The War for the West	Lucas Zaper	Fantasy	NaN	490000	4740	4.8	43.0	81%	2458.0	...	0	0	0	0	2019	0	1	3017.523862
128	The Parenting Simulator	Matt Simpson	Slice of Life	Humor	190000	2995	4.7	28.0	93%	1015.0	...	0	1	0	0	2019	1	1	2681.344144
8	Zombie Exodus	Jim Dattilo	Supernatural	Post-apocalyptic	700000	1463	4.7	138.0	89%	7191.0	...	1	0	0	0	2011	1	0	2473.728144

	HG Story Title	Author	Genre	Subgenre (if applicable)	Word Count	# of Omnibus Ratings	Omnibus Rating	# of Steam Reviews	% of Positive Steam Reviews	# of Google Reviews	...	year	author_count	post_2018	ba_omnibus_rating	ba_google_rating	omnibus_rating_predictions	omnibus_resid	gps_rating_predictions	gps_resid
145	Trees Don’t Tell	Taylor Zane	Mystery	Horror	36000	9	4.8	NaN	NaN	16.0	...	2020	0	1	4.507551	3.843390	4.089295	0.418257	3.214424	0.628965
56	Guns of Infinity	Paul Wang	War	Fantasy	440000	683	4.9	76.0	94%	647.0	...	2016	1	0	4.885091	4.866960	4.509048	0.376043	4.019385	0.847575
83	The Dryad’s Riddle	Avery Moore	Puzzle	Fantasy	71000	8	4.6	NaN	NaN	33.0	...	2017	0	0	4.439964	3.986076	4.077875	0.362089	3.371928	0.614148
85	Lost in the Pages	Various	Supernatural	Puzzle	130000	35	4.7	NaN	NaN	115.0	...	2017	1	0	4.582163	3.884904	4.220974	0.361189	3.643549	0.241355
22	Tin Star	Allen Gies	Historical	NaN	1400000	847	4.9	263.0	96%	1188.0	...	2014	3	0	4.887911	4.783412	4.529913	0.357998	4.151598	0.631814
47	A Study in Steampunk: Choice by Gaslight	Heather Albano	Steampunk	Mystery	280000	356	4.8	151.0	96%	772.0	...	2015	0	0	4.777444	4.774700	4.445874	0.331570	4.057816	0.716884
86	Highlands, Deep Waters	Fernando Neves and Lucas Zaper	Horror	Mystery	350000	109	4.7	20.0	80%	458.0	...	2017	0	0	4.649760	4.374816	4.331117	0.318642	3.944219	0.430597
98	A Mummy is Not an Antique	Randy Condon	Humor	Supernatural	54000	3	3.8	NaN	NaN	15.0	...	2018	0	1	4.300826	3.456058	3.982776	0.318049	3.200383	0.255675
48	Captive of Fortune	Samuel Harrison Young	Fantasy	NaN	130000	35	4.6	3.0	0%	47.0	...	2015	4	0	4.518527	3.729284	4.220974	0.297553	3.448870	0.280414
82	Diamant Rose	Teo Kuusela	Spy	Mystery	89000	15	4.5	NaN	NaN	68.0	...	2017	2	0	4.429114	4.031387	4.138823	0.290291	3.529232	0.502155

	HG Story Title	Author	Genre	Subgenre (if applicable)	Word Count	# of Omnibus Ratings	Omnibus Rating	# of Steam Reviews	% of Positive Steam Reviews	# of Google Reviews	# of GPS Installs	GPS Score	Release Date	Price	Free?
Release Date
2010	3	3	3	2	3	3	3	0	0	3	3	3	3	3	3
2011	10	10	10	8	10	10	10	1	1	10	10	10	10	10	10
2012	3	3	3	2	3	3	3	0	0	3	3	3	3	3	3
2013	5	5	5	3	5	5	5	1	1	5	5	5	5	5	5
2014	13	13	13	7	13	13	13	3	3	11	11	11	13	13	13
2015	17	17	17	11	17	17	17	5	5	17	17	17	17	17	17
2016	22	22	22	15	22	22	22	10	10	22	22	22	22	22	22
2017	16	16	16	12	16	16	16	7	6	16	16	16	16	16	16
2018	25	25	25	15	25	25	25	15	14	25	25	25	25	25	25
2019	21	21	21	10	21	21	21	10	10	21	21	21	21	21	21
2020	21	21	21	17	21	21	21	9	9	20	21	20	21	21	21
2021	5	5	5	4	5	5	5	2	2	5	5	5	5	5	5