Research talk:Reading time/Work log/2018-10-14
Sunday, October 14, 2018
editModel Selection
editI iterated a bit on the work I did on model selection earlier this week.
I fixed a bug in the ks tests . Now they make more sense and they agree with the AIC. Also note that the KS tests frequently accept the null hypotheses that the data was generated by a lognormal distribution or a Exponentiated Weibull distribution. This gives a pretty good statistical rationale for using these models. However, note that according to this answer on crossvalidated, if the parameters of the model were fit on the data then the assumptions of the KS are violated. I fit the models on a 75% "train set" and ran the tests on a 25% "test set," but at this point I'm not sure if I think this means the assumptions of the KS test are valid. I need to do some more reading or formal analysis to convince myself. The crossvalidated post uses bootstrapping and I could do that too, but it will take a big chunk of time. At this point I'm satisfied that the Exponentiated Weibull and Log Normal distributions are good enough fits for the data on a large majority of cases. I would be comfortable assuming these parametric models even if they are not perfect fits.
The code is below.
AIC_rank | BIC_rank | ks_rank | ks_pvalue | |||||
---|---|---|---|---|---|---|---|---|
mean | median | mean | median | mean | median | mean | median | |
model | ||||||||
expon | 4.896694 | 5.0 | 4.888430 | 5.0 | 4.946281 | 5.0 | 0.016276 | 0.000000 |
exponweib | 1.471074 | 1.0 | 1.607438 | 2.0 | 1.491736 | 1.0 | 0.305993 | 0.222638 |
gamma | 4.020661 | 4.0 | 4.012397 | 4.0 | 3.896694 | 4.0 | 0.025147 | 0.000000 |
lognorm | 1.623967 | 2.0 | 1.504132 | 1.0 | 1.685950 | 2.0 | 0.263722 | 0.150254 |
weibull_min | 2.987603 | 3.0 | 2.987603 | 3.0 | 2.979339 | 3.0 | 0.054931 | 0.000008 |
ks_pass95 | ks_pass97_5 | |
---|---|---|
mean | mean | |
model | ||
expon | 0.033058 | 0.033058 |
exponweib | 0.747934 | 0.814050 |
gamma | 0.082645 | 0.107438 |
lognorm | 0.681818 | 0.719008 |
weibull_min | 0.194215 | 0.239669 |
import itertools as it
def select_model(dt_all):
time_threshhold = np.inf
models = [stats.weibull_min, stats.exponweib, stats.lognorm, stats.gamma, stats.expon]
dfs = {mod.name:df for mod,df in zip(models,[2,3,2,2,1])}
res = {}
for wiki in set(dt_all.wiki):
dtwiki = dt_all.loc[(dt_all.wiki==wiki) &
(~dt_all.visiblelength.isnull()) &
(dt_all.visiblelength>0) &
(dt_all.visiblelength<=time_threshhold),["visiblelength"]]
trainassign = np.random.binomial(1,0.75,len(dtwiki))
dtwiki_train = dtwiki.iloc[trainassign==1]
dtwiki_test = dtwiki.iloc[trainassign==0]
fit_models = [m.fit(dtwiki_train,floc=0) for m in models]
ks_tests = [stats.kstest(dtwiki_test['visiblelength'],mod.name,fit) for mod,fit in zip(models,fit_models)]
logliks = []
AICs = []
BICs = []
for mod,fit in zip(models,fit_models):
print(fit)
loglik = mod.logpdf(dtwiki_test,*fit).sum()
logliks.append(loglik)
AICs.append(2 * dfs[mod.name] - 2 * loglik)
BICs.append(np.log(len(dtwiki_test)) * dfs[mod.name] - 2 * loglik)
print(logliks)
logliks = np.array(logliks)
print(logliks)
lrs = []
# these are chi-square tests with appropriate df
lrtests = {}
for pair in list(it.combinations(zip(models,logliks),2)):
df = dfs[pair[0][0].name] - dfs[pair[1][0].name]
if df < 0:
df = -df
a = pair[1]
b = pair[0]
else:
a = pair[0]
b = pair[1]
k = "{0}:{1}".format(a[0].name, b[0].name)
# if lr > 0 then a is more likely than b
lr = np.exp(a[1] - b[1])
lrs.append((k,lr))
pvalue = stats.chi2.sf(-2*(a[1] - b[1]),df=1+df)
# it's the p-value that
lrtests[k] = pvalue
print(len(models))
print(len(ks_tests))
res[wiki] = [pd.DataFrame({"model" : [mod.name for mod in models],
"params" : fit_models,
"ks" : [ks.statistic for ks in ks_tests],
"ks_pvalue" : [ks.pvalue for ks in ks_tests],
"loglik" : logliks,
"AIC" : AICs,
"BIC" : BICs}),
lrtests]
return res
for r in res.keys():
res[r][0]['wiki'] = r
table = pd.concat([res[r][0] for r in res.keys()]).reset_index()
table = table[[8,6,1,2,3,4,5,7]].reset_index()
gb = table.groupby(['wiki'])['AIC','BIC','ks'].rank().reset_index()
gb['AIC_rank'] = gb['AIC']
gb['BIC_rank'] = gb['BIC']
gb['ks_rank'] = gb['ks']
gb = gb.drop(['AIC','BIC','ks'],1)
table = table.merge(gb, on='index')
table.groupby(['model'])['AIC_rank','BIC_rank','ks_rank'].agg(['mean','median'])
table['ks_pass95'] = table['ks_pvalue'] >=0.05
table['ks_pass97_5'] = table['ks_pvalue'] >=0.025
table.groupby(['model'])['AIC_rank','BIC_rank','ks_rank','ks_pvalue'].agg(['mean','median'])
As a final step in model selection process. I made a series of plots that visualize the distribution fit. For example, here is the plot for English Wikipedia.