Peter Ralph
1 December 2020 – Advanced Biological Statistics
If
then “\(X b\)” is shorthand for the \(n\)-vector \[ (Xb)_i = \sum_{j=1}^k X_{ij} b_j . \]
%*%
.*
.Let’s say we want to predict apple yield
using leaf_area
, tree_age
, and fertilizer
: \[\begin{aligned}
\text{yield}_i &= \beta_0 + \beta_1 \times \text{leaf_area}_i \\
&\qquad {} + \beta_2 \times \text{tree_age}_i
+ \beta_3 \times \text{fertilizer}_i + \epsilon_i.
\end{aligned}\]
If we set the columns of \(X\) to be
leaf_area
tree_age
fertilizer
and \(\beta = (\beta_0, \beta_1, \beta_2, \beta_3),\) then this is \[ \text{yield}_i = (X \beta)_i + \epsilon_i . \]
Or, \[ \text{yield} = X \beta + \epsilon . \]
The model.matrix( )
function gives you back an X
matrix appropriate for factors as well.
(the “family”; describes the output)
(connects input to output)
(connects linear predictor to probability distribution)
Linear: \[ Y \sim \Normal(X \beta, \sigma) .\]
Binomial (“Logistic”): \[ Y \sim \Binom(n, \logistic(X\beta)) .\]
Poisson: \[ Y \sim \Poisson(\exp(X\beta)) .\]
Gamma: \[ Y \sim \Gam(\text{scale}=\exp(X\beta)/k, \text{shape}=k) .\]
glm(y ~ x, family=binomial)
family {stats}
Family objects provide a convenient way to specify the details of the models
used by functions such as glm.
binomial(link = "logit")
gaussian(link = "identity")
Gamma(link = "inverse")
inverse.gaussian(link = "1/mu^2")
poisson(link = "log")
Arguments
link: a specification for the model link function. This can be a
name/expression, a literal character string, a length-one character vector, or
an object of class "link-glm" (such as generated by make.link) provided it is
not specified via one of the standard names given next.
The gaussian family accepts the links (as names) identity, log and inverse; the
binomial family the links logit, probit, cauchit, (corresponding to logistic,
normal and Cauchy CDFs respectively) log and cloglog (complementary log-log);
the Gamma family the links inverse, identity and log; the poisson family the
links log, identity, and sqrt; and the inverse.gaussian family the links
1/mu^2, inverse, identity and log.
Note that the link function is the inverse of what we’d write down in the model: for instance, if the mean of the response is the exponential of the linear predictor, i.e., if \[ \E[Y] = \exp(X\beta) . \] then we have
link = "log"
Other inverses:
logistic
has inverse logit
identity
has inverse identity
x^2
(“squared”) has inverse sqrt
100 trials, where probability of success depends on \(x\):
glm()
##
## Call:
## glm(formula = zz ~ x, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.79584 -0.14355 0.01418 0.07181 2.58115
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.7512 2.4701 -3.948 7.89e-05 ***
## x 1.9531 0.4849 4.028 5.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.269 on 99 degrees of freedom
## Residual deviance: 28.807 on 98 degrees of freedom
## AIC: 32.807
##
## Number of Fisher Scoring iterations: 7
logistic_model <- stan_model(model_code=logistic_block)
fit <- sampling(logistic_model, chains=3, iter=1000,
data=list(N=100, X=x, Z=z))
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## alpha -11.231396 0.19863109 2.917856 -18.153892 -12.818953 -10.877573 -9.235022 -6.703278 215.7910 1.017132
## beta 2.245831 0.03897684 0.568362 1.364083 1.848582 2.187879 2.551442 3.576015 212.6360 1.017964
## lp__ -15.467009 0.06580101 1.124115 -18.564140 -15.838721 -15.127840 -14.697168 -14.435249 291.8477 1.007922
samples <- extract(fit)
layout(t(1:2))
hist(samples$alpha, main=expression(alpha))
abline(v=alpha, col='red'); abline(v=coef(glm_fit)[1], col='green')
hist(samples$beta, main=expression(beta))
abline(v=beta, col='red'); abline(v=coef(glm_fit)[2], col='green')
legend("topright", lty=1, col=c('red', 'green'), legend=c('truth', 'glm'))
data {
int N;
vector[N] X;
vector[N] Y;
vector<lower=0> Z[N];
}
parameters {
real beta[2];
}
model {
Z ~ gamma(1, exp(- beta[1] * X - beta[2] * Y));
}
What is
the probability distribution? (describes the output)
the linear predictor? (connects input to output)
the link function? (connects linear predictor to probability distribution)
Then, simulate from it.