\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Homework 4:

Assignment: Your task is to use Rmarkdown to write a short report, readable by a technically literate person. The code you used should not be visible in the final report (unless you have a good reason to show it).

Due: Submit your work via Canvas by the end of the day (midnight) on Thursday, October 27th. Please submit both the Rmd file and the resulting html or pdf file. You can work with other members of class, but I expect each of you to construct and run all of the scripts yourself.

The problem

We have gene expression data from 15,253 genes in 12 male pipefish; half are pregnant and half are not, and we would like to describe how genes expression differs between the pregnant and non-pregnant fish. The data are from Small et al 2016, and are available in the file pipefish_RNAseq_CPM.tsv (described more in the README). There is one row per gene; the columns starting with P give gene expression measurements for pregnant individuals, and those starting with N give the same for non-pregnant individuals. The gene expression measurements have already been normalized (but you will do some more). The final two columns give some more information about the gene.

To do this, you should (1) Log-transform the data (you’ll need to first add a small number, say 0.1, to eliminate zeros), visualizing the distribution of mean expression across genes before and after the transformation. (2) Do \(t\)-tests comparing expression of pregnant and non-pregnant individuals for every gene, and visualize the distributions of the resulting differences in expression and (unadjusted) \(p\)-values. (3) Report (in a table) the top 20 or so genes, and make a plot depicting the expression levels of the genes with \(p\)-value below some reasonable false-positive-adjusted threshold, (4) The previous plot should allow you to see that those genes with the strongest differences between groups are expressed at a higher level in non-pregnant males (especially if you subtract off means for each gene first). Describe the extent of this enrichment.