On using existing datasets: An fun and exciting part of data analysis is exploration: what will we discover in the data? It can also be frustrating, especially if you don’t know what you’re looking for: have I not found X because I have a bug in my code or because it’s not there? This is one reason it’s important to have the goals and questions clear, not only for final analyses but also for the exploratory phase: be clear what aspects of the data you want to summarize. If you want to analyze a dataset attached to a paper, you might feel like reading the paper first is cheating, because then you already know “what the answer is”. But, reanalyzing a dataset following along what someone else did is a very useful exercise, in part because it lets you focus on the statistics separately from figuring out what the goal is. Of course, the skill of figuring out what to look at and what questions to ask is another very important skill to practice, but it helps to separate these when learning.

Other sources of datasets

Here are links to data from some studies that other people have done, and also some datasets for teaching.

Datasets in this directory:

Airbnb_listings_Portland

PanTHERIA:

A dataset containing information about 5,416 extant mammal species.

Ocean temperature near Newport:

CO2 concentration at Mauna Loa:

Cream Cheese sensory data

Lizard morphology before and after hurricanes

Nestling morphology and incubation temperature

stickleback_GFvsCV_RNAseq/

stickleback_MAvsCV_RNAseq/

stickleback_CommonGarden_16S/

pipefish_PregVsNonPreg_RNAseq/

Perchlorate

  • TODO: what is this?

Galton’s parent-child height data:

NYTS tobacco use survey data

A summary dataset is available here, for 2011-2017, but individual-level data is here, with tens of thousands of individuals per year.

  • Data in the YTS/ directory (no README yet).

Data from Logan:

Some datasets from the book Biostatistical Design and Analysis Using R are in the directory Logan_data/ (no README yet).

Other ideas

Fly distributions and associated weather data in the LA basin

From Temperature accounts for the biodiversity of a hyperdiverse group of insects in urban Los Angeles, twelve sites in LA in 2014, about 100 taxa, from monthly samples: dryad

New York Housing Authority data

Lots and lots of data from across decades are available, maybe here. (But note that data cleaning and interpretation is a major task.)