Datasets
On using existing datasets: An fun and exciting part of data analysis is exploration: what will we discover in the data? It can also be frustrating, especially if you don’t know what you’re looking for: have I not found X because I have a bug in my code or because it’s not there? This is one reason it’s important to have the goals and questions clear, not only for final analyses but also for the exploratory phase: be clear what aspects of the data you want to summarize. If you want to analyze a dataset attached to a paper, you might feel like reading the paper first is cheating, because then you already know “what the answer is”. But, reanalyzing a dataset following along what someone else did is a very useful exercise, in part because it lets you focus on the statistics separately from figuring out what the goal is. Of course, the skill of figuring out what to look at and what questions to ask is another very important skill to practice, but it helps to separate these when learning.
Other sources of datasets
Here are links to data from some studies that other people have done, and also some datasets for teaching.
-
Data Dryad, Pangaea, KNB - repositories for data sets from scientific studies.
-
List of links from Statistical Science
Datasets in this directory:
Airbnb_listings_Portland
PanTHERIA:
A dataset containing information about 5,416 extant mammal species.
Ocean temperature near Newport:
CO2 concentration at Mauna Loa:
Cream Cheese sensory data
Lizard morphology before and after hurricanes
Nestling morphology and incubation temperature
stickleback_GFvsCV_RNAseq/
stickleback_MAvsCV_RNAseq/
stickleback_CommonGarden_16S/
pipefish_PregVsNonPreg_RNAseq/
Perchlorate
- TODO: what is this?
Galton’s parent-child height data:
NYTS tobacco use survey data
A summary dataset is available here, for 2011-2017, but individual-level data is here, with tens of thousands of individuals per year.
- Data in the
YTS/
directory (no README yet).
Data from Logan:
Some datasets from the book Biostatistical Design and Analysis Using R
are in the directory Logan_data/
(no README yet).
Other ideas
Fly distributions and associated weather data in the LA basin
From Temperature accounts for the biodiversity of a hyperdiverse group of insects in urban Los Angeles, twelve sites in LA in 2014, about 100 taxa, from monthly samples: dryad
New York Housing Authority data
Lots and lots of data from across decades are available, maybe here. (But note that data cleaning and interpretation is a major task.)