Codebook and example merging for NHANES Datasets
Note: you can also do this lab and upload local files on its Google Colab version.
Codebook: NHANES Dataset
-
Dataset names:
PBCD_J.xptandINS_J.xpt -
Location: NHANES datasets and codebooks for all years are available here.
- There is a single-page list of all download links and codebooks.
- Direct download links from CDC for the 2017-18 PBCD_J.xpt and INS_J.xpt
- Backup download links from GitHub for the same PBCD_J.xpt and INS_J.xpt
- Codebooks for PBCD_J and INS_J
Getting the data
Option 1: Download datasets directly from CDC. Note: this seems to get blocked by the browser in webR so it is not run here
download.file(
"https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2017/DataFiles/PBCD_J.xpt",
destfile = "PBCD_J.xpt"
)
download.file(
"https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2017/DataFiles/INS_J.xpt",
destfile = "INS_J.xpt"
)
Option 2: Download datasets from GitHub backup:
You then need to load and then merge, or join, these datasets into a single file. First, load them using the haven package which can import data from SAS, SPSS, and STATA formats.
Now join the two datasets, matching by the partipant ID column SEQN. This uses full_join from the incredible user-friendly and powerful join functions from the dplyr package.
full_join means that all partipants in the SEQN column will be maintained, even if a participant exists only in one of the two datasets. There are other options, for example inner_join keeps only participants present in both datasets, and left_join keeps only partipants present in the first dataset (the first dataset being whichever you specify as the first argument to the join function). See the dplyr help on two table verbs for more information.
Finally, you may want to simplify the dataset using dplyr::select to select only the columns you intend to use. You don’t need the dplyr:: but I like to include it because if you forgot library(dplyr) or library(tidyverse) you would accidentally use select from base R, which has different usage and would give you an error.
Getting and joining other NHANES datasets
You can join two datasets like above then join a third dataset, and so on, repeating as many times as you want. Just make sure to use the same participant ID (SEQN) and the same NHANES cycle. Here’s a shorthand join of several 2021-23 datasets, downloaded then inner_joined to keep only participants for whom all data are available. These datasets are:
- Demographic Variables and Sample Weights
- Dietary Supplement Use 30-day - Total Dietary Supplements
- Body Measures
The following command downloads and reads the dataset on the first line, joins it with the dataset on the second line, and then joins the result with the dataset on the third line.
Success! If you want to save the merged nhanes2 object, use readr::write_csv or something like it. Take a look at the dataset to see what you have:
Note: if you are running outside of webR, you could save the step of downloading first, and provide the full URL instead of the filename. Most other R data-loading functions can do take a URL as the filename.
Dataset Overview
The nhanes dataset contains three variables from the 2017-18 NHANES dataset. The dataset includes 8,366 observations with 3 variables.
Variables
SEQN
- Description: Sequence number
- Type: Numeric
- Values: Various sequence numbers, e.g., 93703, 93704, etc.
- Missing values: 0
LBXBPB
- Description: Blood lead level (ug/dL)
- Type: Numeric
- Values: Various blood lead levels, e.g., 2.98, 0.74, etc.
- Missing values: 1,482
LBXIN
- Description: Insulin level (uU/mL)
- Type: Numeric
- Values: Various insulin levels, e.g., 9.72, 5.28, etc.
- Missing values: 5,541
Summary Statistics
-
SEQNomitted for brevity as it contains unique values for each observation.
Data Quality Notes
- The dataset has missing values in the
LBXBPBandLBXINcolumns.
Acknowledgements
This codebook was drafted by Microsoft Copilot and edited by Levi Waldron.
Who am I kidding, Copilot was not very helpful with this Codebook because there were too many weird issues with downloading certain files in webR. The only useful things it did were the Dataset Overview and Variables sections, although even there I had to fix the numbers which were incorrect in its output.