Codebook and example merging for NHANES Datasets • PUBH614

Note: you can also do this lab and upload local files on its Google Colab version.

Codebook: NHANES Dataset

Dataset names: PBCD_J.xpt and INS_J.xpt
Location: NHANES datasets and codebooks for all years are available here.
- There is a single-page list of all download links and codebooks.
- Direct download links from CDC for the 2017-18 PBCD_J.xpt and INS_J.xpt
- Backup download links from GitHub for the same PBCD_J.xpt and INS_J.xpt
- Codebooks for PBCD_J and INS_J

Getting the data

Option 1: Download datasets directly from CDC. Note: this seems to get blocked by the browser in webR so it is not run here

download.file(
  "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2017/DataFiles/PBCD_J.xpt",
  destfile = "PBCD_J.xpt"
)
download.file(
  "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2017/DataFiles/INS_J.xpt",
  destfile = "INS_J.xpt"
)

Option 2: Download datasets from GitHub backup:

You then need to load and then merge, or join, these datasets into a single file. First, load them using the haven package which can import data from SAS, SPSS, and STATA formats.

Now join the two datasets, matching by the partipant ID column SEQN. This uses full_join from the incredible user-friendly and powerful join functions from the dplyr package.

full_join means that all partipants in the SEQN column will be maintained, even if a participant exists only in one of the two datasets. There are other options, for example inner_join keeps only participants present in both datasets, and left_join keeps only partipants present in the first dataset (the first dataset being whichever you specify as the first argument to the join function). See the dplyr help on two table verbs for more information.

Finally, you may want to simplify the dataset using dplyr::select to select only the columns you intend to use. You don’t need the dplyr:: but I like to include it because if you forgot library(dplyr) or library(tidyverse) you would accidentally use select from base R, which has different usage and would give you an error.

Getting and joining other NHANES datasets

You can join two datasets like above then join a third dataset, and so on, repeating as many times as you want. Just make sure to use the same participant ID (SEQN) and the same NHANES cycle. Here’s a shorthand join of several 2021-23 datasets, downloaded then inner_joined to keep only participants for whom all data are available. These datasets are:

The following command downloads and reads the dataset on the first line, joins it with the dataset on the second line, and then joins the result with the dataset on the third line.

Success! If you want to save the merged nhanes2 object, use readr::write_csv or something like it. Take a look at the dataset to see what you have:

Note: if you are running outside of webR, you could save the step of downloading first, and provide the full URL instead of the filename. Most other R data-loading functions can do take a URL as the filename.

Dataset Overview

The nhanes dataset contains three variables from the 2017-18 NHANES dataset. The dataset includes 8,366 observations with 3 variables.

Variables

SEQN

Description: Sequence number
Type: Numeric
Values: Various sequence numbers, e.g., 93703, 93704, etc.
Missing values: 0

LBXBPB

Description: Blood lead level (ug/dL)
Type: Numeric
Values: Various blood lead levels, e.g., 2.98, 0.74, etc.
Missing values: 1,482

LBXIN

Description: Insulin level (uU/mL)
Type: Numeric
Values: Various insulin levels, e.g., 9.72, 5.28, etc.
Missing values: 5,541

Summary Statistics

SEQN omitted for brevity as it contains unique values for each observation.

Data Quality Notes

The dataset has missing values in the LBXBPB and LBXIN columns.

Acknowledgements

This codebook was drafted by Microsoft Copilot and edited by Levi Waldron.

Who am I kidding, Copilot was not very helpful with this Codebook because there were too many weird issues with downloading certain files in webR. The only useful things it did were the Dataset Overview and Variables sections, although even there I had to fix the numbers which were incorrect in its output.