Skip to contents

The data

For this lab we load the Small-cell Lung Cancer dataset (see codebook).

When working with files in R, it’s important to know what your working directory is, and where the file that you want to load is. This lab uses webR/WASM, which runs in your web browser and is isolated from your computer’s filesystem. You’ll see that this R session is in an empty directory that exists only in this session and not otherwise on your computer.

You would use these same commands when running R on your computer, but you would see the same directories and files as in your file explorer.

Since the webR/WASM filesystem is isolated from your local filesystem, we’ll use download.file() to download the dataset from the web, instead of clicking in a web browser. You could use Dropbox etc to create a downloadable URL for another file you want to use.

Datasets for PUBH614 labs are kept at https://github.com/CUNY-epibios/PUBH614/tree/main/datasets. After clicking on a dataset, be sure to press the “Raw” button to get the URL for the raw dataset, as opposed to the HTML page displaying it. For this dataset, it is https://raw.githubusercontent.com/CUNY-epibios/PUBH614/refs/heads/main/datasets/Stats4-%20more.csv.

The destfile argument specifies what filename to use for the downloaded file.

Check now that this file has been downloaded:

readr provides read_csv, a more powerful version of the base-R read.csv function.

If you are using RStudio, the “File - Import Dataset” option provides a much simplified way to find datasets on your local filesystem and import them - see here.

Smoking and Lung Cancer Analysis

There are other functions such as read_xlsx for other types of excel sheets.

Use glimpse to see what’s there:

Cross-tabulation of Smoking and Lung Cancer

Compare with slide 8 of session 4:

Logistic Regression Analysis

Basic Model: Smoking and Lung Cancer

The coefficients are in log odds, so we exponentiate to get the odds ratio for smoking on lung cancer:

Adjusted Model: Smoking, Lung Cancer, and Sex

Note that this is now adjusted for sex.

Sex-specific Estimates

Analysis for Men
Analysis for Women

This demonstrates that estimating odds ratios from a suitable two-by-two table gives you the same answer as using logistic regression, but logistic regression is much easier to use, particularly when adjusting for confounders. However, you do need to have your data in the right format - here, a row for each person with their smoking status, lung cancer status, and sex.

Job Callback Analysis

Initial Data Exploration

Gender Analysis

Callback by Gender

Logistic Regression for Gender

Education Analysis

Callback by Years of College

Logistic Regression for Education

Adjusted Education Analysis

Race Analysis

Callback by Race

Basic Race Model

Adjusted Race Model

Experience Analysis

Overall Experience Effect

Experience Effect by Demographics

For Black Candidates
For White Candidates
For Male Candidates
For Female Candidates

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.