04b. Logistic regression
The data
For this lab we load the Small-cell Lung Cancer dataset (see codebook).
When working with files in R, it’s important to know what your working directory is, and where the file that you want to load is. This lab uses webR/WASM, which runs in your web browser and is isolated from your computer’s filesystem. You’ll see that this R session is in an empty directory that exists only in this session and not otherwise on your computer.
You would use these same commands when running R on your computer, but you would see the same directories and files as in your file explorer.
Since the webR/WASM filesystem is isolated from your local filesystem, we’ll use download.file() to download the dataset from the web, instead of clicking in a web browser. You could use Dropbox etc to create a downloadable URL for another file you want to use.
Datasets for PUBH614 labs are kept at https://github.com/CUNY-epibios/PUBH614/tree/main/datasets. After clicking on a dataset, be sure to press the “Raw” button to get the URL for the raw dataset, as opposed to the HTML page displaying it. For this dataset, it is https://raw.githubusercontent.com/CUNY-epibios/PUBH614/refs/heads/main/datasets/Stats4-%20more.csv.
The destfile argument specifies what filename to use for the downloaded file.
Check now that this file has been downloaded:
readr provides read_csv, a more powerful version of the base-R read.csv function.
If you are using RStudio, the “File - Import Dataset” option provides a much simplified way to find datasets on your local filesystem and import them - see here.
Smoking and Lung Cancer Analysis
There are other functions such as read_xlsx for other types of excel sheets.
Use glimpse to see what’s there:
Cross-tabulation of Smoking and Lung Cancer
Compare with slide 8 of session 4:
Logistic Regression Analysis
Basic Model: Smoking and Lung Cancer
The coefficients are in log odds, so we exponentiate to get the odds ratio for smoking on lung cancer:
Adjusted Model: Smoking, Lung Cancer, and Sex
Note that this is now adjusted for sex.
Sex-specific Estimates
Analysis for Men
Analysis for Women
This demonstrates that estimating odds ratios from a suitable two-by-two table gives you the same answer as using logistic regression, but logistic regression is much easier to use, particularly when adjusting for confounders. However, you do need to have your data in the right format - here, a row for each person with their smoking status, lung cancer status, and sex.
Job Callback Analysis
Initial Data Exploration
Gender Analysis
Callback by Gender
Logistic Regression for Gender
Education Analysis
Callback by Years of College
Logistic Regression for Education
Adjusted Education Analysis
Race Analysis
Callback by Race
Basic Race Model
Adjusted Race Model
Experience Analysis
Overall Experience Effect
Experience Effect by Demographics
For Black Candidates
For White Candidates
For Male Candidates
For Female Candidates

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.