solution
Logarithmic transformations: data set pollution.csv (variable definitions are in file pollution.txt) contains mortality rates and various environmental factors from 60 U.S. metropolitan areas [McDonald and Schwing, 1973]. For this exercise we shall model mortality rate given nitric oxides, sulfur dioxide, and hydrocarbons as inputs. This model is an extreme oversimplification as it combines all sources of mortality and does not adjust for crucial factors such as age and smoking. We use it to illustrate log transformations in regression.
(a) Create a scatter plot of mortality rate versus level of nitric oxides. Do you think a linear model will fit these data well? Fit the regression and evaluate a residual plot from the regression.
(b) Find an appropriate transformation that will result in data more appropriate for linear regression. Fit a regression to the transformed data and evaluate the new residual plot.
(c) Interpret the slope coefficient from the model you chose in the previous step.
(d) Now fit a model predicting mortality rate using levels of nitric oxides, sulfur dioxide, and hydrocarbons as inputs. Use appropriate transformations when appropriate. Plot the fitted regression model and interpret the coefficients.
(e) Cross-validate: split the data into two halves and refit the model you chose from the last step to the first half. Use the resulting model to predict the mortality rate using data from the second half. Discuss the result. (A “real†cross-validation often split the data into more, e.g., 20, subsets, and fit the model by leaving one subset out, and make predictions for the set-aside subset.)
(f) Interaction: use conditional plot to investigate potential interaction effects among the three predictors. If you have reason to believe that interaction effects are important, refit the model with these interactions and interpret the fitted model coefficients.
These steps are common for a statistical analysis of observational data. The first four steps are considered exploratory; step 5 verifies a model’s predictive capability. Step 6 is often ignored in many studies. In many cases, interaction is more interesting and more informative. Logarithmic transformation is frequently used, but its interpretation is rarely explained clearly in the literature. When explaining the models, you should interpret each model coefficients in plain English.
Write a short report on your findings.