solution

Many of the ideas of regression first appeared in the work of Sir Francis Galton on the inheritance of characteristics from one generation to the next. In a paper on “Typical Laws of Heredity,” delivered to the Royal Institution on February 9, 1877, Galton discussed some experiments on sweet peas. By comparing the sweet peas produced by parent plants to those produced by offspring plants, he could observe inheritance from one generation to the next. Galton categorized parent plants according to the typical diameter of the peas they produced. For seven size classes from 0.15 to 0.21 inches, he arranged for each of nine of his friends to grow 10 plants from seed in each size class; however, two of the crops were total failures. A summary of Galton’s data was published by Karl Pearson (see table 5.3 and the data file galtonpeas.txt). Only average diameters and standard deviation of the offspring peas are given by Pearson; sample sizes are unknown.

(a) Draw the scatter plot of Progeny versus Parent.

(b) Assuming that the standard deviations given are population values, compute the regression of Progeny on Parent and draw the fitted mean function on the scatter plot.

(c) Galton wanted to know if characteristics of the parent plant such as size were passed on to the offspring plants. In fitting the regression, a parameter value of Many of the ideas of regression first appeared in the work of Sir Francis Galton on the inheritance...-11 = 1 would correspond to perfect inheritance, while Many of the ideas of regression first appeared in the work of Sir Francis Galton on the inheritance...-11 Many of the ideas of regression first appeared in the work of Sir Francis Galton on the inheritance...-3 1 would suggest that the offspring are “reverting” towards “what may be roughly and perhaps fairly described as the average ancestral type” (the substitution of “regression” for “reversion” was probably due to Galton in 1885). Test the hypothesis that β1 = 1 versus the alternative that β1

(d) In his experiments, Galton took the average size of all peas produced by a plant to determine the size class of the parent plant. Yet for seeds to represent that plant and produce offspring, Galton chose seeds that were as close to the overall average size as possible. Thus, for a small plant, exceptionally large seed was chosen as a representative, while larger, more robust plants were represented by relatively smaller seeds. What effects would you expect these experimental biases to have on (1) estimation of the intercept and slope and (2) estimates of error?

table 5.3

Many of the ideas of regression first appeared in the work of Sir Francis Galton on the inheritance...-4

 
"Looking for a Similar Assignment? Get Expert Help at an Amazing Discount!"

solution

Logarithmic transformations: data set pollution.csv (variable definitions are in file pollution.txt) contains mortality rates and various environmental factors from 60 U.S. metropolitan areas [McDonald and Schwing, 1973]. For this exercise we shall model mortality rate given nitric oxides, sulfur dioxide, and hydrocarbons as inputs. This model is an extreme oversimplification as it combines all sources of mortality and does not adjust for crucial factors such as age and smoking. We use it to illustrate log transformations in regression.

(a) Create a scatter plot of mortality rate versus level of nitric oxides. Do you think a linear model will fit these data well? Fit the regression and evaluate a residual plot from the regression.

(b) Find an appropriate transformation that will result in data more appropriate for linear regression. Fit a regression to the transformed data and evaluate the new residual plot.

(c) Interpret the slope coefficient from the model you chose in the previous step.

(d) Now fit a model predicting mortality rate using levels of nitric oxides, sulfur dioxide, and hydrocarbons as inputs. Use appropriate transformations when appropriate. Plot the fitted regression model and interpret the coefficients.

(e) Cross-validate: split the data into two halves and refit the model you chose from the last step to the first half. Use the resulting model to predict the mortality rate using data from the second half. Discuss the result. (A “real” cross-validation often split the data into more, e.g., 20, subsets, and fit the model by leaving one subset out, and make predictions for the set-aside subset.)

(f) Interaction: use conditional plot to investigate potential interaction effects among the three predictors. If you have reason to believe that interaction effects are important, refit the model with these interactions and interpret the fitted model coefficients.

These steps are common for a statistical analysis of observational data. The first four steps are considered exploratory; step 5 verifies a model’s predictive capability. Step 6 is often ignored in many studies. In many cases, interaction is more interesting and more informative. Logarithmic transformation is frequently used, but its interpretation is rarely explained clearly in the literature. When explaining the models, you should interpret each model coefficients in plain English.

Write a short report on your findings.

 
"Looking for a Similar Assignment? Get Expert Help at an Amazing Discount!"

solution

Regression analysis is often used as a tool for causal inference. A typical application of regression analysis for casual inference will fit a model using the outcome as the response variable and the potential cause(s) as the predictor(s). Because of the inevitable confounding factors in a typical social science study, the regression model will inevitably include other predictors to account for the variability associated with different conditions. Including confounding factors in a regression model is often called controlling in social science. It is this controlling that often leads to the misuse of regression analysis. For example, Kanazawa and Vandermassen [2005] suggested that parent’s occupation can predict the likelihood of having boys or girls. Particularly, if the parent’s occupation is “systematizing” (e.g., engineering), she/he tends to have more boys, and if the parent’s occupation is “empathizing” (e.g., nursing), he/she tends to have more girls. The conclusion was reached by using a regression analysis to the University of Chicago’s General Social Survey data. When studying a parent’s likelihood of having boys, the article used a regression model of the form:

Regression analysis is often used as a tool for causal inference. A typical application of...-1

That is, number of boys is predicted by parent’s occupation after controlling the number of girls (opposite sex children), plus other predictors (such as income) (Table 1 of Kanazawa and Vandermassen [2005]). The theory was illustrated using the model because the slope for engineer is positive and statistically different from 0. In a letter to editor, Gelman [2007] pointed out that this result may be a statistical artifact, and proposed a simulation.

The simulation creates two groups of families (nurses and engineers) of families, each having one or two children. Collectively, child sex ratios of the two groups of families are both one boy to one girl. The difference between a nurse family and an engineer family is how they decide the number of children: nurses will stop at having one child if the first born is a boy, and two children otherwise; engineers will stop at one child with probability 30% and continue on to a second child with probability 70%, regardless of the sex of the first child. In this simulated data, the probability of a boy is exactly 50% for all births; thus the true effect, the difference in sex ratios between engineer and nurse families, is actually zero. Under this simulated model, nurses will have the following distribution of family types: 50% boy, 25% girl-boy, 25% girl-girl. Engineers will have the distribution: 15% boy, 15% girl, 17.5% boy-boy, 17.5% boy-girl, 17.5% girl-boy, 17.5% girl-girl. Use the following scripts to generate 800 families of engineers and 800 families of nurses and fit the regression model:

Regression analysis is often used as a tool for causal inference. A typical application of...-2

Is the model result in conflict with the data? Any thoughts on why this would happen (hint: think about the meaning of the slope of engineer)?

 
"Looking for a Similar Assignment? Get Expert Help at an Amazing Discount!"

solution

You are the owner of the Basket of Rye Corner Market, a small local convenience store that specializes in fresh bread but that also sells various sundries, including cigarettes and alcohol. You hire Tanisha, a clerk, to cashier and Ryan, a manager to run the store when you are not there. You are almost never in the store. You have The law in your State states: “Every person attempting to purchase beer or alcohol must be of the legal age of majority, 21, and must show government-issued Identification, unless they are personally known to the seller. It is unlawful to sell beer or alcohol to a minor.” The fine for violating the law is $5,000 per occurrence. To adhere to the law, you instruct Ryan to post signs in the store alerting customers to the rule. One evening, some teenagers visit the store. One of the teenagers, Sarah, is the younger sister of your clerk, Tanisha. The teens send Sarah to the counter to buy the beer. She is able to convince Tanisha to make the sale. None of the teens show identification. An undercover police saw the entire transaction, and see the teenagers leave the store with the beer. They confiscate the beer and question the Tanisha. You are contacted. Discuss if the law was violated and who, if anyone, may be held responsible for the violation? What is your BEST argument that your business should not be fined.

 
"Looking for a Similar Assignment? Get Expert Help at an Amazing Discount!"