ECONOMETRICS: HOMEWORK # 1
a. Use the ascii data file DOCTOR1 to create a
SAS dataset.
b.
Construct the
histograms of the frequency distributions for HOURS and DOCWAGE. Given the shapes of the histograms, do you
believe it is reasonable to assume that each of these variables has a normal
distribution? Yes/no. Explain.
No, I don’t believe it is reasonable to assume that each of these variables has a normal distribution. Given the shapes of the histograms, I believe the frequency distribution for hours has a normal distribution because it has the appearance of a bell shape. Its histogram shows that the distribution is pretty symmetric about the mean. On the other hand, the frequency distribution for docwage doesn’t appear to be normal. The distribution is skewed to the right suggesting possibly an F-distribution or a Chi-square distribution for docwage.
c.
Calculate the sample
mean, sample variance, and sample standard deviation for HOURS and
DOCWAGE. What information do these
three statistics give you about the distributions of HOURS and DOCWAGE? Explain.
Sample mean is a measure of the central tendency of a sample of data. It is calculated by taking ∑ xi / n from i = 1 to i = n. The sample mean for hours is approximately 60 hours per week; while the sample mean for docwage is around $26 per hour. The information tells us that a typical primary-care physician in this sample works 60 hours per week and earns $26 per hour. The sample variance is an average measure of dispersion of a sample of data and is calculated by taking ∑ (xi – x bar )2 / n-1 from i = 1 to i = n. In addition, the sample standard deviation is just another measure of dispersion that is calculated by taking the square root of the variance. For this data, the sample variance for hours is approximately 269 hours/week; while the sample variance for docwage is around $155/hour. Thus, the sample standard deviation for hours is around 16 hours/week and it is about $12/hour for docwage. The sample variance and sample standard deviation tells us the spread of the distribution. The greater the variance, the greater the spread of the distribution. And, the smaller the variance, the more tightly packed the values are.
d.
Choose a descriptive
statistic that can be used to compare the dispersion of HOURS and DOCWAGE. Explain why you selected this particular
statistic. Calculate this
statistic. According to this statistic,
which variable has a relatively larger amount of dispersion?
I would choose the sample coefficient of variation (CV) to compare the dispersion of hours and docwage. I would select this statistic because it is a unit-free measure of dispersion, thereby allowing a comparison of the dispersion levels for hours and docwage. CV is calculated by dividing the sample standard deviation by the sample mean. The CV for hours is .27 and the CV for docwage is .48. Therefore, docwage has a relatively larger amount of dispersion.
e.
Construct a scatter
diagram for HOURS and DOCWAGE, with the former variable measured on the
vertical axis and the latter variable measured on the horizontal axis. What does this scatter diagram suggest about
the relationship between the two variables?
The scatter diagram suggests that there is a negative linear relationship between hours and docwage because the data appears to be downward sloping with most of the values located in the 2nd and 4th quadrants.
f.
On the scatter
diagram for HOURS and DOCWAGE, draw a vertical line at the mean of DOCWAGE and
a horizontal line at the mean of HOURS.
This breaks the scatter diagram up into four quadrants. Use this scatter diagram to explain the
logic that underlies the measure of covariance between HOURS and DOCWAGE. Make sure to explain how sample covariance
is calculated and what it measures.
Sample covariance is an “average” measure of the linear association between two variables. If docwage = x and hours = y, the sample covariance (Sxy) = ∑ (xi - x bar)(yi – y bar) / n – 1 from i = 1 to i = n. This is simply the average measure of the sum of the mean deviated products of docwage and hours. Since most of the data lies in quadrants II and IV, there is a negative linear relationship between docwage and hours. The product of the mean deviated products must be negative. Thus, as docwage increases (decreases), the number of hours worked per week decreases (increases). This goes along with the data obtained fro SAS which says there is a covariance of approximately -97 between docwage and hours.
g.
Choose a descriptive statistic that measures the
strength and direction of the relationship between HOURS and DOCWAGE. Explain how this statistic differs from
covariance. Calculate and interpret this
statistic. What does it suggest about
the relationship between HOURS and DOCWAGE?
Explain.
I would use the sample correlation coefficient to measure the strength and direction of the relationship between hours and docwage. This statistic differs from covariance in the fact that it is a “unit-free” measure of the degree of linear association between two variables. It is an index of strength. The sample correlation coefficient (rxy) is calculated by dividing the sample covariance by the product of the sample standard deviations for docwage and hours. For this example, the sample correlation coefficient is -.47239. Since this value is negative, there is a negative linear relationship between the variables. The strength of the relationship depends on the absolute value of the sample correlation coefficient’s closeness to 1. The closer its absolute value is to 1, the stronger the relationship. Thus, since the value is approximately ˝, the relationship is neither extremely strong nor weak, but rather just average indicating a linear relationship between hours and docwage.
a.
Use the ascii data
file CRIME to create a SAS dataset.
b.
Divide the sample of
44 states into two sub samples, one consisting of states that have capital
punishment and one for states that do not have capital punishment. Note:
Whether a state has capital punishment is identified by the variable
D1. This is called an indicator (or
dummy) variable, and takes the value of 1 if a state has capital punishment,
and zero if a state does not have capital punishment. To create the dataset for states without capital punishment
include the SAS statement: If D1 = 0
THEN DELETE. To create the dataset for
states without capital punishment include the SAS statement: IF D1 = 1 THEN DELETE.
c.
Calculate the sample
mean for each variable (except for S) for each sub sample. Use the sample means to describe a typical
state with capital punishment and a typical state without capital
punishment. How do the two types of
states differ?
The sample means from SAS tell us that a typical state with capital punishment in this sample has approximately 6 murders per 100,000 population in the state in 1950, .25 convictions per murder in 1950, .076 executions during 1946-1950, a median family income of $1,755 of families in 1949, 53% labor force participation rate in 1950, a proportion of .128 of the population that is nonwhite in 1950, .43 of these states with capital punishment were southern states, and a median time served of 136 months of convicted murderers released in 1951. On the other hand, a typical state without capital punishment in this sample has approximately 2 murders per 100,000 population in the state in 1950, .30 convictions per murder in 1950, 0 executions during 1946-1950, a median family income of $1,880 of families in 1949, 54% labor force participation rate in 1950, a proportion of .018 of the population that is nonwhite in 1950, 0 of these states without capital punishment were southern states, and a median time served of 139 months of convicted murderers released in 1951. Thus, all southern states have capital punishment. In addition, these states with capital punishment tend to have more murders, a lower conviction rate, a lower median family income, a lightly lower labor force participation rate, and convicted murderers typically serve less time than in those states without capital punishment.
d.
Recall that you are
particularly interested in whether capital punishment lowers the murder
rate. Compare the sample mean for the
murder rate of each sub sample. What
does this comparison suggest about the impact of capital punishment on the
murder rate? Do you believe that you
can draw valid conclusions about the impact of capital punishment on the murder
rate by comparing the sample mean murder rate for each sub sample? Why or why not? Explain.
The sample mean for the murder rate in the states with capital punishment is approximately 6 murders per 100,000 population in the state in 1950. On the other hand, the sample mean for the murder rate in the states without capital punishment is approximately 2 murders per 100,000 population in the state in 1950. This suggests that capital punishment raises the murder rate. I don’t believe that you can draw valid conclusions about the impact of capital punishment on the murder rate by just comparing the sample mean murder rate for each sub sample. There are other variables besides capital punishment that impact a state’s murder rate. For instance, the proportion of nonwhites or the poverty level may have an impact on the murder rate. Thus, in order to draw valid conclusions by comparing the sample mean murder rates for each sub sample, you must hold all other things constant. So, the sub samples must be the same in all regards except for the issue of capital punishment. If you take this into account, you will be able to see the actual affect of capital punishment on the murder rate. Also, remember that mean only tells us the average outcome. It doesn’t tell us about the causation process!
X = Quantity of Beef X ~ N(µx, σ 2x) Random Sample
Estimator ŕ X bar = ∑ xi / n from i = 1 to n where n = 30
-- The sample mean is used as an estimator because it possesses all the qualities of a reliable estimator such as unbiasedness, efficiency, etc.
a. Use the sample mean as an estimator for
population mean beef consumption. Give
a logical argument to deduce the type of sampling distribution the sample mean
has. Derive the mean and variance of
the sampling distribution of the sample mean.
Show your work. (Hint: See notes taken in class).
X bar = ∑ xi / n = 1/n(x1) + 1/n(x2) + … + 1/n(xn) Since X bar is a linear function of independent normal random variables, the sample mean has a normal distribution.
n Derivation of the mean and variance of the sampling distribution of the sample mean:
E(X bar) = 1/n E(x1) + 1/n E(x2) + … + 1/n E(xn)
E(X bar) = 1/n (µx) + 1/n (µx) + … + 1/n (µx)
E(X bar) = (1/n) * n * µx
E(X bar) = µx
ŕ This says that the mean of the sample equals the true value of the mean of the population.
VAR(X bar) = (1/n)2 Var(x1) + (1/n)2 Var(x2) + … + (1/n)2 Var(xn) +
Sum of Covariances
*But, for independent random variables, covariances = 0. So…
VAR(X bar) = (1/n)2 σ 2x + (1/n)2 σ 2x + … + (1/n)2 σ 2x
VAR(X bar) = (1/n)2 * n
* σ 2x
VAR(X bar) = σ 2x / n
Therefore, the sampling distribution for the sample mean is approximately a normal distribution with mean µx and variance σ 2x / n.
X bar ~ N (µx, σ 2x / n)
b. What information does the sampling distribution of the estimator give you?
The sampling distribution of the estimator tells you the probability distribution of the estimator. Since the sampling distribution is a normal distribution with mean µx and variance σ 2x / n, the distribution takes the form of a bell shape that is symmetric about its mean. The sampling distribution also tells us that the mean of the sample is the same as the mean of the true population. The variance of the sampling distribution tells us that as the size of the sample increases, the variance of the distribution decreases.
c. Use your estimator and the sample data to
obtain a point estimate for population mean beef consumption. Interpret this point estimate.
By using proc means on SAS, we obtain a point estimate for the population mean quantity of beef consumed to be approximately 27 lbs/year. This means that on average a typical person in the U.S. consumes 27 lbs of beef per year.
d. Use your estimator and the sample data to
obtain a 95% interval estimate for population mean beef consumption. Show your work. Interpret this interval estimate.
The rule of thumb for a 95% interval estimate is:
Point estimate ± 2 * (standard error of the estimate)
n Since we don’t know the true standard error, we must estimate it!
n Var = σ 2x / n
n s.d. = σ x / √n = Sx / √n = 15 / √30 = 2.7386 ≈ 3
n 27 ± 2 (3)
n 27 ± 6 ŕ (21, 33)
n This 95% interval estimate of (21,33) means that we can state with 95% confidence that a typical consumer in the U.S. consumes between 21 and 33 lbs of beef per year.
e. Use the sample data to test the null
hypothesis that a typical consumer in the population consumes 40 pounds of beef
per year. Show your work. Interpret your result.
Since the estimator for µx is X bar, we can derive a z-statistic of Z = X bar - µx / (σ x /√n). However, due to the fact that the standard deviation (σ x) is unknown, we must take the sample standard deviation. Thus, the z-statistic turns into a t-statistic with a t-distribution having n-1 degrees of freedom.
T = X bar – 40 / (S / √n) = (27 – 40) / (15 / √30) ≈ - 4.75
Ho: µx = 40
H1: µx ≠ 40 (two-tailed test)
tn-1 = t30-1 = t29
We reject the null hypothesis. The event that a typical consumer in the U.S. consumes 40 lbs of beef pre year is so unlikely that it’s false. However, there is a 5% chance that we rejected the null when it was true meaning that a typical consumer in the U.S. actually does consume 40 lbs of beef per year.