EC2509 – Econometrics B
This EC2509 problem set must be handed in via Turnitin by Wednesday 23th February 2022, 23.59
Please submit your answer as a single PDF document. The document should contain the results as well as your written answer to the questions, and an appendix with your full R.codes, which could be used to reproduce your results.
The submitted document should clearly state which question you are attempting. For each question you should include any R output that is required – ideally this should come in form of tables or graphs – , as well as written interpretation of your results – Econometrics is not about being able to produce some R-codes, this is a necessary but not sufficient condition, you also need to be able to explain, in English what you have done, why you have done it, and what we learn from it.
In this problem set you start doing your own piece of analysis in education economics. The research question is: Does attending summer camp improve academic performance?
There is already evidence from Matsudaira (2008) that suggest that attending summer camp improves school performance in maths and reading. You can find it on Moodle.
Question 1 – Planning the project (50 marks)
Before you even open R:
Make a research plan.
- In an optimal setting without constraints how could you evaluate the causal impact of the summercamp on academic performance?
- What would be your hypothesis?
- Spell out the econometric method
- Write out the empirical specification and the key parameter of interest
- Which covariates? (check what is in the data and be specific about what you are going to use and why)
- Which covariates might you have wanted in addition to the ones in your dataset if you had collected the data yourself?
- The impact of summercamp may not be the same for all kids. Think of two characteristics along which its estimated impact may differ, and explain your hypothesis.
- What may be the threats to internal validity in this economic problem? How can you find out whether it applies here?
Question 2 – Look at the data and merge (5 marks)
You come across an RCT in which an optional, free summer camp is offered to kids around age 10 via a randomised letter to the parents. The camp includes games to improve math and reading and aims at teaching kids to be persistent to achieve their aims in spite of potential setbacks or obstacles. Signup for the summer camp has to be done by the parents on behalf of their children.
The researchers have abandoned the project and you are not sure how good a job they did.
You get hold of data that has one observation per kid and contains:
- an id for each person,
- their school’s id,
- a dummy variable indicating whether they went to summer camp,
- gender
- parental income
- parental education
- test scores in grade 5, i.e. before the treatment, and below (grades 2-4)
- test scores in grade 6, i.e after the treatment and in consecutive years (grades 7-10)
- a dummy variable whether the individual received a letter inviting them to the summercamp
Unfortunately, the data is split into three datasets that have different formats (csv, excel). You find them on Moodle:
school_data_1 (2 and 3)
Install package “readr” which allows you to convert lots of data formats (e.g. csv and stata) into R format.
- Load and look at each dataset to figure out what information is contained where.
- Each dataset has the identifier variable person_id which allows you to combine the three datasets.
Hint: you can get a quick overview of the data structure by displaying the first 6 observations using the command
Head(name of dataset)
- You now want to combine the three datasets into one. Hint: the common identifier is person_id and you want a final dataset with one observation per person
Check you’ve done it correctly by creating a table of summary statistics including your targeted key outcome variables, key explanatory variables, covariates and write a short paragraph describing the dataset, sample (size) and variables used for analysis!
Question 3 – Knock the data into shape (5 marks)
- Instead of a variable for each test score in each year, we would now like to create a panel with one observation per grade and person_id. This means that rather than one obs per person (sample size n), we now want to have 9 observations for each person, one per grade with testscore.
You want to convert something like this:
person_id | oth_variable | testscoregrade1 | testscoregrade2 | testscoregrade3 |
1 | 5 | 90 | 51 | 35 |
2 | 4 | 75 | 45 | 41 |
3 | 7 | 69 | 32 | 57 |
Into something that looks like this:
person_id | Other variable | testscore | grade |
1 | 5 | 90 | 1 |
1 | 5 | 51 | 2 |
1 | 5 | 35 | 3 |
2 | 4 | 75 | 1 |
2 | 4 | 45 | 2 |
2 | 4 | 41 | 3 |
3 | 7 | 69 | 1 |
3 | 7 | 32 | 2 |
3 | 7 | 57 | 3 |
Hint: you will need the tidyverse package and the tidyr library You can do this with the pivot_longer command
Its syntax is
New datasetname <- old dataset name%>5 pivot_longer{
cols=starts_with(“testcore”), #this is the first variable you want to pivot names_to= “grade”, #name of the new running variable names_prefix=”testscore”, #testscore variable name starts with
names_transform= list(grade=as.integer), #you could also pivot string variables e.g. so you need to specify the format of the new running variable
values_to=”testscore”, #what you want it to be called afterwards
)
Check how many columns you have after pivoting:
ncol()
and describe it in your report
- You have probably noticed that some variables have missing observations. Use the skim() function to check how many missing variables you have. Copy the output into your report. How concerned should we be about these missing observations?
- Let’s assume these values are missing at random and remove these rows using filter(). (library dplyr).
The filter() function takes two arguments. First, the name of the original dataset, and secondly the condition a row must satisfy to be kept in the filtering process. The condition we want to satisfy is that !is.na(variable name). The is.na(variable name) is true if the element (variable name) is missing. Use the “!” to achieve the opposite of that, i.e. the observation is not missing.
- You can do this step in one step (separating the conditions by a “,” or “&”) or in several separate steps for each variable with missing observations.
- Report the number of observations that you are left with.
Question 4 – Check the measurements (5 marks)
Is your measure of academic performance comparable across schools and grades? Think about what you can do to make measurements comparable that are on different scales. In
this example, think particularly about whther your variable is comparable across schools and grades (years)?
Produce summary stats for the variables you want to use, and comment on them. Hint: you may need the following commands:
To group a variable by grade, use the group_var command. Check here to see how it works.
To normalise a variable, use a dplyr function: mutate.
The structure is mutate(dataframe, newvar = f(oldvar)) , were f is a function, here the normalisation operation.
Question 5 – Descriptive analysis (20 marks)
- Can we find a correlation between summercamp and test scores? Use a boxplot graph to do so. Check here how to use boxplot: https://r-charts.com/distribution/box- plot-group/
And include it in your report
- Do the treatment and control group differ in their testscores before year 6, i.e. before any treatment takes place? Produce an adequate table, and also draw a histogram of testscores (after the summer school) by your treatment variable.
- Check the balancedness of the treatment and control group, what do you conclude.
Question 6 – First estimation step (15 marks)
Perform estimation using OLS to answer the research question (without thinking about internal validity etc. yet). You might try more than one model, present your results in a Table (using stargazer).
à what do you find?
Get expert help for EC2509 – Econometrics B and many more. 24X7 help, plag free solution. Order online now!