Foundations of Statistics and Econometrics, MANM467, University of Surrey Semester 1, 2022-23
Essential Information:
- Students need to prepare the final project individually and submit it before Wednesday 4th January 2023, 4:00 pm via SurreyLearn.
- The project consists of applying econometric analyses based on a real-world dataset, as described below, using statistical software (Stata). The expected level of econometric analyses is based on the lectures and lab sessions.
- Stata is available on All FASS labs and ALL central labs (including the library) computers. The network version is also accessible via workspaces.surrey.ac.uk. If you need assistance with remote access, please contact IT Services (itservicedesk@surrey.ac.uk).
- All documents and files (e.g., Dataset, Assessment Brief, Sample Report) related to the final project are available from SurreyLearn: Course Materials à Assessment Information.
- Please ensure you are aware of the assessment regulation and submission:
Dataset:
The panel dataset contains some variables for app developers in the Google Play app store. The data has been captured for eight periods during the two years of 2020 and 2021. The app store is an intensively competitive market where app developers release their apps to gain users’ attention and engagement. Many of the apps, even after being installed by users, fail to attract users’ engagement and eventually will be either uninstalled or abandoned from further usage. Therefore, the number of active users is a critical success factor in this market. The objective of this project is to explore some of the determinants of the total number of active users of app developers (as a measure of their success), as explained further.
Below is the description of the variables in the dataset.
- apps: Total number of apps released to the app store by a given app developer until a given period.
- free_percent: Percentage of free apps to all apps released by a given app developer until a given period. For example, ‘free equals one’ means all developer’s apps are free apps (i.e., no paid apps).
o Free apps are those that can be downloaded and installed free of charge from the app store. Paid apps are those that users should pay the app price before being able to download and install them.
- users: Total number of active users for all apps released by a given app developer until a given period.
- installs: Total number of installs by app users across all apps released by a given app developer until a given period.
- log_price: Natural logarithm of the average price for all apps released by a given app developer until a given period. App price is the price that users are required to pay before downloading the paid apps.
- hhi_category: Herfindahl–Hirschman Index (HHI) for measuring the diversity of app categories across which a given app developer released its apps until a given period. It is a ratio between zero and one and simply means that the app developer is focused on only few app categories or diversified across many app categories. For example, if a developer publishes only Entertainment apps, HHI will be one, but if a developer has a portfolio of apps across many categories such as Games, Business, Finance, Entertainment, Music & Audio, etc. HHI will be very small. In other words, a small HHI (closer to zero) implies that the app developer is a generalist, while a large HHI (closer to one) implies that the app developer is a specialist.
- size: a categorical variable which indicates if the app developer is a small, medium, or large firm.
- dev_id is the unique identifier of the app developer.
- period: the time-period identifier of the panel data. Consider it as a categorical variable.
Note: The log transformation applied for the price variable (log_price) is Ln(x+1), rather than Ln(x), to avoid losing observations with price=0; hence, if the price is zero, the log-transformed version will be zero as well— Ln(0+1)=0. For simplicity, you can interpret the effect size (if needed) as Ln(x).
Note: If during your analysis you face this error: “matsize too small”, which may or may not happen depending
on your working memory, run the below code and then continue your analysis:
o set matsize 1000
Content and Structure:
Introduction
- Provide a brief explanation for the methodology, such as data, the definition of dependent, independent, and control variables, the objective of the analyses, and the baseline model (as explained in the Main Regression Analysis section).
- The total number of installs, the total number of active users, and the average price should be used in the natural-log-transformed version in all models. Other variables should be used as not-logged.
o Hint: In the dataset, the price variable is already logged, but you need to generate the natural log version of the two abovementioned variables for your analysis.
Descriptive Analysis
- Provide a two-way table showing the summary statistics of the variables for subsamples of small, medium, and large developers, as well as the full sample. Briefly discuss the results.
- Apply an appropriate test to evaluate if there is any statistically significant difference (at 0.05 significance level) between small, medium, and large developers regarding the total number of active users (logged). Briefly discuss the results.
- Provide the correlation matrix of the variables. Briefly discuss the results.
Exploratory Analysis
- Inspect the data graphically, such as visual summary statistics, check the distribution/skewness of variables, pre-check the possibility of outliers, and pre-check the relationship between the dependent and independent variables, the longitudinal trend of variables, etc. The details and types of graphs are your decision—the objective is to provide a concise yet informative inspection of the data before running the regression. You may pick up a few of the above- mentioned list of potential graphs (or other graphs), which describe various aspects of the data efficiently. Hint: more than six graphs would be too much!
Main Regression Analysis:
- Conduct an OLS regression to estimate the effect of the total number of installs (logged), the total number of apps, and category HHI on the total number of active users (logged), while controlling the average app price (logged) and time period. This will be the baseline model. Carefully interpret and discuss the results (e.g., R-squared, the statistical significance of coefficients, and the effect size).
- Briefly justify the positive or negative effects of the regressors conceptually. Particularly, based on the results, does it seem to be better to be a specialist or a generalist app developer to gain user engagement?
- Looking at the free_percent variable (i.e., the percentage of free apps to total apps), you can see in this data that some of the app developers release only free apps, and some release both free and paid apps. Generate a categorical variable which distinguishes these two types of developers—“only free” or “free & paid”. Modify the baseline model to estimate the differential effect of the total number of apps on the total number of active users (logged) for
“only free” vs “free & paid” developers. Based on the results, discuss the statistical significance and effect size of the difference. Run a margins plot and discuss how this graph supports the regression results. Can you explain what this means conceptually?
Diagnostics and Robustness Analysis:
- Apply diagnostic analyses on the baseline model to check the potential heteroskedasticity and apply an appropriate remedy if needed. Briefly compare the new results with the original results of the baseline.
- Investigate the possibility of a quadratic effect of the total number of apps on the total number of active users (logged) and discuss the result. You can use graphical illustrations to enhance your discussion.
- Run the baseline model with app developers fixed effects with robust standard errors. Briefly compare the new results with the original results of the baseline model. Given the context of the data and variables, explain how the fixed effect model can mitigate the endogeneity problem in your baseline model.
Appendix
- Copy the programming codes in the appendix in Word format. Do not copy the codes as a screenshot. Alternatively, you can upload the Stata do-file along with your report on SurreyLearn as a separate file.
Format:
- The project file should be in Microsoft word format in Times New Roman 12-point font double spaced. The word count of the project should be no more than 3500 words.
- The word count includes everything from the first word of the introduction to the last word of the conclusion. The word count does not include tables, figures or images, and appendices. It does not include abstract, table of contents, abbreviation pages, or references (though these are not mandatory in this project). You should report the word count, your name and student number at the beginning of your project. According to the university policy, exceeding the word count limit is subject to a 10-point penalty.
Guideline and Tips:
- Apply the analyses required as explained orderly, section by section (from Introduction to Diagnostics and Robustness Analysis).
- The report—the writing, explanations, tables, and graphs—should be clear and informative as a self-sufficient and stand-alone document for readers who do not have access to this Final Project Description.
- In the introduction, concisely explain the aim of the empirical report, sample and data, and definition of all final variables incorporated in your regression models. Some of this information (such as sample and variables definition) has been provided, but you need to summarise them in your report concisely.
- All tables and graphs should be numbered and titled (with captions if an additional explanation is required) and should be referred to in the report accordingly. The label of the variables in tables and graphs should be informative.
- Graphs should be visually clear (axis title, colour, legend, axis scale, etc.). You may use image format for your graphs. Do not populate the report with lots of graphs; be selective and use the most informative ones for your purpose.
- Tables should be exported from the statistical software to a proper and readable Word format. You can report various models in one or two tables (each model in one column). Yet, you need to clearly number your models and refer to them in the discussions accordingly.
- In the regression tables, standard errors should be reported below each coefficient (in the parenthesis), and the significance level of the coefficient should be determined by asterisks. The R-squared and number of observations for each model should be reported (see the Sample Report).
- The programming codes used for preparing the tables, graphs, and regressions should be provided in a clear, easy-to-trace, and readable format in the appendix or as a separate do-file.
- You don’t need to cite any reference but use a proper citation style and provide the reference list in the appendix if you intend to do so.
- Overall, the project’s quality (i.e., clarity, rigour, precision, and depth) is more important than the length.
Get expert help for Final Project Description and many more. 24X7 help, plag free solution. Order online now!