Project:
Advanced Big Data Computing and Programming
(BAN 5600 – I)
Instructor: Hamidreza Ahady Dolatsara (Hamid)
The final project should address an analytic problem or be driven by your literature review. It must be developed in PySpark or SparkR environments and the results should be achieved through one of them. The project is to be done individually. The below items are to be added to the workflow of the project:
- A programming over GitHub.
- The project should be performed on two different platforms (AWS, DataBricks, Google Cloud, local installation, etc.)
You need to provide your justification for the project topic, the research questions, methods, reproducible results, and the methods you choose. The last week is for the project presentation. All the students should make a (YouTube) video for their presentation.
There are three deliverables for the final project as follows:
- Project Proposal (10 points), Due date: 03/10/2023, submission through Moodle
- Problem statement (including but not limited to) motivation, introducing the data and
the topic, and the project scope.
- Data source, and description of the data set (including the data dictionary)
- Problem statement (including but not limited to) motivation, introducing the data and
- Project Introduction (10 points), Due date: 03/10/2023, submission through Moodle
- Literature review (including citation), identifying the gap in the literature
- Research questions, and the significance of the study
- Descriptive statistics, and visualizations
- Final products (80 points), Due date: 04/10/2023, submission through Moodle
- Presentation (YouTube)
- Final Report as below:
- It should include the previous deliverables and these: methods, analyzes,
conclusions, business suggestions, codes, and software outputs.
- There is no specific instruction on how many lines or pages of write-ups should be delivered but each section of the report should be well-developed and explained clearly.
- In this report, students must submit a complete project as described above.
- This report should be comprehensive and reproducible. It means TA or instructor should use the codes and data and reproduce the same outputs.
- It should include the previous deliverables and these: methods, analyzes,
- Important Points:
- The project is not just an implementation of a python project in the Spark environment.
The expectation is much more. It should be a professional report.
- You can use your knowledge in other programming languages such as R, SQL, etc.
- According to CRISP-DM, your project is a cyclical one. It means you should try to improve the performance of your project and try at least 3 different approaches. Here is a link to know more about this standard
- You should provide clear and well-developed comments, so everyone who reads your
codes can understand the workflow.
- You may submit a video though zoom.
- A good “Literature Review” means reading plenty of scientific papers and citing them
professionally.
- The project is not just an implementation of a python project in the Spark environment.
Get expert help for Big Data Computing and Programming and many more. 24X7 help, plag free solution. Order online now!