Introduction:
Organizations often gain data from multiple sources, filter and clean it and then load the data onto different systems for visualizations and analysis. Over time, many of these systems may become disconnected and not performing optimally. As a Big Data Developer, you may be tasked with trying to combine data from multiple sources into one or more other sinks. The go-to technologies for transferring data are often Apache Kafka and Spark Streaming. In this Final Project you will mimic an ETL process (short for extra, transform and load) using Spark Streaming and different data sources and sinks giving you practice on how such systems may be developed and implemented.
ETL Process
The ETL process will have two sources:
- A Flume agent writing data into an HDFS folder.
- An HDFS folder with data being sent from the Linux local file system (optional if doing individually)
For this assignment, we will use iot data from a Nexus phone (as used in the lecture exercises for Spark Streaming). You can choose other datasets if you wish. Spark streaming will be reading data from the above two sources. The data will then be aggregated and sent to a Kafka sink depending on the type of data that it is.
For any data that involves sitting or standing, the data is sent to a Kafka topic called “idle”. For any data other than sitting or standing, the data is sent to a Kafka topic called “active”.
Two consumers should then read the data and display the activity and time according to the type of activity.
Tasks:
- Watch videos on Apache Flume and associated exercise to understand how to setup a Flume agent for the project
- Watch videos on Apache Kafka to understand how Apache Kafka works and do exercises on producers and consumers in order to implement Kafka source and sink in the project
- Watch exercises on Spark Streaming
- Implement a system as described.
Deliverables:
- Write a report of the setup of the system.
- Make a maximum 5 minute video showing how the system works and gets data from the sources and sends it to the sinks.
Group Sign Up Details
- To form a group, students must discuss among themselves and mutually agree to be part of a group before they sign up to a group.
- Group members will then self-enrol into a specific group using the link on the course shell. Let your team members know which Group number to register for.
- You can see a list of the current group members for each group so there should not be any confusion in choosing the wrong group.
- You cannot enrol into a group if you have not gotten permission from members to do so. Enrolment into a group without permission is considered an academic misconduct.
- If you are part of a group and an unknown member joins your group, it is your responsibility to let me know of this. Failure to do so can result in an academic misconduct.
- There can be a maximum of 3 people in a group.
- You can do the project individually if you wish but you must still sign up in an empty group. If the project is done individually, you do not need to setup the second source (HDFS).
Get expert help for DATA 1202 – Group Final Project and many more. 24X7 help, plag-free solution. Order online now!