ASSESSMENT 1 – MAPREDUCE, SIMILAR ITEMS AND DATA STREAMS
Due: Week 2, Wednesday at 11.59 pm Weighting: 20%
Purpose
The purpose of this assessment is to evaluate your learning and knowledge of using the MapReduce technique mentioned in Week 1, finding similar items and mining data streams.
Your Task
Your task is to complete the following exercises:
Exercise 1 – Friend Recommendation System (Stanford) (40 points)
- Write a MapReduce program in Spark (see Overview Module for download instructions) that implements a simple “People You Might Know” social network friendship recommendation algorithm.
The key idea is that if two people have a lot of mutual friends, then the system should recommend that they connect with each other.
Input:
- · Download the input file from the link: http://snap.stanford.edu/class/ cs246- data/hw1q1.zip. The input file contains the adjacency list and has multiple lines in the following format: <User><TAB><Friends>
- · Here, <User> is a unique integer ID corresponding to a unique user, and <Friends> is a comma separated list of unique IDs corresponding to the friends of the user with the unique ID <User>.
- · Note that the friendships are mutual (i.e., edges are undirected): if A is friends with B then B is also friends with A. Algorithm: Let us use a simple algorithm such that, for each user U, the algorithm recommends N = 10 users who are not already friends with U, but have the greatest number of mutual friends in common with U.
Output:
- The output should contain one line per user in the following format:
<User><TAB><Recommendations> where <User> is a unique ID corresponding to a user and <Recommendations> is a comma separated list of unique IDs corresponding to the algorithm’s recommendation of people that <User> might know, ordered in decreasing number of mutual friends.
- Even if a user has less than 10 second-degree friends, output all of them in decreasing order of the number of mutual friends.
- If there are recommended users with the same number of mutual friends, then output those user IDs in numerically ascending order.
- Also, please provide a description of how you are going to use MapReduce jobs to solve this problem. Do not write more than 3 to 4 sentences for this: only a very high- level description of your strategy to tackle this problem.
For your submission
- Include your source code
- Include in your writeup a short paragraph describing your algorithm to tackle this problem.
- Include in your writeup the recommendations for the users with following user IDs: 924, 8941, 8942, 9019, 9020, 9021, 9022, 9990, 9992, 9993.
Exercise 2 S-curve (exercise 3.4.1 in Leskovec, Rajaraman and Ullman) (7+7+7 points)
Evaluate the S-curve 1 − (1 − sr)b for s = 0.1, 0.2, . . ., 0.9, for the following values of r and b;
- r=3 and b=10.
- r=6 and b=20.
- r=5 and b=50.
Exercise 3 Filtering Streams (similar to Exercises of 4.3 in Leskovec, Rajaraman and Ullman) (10 + 10 points)
- For the situation of the running example of Section 4.3.1 in Leskovec, Rajaraman and Ullman with changed conditions (10 billion bits, 2 billion members of the set S).
Calculate the false-positive rate when using three hash functions. Do the same for four hash functions.
- As a function of n, the number of bits and m the number of members in the set S, what number of hash functions minimizes the false-positive rate?
Reference:
Leskovec, J., Rajaraman, A. and Ullman, J.D., 2020. Mining of massive data sets. Cambridge university press.
Outcomes
This task addresses the following course learning outcomes.
Course Learning Outcomes |
CLO1 |
Explain algorithms for big data sets and methodologies in the context of data mining. |
CLO3 Develop and integrate algorithms as a part of software development for mining big data. |
CLO5 Utilise contemporary technologies and practices to effectively handle big datasets. |
Requirements
- You must submit your assessment using the relevant portal in MyUni/Canvas.
- All written assessments (excluding quizzes) must be submitted using a text document e.g., doc, docx, pdf via the link at the top of the page.
- Consult the assessment rubric when preparing your submission.
- Questions can be posted to the relevant assessment Discussion Board.
Academic Integrity
Please ensure that you have read the Academic Integrity Policy.
Grading Criteria
This assessment is worth 20% of your overall grade. Refer to the attached rubric for detailed information on the grading criteria for this assessment.
Rubric title: Assessment 1 — MapReduce, Similar Items and Data Streams | |||||
Criteria | Ratings | Points | |||
Exercise 1: | Points: 40.0 Name: Full points | Points: 30.0 Name: Partial points | Points: 20.0 Name: Partial points | Points: 0.0 Name: No points | 40.0 pts |
MapReduce implementation works and the output meets all the requirements. | MapReduce implementation works but meets only some requirements and includes small mistakes. | Shows understanding of how to solve the task, but the MapReduce implementation only partially works. | No working implementation of MapReduce. | ||
Exercise 2: | Points: 21.0 Name: Full points | Points: 14.0 Name: Partial points | Points: 7.0 Name: Partial points | Points: 0.0 Name: No points | 21.0 pts |
All results are correct. | 2/3 of the results are correct. | 1/3 of the results are correct. | No correct results. | ||
Exercise 3: | Points: 20.0 Name: Full points | Points: 15.0 Name: Partial points | Points: 10.0 Name: Partial points | Points: 0.0 Name: No points | 20.0 pts |
Both parts are completely correct. | One of the parts is completely correct, the other has small mistakes. | One of the parts is partially correct; or, both are not completely correct but follow the correct approach. | Both parts are incorrect and the approach to solve them is incorrect. | ||
Total: | 81 pts |
Get expert help for SIMILAR ITEMS AND DATA STREAMS and many more. 24X7 help, plag free solution. Order online now!