INSTRUCTIONS
In the lecture on Spark Structured API, we did not specify the schema of our dataset. We relied on the inference of Spark engine which generally loads data as Strings. We can create a schema by using an object of a class called StructType consisting of an array of StructFields. More details on Spark Schemas can be found at this link
https://sparkbyexamples.com/spark/spark-schema-explained-with-examples/
The code to load the youtube dataset used in the lectures with a schema has been provided as a guide.
Note that dates in Spark are only recognized if they have a special format (we can load them if we specify the schema and date format but this is a bit beyond this course). For simplicity, you can treat dates as strings in this assignment. Once you have loaded the stocks datasets with the correct schema in Spark, answer the following questions. For any questions that require you to execute a command take a screenshot(s) of the command and the output.
Load the large stocks dataset (400 MB) into HDFS and use the dataset to create a scala DataFrame with correct schema specified (see examples given in website above)
- Write a command to find the stocks with average daily volume larger than 1 million shares
- Write a query to find the top 3 stocks by volume for the year 2004.
- Write a query to find the top 3 stocks by volume and whose symbol start with the first letter of your name (example for Saber, it is symbols starting with “S”). If there is no stocks with the letter you specify, choose another letter.
- Write a query to find all the stocks symbols whose closing price is larger than your age.
- Write a query to find the top 10 stocks with largest intraday price change (difference between high and low price during a trading day) and also display the amount of the change.
Deliverables:
Answer all questions in a well laid out single PDF or Word document (don’t just submit a bunch of screenshots). For any query commands, make sure to include the screenshot(s) of the command being executed and corresponding output. Do not submit multiple files. Do not submit any compressed files (example zip or rar).
Get expert help for Spark Structured API and many more. 24X7 help, plag free solution. Order online now!