Course Work:
Task: Develop a vertical search engine similar to Google Scholar that only retrieves papers/books published by a member of Coventry University. That is, at least one of the co-authors must be from CU. To that end, you crawl Google Scholar profiles of academic staff at CU and index their papers in their profiles. The seed page for your crawler, i.e. the first page to crawl, is the Google Scholar page for Coventry University: https://scholar.google.co.uk/citations?view_op=view_org&hl=en&org=9117984065169182779 Your system crawls this page and the links provided for each member of staff there to access their Google Scholar profiles. Then for each profile, it goes through the publications and construct the inverted index using the information about those publications. Because of low rate of changes to this information, your crawler may be scheduled to look for new information, say, once per week, but it should ideally be able to do so automatically, as a scheduled task. From the user’s point of view, your system has an interface that is similar to the Google Scholar main page, where the user can type in their queries/keywords about the resources they want to find. Then, your system will display the results, sorted by relevance, in a similar way Google Scholar does. However, only publications with at least one co-author from CU are retrieved. You may further specialise your search engine to a specific field, e.g., computer science, mechanical engineering, bioinformatics or whatever you would like. In addition, whether as a separate program or integrated with search engine, a subject classification functionality is needed. More specifically, the input is a scientific text and the output is its subject among zero or more of the cases: Health, Engineering, Business, Art. You can use any general purpose programming language of your choice although Python is recommended because of its rich library and sample codes developed in the labs. In case of ambiguity, make reasonable assumptions and/or let me know. Please note that to show that your system meets each of the above-mentioned requirements, your report must provide sufficient evidence including clear description, complete source code, and complete screenshots where applicable. |
Suggested structure for the report
Part 1 – Search engine
1. Crawler:
1.1 Number of staff whose Google Scholar profiles are crawled (approximately).
1.2. Which parts of a Google Scholar profile is retrieved (e.g. the title of the publications only or any additional part)
1.3. Which pre-processing is performed before passing data to Indexer/Elastic Search (preferably, screenshot of data before passing to Indexer).
1.4. When the crawler operates, e.g. scheduled or run manually
1.5. Explanation of how it works step by step based on the written code (one or two pages should be enough)
2. Indexer
2.1. Whether you implemented the index or used Elastic Search
2.2. If you implemented it, which data structure is used (for example, incidence matrix or inverted index)
2.3. If you implemented it, whether it is incremental, i.e. it grows and gets updated over the time, or it is constructed from scratch every time your crawler is run
2.4. If you implemented it, some screenshots of its content
2.5. How it works step by step based on the written code (a few pages should be enough)
3. Query processor
3.1. Do you only support Boolean queries (using AND, OR, NOT, etc.) or accept keywords like Google does (without any need for AND, OR, NOT etc.)
3.2. If Elastic Search is used, how you convert a user query to an appropriate query for Elastic Search
3.3. If Elastic Search is NOT used, whether or not you perform ranked retrieval; if yes, the method used to calculate the ranks
3.4. Screenshots of the search results found for various queries
3.5. How it works step by step based on the written code and screenshots, where applicable (a few pages should be enough)
4. (Optional)
Any other important point you may want to mention, including any restriction, extras, issues.
Part 2 – Subject classification (text classification)
1. How training data are gathered, and how many records are there
2. Which machine learning method (e.g. Multinomial NB) has been used and how its performance is measured
3. How it works step by step based on the written code and screenshots of its answers for various inputs (a few pages should be enough).
4. (Optional) any other important point you may want to mention
Appendix
1. Complete source code
2. Training dataset used in Part 2
Note: This is only a suggestion and you may change it as you wish, as long as you provide enough evidence for meeting the requirements specified in the assignment brief. Also, please be generous and provide enough screenshots where applicable.
Get expert help for Develop a vertical search engine and many more. 24X7 help, plag free solution. Order online now!