Take-home exam
Course: Information retrieval 1, 7,5 hp (C3LIR1)
Publication date: 2022-10-18
Please note that the exam consists of 13 questions (with subquestions). The maximum number of points is 45. To obtain one of the grades A–E, you need to get at least the following number of points: grade E ≥ 31 points, grade D ≥ 34 points, grade C ≥ 37 points, grade B ≥ 40 points, and grade A ≥ 43 points. This exam is to be performed individually and should be submitted in Canvas at the latest by 2022-10-30.
- Many queries used on the web consist of a few search terms (maybe only one term) and lack operators (e.g., Boolean operators). This makes it in many ways problematic for a search en- gine to retrieve relevant information for the user. Please characterize, using different concepts from the IR theory, two of the problems that the search engine is faced with when dealing with such short, operator-less queries, and indicate for each problem a method or approach that can be used to handle that problem in the design and construction of the search engine. [4 p]
- Assume that we, in an IR system based on the Boolean model, formulate a query q with the following structure:
q = renewable AND (wind OR (NOT oil))
- With regard to the presence / absence of the three query terms – which documents will be retrieved as (system) relevant? [2 p]
- Which one of the three terms should the system first examine (with regard to the possible presence of the term in the document) to ascertain as quickly as possible whether a given document should be retrieved or not? Why this term in particular? [2 p]
- Term weighting. Let k1 and k2 be two terms (if you want to, you can replace these variables with two concrete terms in the answer), D a document collection, and d a document in D. Assume that the term k1 occurs 4 times in the document d and that the term k2 also occurs 4 times in d. Finally, assume that the term k1 occurs in 10 documents in D and that the term k2 term is found in 15 documents in D. Use reasoning * to find out whether the term weight of k1 in d will be greater than, less than or equal to the term weight of k2 in d, given that the following IR models and/or term weighting schemes are used:
- the vector space model with tf-idf weighting. [1 p]
- the classical probabilistic model (binary independence model). [1 p]
*You should not have to use a calculator to obtain the answer.
- A commonly used similarity measure in the vector space model (VSM) is the cosine measure. Please explain the general reasoning behind the design and use of this measure. You do not need to use any mathematical formulas in your answer – it is sufficient to explain with words only. [2 p]
- A model of how users go about when searching for information is called the berrypicking model (formulated by Marcia Bates). Briefly explain this model and how it differs from the traditional understanding of the search process. [2 p]
- One of the basic components of every IR model is a ranking function sim that in the context of a query q assigns to each document a score that indicates the ”similarity” between the document and the query, according to their representations in the specific IR model. We say that two documents are separable in a given IR model if it is possible to formulate at least one query that results in different values of the ranking function sim for the two documents. Consider the (short) documents:
d1 = to go or not to go
and
d2 = to go or not
Are these documents separable in
- the Boolean model, [2 p]
- the vector space model with tf-idf weighting and the cosine similarity measure? [2 p] Please substantiate your answers.
- Lexical analysis and term weighting.
- What is meant by lexical analysis? [1 p]
- Present a disadvantage of using tf weighting instead of tf-idf weighting for document representation? [1 p]
- What is meant by document length normalization and why may it be important to perform when generating document representations? [2 p]
- Text processing.
- Briefly present a potential advantage of using stemming in an IR system. [1 p]
- Also indicate a potential disadvantage of using stemming in an IR system. [1 p]
- What is the Levenshtein distance between the terms string and song? [1 p]
- Relevance feedback.
- What is meant by relevance feedback and how does this work in principle? [2 p]
- Explain how relevance feedback works, by exemplifying with the Rocchio method or the classical probabilistic model. Please note that it is sufficient that you explain only one of these methods, and that you do not need to use mathematical formalism in your answer (though it may facilitate your presentation). [2 p]
- Compare relevance feedback with query expansion. What similarities and differences regarding purpose and approach can you find? [2 p]
- What is the difference between local and global analysis in the context of implicit feed- back? [2 p]
- We perform a search in a reference collection on a topic with 12 known relevant documents. The current IR system is based on the vector space model. The returned documents are rel- evance assessed up until DCV = 20 (where DCV, i.e. document cutoff value is equal to the position up until which the evaluation measurements are calculated), whereby the following list is obtained. We let R represent a relevant document and 0 a non-relevant document.
R R 0 R 0 0 R R 0 0 R 0 0 0 0 R 0 0 0 0
For the search result above and DCV = 20, please calculate
- recall [1 p]
- precision [1 p]
- R-precision [1 p]
- the F-measure (F1) [1 p]
- Evaluation.
- Why is recall an inappropriate measure when evaluating web searches? [1 p]
- A common observation when evaluating search results is that the recall tends to be higher for larger DCV values (higher document positions), while the precision tends to decrease for larger DCV values. Present an explanation of this phenomenon based on the defini- tions of recall and precision. [2 p]
- Compare the algorithm HITS and PageRank for similarities and differences in how they rank web pages. [2 p]
- For this exam question we use the simple definition of PageRank, which is formulated as fol- lows. Let w and v be web pages. By the notation ∀v : v ↣ w we denote ”for all pages v such that v links to page w”. We also let r(w) denote the PageRank value for page w and let g(v) denote the number of links from page v. PageRank is then defined:
∑ |
r(w) =
∀v:v ↣ w
r(v)
g(v)
Consider the link structure in the figure below. The nodes (circles) represent web pages and the arrows represent hypertext links. Assume that the value of r(A), that is, the PageRank value for page A, is known and happens to be independent of the structure in the figure. Furthermore, assume that the PageRank values can be fully calculated based on the structure in the figure. Which of the following statements is correct and how did you come up with the answer?
- r(B) > r(C) [r(B) is greater than r(C)].
- r(B) < r(C) [r(B) is less than r(C)].
- r(B) = r(C) [r(B) equals r(C)].
The answer is not dependent on the exact value of r(A), but you can assume, for example, that
r(A) = 18.0. [3 p]
Get expert help for Information retrieval and many more. 24X7 help, plag-free solution. Order online now!