Lecture
|
Topic Title
|
Topic Detail
|
Primary/Secondary
Resource/Book;
Course Notes for the Topic
|
Page/Section/URL
of the Resource
|
1
|
Introduction
|
Information retrieval IR system architecture
Web search History of IR Related areas
|
Textbook
|
Chapter 1
|
2
|
Information Retrieval Models
Boolean Retrieval Model
|
What is Information Retrieval?
IR Models
The Boolean Model
Considerations on the Boolean Model
|
Textbook, Prof. Joydeep
Ghosh (UT ECE)
|
Chapter 1
|
3
|
Boolean Retrieval Model
Rank Retrieval Model
|
Boolean Retrieval Model
Information Retrieval Ingredients
Westlaw
Ranked retrieval models
|
Chapter 1 of IIR
Boolean Retrieval
|
|
4
|
Vector Space Retrieval Model
|
The Vector Model
Term frequency tf
Document frequency
tf-idf weighting
|
IIR 6.2 – 6.4.3
|
https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/ https://www.bionicspirit.com/blog/2012/01/16/cosine-similarity-euclidean-distance.html
|
5
|
TF-IDF Weighting
Document Representation in Vector Space
Query Representation in Vector Space
Similarity Measures
|
Computing TF-IDF
Mapping to vectors
Coefficients
Query Vector
Similarity Measure
Jaccard coefficient
Inner Product
|
Many slides in this section are adapted from Prof. Joydeep Ghosh
(UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech,
Hong Kong)
|
Many slides in this section are
adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof.
Dik Lee (Univ. of Science and Tech, Hong Kong)
|
6
|
Similarity Measures
Cosine Similarity Measure
|
Cosine Similarity
Basic indexing pipeline
Sparse Vectors
Inverted Index
|
IIR 6.10
|
Chapter 06
|
7
|
Parsing Documents
|
Basic indexing pipeline
Inverted Index
Cosine Similarity Measure
Time Complexity of Indexing
Retrieval with Inverted Index
Inverted Query Retrieval Efficiency
|
Textbook
|
Chapter 1
|
8
|
Token
Numbers
Stop Words
|
Parsing a document
Complications: Format/language
Precision and Recall
Tokenization
Numbers
Tokenization: language issues
Stop words
|
MG 3.6, 4.3; MIR 7.2
|
Porter’s stemmer: http://people.ischool.berkeley.edu/~hearst/irbook/porter.html
H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined
Indexes”, ACM Transactions on Information Systems.
http://www.seg.rmit.edu.au/research/research.php?author=4
|
9
|
Terms Normalization
|
Normalization
Case folding
Normalization to terms
Thesauri and soundex
|
MG 3.6, 4.3; MIR 7.2
|
Porter’s stemmer: http://people.ischool.berkeley.edu/~hearst/irbook/porter.html
H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined
Indexes”, ACM Transactions on Information Systems.
http://www.seg.rmit.edu.au/research/research.php?author=4
|
10
|
Lemmatization
Stemming
|
Lemmatization
Stemming
Porter’s algorithm
Language-specificity
|
MG 3.6, 4.3; MIR 7.2
|
Porter’s stemmer: http://people.ischool.berkeley.edu/~hearst/irbook/porter.html
H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined
Indexes”, ACM Transactions on Information Systems.
www.seg.rmit.edu.au/research/research.php?author=4
|
11
|
Compression
|
compression for inverted indexes
Dictionary storage
Dictionary-as-a-String
Blocking
|
Chapter 5 of IIR
|
Original publication on word-aligned binary codes by Anh and Moffat (2005);
also: Anh and Moffat (2006a)
Original publication on variable byte codes by Scholer, Williams, Yiannis and
Zobel (2002)
More details on compression (including compression of positions and
frequencies) in Zobel and Moffat (2006)
|
12
|
Compression
|
Blocking
Front coding
Postings compression
Variable Byte (VB) codes
|
Chapter 5 of IIR
|
Original publication on word-aligned binary codes by Anh and Moffat (2005);
also: Anh and Moffat (2006a)
Original publication on variable byte codes by Scholer, Williams, Yiannis and
Zobel (2002)
More details on compression (including compression of positions and
frequencies) in Zobel and Moffat (2006)
|
13
|
Compression
|
Variable Byte (VB) codes
Unary code
Gamma codes
Gamma code properties
RCV1 compression
|
Chapter 5 of IIR
|
Resources at http://ifnlp.org/ir
Original publication on word-aligned binary codes by Anh and Moffat (2005);
also: Anh and Moffat (2006a)
Original publication on variable byte codes by Scholer, Williams, Yiannis and
Zobel (2002)
More details on compression (including compression of positions and
frequencies) in Zobel and Moffat (2006)
|
14
|
Index Constructions
|
Scaling index construction
Memory Hierarchy
Hard Disk Tracks and Sectors
Hard Disk Blocks
Hardware basics
|
Chapter 4 of IIR
MG Chapter 5
|
Original publication on MapReduce:
Dean and Ghemawat (2004)
Original publication on SPIMI: Heinz and Zobel (2003)
|
15
|
Merge Sort
|
Two-Way Merge Sort
Single-pass in-memory indexing
SPIMI-Invert
|
Chapter 4 of IIR
MG Chapter 5
|
Original publication on MapReduce:
Dean and Ghemawat (2004)
Original publication on SPIMI: Heinz and Zobel (2003)
|
16
|
Phrase queries
|
Types of Queries
Phrase queries
Biword indexes
Extended biwords
Positional indexes
|
MG 3.6, 4.3; MIR 7.2
|
Porter’s stemmer: http://people.ischool.berkeley.edu/~hearst/irbook/porter.html
H.E. Willia ms, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined
Indexes”, ACM Transactions on Information Systems.
http://www.seg.rmit.edu.au/research/research.php?author=4
|
17
|
Processing a phrase query
Proximity queries
|
Processing a phrase query
Proximity queries
Combination schemes
|
MG 3.6, 4.3; MIR 7.2
|
Porter’s stemmer: http//www.sims.berkeley.edu/~hearst/irbook/porter.html
H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined
Indexes”, ACM Transactions on Information Systems.
http://www.seg.rmit.edu.au/research/research.php?author=4
|
18
|
Wild Card Queries
B Tree
|
How to Handle Wild-Card Queries
Wild-card queries: *
B-Tree
B+ Tree
|
IIR 3, MG 4.2
|
Efficient spell retrieval:
K. Kukich. Techniques for automatically correcting words in text. ACM
Computing Surveys 24(4), Dec 1992.
J. Zobel and P. Dart. Finding approximate matches in large
lexicons. Software - practice and experience 25(3), March 1995.
http://citeseer.ist.psu.edu/zobel95finding.html
Mikael Tillenius: Efficient Generation and Ranking of Spelling Error
Corrections. Master’s thesis at Sweden’s Royal Institute of Technology.
http://citeseer.ist.psu.edu/179155.html
|
19
|
Permuterm index
k-gram
|
Permuterm index
k-gram
Soundex
|
IIR 3, MG 4.2
|
Nice, easy reading on spell
correction:
Peter Norvig: How to write a spelling corrector
http://norvig.com/spell-correct.html
|
20
|
Spelling Correction
|
Spell correction
Document correction
Query mis-spellings
Isolated word correction
Edit distance
Fibonacci series
|
IIR 3, MG 4.2
|
Efficient spell retrieval:
K. Kukich. Techniques for automatically correcting words in text. ACM
Computing Surveys 24(4), Dec 1992.
J. Zobel and P. Dart. Finding approximate matches in large
lexicons. Software - practice and experience 25(3), March 1995.
http://citeseer.ist.psu.edu/zobel95finding.html
Mikael Tillenius: Efficient Generation and Ranking of Spelling Error
Corrections. Master’s thesis at Sweden’s Royal Institute of Technology.
http://citeseer.ist.psu.edu/179155.html
|
21
|
Spelling Correction
|
Edit distance
Using edit distances
Weighted edit distance
n-gram overlap
One option – Jaccard coefficient
|
IIR 3, MG 4.2
|
Nice, easy reading on spell
correction:
Peter Norvig: How to write a spelling corrector
http://norvig.com/spell-correct.html
|
22
|
Spelling Correction
|
Matching trigrams
Computing Jaccard coefficient
Context-sensitive spell correction
General issues in spell correction
|
IIR 3, MG 4.2
|
Nice, easy reading on spell
correction:
Peter Norvig: How to write a spelling corrector
http://norvig.com/spell-correct.html
|
23
|
Performance Evaluation of Information Retrieval Systems
|
Why System Evaluation?
Difficulties in Evaluating IR Systems
Measures for a search engine
Measuring user happiness
How do you tell if users are happy?
|
Textbook
|
Chapter 8
|
24
|
BENCHMARKS FOR THE EVALUATION OF IR SYSTEMS
|
Happiness: elusive to measure
Gold Standard
search algorithm
Relevance judgments
Standard relevance benchmarks
Evaluating an IR system
|
Textbook
|
Chapter 8
|
25
|
BENCHMARKS FOR THE EVALUATION OF IR SYSTEMS
|
Evaluation Measures
Precision and Recall
Unranked retrieval evaluation
Trade-off between Recall and Precision
Computing Recall/Precision Points
|
Textbook
|
Chapter 8
|
26
|
Precision and Recall
|
Computing Recall/Precision
Interpolating a Recall/Precision
Average Recall/Precision Curve
R- Precision
Precision@K
F-Measure
E Measure
|
Textbook
|
Chapter 8
|
27
|
Mean Average Precision
Non Binary Relevance
DCG
NDCG
|
Mean Average Precision
Mean Reciprocal Rank
Cumulative Gain
Discounted Cumulative Gain
Normalized Discounted Cumulative Gain
|
Textbook
|
Chapter 8
|
28
|
Using user Clicks
|
What do clicks tell us?
Relative vs absolute ratings
Pairwise relative ratings
Interleaved docs
Kendall tau distance
Critique of additive relevance
Kappa measure
A/B testing
|
Textbook
|
Chapter 9
|
29
|
Cosine Ranking
|
Computing cosine-based ranking
Efficient cosine ranking
Computing the K largest cosines
Use heap for selecting top K
High-idf query terms
|
Textbook
|
Chapter 9
|
30
|
Sampling and pre-grouping
|
Term-wise candidates
Preferred List storage
Sampling and pre-grouping
General variants.
|
Textbook
|
Chapter 13
|
31
|
Dimensionality reduction
|
Dimensionality reduction
Random projection onto k<<m axes
Computing the random projection
Latent semantic indexing (LSI)
The matrix
Dimension reduction
|
Books: MG 4.6, MIR 2.7.2.
|
Random projection theorem:http://citeseer.nj.nec.com/dasgupta99elementary.html
Faster random projection: http://citeseer.nj.nec.com/frieze98fast.html
Latent semantic indexing:
http://citeseer.nj.nec.com/deerwester90indexing.html
|
32
|
Web Search
|
The World Wide Web
Web Pre-History
Web Search History
Web Challenges for IR
Graph Structure in the Web
Zipf’s Law on the Web
Manual Hierarchical Taxonomies
Business Models for Web Search
|
Textbook
|
Chapter 19
|
33
|
Spidering
|
Web Search Using IR
Spiders (Robots/Bots/Crawlers)
What any crawler must do
What any crawler should do
Search Strategies
Basic crawl architecture
Restricting Spidering
Robot Exclusion
|
Textbook
|
Chapter 19
|
34
|
Web Crawler
|
Basic crawl architecture
Communication between nodes
URL frontier: Mercator scheme
Duplicate documents
Shingles + Set Intersection
|
Textbook
|
Chapter 20
|
35
|
Distributed Index
|
Distributed Index
Map Reduce
Mapping onto Hadoop Map Reduce
Dynamic Indexing
Issues in Dynamic Indexing
|
Textbook
|
Chapter 04
|
36
|
Link Analysis
|
Hypertext and Links
The Web as a Directed Graph
Anchor Text
Indexing anchor text
PageRank scoring
|
IIR Chap 21
|
http://www2004.org/proceedings/docs/1p309.pdf
http://www2004.org/proceedings/docs/1p595.pdf
http://www2003.org/cdrom/papers/refereed/p270/kamvar-270-xhtml/index.html
http://www2003.org/cdrom/papers/refereed/p641/xhtml/p641-mccurley.html
|
37
|
Markov chains
|
Markov chains
Ergodic Markov chains
Markov Chain with Teleporting
Query Processing
Personalized PageRank
|
IIR Chap 21
|
http://www2004.org/proceedings/docs/1p309.pdf
http://www2004.org/proceedings/docs/1p595.pdf
http://www2003.org/cdrom/papers/refereed/p270/kamvar-270-xhtml/index.html
http://www2003.org/cdrom/papers/refereed/p641/xhtml/p641-mccurley.html
|
38
|
HITS
|
Hyper-link induced topic search (HITS)
Hubs and Authorities
The hope
Distilling hubs and authorities
HITS for Clustering
|
IIR Chap 21
|
http://www2004.org/proceedings/docs/1p309.pdf
http://www2004.org/proceedings/docs/1p595.pdf
http://www2003.org/cdrom/papers/refereed/p270/kamvar-270-xhtml/index.html
http://www2003.org/cdrom/papers/refereed/p641/xhtml/p641-mccurley.html
|
39
|
Search Computing
|
Multi-domain queries with ranking
Why Search Engines can’t do it?
Observed trends
Search Computing
The Search Computing “Manifesto”
Search Computing architecture
|
Search Computing by Stefano Ceri et al.
|
www.search-computing.org
|
40
|
Top-k Query Processing
|
Top-k Query Processing
Simple Database model
Fagin’s Algorithm
Threshold Algorithm
Comparison of Fagin’s and Threshold Algorithm
|
Search Computing by Stefano Ceri et al.
|
http://dl.acm.org/citation.cfm?id=1391730
|
41
|
Clustering
|
What is clustering?
Improving search recall
Issues for clustering
Notion of similarity/distance
Hard vs. soft clustering
Clustering Algorithms
|
Textbook
|
Chapter 13
|
42
|
Classification
|
Why probabilities in IR?
Document Classification
Bayes’ Rule For Text Classification
Bernoulli Random Variables
Smoothing Function
|
Textbook
|
Chapter 13
|
43
|
Clustering
|
Rochio classification
K Nearest neighbors
Nearest-Neighbor Learning
kNN decision boundaries
Bias vs. variance
|
Textbook
|
Chapter 17
|
44
|
Recommender Systems
|
Recommender Systems
Personalization
Basic Types of Recommender Systems
Collaborative Filtering
Content-Based Recommending
|
C. Sammut, G. Webb (eds.), Encyclopedia of
Machine Learning, Springer-Verlag Berlin Heidelberg, 2010
|
|
45
|
Final Notes on Information Retrieval
|
Other topics and research issues in Information Retrieval Domain
|
|
|