CS726 : Information Retrieval Techniques

I like this Course

Course Info

Course Category

Computer Science/Information Technology

Course Level

Graduate

Credit Hours

3

Pre-requisites

N/A

Instructor

Dr. Adnan Abid
PhD

Course Contents

Lecture

Topic Title

Topic Detail

Primary/Secondary
Resource/Book;
Course Notes for the Topic

Page/Section/URL of the Resource

1

Introduction

Information retrieval IR system architecture
Web search History of IR Related areas

Textbook

Chapter 1

2

Information Retrieval Models
Boolean Retrieval Model

What is Information Retrieval?
IR Models
The Boolean Model
Considerations on the Boolean Model

Textbook, Prof. Joydeep
Ghosh (UT ECE)

Chapter 1

3

Boolean Retrieval Model
Rank Retrieval Model

Boolean Retrieval Model
Information Retrieval Ingredients
Westlaw
Ranked retrieval models

Chapter 1 of IIR
Boolean Retrieval


4

Vector Space Retrieval Model

The Vector Model
Term frequency tf
Document frequency
tf-idf weighting

IIR 6.2 – 6.4.3

https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/

https://www.bionicspirit.com/blog/2012/01/16/cosine-similarity-euclidean-distance.html

5

TF-IDF Weighting
Document Representation in Vector Space
Query Representation in Vector Space
Similarity Measures

Computing TF-IDF
Mapping to vectors
Coefficients
Query Vector
Similarity Measure
Jaccard coefficient
Inner Product

Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)

Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)

6

Similarity Measures
Cosine Similarity Measure

Cosine Similarity
Basic indexing pipeline
Sparse Vectors
Inverted Index

IIR 6.10

Chapter 06

7

Parsing Documents

Basic indexing pipeline
Inverted Index
Cosine Similarity Measure
Time Complexity of Indexing
Retrieval with Inverted Index
Inverted Query Retrieval Efficiency

Textbook

Chapter 1

8

Token
Numbers
Stop Words

Parsing a document
Complications: Format/language
Precision and Recall
Tokenization
Numbers
Tokenization: language issues
Stop words

MG 3.6, 4.3; MIR 7.2

Porter’s stemmer: 

http://people.ischool.berkeley.edu/~hearst/irbook/porter.html
H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems.
http://www.seg.rmit.edu.au/research/research.php?author=4

9

Terms Normalization

Normalization
Case folding
Normalization to terms
Thesauri and soundex

MG 3.6, 4.3; MIR 7.2

Porter’s stemmer:

 http://people.ischool.berkeley.edu/~hearst/irbook/porter.html
H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems.
http://www.seg.rmit.edu.au/research/research.php?author=4

10

Lemmatization
Stemming

Lemmatization
Stemming
Porter’s algorithm
Language-specificity

MG 3.6, 4.3; MIR 7.2

Porter’s stemmer: 

http://people.ischool.berkeley.edu/~hearst/irbook/porter.html
H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems.
www.seg.rmit.edu.au/research/research.php?author=4

11

Compression

compression for inverted indexes
Dictionary storage
Dictionary-as-a-String
Blocking

Chapter 5 of IIR


Original publication on word-aligned binary codes by Anh and Moffat (2005); also: Anh and Moffat (2006a)
Original publication on variable byte codes by Scholer, Williams, Yiannis and Zobel (2002)
More details on compression (including compression of positions and frequencies) in Zobel and Moffat (2006)

12

Compression

Blocking
Front coding
Postings compression
Variable Byte (VB) codes

Chapter 5 of IIR


Original publication on word-aligned binary codes by Anh and Moffat (2005); also: Anh and Moffat (2006a)
Original publication on variable byte codes by Scholer, Williams, Yiannis and Zobel (2002)
More details on compression (including compression of positions and frequencies) in Zobel and Moffat (2006)

13

Compression

Variable Byte (VB) codes
Unary code
Gamma codes
Gamma code properties
RCV1 compression

Chapter 5 of IIR

Resources at http://ifnlp.org/ir
Original publication on word-aligned binary codes by Anh and Moffat (2005); also: Anh and Moffat (2006a)
Original publication on variable byte codes by Scholer, Williams, Yiannis and Zobel (2002)
More details on compression (including compression of positions and frequencies) in Zobel and Moffat (2006)

14

Index Constructions

Scaling index construction
Memory Hierarchy
Hard Disk Tracks and Sectors
Hard Disk Blocks
Hardware basics

Chapter 4 of IIR
MG Chapter 5

Original publication on MapReduce: Dean and Ghemawat (2004)
Original publication on SPIMI: Heinz and Zobel (2003)

15

Merge Sort

Two-Way Merge Sort
Single-pass in-memory indexing
SPIMI-Invert

Chapter 4 of IIR
MG Chapter 5

Original publication on MapReduce: Dean and Ghemawat (2004)
Original publication on SPIMI: Heinz and Zobel (2003)

16

Phrase queries

Types of Queries
Phrase queries
Biword indexes
Extended biwords
Positional indexes

MG 3.6, 4.3; MIR 7.2

Porter’s stemmer: 

http://people.ischool.berkeley.edu/~hearst/irbook/porter.html
H.E. Willia ms, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems.
http://www.seg.rmit.edu.au/research/research.php?author=4

17

Processing a phrase query
Proximity queries

Processing a phrase query
Proximity queries
Combination schemes

MG 3.6, 4.3; MIR 7.2

Porter’s stemmer: http//www.sims.berkeley.edu/~hearst/irbook/porter.html
H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems.
http://www.seg.rmit.edu.au/research/research.php?author=4

18

Wild Card Queries
B Tree

How to Handle Wild-Card Queries
Wild-card queries: *
B-Tree
B+ Tree

IIR 3, MG 4.2

Efficient spell retrieval:
K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys 24(4), Dec 1992.
J. Zobel and P. Dart.  Finding approximate matches in large lexicons.  Software - practice and experience 25(3), March 1995. http://citeseer.ist.psu.edu/zobel95finding.html
Mikael Tillenius: Efficient Generation and Ranking of Spelling Error Corrections. Master’s thesis at Sweden’s Royal Institute of Technology. http://citeseer.ist.psu.edu/179155.html

19

Permuterm index
k-gram

Permuterm index
k-gram
Soundex

IIR 3, MG 4.2

Nice, easy reading on spell correction:
Peter Norvig: How to write a spelling corrector
http://norvig.com/spell-correct.html

20

Spelling Correction

Spell correction
Document correction
Query mis-spellings
Isolated word correction
Edit distance
Fibonacci series

IIR 3, MG 4.2

Efficient spell retrieval:
K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys 24(4), Dec 1992.
J. Zobel and P. Dart.  Finding approximate matches in large lexicons.  Software - practice and experience 25(3), March 1995. http://citeseer.ist.psu.edu/zobel95finding.html
Mikael Tillenius: Efficient Generation and Ranking of Spelling Error Corrections. Master’s thesis at Sweden’s Royal Institute of Technology. http://citeseer.ist.psu.edu/179155.html

21

Spelling Correction

Edit distance
Using edit distances
Weighted edit distance
n-gram overlap
One option – Jaccard coefficient

IIR 3, MG 4.2

Nice, easy reading on spell correction:
Peter Norvig: How to write a spelling corrector
http://norvig.com/spell-correct.html

22

Spelling Correction

Matching trigrams
Computing Jaccard coefficient
Context-sensitive spell correction
General issues in spell correction

IIR 3, MG 4.2

Nice, easy reading on spell correction:
Peter Norvig: How to write a spelling corrector
http://norvig.com/spell-correct.html

23

Performance Evaluation of Information Retrieval Systems

Why System Evaluation?
Difficulties in Evaluating IR Systems
Measures for a search engine
Measuring user happiness
How do you tell if users are happy?

Textbook

Chapter 8

24

BENCHMARKS FOR THE EVALUATION OF IR SYSTEMS

Happiness: elusive to measure
Gold Standard
search algorithm
Relevance judgments
Standard relevance benchmarks
Evaluating an IR system

Textbook

Chapter 8

25

BENCHMARKS FOR THE EVALUATION OF IR SYSTEMS

Evaluation Measures
Precision and Recall
Unranked retrieval evaluation
Trade-off between Recall and Precision
Computing Recall/Precision Points

Textbook

Chapter 8

26

Precision and Recall

Computing Recall/Precision
Interpolating a Recall/Precision
Average Recall/Precision Curve
R- Precision
Precision@K
F-Measure
E Measure

Textbook

Chapter 8

27

Mean Average Precision
Non Binary Relevance
DCG
NDCG

Mean Average Precision
Mean Reciprocal Rank
Cumulative Gain
Discounted Cumulative Gain
Normalized Discounted Cumulative Gain

Textbook

Chapter 8

28

Using user Clicks

What do clicks tell us?
Relative vs absolute ratings
Pairwise relative ratings
Interleaved docs
Kendall tau distance
Critique of additive relevance
Kappa measure
A/B testing

Textbook

Chapter 9

29

Cosine Ranking

Computing cosine-based ranking
Efficient cosine ranking
Computing the K largest cosines
Use heap for selecting top K
High-idf query terms

Textbook

Chapter 9

30

Sampling and pre-grouping

Term-wise candidates
Preferred List storage
Sampling and pre-grouping
General variants.

Textbook

Chapter 13

31

Dimensionality reduction

Dimensionality reduction
Random projection onto k<<m axes
Computing the random projection
Latent semantic indexing (LSI)
The matrix
Dimension reduction

Books: MG 4.6, MIR 2.7.2.

Random projection theorem:http://citeseer.nj.nec.com/dasgupta99elementary.html
Faster random projection: http://citeseer.nj.nec.com/frieze98fast.html
Latent semantic indexing: http://citeseer.nj.nec.com/deerwester90indexing.html

32

Web Search

The World Wide Web
Web Pre-History
Web Search History
Web Challenges for IR
Graph Structure in the Web
Zipf’s Law on the Web
Manual Hierarchical Taxonomies
Business Models for Web Search

Textbook

Chapter 19

33

Spidering

Web Search Using IR
Spiders (Robots/Bots/Crawlers)
What any crawler must do
What any crawler should do
Search Strategies
Basic crawl architecture
Restricting Spidering
Robot Exclusion

Textbook

Chapter 19

34

Web Crawler

Basic crawl architecture
Communication between nodes
URL frontier: Mercator scheme
Duplicate documents
Shingles + Set Intersection

Textbook

Chapter 20

35

Distributed Index

Distributed Index
Map Reduce
Mapping onto Hadoop Map Reduce
Dynamic Indexing
Issues in Dynamic Indexing

Textbook

Chapter 04

36

Link Analysis

Hypertext and Links
The Web as a Directed Graph
Anchor Text
Indexing anchor text
PageRank scoring

IIR Chap 21

http://www2004.org/proceedings/docs/1p309.pdf
http://www2004.org/proceedings/docs/1p595.pdf
http://www2003.org/cdrom/papers/refereed/p270/kamvar-270-xhtml/index.html
http://www2003.org/cdrom/papers/refereed/p641/xhtml/p641-mccurley.html

37

Markov chains

Markov chains
Ergodic Markov chains
Markov Chain with Teleporting
Query Processing
Personalized PageRank

IIR Chap 21

http://www2004.org/proceedings/docs/1p309.pdf
http://www2004.org/proceedings/docs/1p595.pdf
http://www2003.org/cdrom/papers/refereed/p270/kamvar-270-xhtml/index.html
http://www2003.org/cdrom/papers/refereed/p641/xhtml/p641-mccurley.html

38

HITS

Hyper-link induced topic search (HITS)
Hubs and Authorities
The hope
Distilling hubs and authorities
HITS for Clustering

IIR Chap 21

http://www2004.org/proceedings/docs/1p309.pdf
http://www2004.org/proceedings/docs/1p595.pdf
http://www2003.org/cdrom/papers/refereed/p270/kamvar-270-xhtml/index.html
http://www2003.org/cdrom/papers/refereed/p641/xhtml/p641-mccurley.html

39

Search Computing

Multi-domain queries with ranking
Why Search Engines can’t do it?
Observed trends
Search Computing
The Search Computing “Manifesto”
Search Computing architecture

 Search Computing by Stefano Ceri et al.

www.search-computing.org

40

Top-k Query Processing

Top-k Query Processing
Simple Database model
Fagin’s Algorithm
Threshold Algorithm
Comparison of Fagin’s and Threshold Algorithm

Search Computing by Stefano Ceri et al.

http://dl.acm.org/citation.cfm?id=1391730

41

Clustering

What is clustering?
Improving search recall
Issues for clustering
Notion of similarity/distance
Hard vs. soft clustering
Clustering Algorithms

Textbook

Chapter 13

42

Classification

Why probabilities in IR?
Document Classification
Bayes’ Rule For Text Classification
Bernoulli Random Variables
Smoothing Function

Textbook

Chapter 13

43

Clustering

Rochio classification
K Nearest neighbors
Nearest-Neighbor Learning
kNN decision boundaries
Bias vs. variance

Textbook

Chapter 17

44

Recommender Systems

Recommender Systems
Personalization
Basic Types of Recommender Systems
Collaborative Filtering
Content-Based Recommending

C. Sammut, G. Webb (eds.), Encyclopedia of Machine Learning, Springer-Verlag Berlin Heidelberg, 2010

 

45

Final Notes on Information Retrieval

Other topics and research issues in Information Retrieval Domain