Author: Xiaozhao Zhao.
Institution: Tianjin University
A python implementation for information retrieval tasks, including forward/inverted index, basic retrieval models (e.g., BM25, uni-gram language model). The indexing module uses a thread-safe Python bindings for LevelDB (https://code.google.com/p/py-leveldb/). LevelDB is a fast key-value storage library.
run: sh buildIndex.sh
tokenize corpus by nltk
extract the document infomation from the tokenized corpus, transform words into termids. output format: [docid \t termid \t termtf \t positionsindoc]
sort the second column (term_id) by number order (for building inverted index)
build the forward index for corpus output to leveldb files: /forwardindexdb
build the inversed index for corpus output to leveldb files: /invertedindexdb
get document length, get term's document frequency
apply to information retrieval (BM25 and Language model)
evaluation by trec_eval (version 9.0)
This project can be found from github.