A Python implementation of IR based on LevelDB

Author: Xiaozhao Zhao.

Institution: Tianjin University

A python implementation for information retrieval tasks, including forward/inverted index, basic retrieval models (e.g., BM25, uni-gram language model). The indexing module uses a thread-safe Python bindings for LevelDB (https://code.google.com/p/py-leveldb/). LevelDB is a fast key-value storage library.

run: sh buildIndex.sh

tokenize corpus by nltk
extract the document infomation from the tokenized corpus, transform words into termids. output format: [docid \t termid \t termtf \t positionsindoc]
sort the second column (term_id) by number order (for building inverted index)
build the forward index for corpus output to leveldb files: /forwardindexdb
build the inversed index for corpus output to leveldb files: /invertedindexdb
get document length, get term's document frequency
apply to information retrieval (BM25 and Language model)
evaluation by trec_eval (version 9.0)

This project can be found from github.

The Logic Of Science

A Python implementation of IR based on LevelDB