The Logic Of Science

Me Blog Project Publication

A Python implementation of IR based on LevelDB

23 Oct 2015

Author: Xiaozhao Zhao.

Institution: Tianjin University

A python implementation for information retrieval tasks, including forward/inverted index, basic retrieval models (e.g., BM25, uni-gram language model). The indexing module uses a thread-safe Python bindings for LevelDB (https://code.google.com/p/py-leveldb/). LevelDB is a fast key-value storage library.

run: sh buildIndex.sh

  1. tokenize corpus by nltk

  2. extract the document infomation from the tokenized corpus, transform words into termids. output format: [docid \t termid \t termtf \t positionsindoc]

  3. sort the second column (term_id) by number order (for building inverted index)

  4. build the forward index for corpus output to leveldb files: /forwardindexdb

  5. build the inversed index for corpus output to leveldb files: /invertedindexdb

  6. get document length, get term's document frequency

  7. apply to information retrieval (BM25 and Language model)

  8. evaluation by trec_eval (version 9.0)

This project can be found from github.