The Logic Of Science

Me Blog Project Publication

Context Representation and Modelling

18 Mar 2014

In information retrieval tasks, we need to analyze the contextual information, which is composed of high-order word patterns with semantic associations implied by the textual data. The definition of context should satisfy the following metrics:
(1) the information provided is large enough;
(2) the noise contained is small enough;
(3) it is effective in terms of time and space cost.

In order to process the massive data with high dimensionality, the traditional data mining algorithms choose the neighborhood terms within a certain scope surrounding the core word as the context, the so-called "window". Besides, some traditional word association approaches could be used to analyze the contextual information, such as the Apriori approach, closed frequency itemset, co-occurrence analysis, syntactical phrase and word-net association, etc.

In terms of probabilistic statistics, the context composed of several terms could be sufficiently described by the joint probability distribution. And the pure high-order word association (described in the former post) could lead to a new perspective for the interpretation of contextual semantics w.r.t. the properties of the joint distribution.

Recently, we have just opened up a collaborative project with Baidu on the Baidu openresearch community. ( Project No. 5 )