- 04-23李青青 Deep learning for drug-drug interaction extraction
- 04-15刁宇峰 AAAI2018中的自注意力机制(Self-attention Mechanism)
- 04-15王治政 Label informed Attributed Network Embedding
- 03-24罗凌 自然语言处理中的自注意力机制（Self-attention Mechanism）
- 01-04任玉琪 Capsule间的动态路由
- 12-24王治政 异构信息网络的表示学习

**A Probabilistic Model for Information Retrieval Based on Maximum Value Distribution**

**Abstract**

l The main goal of a retrieval model is to measure the degree of relevance of a document with respect to the given query.

l Recent research shows that tf normalization that factors in multiple aspects of term salience is an effective scheme.

l The weight of a term in a document is proportional to the probability that the normalized tf is maximum under the hypothesis that the frequencies are generated randomly.

**Introduction**

l To measure the weight of a term in a document, most well known functions combine three major components

◆ the term frequency

◆ the inverse document frequency

◆ the document length

l The key question is then how these components can be integrated to produce a composite weight.

**Models**

l Modeling term weight is the central issue in an information retrieval system.

l Three widely used models in IR:

◆ the probabilistic models

◆ the vector space model

◆ the inference network based model

l Probabilistic models can be broadly classified into three groups :

◆ classical probabilistic model

◆ language model

◆ divergence from randomness model

**Classical Probabilistic Model**

l The key part is to estimate the probability of relevance of the documents for a query.

l A large number of weighting formulae attempt to measure document relevance probabilistically.

l BM25 seems to be the most effective weighting function from among them.

**Language Model**

l LM computes the probability of generating a query from a document, assuming that the query terms are chosen independently.

l Smoothing is crucial and three major smoothing techniques (Dirichlet, Jelinek-Mercer and Two-stage) are commonly used.

l One major deficiency with using a multinomial distribution as a language model is that all term occurrences are treated independently.

**Divergence from Randomness Model**

l The weight of a term is the amount of divergence between a term distribution produced by a random process and the actual term distribution.

l The weighting function of DFR is defined as

◆ Prob1 is computed using various well known distributions (such as Bose-Einstein statistics, Poisson distributions etc)

◆ Prob2 is measured using Laplace law of succession or the ratio of two Binomial distributions.

**Vector Space Models**

l Queries and documents are represented as the vectors of terms.

l The model measures the similarity between the query and document vector using cosine function.

l The central part is to determine the weight of the terms that are present in the query and the documents.

**TF-IDF Model: Probabilistic View**

l The idf factor of term t (idf(t)) measures the information gain of randomly picking a document that will fall in the elite set for t(E(t)).

l The tf factor of t for a document d (tff(t, d)) measures the relative weights of documents within E(t).

l If ntf(t, d) means normalized tf of t in d, tff(t, d) can be defined as

where X is the random variable on normalized tf values in E(t).

**Term Frequency Normalization**

l The normalization factors are called ritf (t, d) (relative intra-document frequency of term t in the document d) and lrtf (t, d) (length normalized frequency of term t in the document d).

l The terms mtf, adl and l(d) denote the mean term frequency of the document that contains t, the average document length of the collection and the length of the document d, and k (>=1) is a smoothing parameter.

**Maximum Value Model**

l We attempt to measure tff(t, d) based on the nature of some of the largest values of normalized tf for that term.

l This scoring is perfectly consistent with tf hypothesis, where the document having highest normalized tf gets highest weight.

l These problems are addressed using a sampling based technique which exclusively focuses on maximum values of samples.

l Rather than relying on a single value, we measure the distribution of values at the right tail where some of the largest values fall.

l We hypothesize that the most potentially relevant documents for a term fall on that part of the distribution.

**Example**

l Figure 1 shows that both the terms seem to be following long tail distributions with monotonically decreasing density functions.

l We consider two such long tail distributions, namely, exponential distribution and Pareto distribution.

l Let us assume that N samples, each of size n are drawn from the same population. From each sample we can get the largest value.

l The maximum value distribution is defined as follows. Let X1,X2,…Xn be independent and identically distributed random variable with distribution F. Let Mn = max(X1,X2, …Xn).

The above equation is equivalent to

l Fisher-Tippett-Gnedenko theorem states that if a pairs of real numbers (an, bn) (an and bn must be functions of n) exist such that an > 0 and

l For a distribution F, then D(x) can be Type I or Type II distribution

◆ The type I distribution (known as Gumbel distribution) is defined as

◆ Type II distribution (Frechet distribution) for positive random variable is defined as

l Next major goal is to verify that the maximum value distributions satisfy the mandatory preconditions in order to be applicable in our task.

l The data must be coming from a distribution F that satisfy Fisher-Tippett-Gnedenko theorem.

l The nature of the tails are different in these two cases.

l Case 1, suppose the data have been distributed from exponential distribution. , choose ,

l Case 2, suppose now the data have Pareto tail.

, if we set ,

l In case 1, maximum value distribution converges to Gumbel distribution. In case 2, it turned out to be Frechet distribution.

l The above results provide us the necessary evidence that the maximum value distributions can be applied on our data.

l If a term is more general (but not really stop words), the frequency distribution for that term likely to have a longer tail than that of more specific term.

l Thus, an attempt to model the distribution of a term using only one of Gumbel and Frechet may lead to lower accuracy.

l Thus, our resulting distribution is defined as

p can be considered as prior of .

l We make the value of p dependent on df. Specifically, if a term has low df (high idf) we give higher weight to .

l We formalize this intuition using the following well known linear model, where ( > 0) is a free parameter.

**Scoring Function**

l Our scoring function uses two aspect tf normalization in maximum value distribution framework.

l If X and Y be the random variables corresponding to ritf (t) and lrtf (t) in E(t), then tff(t, d) is defined as

where 0 < < 1, is the interpolation parameter.

l Consequently, the final scoring function for a query Q = q1q2…qn and a document d is defined as

where is set empirically.

**Model Parameter Estimation**

l We explain the difficulty using maximum likelihood estimate(MLE) with Gumbel distribution only (similar argument holds for Frechet).

l The log-likelihood function of Gumbel based on random sample x1,x2…xn is given by

l The system of differential equations(used for MLE)

l The following estimates for and

l And

**Parameter Estimation for Gumbel**

l The mean of Gumbel distribution is ,while the standard deviation is .

l To estimate the values of and we equate them with corresponding sample mean and standard deviation

where and s are sample mean and standard deviation respectively.

l Surprisingly, point estimate of as is does not perform well in practice. Thus, we use a linear transformation, , where z1 and z2 are set empirically to 2.5 and 0.04 respectively.

**Parameter Estimation for Frechet**

l Mean and variance for Frechet are dened respectively as

The above two expressions are not very convenient to use.

l Fortunately, median and mode of Frechet distribution have much manageable expressions.

l Median is defined as , and mode is defined as

l Median for a sample is easy to determine, we then take the median of the bin having highest frequency as our sample mode.

**Experiment Setup**

l The test collections are taken from TREC web tasks of recent years (2009-2012) as well as from million query 2009 (MQ-2009).

l The documents are crawled from web, Clueweb.B collection contains nearly 50 million documents, while ClueWeb.A collection contains approximately 500 million web pages.

l Documents and queries are stemmed via Porter stemmer. Stop words are removed from documents and queries.

l Statistically significant performance differences are determined using a paired t-test at 95% confidence level (p < 0.05).

l All our experiments are done using title field of the topics.