Paper Reading - Latent Aspect Rating Analysis on Review Text Data

kuniga.me > NP-Incompleteness > Paper Reading - Latent Aspect Rating Analysis on Review Text Data

Paper Reading - Latent Aspect Rating Analysis on Review Text Data

09 Oct 2020

In [1] Wang, Lu and Zhai present a supervised machine learning model to solve a problem they name Latent Aspect Rating Analysis. To explain the problem, let’s start with some context. Sites like Amazon, TripAdvisor or Audible have a review system where the user provides a rating and a review text.

The overall rating might be too vague, so it’s desirable to break down this rating per what is called aspect. The list of aspects depend on the use case: for logding, it could be room, cleaningless or location; for Audible it has performance and story. The model tries to infer these implicit (latent) ratings from only the overall rating and the review text.

The authors define a model that learns from training data and show promising results with test data. In this post we’ll focus on the theory presented in [1].

Overview of the solution

To solve the Latent Aspect Rating Analysis problem, first the authors build a correlation matrix between words and aspects by scanning all the texts from the reviews in the training dataset. This matrix is then used in tandem with both the overall and aspect ratings in the training dataset to construct a model which is then used to infer the aspect ratings for the test dataset. The inferred values are then compared to the actual values to assess the quality of the model.

Definitions

Aspect and associated keywords

We’ll have a fixed set of $k$ aspects (e.g. for hotels it could be location, cleaningness, price). We then let $A_i$ for $i \in [1, …, k]$ be the set of words associated with an aspect $i$. Note that these sets are not mutually exclusive, so for example “view” could be related to aspect “room” or “location”.

Word-aspect correlation

For each document $d \in D$, we have $W_d$ which is a $k \times n$ where $W_{dij}$ is the frequency of word $w_j$ for aspect $i$ normalized by the total number of the words in the text in that aspect, that is:

\(\sum_{j=1}^n W_{ijd} = 1\) for all documents $d$ and aspects $i$.

Aspect ratings and weights

The aspect rating for a document $d$, denoted by $s_d$, is a $k$-dimensional vector where $s_{di} \in [r_{min}, r_{max}]$ is the rating for corresponding to aspect $i$. For example, a user might rate a hotel’s location as 4 (out of 5), but its cleaningness as 1.

The aspect weight for a document $d$, denoted by $\alpha_d$ is also a $k$-dimensional vector where $\alpha_{di} \in [0, 1]$ and $\sum_{i=1}^k \alpha_{di} = 1$ and represents the weight the review $d$ puts on aspect $i$. For example, a user might put more importance of location than price.

These are the two quantities the model will try to predict once it’s properly trainted. Note that for a document $d$:

\[(1) \quad r_d = \sum_{i=1}^{k} \alpha_{di} s_{di}\]

The Model

Word-aspect correlation

We’ll gloss over the methodology for building the matrix $W_d$ and present it only at a high level. The idea is to seed each aspect with initial keywords. Then use a clustering algorithm to include more words into each of the aspects based on the reviews texts. Refer to [1] for more details.

Aspect ratings

The model defines a parameter $\beta$ representing the overall rating associated with a (word-aspect) pair. This can be thought of as the average rating a user would to aspect $i$ solely based on word $j$, so $\beta_{ij} \in [r_{min}, r_{max}]$.

Since \(\sum_{j=1}^n W_{ijd} = 1\), the model assumes that for document $d$ and aspect $i$, $s_{di}$ is a linear combination of $\beta_i$ and $W_{di}$:

\[(2) \quad s_{di} = \sum_{j=1}^n \beta_{ij} W_{dij}\]

Overal ratings

In theory $r_d$ can be obtained by (1), but to encode uncertainty and allow variance, the model assumes $r_d$ is drawn from a Gaussian distribution, with mean (1) and some variance $\delta$ which will be another parameter in the model.

\[r_{d} \sim N(\sum_{i=1}^{k} \alpha_{di} s_{di}, \delta^2)\]

Exapanding $s_{di}$ using (2):

\[(3) \quad r_{d} \sim N(\sum_{i=1}^{k} \alpha_{di} \sum_{j=1}^{n} \beta_{ij} W_{dij}, \delta^2)\]

Correlated aspect weights

The authors argue that aspect weights are not completely independent (e.g. a preference for cleaninliness is correlated with a preference for room, maybe less so for price).

To model this, we assume $\alpha_{d}$ is drawn from a multi-variate Gaussian distribution:

\[\alpha_{d} \sim N(\mu, \Sigma)\]

where $\mu$ is a $k$ dimensional vector and $\Sigma$ a $k \times k$ variance matrix encoding the correlation between pairs of aspects.

Objective function

Let’s bundle the parameters needed to compute $r_d$ into $\Theta = (\mu, \Sigma, delta^2, \beta)$. We then define the likelihood of observing $D$ (i.e. the overall and aspect ratings) as

\[p(D \mid \Theta)\]

we need to find $\hat \Theta = \mbox{argmax}_\Theta \, p(D \mid \Theta)$.

Solving the model

We’ll use an interactive algorithm. We start with an initial estimate $\Theta_0$ for $\Theta$.

Because $\alpha_d$ is also considered a random variable, we can estimate it from $\Theta_t$. We then compute $\Theta_{t+1}$ assuming $\alpha_d$ is constant. We keep iterating until $\Theta_t$ converges.

Estimating aspect weights

Given a rating $r_{d}$, how can we estimate $\alpha_{d}$, that is $p(\alpha_d \mid r_d)$? In particular what is the most probable value of $\alpha_{d}$? We can use the Maximum a Posteriori Estimate (MAP) [2] method to find:

\[\hat \alpha_{d} = \mbox{argmax}_{\alpha_d} p(\alpha_d \mid r_d)\]

Which we show in Appendix A,

\[(4) \quad \hat \alpha_{d} = \mbox{argmax}_{\alpha_d} \bigg[ - \frac{(r_d - \alpha_d^T s_d)^2}{2 \delta^2} -\frac{1}{2} (\alpha_d - \mu)^T \Sigma^{-1} (\alpha_d - \mu) \bigg]\]

We can use a non-linear optimization such as L-BFGS to solve this, using the gradient of (4):

\[- \frac{(r_d - \alpha_d^T s_d) s_d}{\delta^2} - \Sigma^{-1} (\alpha_d - \mu)\]

Estimating parameters $\mu$ and $\Sigma$

To estimate the $\Theta$ we’ll use Maximum Likelihood Estimate (MLE). Note that since there’s no constraint limiting each of $\mu$, $\Sigma$, $\beta$ and $\delta$, we can estimate them individually.

First we estimate $\mu$ and $\Sigma$ to maximize the likelihood of observing $\alpha$, $p(\alpha_1, \cdots, \alpha_D \mid \mu, \Sigma)$.

We show in Appendix A that if

\[\mu_{MLE} = \mbox{argmax}_{\mu} p(\alpha_1, \cdots, \alpha_D \mid \mu, \Sigma)\]

then

\[\mu_{MLE} = \frac{1}{\mid D \mid} \sum_{d \in D} \alpha_d\]

It can be shown [4] that if

\[\Sigma_{MLE} = \mbox{argmax}_{\mu} p(\alpha_1, \cdots, \alpha_D \mid \mu, \Sigma)\]

then

\[\Sigma_{MLE} = \frac{1}{\mid D \mid} \sum_{d \in D} (\alpha_d - \mu_{MLE})(\alpha_d - \mu_{MLE})^{T}\]

Summarizing, from a iteractive algorithm perspective, we’re given $\alpha$ computed previously and we can estimate the next iteration $\mu_{t+1}$ and $\Sigma_{t+1}$:

\[\mu_{t+1} = \frac{1}{\mid D \mid} \sum_{d = 1}^{\mid D \mid} \alpha_d\] \[\Sigma_{t+1} = \frac{1}{\mid D \mid} \sum_{d \in D} (\alpha_d - \mu_{t+1})(\alpha_d - \mu_{t+1})^{T}\]

Estimating parameters $\beta$ and $\delta^2$

Now we estimate $\beta$ and $\delta^2$ to maximize the likelihood of observing $r$, $p(r_1, \cdots, r_D \mid \beta, \delta^2)$.

We show in Appendix A that if:

\[\delta_{MLE}^2 = \mbox{argmax}_{\delta^2} p(r_1, \cdots, r_D \mid \beta, \delta^2)\]

then

\[\delta_{MLE}^2 = \frac{r_d - \alpha_d^T s_d}{\mid D \mid}\]

For $\beta$ we need to expand $\alpha_d^T s_d$ since it’s a function of $\beta$, that is $\alpha_d^T s_d = \sum_{i = 1}^{k} \alpha_{di} \beta_i^T W_{di}$. This give us:

\[\beta_{MLE} = \mbox{argmax}_{\beta} \sum_{d \in D} \log(\frac{1}{\delta \sqrt{2 \pi}}) - \frac{1}{2} (\frac{r_d - \sum_{i = 1}^{k} \alpha_{di} \beta_i^T W_{di}}{\delta})^2\]

Discarding the terms independent of $\beta$:

\[\beta_{MLE} = \mbox{argmax}_{\beta} - \sum_{d \in D} (r_d - \sum_{i = 1}^{k} \alpha_{di} \beta_i^T W_{di})^2\]

The authors claim the closed form of this optimization requires inverting a $n \times n$, which is costly. The proposal is to also used a non-linear optimization such as L-BFGS like we did for $\hat \alpha_{d}$. The $j$-th component of the gradient of the log-likelihood function above is:

\[\frac{\partial \mathcal{L}({\beta})}{\partial \beta_j} = \sum_{d \in D} \big( (r_d - \sum_{i = 1}^{k} \alpha_{di} \beta_i^T W_{di}) \alpha_{dj} W_{dj} \big)\]

Summarizing, from a iteractive algorithm perspective, we’re given $\alpha$ computed previously and we can estimate the next iteration $\delta^2_{t+1}$ and $\beta{t+1}$:

\[\delta^2_{t+1} = \frac{r_d - \alpha_d^T s_d}{\mid D \mid}\]

and

\[\beta_{t+1} = \mbox{LBFGS}(\beta_{t}, r, \alpha, s, W)\]

Conclusion

In this post we studied the theory of the paper Latent Aspect Rating Analysis on Review Text Data. I also planned to work on a Python implementation but it will have to wait. It’s the first machine learning paper I can remember reading and it was interesting to see how much calculus is involved. I had past experience with linear (discrete and continuous) optimization, which involves mostly basic linear algebra.

In studying this post I had to watch many basic machine leaning math, via the mathematicakmonk channel. I also realized how rusty I am in regards to basic derivatives operations, so I’ll need a refresher.

References

Appendix A

Finding $\hat \alpha_{d}$

Here we derive $\hat \alpha_{d}$ as a function of $\Theta$.

\[\hat \alpha_{d} = \mbox{argmax}_{\alpha_d} p(\alpha_d \mid r_d)\]

Using Bayes’ Rule, we have

\[p(\alpha_d \mid r_d) = \frac{p(r_d \mid \alpha_d) p(\alpha_d)}{p(r_d)}\]

Since $\log$ is a monotomic function and $p(r_d)$ does’t depend on $\alpha_{d}$, finding $\alpha_{d}$ that maximizes the above is equal to maximizing

\[\hat \alpha_{r_d} = \mbox{argmax}_{\alpha_d} \log p(r_d \mid \alpha_d) + \log p(\alpha_d)\]

Let’s consider $p(r_d \mid \alpha_d)$ first. We know it’s the Gaussian with mean $\alpha_d^T s_d$ and variance $\delta^2$ (3), so by the definition of a Gaussian,

\[p(r_d \mid \alpha_d) = \frac{1}{\delta \sqrt{2 \pi}} \exp(-\frac{1}{2} (\frac{r_d - \alpha_d^T s_d}{\delta})^2)\]

Applying log we get,

\[(5) \quad \log p(r_d \mid \alpha_d) = \log(\frac{1}{\delta \sqrt{2 \pi}}) - \frac{1}{2} (\frac{r_d - \alpha_d^T s_d}{\delta})^2\]

The first term is independent of alpha, so we’re left with

\[\frac{(r_d - \alpha_d^T s_d)^2}{2 \delta^2}\]

Let’s now look at $p(\alpha_d)$. It’a multi-variate Gaussian, so by definition:

\[(G) \quad p(\alpha_d) = \frac{\exp(-\frac{1}{2} (\alpha_d - \mu)^T \Sigma^{-1} (\alpha_d - \mu))}{\sqrt{(2 \pi)^{k} \mid \Sigma \mid}}\]

Where $k$ is the dimension of $\alpha_d$, $\mu$ and $\Sigma$ ($k \times k$) and $\mid \Sigma \mid$ is the determinant of $\Sigma$. If we apply $\log$ and discard terms unrelated to $\alpha_d$ we get

\[-\frac{1}{2} (\alpha_d - \mu)^T \Sigma^{-1} (\alpha_d - \mu)\]

Putting everything together, we want to find

\[(4) \quad \hat \alpha_{d} = \mbox{argmax}_{\alpha_d} \bigg[ - \frac{(r_d - \alpha_d^T s_d)^2}{2 \delta^2} -\frac{1}{2} (\alpha_d - \mu)^T \Sigma^{-1} (\alpha_d - \mu) \bigg]\]

Finding $\mu_{MLE}$

We have that

\(\mu_{MLE} = \mbox{argmax}_{\mu} p(\alpha_1, \cdots, \alpha_D \mid \mu, \Sigma) = \prod_{d \in D} p(\alpha_d \mid \mu, \Sigma)\).

Recalling $\alpha_d$ is sampled from a multi-variate Gaussian. Sing $\log$ is a monotonically increasing function, we can take the $\log$ on the right side of the equation:

\[\quad \mu_{MLE} = \mbox{argmax}_{\mu} -\sum_{d \in D} (\alpha_d - \mu)^T \Sigma^{-1} (\alpha_d - \mu)\]

We can take the derivative of with respect to $\mu$ to obtain the gradient:

\[\sum_{d \in D} (\alpha_d - \mu) \Sigma^{-1}\]

and set the gradient to 0 for which $\mu$ is optimal, and multiply both sides by $\Sigma$ to obtain:

\[0 = \sum_{d \in D} (\alpha_d - \mu)\]

which implies

\[\sum_{d \in D} \mu = D \mu = \sum_{d \in D} \alpha_d\]

so

\[\mu_{MLE} = \frac{1}{\mid D \mid} \sum_{d \in D} \alpha_d\]

Finding $\delta_{MLE}$

We have that

\[\delta_{MLE}^2 = \mbox{argmax}_{\delta^2} p(r_1, \cdots, r_D \mid \beta, \delta^2) = \prod_{d \in D} p(r_d \mid \beta, \delta^2)\]

Since $\log$ is a monotomic function, we have,

\[\delta_{MLE}^2 = \mbox{argmax}_{\delta^2} \sum_{d \in D} \log p(r_d \mid \beta, \delta^2)\]

Recalling $r_d$ is sampled from a univariate Gaussian (3) with some a mean that is a function of $\beta$ (say $f(\beta)$) and variance $\delta^2$ (3), so by the definition of a Gaussian,

\[p(r_d \mid \beta, \delta^2) = \frac{1}{\delta \sqrt{2 \pi}} \exp(-\frac{1}{2} (\frac{r_d - f(\beta)}{\delta})^2)\]

Applying log,

\[\log p(r_d \mid \beta, \delta^2) = \log(\frac{1}{\delta \sqrt{2 \pi}}) - \frac{1}{2} (\frac{r_d - f(\beta)}{\delta})^2\]

Putting it back in the original sum:

\[\delta_{MLE}^2 = \mbox{argmax}_{\delta^2} \mid D \mid \log(\frac{1}{\delta \sqrt{2 \pi}}) - \frac{1}{2\delta^2} \sum_{d \in D} (r_d - f(\beta))^2\]

If we set $y = x^2$, $a = \mid D \mid$, $b = \sqrt(2\pi)$ and $c = \sum_{d \in D} (r_d - f(\beta))^2$ we can put it in a form that is easier to derive:

\[a \log(\frac{1}{\sqrt(y) b}) - \frac{c}{2y}\]

If we derive on $y$,

\[- \frac{a}{2y} + \frac{c}{2y^2}\]

Setting to 0, we can solve for $y$ to get:

\[y = \frac{c}{a}\]

Replacing the terms back:

\[\delta_{MLE}^2 = \frac{\sum_{d \in D} (r_d - f(\beta))^2}{\mid D \mid}\]
Tags: artificial intelligence numerical optimization