MLE, MAP, and some distributions

  • A little scattered today: two generic (optimizer / distribution independent) inference techniques, and then some more probabilistic background material before diving further into generative modeling
  • MLE: maximum likelihood estimation (final review from last time)
  • MAP: maximum a posteriori estimation: keeping the prior in the picture
    • Go a little bit deeper on this today, apply to more general models
  • Three common loss functions and their probabilistic derivation
  • Then... JK, we'll do it next time.

MLE

  • As the name claims: we maximize the data likelihood. Okay. If you take anything away, it's that -- and we do only that, nothing else.
  • Recall the general setup (phrased in supervised learning language -- a good exercise is to rewrite this for the unsupervised case.
    • Discriminative data likelihood $y \sim p_x(y | z)$, and we probably have a dataset of samples $\mathcal D = \{y_i, x_i\}_i$.
    • We rewrite the posterior density as $$ p_x(z | y) \propto p_x(y | z) p(z), $$ and setting our priors to the (probably improper) $p(z) \propto 1$, we crudely estimate $$ p_x(z | y) \propto p_x(y | z), $$ and hence attempt to solve $\max_z p_x(y | z)$.
  • Implementation-wise, this is pleasant: $\mathcal D$ is iid, so this trivially becomes $$ \begin{aligned} \max_z p_x(y | z) &= \max_z \prod_i p_{x_i}(y_i | z) \\ &= \max_z \sum_{i}\log p_{x_i}(y_i | z) \\ &\propto E_{(y_i, x_i) \sim \mathcal D}[ \log p_{x_i}(y_i | z)] \end{aligned} $$ since the constant of proportionality doesn't matter as far as optimization goes.
  • Recalling back to our optimization lectures, we can approximate this average by averages over minibatches a la stochastic gradient descent... all very pleasant.

MLE

  • If constants of proportionality don't matter, then do we need the normalization constant? I.e., if we have $p_\theta(x) = \frac{1}{\mathcal Z(\theta)} f_\theta(x)$, then $$ \log p_\theta(x) = \log f_\theta(x) - \log \mathcal Z(\theta) $$
  • So, yes, we need it -- but maybe only some of it. Any purely constant part of $\mathcal Z(\theta)$ can be dropped.
    • E.g.: normal data likelihood, i.e. $$ p(y | x, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left\{ -\frac{1}{2}\left(\frac{y - x}{\sigma}\right)^2\right\} $$ Then log-likelihood is $$ \begin{aligned} \log p(y | x, \sigma) &= -\frac{1}{2\sigma^2}(y - x)^2 - \log \sigma {\color{red}{- \log \sqrt{2\pi}}} \end{aligned} $$

MAP

  • Unforunately, we totally ditched our priors to do this. Based on the extensive discussion we had last time, it seems clear that it's not usually helpful to do this, and could potentially be very harmful.
  • Instead, do the same optimization procedure where we only care about finding $z$ that maximizes something, but now instead maximize something proportional to the true posterior density instead of just the likelihood: $$ \max_z p(z|x) \propto p(x | z) p(z) $$ Using our previous notation, now we'll go try to solve $$ \begin{aligned} \max_z p_x(z | y) &= \max_z p(z) \prod_i p_{x_i}(y_i | z) \\ &= \max_z [\log p(z) + \sum_i \log p_{x_i}(y_i | z)] \end{aligned} $$

MAP

  • With suitable rescaling, this is where regularization comes from:

    • Want to write last eqn as

      $$ \max_z \{\ \log p(z) + E_{(y_i, x_i)\sim \mathcal D}[\log p_{x_i}(y_i | z)]\ \} $$ (for batch sanity, etc.), but that scaling isn't quite right on the prior

    • Instead, maybe think about restated Bayes's rule: $$ (*)\qquad p_\lambda(z | x) = \frac{p(x|z) p(z)^\lambda}{p_\lambda(x)}, $$ where we now have $$ p_\lambda(x) = \int\ dz\ p(x|z)\ p(z)^{\lambda} $$
    • Log-ing $(*)$ and using MAP procedure gives $$ (**)\qquad \max_z \{\ \lambda \log p(z) + E_{(y_i, x_i)\sim \mathcal D}[\log p_{x_i}(y_i | z)]\ \}, $$ so now by tuning $\lambda$ -- the so-called "regularization parameter" -- we can confidently state this is proportional to the MAP objective.
  • Ridge regression? Lasso?

MAP

  • Another way of going about regularization is via a prior distribution's scale parameter. Suppose $p(z|\sigma)$ depends on $\sigma$ as $$ p(z|\sigma) = \frac{1}{\sigma}p(z/\sigma) $$ Then $\sigma$ is called a scale parameter.
    • For example, the normal distribution: $$ p(z|\sigma) = \frac{1}{\sqrt{2\pi{\color{red}{\sigma^2}}}} \exp\left\{-\frac{1}{2}\left(\frac{z}{{\color{red}{\sigma}}}\right)\right\} = \frac{1}{\sigma}p(z/\sigma), $$ where $p(z) = \frac{1}{\sqrt{2\pi}}\exp\{ - z^2/2\}$.
  • You will explore this more in the homework.

Common loss functions

  • Connect back most explicitly to your previous ML knowledge -- $L_2$, $L_1$, and cross entropy loss (and many others!) all fall out of distributional estimators or probabilistic considerations.
  • General setup: we have $\hat y = f_\theta(x)$ where $f$ is some generic deterministic model (decision tree, NN, ...). We are trying to minimize a loss function $\ell(y, \hat y) = \ell (y, f_\theta(x))$.
    • $L_2$ loss: $y_i \sim \text{Normal}(f_\theta(x_i), \sigma^2)$. Pre-normalize data so that variance $ = 1$ (though this actually doesn't necessarily matter depending on model structure), then we have the (negative) likelihood function $$
    • \log p_x(y | \theta) = \sum_i(yi - f\theta(xi))^2 = \left\lVert y - f\theta(x) \right\rVert_2^2 = \ell(y, \hat y) $$
    • $L_1$ loss: repeat the same argument but with likelihood function given by the Laplace distribution.
  • Cross-entropy: every observed datapoint $y_i$ is now a probability distribution over $k$ categories, albeit potentially a degenerate one (think about that for a second...), and so is the model prediction $\hat y_i = f_\theta(x_i)$. One way to do this is to minimize the surprisal associated with the model predictions when they're evaluated under the true probability distribution: $$ \begin{aligned} \min_\theta E_{(y_i, x_i) \sim \mathcal D}[ E_{j \sim y_i}[\mathcal I(f_\theta(x_{ij}))]] &= \min_\theta - \sum_{i,j} y_{ij} \log f_\theta(x_{ij}) \\ &= \max_\theta \sum_{i,j} y_{ij} \log f_\theta(x_{ij}) \\ &= \max_\theta \prod_i\left(\prod_j f_\theta(x_{ij})^{y_{ij}} \right)\\ &= \max_\theta \prod_i p(y_i | f_\theta(x_i)), \end{aligned} $$ a maximum likelihood estimator again... (this time using a categorical likelihood, which we'll talk about presently).