- A little scattered today: two generic (optimizer / distribution independent) inference techniques, and then some more probabilistic background material before diving further into generative modeling
- MLE: maximum likelihood estimation (final review from last time)
- MAP: maximum
*a posteriori*estimation: keeping the prior in the picture- Go a little bit deeper on this today, apply to more general models

- Three common loss functions and their probabilistic derivation
- Then... JK, we'll do it next time.

- As the name claims: we maximize the data likelihood. Okay. If you take anything away, it's that -- and we do
*only*that, nothing else. - Recall the general setup (phrased in supervised learning language -- a good exercise is to rewrite this for the unsupervised case.
- Discriminative data likelihood $y \sim p_x(y | z)$, and we probably have a dataset of samples $\mathcal D = \{y_i, x_i\}_i$.
- We rewrite the posterior density as $$ p_x(z | y) \propto p_x(y | z) p(z), $$ and setting our priors to the (probably improper) $p(z) \propto 1$, we crudely estimate $$ p_x(z | y) \propto p_x(y | z), $$ and hence attempt to solve $\max_z p_x(y | z)$.

- Implementation-wise, this is pleasant: $\mathcal D$ is iid, so this trivially becomes $$ \begin{aligned} \max_z p_x(y | z) &= \max_z \prod_i p_{x_i}(y_i | z) \\ &= \max_z \sum_{i}\log p_{x_i}(y_i | z) \\ &\propto E_{(y_i, x_i) \sim \mathcal D}[ \log p_{x_i}(y_i | z)] \end{aligned} $$ since the constant of proportionality doesn't matter as far as optimization goes.
- Recalling back to our optimization lectures, we can approximate this average by averages over minibatches
*a la*stochastic gradient descent... all very pleasant.

- If constants of proportionality don't matter, then do we need the normalization constant? I.e., if we have $p_\theta(x) = \frac{1}{\mathcal Z(\theta)} f_\theta(x)$, then $$ \log p_\theta(x) = \log f_\theta(x) - \log \mathcal Z(\theta) $$
- So, yes, we need it -- but maybe only some of it. Any purely
*constant*part of $\mathcal Z(\theta)$ can be dropped.- E.g.: normal data likelihood, i.e. $$ p(y | x, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left\{ -\frac{1}{2}\left(\frac{y - x}{\sigma}\right)^2\right\} $$ Then log-likelihood is $$ \begin{aligned} \log p(y | x, \sigma) &= -\frac{1}{2\sigma^2}(y - x)^2 - \log \sigma {\color{red}{- \log \sqrt{2\pi}}} \end{aligned} $$

- Unforunately, we totally ditched our priors to do this. Based on the extensive discussion we had last time, it seems clear that it's not usually
*helpful*to do this, and could potentially be very harmful. - Instead, do the same optimization procedure where we only care about finding $z$ that
*maximizes*something, but now instead maximize something proportional to the true posterior density instead of just the likelihood: $$ \max_z p(z|x) \propto p(x | z) p(z) $$ Using our previous notation, now we'll go try to solve $$ \begin{aligned} \max_z p_x(z | y) &= \max_z p(z) \prod_i p_{x_i}(y_i | z) \\ &= \max_z [\log p(z) + \sum_i \log p_{x_i}(y_i | z)] \end{aligned} $$

With suitable rescaling, this is where

*regularization*comes from:Want to write last eqn as

$$ \max_z \{\ \log p(z) + E_{(y_i, x_i)\sim \mathcal D}[\log p_{x_i}(y_i | z)]\ \} $$ (for batch sanity, etc.), but that scaling isn't quite right on the prior

- Instead, maybe think about restated Bayes's rule: $$ (*)\qquad p_\lambda(z | x) = \frac{p(x|z) p(z)^\lambda}{p_\lambda(x)}, $$ where we now have $$ p_\lambda(x) = \int\ dz\ p(x|z)\ p(z)^{\lambda} $$
- Log-ing $(*)$ and using MAP procedure gives $$ (**)\qquad \max_z \{\ \lambda \log p(z) + E_{(y_i, x_i)\sim \mathcal D}[\log p_{x_i}(y_i | z)]\ \}, $$ so now by tuning $\lambda$ -- the so-called "regularization parameter" -- we can confidently state this is proportional to the MAP objective.

- Ridge regression? Lasso?

- Another way of going about regularization is via a prior distribution's
*scale parameter*. Suppose $p(z|\sigma)$ depends on $\sigma$ as $$ p(z|\sigma) = \frac{1}{\sigma}p(z/\sigma) $$ Then $\sigma$ is called a scale parameter.- For example, the normal distribution: $$ p(z|\sigma) = \frac{1}{\sqrt{2\pi{\color{red}{\sigma^2}}}} \exp\left\{-\frac{1}{2}\left(\frac{z}{{\color{red}{\sigma}}}\right)\right\} = \frac{1}{\sigma}p(z/\sigma), $$ where $p(z) = \frac{1}{\sqrt{2\pi}}\exp\{ - z^2/2\}$.

- You will explore this more in the homework.

- Connect back most explicitly to your previous ML knowledge -- $L_2$, $L_1$, and cross entropy loss (and many others!) all fall out of distributional estimators or probabilistic considerations.
- General setup: we have $\hat y = f_\theta(x)$ where $f$ is some generic deterministic model (decision tree, NN, ...). We are trying to minimize a loss function
$\ell(y, \hat y) = \ell (y, f_\theta(x))$.
- $L_2$ loss: $y_i \sim \text{Normal}(f_\theta(x_i), \sigma^2)$. Pre-normalize data so that variance $ = 1$ (though this actually doesn't necessarily matter depending on model structure), then we have the (negative) likelihood function $$
- \log p_x(y | \theta) = \sum_i(y
*i - f*\theta(x*i))^2 = \left\lVert y - f*\theta(x) \right\rVert_2^2 = \ell(y, \hat y) $$ - $L_1$ loss: repeat the same argument but with likelihood function given by the Laplace distribution.

- Cross-entropy: every observed datapoint $y_i$ is now a probability distribution over $k$ categories, albeit potentially a degenerate one (think about that for a second...), and so is the model prediction $\hat y_i = f_\theta(x_i)$. One way to do this is to minimize the surprisal associated with the model predictions when they're evaluated under the true probability distribution:
$$
\begin{aligned}
\min_\theta E_{(y_i, x_i) \sim \mathcal D}[ E_{j \sim y_i}[\mathcal I(f_\theta(x_{ij}))]]
&= \min_\theta - \sum_{i,j} y_{ij} \log f_\theta(x_{ij}) \\
&= \max_\theta \sum_{i,j} y_{ij} \log f_\theta(x_{ij}) \\
&= \max_\theta \prod_i\left(\prod_j f_\theta(x_{ij})^{y_{ij}} \right)\\
&= \max_\theta \prod_i p(y_i | f_\theta(x_i)),
\end{aligned}
$$
a maximum likelihood estimator again... (this time using a
*categorical*likelihood, which we'll talk about presently).