- What the 2-gram says...
- The world around us exists. We will never be able to understand it fully, but we might be able to describe it with some accuracy, and if we're lucky and good at what we do, with some precision.
- The scientific method, embodied by Bayes's theorem, allows us to propose structure models of the world and score them based on the data that we have at hand.

- We must now understand
*what*these models can possibly look like. In this course models are always*generative*-- we will always be able to generate data from our models by writing an algorithm to do so.- (We might need to specify additional parameters too, such as exogenous observed data. More on that later.)

- Generative model -> we can write an algorithm to generate from the model -> we should be able to
*order*the model components somehow. More formalization needed.

In [2]:

```
image = img.imread("./plots/2021-02-17topo-sort.png")
fig, ax = plt.subplots(figsize=(8, 8))
ax.axis('off')
ax.imshow(image)
plt.show()
```

- Statistical models: use Bayes's theorem again. Assume that the nodes $x_1,..., x_N$ are topologically sorted. Then the joint density can be factored as

- So, this is a fundamentally
*generative*story (as suggested by the topological sort!) First we generate from $x_0$, then from $x_1 | x_0$, then from $x_2 | x_1, x_0$, then... etc. - If we are lucky (as in previous slide) then there isn't dependence on
*all*$n' < n$ but rather only some of them. - Usually three ways to write down our generative models:
- (Pseudo)code (necessary and sufficient)
- Mathematically
- Graphically

Pseudocode

`meta_scale ~ Gamma(5.0, 1.0) loc ~ Normal(0.0, meta_scale) scale ~ Gamma(meta_scale, meta_scale) for i = 1,...,N: data[i] ~ Normal(loc, scale, obs=real_data[i]) end`

- Mathematically

In [3]:

```
# graphically
image = img.imread("./plots/2021-02-17meta-scale-model.png")
fig, ax = plt.subplots(figsize=(6, 6))
ax.axis('off')
ax.imshow(image)
plt.show()
```

In [4]:

```
# and as actual code (we'll get to how this works soon)
def meta_scale_model(data, size=1, verbose=False):
meta_scale = pyro.sample(
"meta_scale",
dist.Gamma(5.0, 1.0)
)
loc = pyro.sample(
"loc",
dist.Normal(0.0, meta_scale),
)
scale = pyro.sample(
"scale",
dist.Gamma(meta_scale, meta_scale)
)
if verbose:
print(f"meta_scale = {meta_scale}, loc = {loc}, and scale = {scale}")
with pyro.plate("data_plate", size=size):
obs = pyro.sample(
"data",
dist.Normal(loc, scale),
obs=data
)
return obs
```

In [5]:

```
# model is fully generative here
seed_me()
mock_data = meta_scale_model(None, size=100, verbose=True)
plot_hist(mock_data)
```

meta_scale = 11.556055068969727, loc = 7.411444664001465, and scale = 0.814912736415863

- Important to be able to convert between each of these model representations
- Especially important to note the ups and downs of each. In particular, the graphical representation (also called plate notation) does
*not*represent a model with instantiations of random variables but rather a topological equivalence class of models (see this for more info)- Basic plate notation rules: plates correspond to an independent tensor index (equivalently, a product), shaded nodes correspond to observed rvs, and clear nodes are latents (copied from previous link)

- Mathematical representation is very general and corresponds 1 - 1 with graphical notation. You can instantiate each pdf / pmf in mathematical reepresentation to make model concrete, but loss of generality (e.g., $p(\text{meta_scale}) \mapsto \text{Gamma}(5.0, 5.0)$ in our example here)
- Pseudocode is necessary if we're being generative / constructive (we are) but you lose some generality (or it's just equivalent to graphical presentation if you don't lose generality)

- Especially important to note the ups and downs of each. In particular, the graphical representation (also called plate notation) does

*Discriminative*: a generative model that also depends on a set of free, exogenous parameters.- For as data scientists, those free exoogenous parameters are usually called "data"
- Models tend to have a structure like
$$
p_X(y, z) = p_X(y | z)\ p(z),
$$
where $p_X(y | z)$ is a likelihood function, $p(z)$ is a prior over the latent random variables in the likelihood function, and $X$ are the free exogenous parameters.
- We are contrasting "exogenous" parameters from ones that we'll try to find the optimum values of later in the course.

*Generative*: a model that corresponds to (one or more) terminating computer programs that output data.- Look like $$ p(y, z) = p(y|z)\ p(z) $$
- Note that you can always turn a discriminative model into a generative model by specifying another prior, $$ p(X, y, z) = p(y | X, z)\ p(X|z)\ p(z) $$

- These basically fall into two categories -- maximum likelihood and maximum
*a posteriori*. Both types still use Bayes's theorem, though maybe not in the way you'd expect.

- Maximum likelihood models
- Our data $y$ has a likelihood function $p_X(y | z)$ depending on some latent variables $z$ and exogenous free parameters $X$ (probably more data; we think that there's a function $y = f(X, z)$ involved here).
- This is equivalent to saying that our model's joint density is
$$
p_X(y, z) = p_X(y | z)\ p(z),\ p(z) \propto 1
$$
i.e., that we have a (probably improper) flat prior over
*all!*of our latent random variables. - Improper: it is probably the case that $\int_{z\in \mathcal Z}\ dz\ p(z) = +\infty$ if $p(z) \propto 1$, e.g., if $\mathcal Z$ has infinite cardinality and isn't compact (more about this in a few minutes).
- This is a very strong assumption -- do you really know
*nothing*about*any*of your latent variables?

- Example maximum likelihood model: Ordinary linear regression.
- $y \sim \text{Normal}(X\beta, \sigma^2)$, $\sigma \sim p(\sigma)$, $\beta \sim p(\beta)$.
- Usually, we set $p(\beta) \propto 1$, marginalize out $\sigma$, and care only to find the
*single value*of $\beta$ that maximizes $p(y | \beta)$. That is, we try to solve $$ \begin{aligned} \max_{\beta} \log p_X(y | \beta) &= \max_\beta -\frac{1}{2}\sum_{n=1}^N ((y_n - X_n\beta)/\sigma)^2 \\ &= \min_{\beta}\left\lVert y - X\beta \right\rVert_{2}^2, \end{aligned} $$ which is where the method of least squares comes from. We see how this falls*directly*out of the Gaussian assumption + making the highly questionable assertion that our priors are completely uniform... - Homework exercise: solve $\min_{\beta}\left\lVert y - X\beta \right\rVert_{2}^2$ analytically. (This is a
*classic*linear algebra problem but you can also easily use the multivariate chain rule.)

- Side note: improper priors. We just introduced a prior distribution
$p(\beta) \propto 1$ where $\beta \in \mathbb R^D$. Obviously this is not a true probability density, but we can still use it as a prior distribution
*if*we don't care about generating values from it. - More generally, it's sometimes possible to use improper $p(z)$ as long as certain integrability and data conditions hold.
- Let $x = (x_1,...,x_n)$ be observed datapoints and $z$ be the collection of latent rvs. The posterior density of $z$ is given by $$ p(z|x) = \frac{p(x|z)\ p(z)}{p(x)}, $$ so it's necessary to compute (or approximate) the integral $$ p(x) = \int\ dz\ p(x|z)\ p(z) $$ An improper prior will work out for you iff this integral converges.
- E.g., when $p(z) \propto 1$ as above, we require the ability to integrate $z$ out of the likelihood function:
$$
p(x) = \int\ dz\ p(x|z),
$$
*which may or may not be possible depending on how many data points you have*(trivially impossible with 0 datapoints, for example).