Statistical modeling

  • What the 2-gram says...
    • The world around us exists. We will never be able to understand it fully, but we might be able to describe it with some accuracy, and if we're lucky and good at what we do, with some precision.
    • The scientific method, embodied by Bayes's theorem, allows us to propose structure models of the world and score them based on the data that we have at hand.
  • We must now understand what these models can possibly look like. In this course models are always generative -- we will always be able to generate data from our models by writing an algorithm to do so.
    • (We might need to specify additional parameters too, such as exogenous observed data. More on that later.)
  • Generative model -> we can write an algorithm to generate from the model -> we should be able to order the model components somehow. More formalization needed.
In [2]:
image = img.imread("./plots/2021-02-17topo-sort.png")
fig, ax = plt.subplots(figsize=(8, 8))
  • Statistical models: use Bayes's theorem again. Assume that the nodes $x_1,..., x_N$ are topologically sorted. Then the joint density can be factored as
$$ \begin{aligned} p(x_1, x_2,...,x_N) &= \prod_{n=1}^N p(x_n | x_{n' < n})\\ &\equiv p(x_N|x_{N-1},...) p(x_{N - 1}|x_{N - 2}) \cdots p(x_1) \end{aligned} $$
  • So, this is a fundamentally generative story (as suggested by the topological sort!) First we generate from $x_0$, then from $x_1 | x_0$, then from $x_2 | x_1, x_0$, then... etc.
  • If we are lucky (as in previous slide) then there isn't dependence on all $n' < n$ but rather only some of them.
  • Usually three ways to write down our generative models:
    • (Pseudo)code (necessary and sufficient)
    • Mathematically
    • Graphically
  • Pseudocode

      meta_scale ~ Gamma(5.0, 1.0)
      loc ~ Normal(0.0, meta_scale)
      scale ~ Gamma(meta_scale, meta_scale)
      for i = 1,...,N:
          data[i] ~ Normal(loc, scale, obs=real_data[i])
  • Mathematically
$$ \begin{aligned} &p(\text{meta_scale}, \text{loc}, \text{scale}, \text{data}) = \\ &\quad p(\text{meta_scale}) p(\text{loc}|\text{meta_scale}) p(\text{scale} | \text{meta_scale}) \prod_{i=1}^N p(\text{data}_i | \text{loc}, \text{scale}) \end{aligned} $$
In [3]:
# graphically
image = img.imread("./plots/2021-02-17meta-scale-model.png")
fig, ax = plt.subplots(figsize=(6, 6))
In [4]:
# and as actual code (we'll get to how this works soon)

def meta_scale_model(data, size=1, verbose=False):
    meta_scale = pyro.sample(
        dist.Gamma(5.0, 1.0)
    loc = pyro.sample(
        dist.Normal(0.0, meta_scale),
    scale = pyro.sample(
        dist.Gamma(meta_scale, meta_scale)
    if verbose:
        print(f"meta_scale = {meta_scale}, loc = {loc}, and scale = {scale}")
    with pyro.plate("data_plate", size=size):
        obs = pyro.sample(
            dist.Normal(loc, scale),
    return obs
In [5]:
# model is fully generative here
mock_data = meta_scale_model(None, size=100, verbose=True)
meta_scale = 11.556055068969727, loc = 7.411444664001465, and scale = 0.814912736415863
  • Important to be able to convert between each of these model representations
    • Especially important to note the ups and downs of each. In particular, the graphical representation (also called plate notation) does not represent a model with instantiations of random variables but rather a topological equivalence class of models (see this for more info)
      • Basic plate notation rules: plates correspond to an independent tensor index (equivalently, a product), shaded nodes correspond to observed rvs, and clear nodes are latents (copied from previous link)
    • Mathematical representation is very general and corresponds 1 - 1 with graphical notation. You can instantiate each pdf / pmf in mathematical reepresentation to make model concrete, but loss of generality (e.g., $p(\text{meta_scale}) \mapsto \text{Gamma}(5.0, 5.0)$ in our example here)
    • Pseudocode is necessary if we're being generative / constructive (we are) but you lose some generality (or it's just equivalent to graphical presentation if you don't lose generality)

Discriminative vs generative models

  • Discriminative: a generative model that also depends on a set of free, exogenous parameters.
    • For as data scientists, those free exoogenous parameters are usually called "data"
    • Models tend to have a structure like $$ p_X(y, z) = p_X(y | z)\ p(z), $$ where $p_X(y | z)$ is a likelihood function, $p(z)$ is a prior over the latent random variables in the likelihood function, and $X$ are the free exogenous parameters.
      • We are contrasting "exogenous" parameters from ones that we'll try to find the optimum values of later in the course.
  • Generative: a model that corresponds to (one or more) terminating computer programs that output data.
    • Look like $$ p(y, z) = p(y|z)\ p(z) $$
    • Note that you can always turn a discriminative model into a generative model by specifying another prior, $$ p(X, y, z) = p(y | X, z)\ p(X|z)\ p(z) $$

Discriminative models

  • These basically fall into two categories -- maximum likelihood and maximum a posteriori. Both types still use Bayes's theorem, though maybe not in the way you'd expect.
  • Maximum likelihood models
    • Our data $y$ has a likelihood function $p_X(y | z)$ depending on some latent variables $z$ and exogenous free parameters $X$ (probably more data; we think that there's a function $y = f(X, z)$ involved here).
    • This is equivalent to saying that our model's joint density is $$ p_X(y, z) = p_X(y | z)\ p(z),\ p(z) \propto 1 $$ i.e., that we have a (probably improper) flat prior over all! of our latent random variables.
    • Improper: it is probably the case that $\int_{z\in \mathcal Z}\ dz\ p(z) = +\infty$ if $p(z) \propto 1$, e.g., if $\mathcal Z$ has infinite cardinality and isn't compact (more about this in a few minutes).
    • This is a very strong assumption -- do you really know nothing about any of your latent variables?
  • Example maximum likelihood model: Ordinary linear regression.
    • $y \sim \text{Normal}(X\beta, \sigma^2)$, $\sigma \sim p(\sigma)$, $\beta \sim p(\beta)$.
    • Usually, we set $p(\beta) \propto 1$, marginalize out $\sigma$, and care only to find the single value of $\beta$ that maximizes $p(y | \beta)$. That is, we try to solve $$ \begin{aligned} \max_{\beta} \log p_X(y | \beta) &= \max_\beta -\frac{1}{2}\sum_{n=1}^N ((y_n - X_n\beta)/\sigma)^2 \\ &= \min_{\beta}\left\lVert y - X\beta \right\rVert_{2}^2, \end{aligned} $$ which is where the method of least squares comes from. We see how this falls directly out of the Gaussian assumption + making the highly questionable assertion that our priors are completely uniform...
    • Homework exercise: solve $\min_{\beta}\left\lVert y - X\beta \right\rVert_{2}^2$ analytically. (This is a classic linear algebra problem but you can also easily use the multivariate chain rule.)
  • Side note: improper priors. We just introduced a prior distribution $p(\beta) \propto 1$ where $\beta \in \mathbb R^D$. Obviously this is not a true probability density, but we can still use it as a prior distribution if we don't care about generating values from it.
  • More generally, it's sometimes possible to use improper $p(z)$ as long as certain integrability and data conditions hold.
  • Let $x = (x_1,...,x_n)$ be observed datapoints and $z$ be the collection of latent rvs. The posterior density of $z$ is given by $$ p(z|x) = \frac{p(x|z)\ p(z)}{p(x)}, $$ so it's necessary to compute (or approximate) the integral $$ p(x) = \int\ dz\ p(x|z)\ p(z) $$ An improper prior will work out for you iff this integral converges.
  • E.g., when $p(z) \propto 1$ as above, we require the ability to integrate $z$ out of the likelihood function: $$ p(x) = \int\ dz\ p(x|z), $$ which may or may not be possible depending on how many data points you have (trivially impossible with 0 datapoints, for example).