Distributions and models

  • A little (completely non-exhaustive) zoo of probability distributions, and more assorted facts
  • Honest-to-goodness expressive probabilistic modeling using a modern-day probabilistic programming language (Pyro)
In [2]:
# normal distribution
# support: (-oo, oo)
# Usage: *Very* useful as arises via CLT, diffusion processes, Taylor expansions...
xvals = torch.linspace(-2.0, 2.0, 100)
fig, ax = plt.subplots()
for loc, scale in itertools.product([0.0, 1.0], [0.5, 1.0]):
    plot_continuous(dist.Normal, xvals, loc, scale, ax=ax)
In [3]:
# exponential distribution
# support: (0, oo)
# Usage: memoryless distribution, useful to model quantity occurring between events
# in continuous space / time
xvals = torch.linspace(0.01, 5.0, 100)
fig, ax = plt.subplots()
for rate in [0.5, 1.0, 2.0]:
    plot_continuous(dist.Exponential, xvals, rate, ax=ax)
In [4]:
# gamma distribution
# support: (0, oo)
# Usage: exponential is a special case of this, maximum entropy distribution with respect
# to some interesting constraints (see wikipedia), useful in *conjugacy* (more on that 
# momentarily)
xvals = torch.linspace(0.01, 5.0, 100)
fig, ax = plt.subplots()
for alpha, beta in itertools.product([1.0, 4.0], [1.0, 2.0]):
    plot_continuous(dist.Gamma, xvals, alpha, beta, ax=ax)
In [5]:
# geometric distribution
# support: {0, 1, 2, ...}
# Usage: discrete counterpart to exponential distribution, memoryless discrete
# distribution, "how many coin flips until heads"
xvals = torch.tensor(list(range(10))).type(torch.long)
fig, ax = plt.subplots()
for p in [0.5, 0.25, 0.1]:
    plot_discrete(dist.Geometric, xvals, p, ax=ax)
In [6]:
# poisson distribution
# support: {0, 1, 2, ...}
# Usage: number of events occurring within a time interval
# (exercise: derive this assuming that events happen according to continuous 
# memoryless distribution), more generally as a generic "counting measure"
xvals = torch.tensor(list(range(20))).type(torch.float)
fig, ax = plt.subplots()
for rate in [1.0, 5.0, 10.0]:
    plot_discrete(dist.Poisson, xvals, rate, ax=ax)

# compare to normal...
plot_continuous(dist.Normal, xvals, 10.0, torch.tensor(10.0).sqrt(), ax=ax);
In [7]:
# beta distribution
# support: [0, 1]
# Usage: a random probability
xvals = torch.linspace(0.001, 0.999, 250, dtype=torch.float)
fig, ax = plt.subplots()
for alpha, beta in (
    [0.5, 0.5,], [1.0, 1.0], [2.0, 2.0], [4.0, 2.0]
):
    plot_continuous(dist.Beta, xvals, alpha, beta, ax=ax)
In [8]:
# categorical distribution
# support: any finite set
# Usage: most general discrete distribution
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
for ax, domain, probs in zip(
    axes, 
    ([0.1, 12.8], ["a", "b", "frogs", 63]),
    ([0.3, 0.7], [0.15, 0.5, 0.2, 0.15])
):
    plot_categorical(domain, *probs, ax=ax)
fig.tight_layout()
  • It's literally not possible to cover all "useful" probability distributions, but these will give you a decent start. A good exercise for you is to think about multivariate generalizations when I haven't given thoose explicitly. E.g., what would a multivariate generalization of a beta distribution look like? (It's called a Dirichlet distribution, look it up.)
  • Useful to you in your analytical pursuits is the idea of conjugacy:
    • Write Bayes theorem as $p(z | x) \propto p(x | z) p(z)$. If $p(z | x) \propto p(z')$ with $z' = z'(x)$, we say that $p(z)$ is a conjugate prior.
    • Let's unpack this a little bit...
  • Suppose we observe a sequence of geometric trials (e.g., we flip a coin until it comes up heads, and we do this $n$ times in a row). A reasonable prior to put on the probability of heads $\pi$ is a Beta distribution, $\pi \sim \text{Beta}(\alpha, \beta)$. This prior has the density $$ p(\pi | \alpha, \beta) \propto \pi^{\alpha - 1}(1 - \pi)^{\beta - 1}, $$ while the geometric distribution has mass function $$ p(x) \propto (1 - \pi)^{x} $$
  • Thus, the posterior density is $$ \begin{aligned} p(\pi | x_1,..., x_n) \propto p(x_1,...,x_n, \pi) &= \text{Beta}(\alpha, \beta) \prod_{1 \leq i \leq n} \text{Geometric}(\pi) \\ &\propto \pi^{\alpha - 1}(1 - \pi)^{\beta - 1} \prod_{1 \leq i \leq n} \pi (1 - \pi)^{x_i} \\ &= \pi^{\alpha + n - 1}(1 - \pi)^{\beta + \sum_i x_i - 1} = \text{Beta}(\alpha + n, \beta + \sum_i x_i) \end{aligned} $$
  • The posterior has the same form as the prior, just with different parameter values. This is mostly a convenience and not necessary with modern computing (usually...), but is very good to understand.

Expressive modeling

  • Introduce the idea of a probabilistic programming language: embedding probabilistic modeling in a programming language to facilitate
    • easier modeling (just write code!)
    • easier inference (use the power of compilers, recursion, memoization, parallelization, etc.)
  • You read three papers before this class -- each paper introduced a different PPL: Pyro, Figaro, and Gen. Each PPL is universal in the sense that it can represent any computable probability distribution (a rather large class...) but each also has markedly different design decisions
    • Trace-based (Pyro and Gen) vs non-trace-based (Figaro)
    • Object oriented (Figaro) or largely procedural (Pyro and Gen)
    • Focused on sampling (Figaro and Gen), variable elimination and belief propagation (Figaro and Pyro), or variational inference (Pyro and Gen).
  • We will focus on concepts of generative, trace-based models and use sampling and optimization methods to perform inference.

Trace-based PPL

  • We are going to broaden the notion of a probabilistic model quite a bit. Now we will talk about probability density functions over the space of traces of computer programs. This requires us to know what that means -- you should read the Gen documentation for some help with that.
  • A trace $t$ maps an address $a \in \mathcal A$ to a value $v \in \mathcal V$. A trace comes about by recording the execution of a (stochastic, in our case) program, $$ t = \text{trace}[f](i), $$ where $i \in \mathcal I$ is the set of inputs to that execution of the program. We will think about probability distributions over execution traces of the program, $p_i(t)$, which satisfy $$ \sum_t p_i(t) = 1\ \text{ for all }i \in \mathcal I. $$
  • This greatly, greatly generalizes the notion of generative model that we have been discussing thus far.
  • Consider the following generative models (assume data is some fixed datapoint):
    loc ~ normal(0.0, 1.0)
    scale ~ gamma(2.0, 2.0)
    values ~ observe(normal(loc, scale), data)
    vs.
    prob ~ beta(1.0, 1.0)
    branch ~ bernoulli(prob)
    if branch:
      loc ~ normal(0.0, 1.0)
    else:
      loc = 2.0
    scale ~ gamma(2.0, 2.0)
    values ~ observe(normal(loc, scale), data)
  • Consider the following generative models (assume data is some fixed datapoint):
    T = 100
    x ~ normal(0.0, 1.0)
    for t in 1:T:
      x ~ normal(x, 1.0)
      y ~ normal(x, 0.5)
    vs
    stop ~ bernoulli(0.01)
    x ~ normal(0.0, 1.0)
    while not stop:
      x ~ normal(x, 1.0)
      y ~ normal(x, 0.5)
      stop ~ bernoulli(0.01)
  • In each of these examples, the first program corresponds to a static graph, while the second does not:
    • In the first example, the second program exhibits stochastic control flow, where branching choice depends on randomness
    • In the second example, the second program exhibits open universe structure where the number of random variables is technically unbounded
  • We'll write each of these programs in a PPL (Pyro) and examine their outputs both in standard interpretation (i.e., evaluating their logpdfs at some data) and in nonstandard interpretation (using them to generate data in the first place).
    • By examining their traces, we can also see the difference between static and non-static structure.
In [9]:
def normal_static_model(size=1,):
    loc = pyro.sample("loc", dist.Normal(0.0, 1.0))
    scale = pyro.sample("scale", dist.Gamma(2.0, 2.0))
    values = pyro.sample("values", dist.Normal(loc, scale).expand((size,)))
    return values


def normal_dynamic_model(size=1,):
    probs = pyro.sample("probs", dist.Beta(1.0, 1.0))
    branch = pyro.sample("branch", dist.Bernoulli(probs))
    if branch > 0:
        loc = pyro.sample("loc", dist.Normal(0.0, 1.0))
    else:
        loc = torch.tensor(2.0)
    scale = pyro.sample("scale", dist.Gamma(2.0, 2.0))
    values = pyro.sample("values", dist.Normal(loc, scale).expand((size,)))
    return values
In [10]:
# nonstandard interpretation -- make some data
size = 500

fig, axes = plt.subplots(1, 2, figsize=(10, 5))
plot_pyro_program(normal_static_model, ax=axes[0], size=size)
plot_pyro_program(normal_dynamic_model, ax=axes[1], size=size)
fig.tight_layout()
In [11]:
# let's examine the traces from some model calls.
static_trace = pyro.poutine.trace(normal_static_model).get_trace(size=5)
static_trace.compute_log_prob()
clean_trace_view(static_trace.nodes)
Out[11]:
OrderedDict([('input', {'size': 5}),
             ('loc',
              {'fn': Normal(loc: 0.0, scale: 1.0),
               'value': tensor(1.9132),
               'lpdf': tensor(-2.7491)}),
             ('scale',
              {'fn': Gamma(concentration: 2.0, rate: 2.0),
               'value': tensor(0.6253),
               'lpdf': tensor(-0.3338)}),
             ('values',
              {'fn': Normal(loc: torch.Size([5]), scale: torch.Size([5])),
               'value': tensor([1.6168, 1.7076, 1.6175, 1.2709, 1.7594]),
               'lpdf': tensor(-3.0830)}),
             ('output',
              {'value': tensor([1.6168, 1.7076, 1.6175, 1.2709, 1.7594])})])
In [12]:
# let's examine the traces from some model calls.
dynamic_trace = pyro.poutine.trace(normal_dynamic_model).get_trace(size=5)
dynamic_trace.compute_log_prob()
clean_trace_view(dynamic_trace.nodes)
Out[12]:
OrderedDict([('input', {'size': 5}),
             ('probs',
              {'fn': Beta(), 'value': tensor(0.5612), 'lpdf': tensor(0.)}),
             ('branch',
              {'fn': Bernoulli(probs: 0.5611799955368042, logits: 0.24595242738723755),
               'value': tensor(0.),
               'lpdf': tensor(-0.8237)}),
             ('scale',
              {'fn': Gamma(concentration: 2.0, rate: 2.0),
               'value': tensor(0.9755),
               'lpdf': tensor(-0.5895)}),
             ('values',
              {'fn': Normal(loc: torch.Size([5]), scale: torch.Size([5])),
               'value': tensor([2.5593, 3.5110, 3.5871, 2.3304, 1.7199]),
               'lpdf': tensor(-7.2568)}),
             ('output',
              {'value': tensor([2.5593, 3.5110, 3.5871, 2.3304, 1.7199])})])
In [13]:
# let's examine the traces from some model calls.
conditioned_static_model = pyro.poutine.condition(
    normal_static_model, data=dict(values=torch.tensor(3.0))
)
static_trace = pyro.poutine.trace(conditioned_static_model).get_trace(size=1)
static_trace.compute_log_prob()
clean_trace_view(static_trace.nodes)
Out[13]:
OrderedDict([('input', {'size': 1}),
             ('loc',
              {'fn': Normal(loc: 0.0, scale: 1.0),
               'value': tensor(-1.7348),
               'lpdf': tensor(-2.4237)}),
             ('scale',
              {'fn': Gamma(concentration: 2.0, rate: 2.0),
               'value': tensor(1.3444),
               'lpdf': tensor(-1.0066)}),
             ('values',
              {'fn': Normal(loc: tensor([-1.7348]), scale: tensor([1.3444])),
               'value': tensor(3.),
               'lpdf': tensor(-7.4164),
               'observed': True}),
             ('output', {'value': tensor(3.)})])
In [14]:
# let's examine the traces from some model calls.
conditioned_dynamic_model = pyro.poutine.condition(
    normal_dynamic_model, data=dict(values=torch.tensor(3.0), loc=torch.tensor(1.0))
)
dynamic_trace = pyro.poutine.trace(conditioned_dynamic_model).get_trace(size=1)
dynamic_trace.compute_log_prob()
clean_trace_view(dynamic_trace.nodes)
Out[14]:
OrderedDict([('input', {'size': 1}),
             ('probs',
              {'fn': Beta(), 'value': tensor(0.1226), 'lpdf': tensor(0.)}),
             ('branch',
              {'fn': Bernoulli(probs: 0.12255457043647766, logits: -1.9684582948684692),
               'value': tensor(1.),
               'lpdf': tensor(-2.0992)}),
             ('loc',
              {'fn': Normal(loc: 0.0, scale: 1.0),
               'value': tensor(1.),
               'lpdf': tensor(-1.4189),
               'observed': True}),
             ('scale',
              {'fn': Gamma(concentration: 2.0, rate: 2.0),
               'value': tensor(0.7808),
               'lpdf': tensor(-0.4228)}),
             ('values',
              {'fn': Normal(loc: tensor([1.]), scale: tensor([0.7808])),
               'value': tensor(3.),
               'lpdf': tensor(-3.9519),
               'observed': True}),
             ('output', {'value': tensor(3.)})])