## Distributions and models¶

• A little (completely non-exhaustive) zoo of probability distributions, and more assorted facts
• Honest-to-goodness expressive probabilistic modeling using a modern-day probabilistic programming language (Pyro)
In [2]:
# normal distribution
# support: (-oo, oo)
# Usage: *Very* useful as arises via CLT, diffusion processes, Taylor expansions...
xvals = torch.linspace(-2.0, 2.0, 100)
fig, ax = plt.subplots()
for loc, scale in itertools.product([0.0, 1.0], [0.5, 1.0]):
plot_continuous(dist.Normal, xvals, loc, scale, ax=ax)

In [3]:
# exponential distribution
# support: (0, oo)
# Usage: memoryless distribution, useful to model quantity occurring between events
# in continuous space / time
xvals = torch.linspace(0.01, 5.0, 100)
fig, ax = plt.subplots()
for rate in [0.5, 1.0, 2.0]:
plot_continuous(dist.Exponential, xvals, rate, ax=ax)

In [4]:
# gamma distribution
# support: (0, oo)
# Usage: exponential is a special case of this, maximum entropy distribution with respect
# to some interesting constraints (see wikipedia), useful in *conjugacy* (more on that
# momentarily)
xvals = torch.linspace(0.01, 5.0, 100)
fig, ax = plt.subplots()
for alpha, beta in itertools.product([1.0, 4.0], [1.0, 2.0]):
plot_continuous(dist.Gamma, xvals, alpha, beta, ax=ax)

In [5]:
# geometric distribution
# support: {0, 1, 2, ...}
# Usage: discrete counterpart to exponential distribution, memoryless discrete
# distribution, "how many coin flips until heads"
xvals = torch.tensor(list(range(10))).type(torch.long)
fig, ax = plt.subplots()
for p in [0.5, 0.25, 0.1]:
plot_discrete(dist.Geometric, xvals, p, ax=ax)

In [6]:
# poisson distribution
# support: {0, 1, 2, ...}
# Usage: number of events occurring within a time interval
# (exercise: derive this assuming that events happen according to continuous
# memoryless distribution), more generally as a generic "counting measure"
xvals = torch.tensor(list(range(20))).type(torch.float)
fig, ax = plt.subplots()
for rate in [1.0, 5.0, 10.0]:
plot_discrete(dist.Poisson, xvals, rate, ax=ax)

# compare to normal...
plot_continuous(dist.Normal, xvals, 10.0, torch.tensor(10.0).sqrt(), ax=ax);

In [7]:
# beta distribution
# support: [0, 1]
# Usage: a random probability
xvals = torch.linspace(0.001, 0.999, 250, dtype=torch.float)
fig, ax = plt.subplots()
for alpha, beta in (
[0.5, 0.5,], [1.0, 1.0], [2.0, 2.0], [4.0, 2.0]
):
plot_continuous(dist.Beta, xvals, alpha, beta, ax=ax)

In [8]:
# categorical distribution
# support: any finite set
# Usage: most general discrete distribution
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
for ax, domain, probs in zip(
axes,
([0.1, 12.8], ["a", "b", "frogs", 63]),
([0.3, 0.7], [0.15, 0.5, 0.2, 0.15])
):
plot_categorical(domain, *probs, ax=ax)
fig.tight_layout()

• It's literally not possible to cover all "useful" probability distributions, but these will give you a decent start. A good exercise for you is to think about multivariate generalizations when I haven't given thoose explicitly. E.g., what would a multivariate generalization of a beta distribution look like? (It's called a Dirichlet distribution, look it up.)
• Useful to you in your analytical pursuits is the idea of conjugacy:
• Write Bayes theorem as $p(z | x) \propto p(x | z) p(z)$. If $p(z | x) \propto p(z')$ with $z' = z'(x)$, we say that $p(z)$ is a conjugate prior.
• Let's unpack this a little bit...
• Suppose we observe a sequence of geometric trials (e.g., we flip a coin until it comes up heads, and we do this $n$ times in a row). A reasonable prior to put on the probability of heads $\pi$ is a Beta distribution, $\pi \sim \text{Beta}(\alpha, \beta)$. This prior has the density $$p(\pi | \alpha, \beta) \propto \pi^{\alpha - 1}(1 - \pi)^{\beta - 1},$$ while the geometric distribution has mass function $$p(x) \propto (1 - \pi)^{x}$$
• Thus, the posterior density is \begin{aligned} p(\pi | x_1,..., x_n) \propto p(x_1,...,x_n, \pi) &= \text{Beta}(\alpha, \beta) \prod_{1 \leq i \leq n} \text{Geometric}(\pi) \\ &\propto \pi^{\alpha - 1}(1 - \pi)^{\beta - 1} \prod_{1 \leq i \leq n} \pi (1 - \pi)^{x_i} \\ &= \pi^{\alpha + n - 1}(1 - \pi)^{\beta + \sum_i x_i - 1} = \text{Beta}(\alpha + n, \beta + \sum_i x_i) \end{aligned}
• The posterior has the same form as the prior, just with different parameter values. This is mostly a convenience and not necessary with modern computing (usually...), but is very good to understand.

## Expressive modeling¶

• Introduce the idea of a probabilistic programming language: embedding probabilistic modeling in a programming language to facilitate
• easier modeling (just write code!)
• easier inference (use the power of compilers, recursion, memoization, parallelization, etc.)
• You read three papers before this class -- each paper introduced a different PPL: Pyro, Figaro, and Gen. Each PPL is universal in the sense that it can represent any computable probability distribution (a rather large class...) but each also has markedly different design decisions
• Trace-based (Pyro and Gen) vs non-trace-based (Figaro)
• Object oriented (Figaro) or largely procedural (Pyro and Gen)
• Focused on sampling (Figaro and Gen), variable elimination and belief propagation (Figaro and Pyro), or variational inference (Pyro and Gen).
• We will focus on concepts of generative, trace-based models and use sampling and optimization methods to perform inference.

## Trace-based PPL¶

• We are going to broaden the notion of a probabilistic model quite a bit. Now we will talk about probability density functions over the space of traces of computer programs. This requires us to know what that means -- you should read the Gen documentation for some help with that.
• A trace $t$ maps an address $a \in \mathcal A$ to a value $v \in \mathcal V$. A trace comes about by recording the execution of a (stochastic, in our case) program, $$t = \text{trace}[f](i),$$ where $i \in \mathcal I$ is the set of inputs to that execution of the program. We will think about probability distributions over execution traces of the program, $p_i(t)$, which satisfy $$\sum_t p_i(t) = 1\ \text{ for all }i \in \mathcal I.$$
• This greatly, greatly generalizes the notion of generative model that we have been discussing thus far.
• Consider the following generative models (assume data is some fixed datapoint):
loc ~ normal(0.0, 1.0)
scale ~ gamma(2.0, 2.0)
values ~ observe(normal(loc, scale), data)
vs.
prob ~ beta(1.0, 1.0)
branch ~ bernoulli(prob)
if branch:
loc ~ normal(0.0, 1.0)
else:
loc = 2.0
scale ~ gamma(2.0, 2.0)
values ~ observe(normal(loc, scale), data)
• Consider the following generative models (assume data is some fixed datapoint):
T = 100
x ~ normal(0.0, 1.0)
for t in 1:T:
x ~ normal(x, 1.0)
y ~ normal(x, 0.5)
vs
stop ~ bernoulli(0.01)
x ~ normal(0.0, 1.0)
while not stop:
x ~ normal(x, 1.0)
y ~ normal(x, 0.5)
stop ~ bernoulli(0.01)
• In each of these examples, the first program corresponds to a static graph, while the second does not:
• In the first example, the second program exhibits stochastic control flow, where branching choice depends on randomness
• In the second example, the second program exhibits open universe structure where the number of random variables is technically unbounded
• We'll write each of these programs in a PPL (Pyro) and examine their outputs both in standard interpretation (i.e., evaluating their logpdfs at some data) and in nonstandard interpretation (using them to generate data in the first place).
• By examining their traces, we can also see the difference between static and non-static structure.
In [9]:
def normal_static_model(size=1,):
loc = pyro.sample("loc", dist.Normal(0.0, 1.0))
scale = pyro.sample("scale", dist.Gamma(2.0, 2.0))
values = pyro.sample("values", dist.Normal(loc, scale).expand((size,)))
return values

def normal_dynamic_model(size=1,):
probs = pyro.sample("probs", dist.Beta(1.0, 1.0))
branch = pyro.sample("branch", dist.Bernoulli(probs))
if branch > 0:
loc = pyro.sample("loc", dist.Normal(0.0, 1.0))
else:
loc = torch.tensor(2.0)
scale = pyro.sample("scale", dist.Gamma(2.0, 2.0))
values = pyro.sample("values", dist.Normal(loc, scale).expand((size,)))
return values

In [10]:
# nonstandard interpretation -- make some data
size = 500

fig, axes = plt.subplots(1, 2, figsize=(10, 5))
plot_pyro_program(normal_static_model, ax=axes[0], size=size)
plot_pyro_program(normal_dynamic_model, ax=axes[1], size=size)
fig.tight_layout()

In [11]:
# let's examine the traces from some model calls.
static_trace = pyro.poutine.trace(normal_static_model).get_trace(size=5)
static_trace.compute_log_prob()
clean_trace_view(static_trace.nodes)

Out[11]:
OrderedDict([('input', {'size': 5}),
('loc',
{'fn': Normal(loc: 0.0, scale: 1.0),
'value': tensor(1.9132),
'lpdf': tensor(-2.7491)}),
('scale',
{'fn': Gamma(concentration: 2.0, rate: 2.0),
'value': tensor(0.6253),
'lpdf': tensor(-0.3338)}),
('values',
{'fn': Normal(loc: torch.Size([5]), scale: torch.Size([5])),
'value': tensor([1.6168, 1.7076, 1.6175, 1.2709, 1.7594]),
'lpdf': tensor(-3.0830)}),
('output',
{'value': tensor([1.6168, 1.7076, 1.6175, 1.2709, 1.7594])})])
In [12]:
# let's examine the traces from some model calls.
dynamic_trace = pyro.poutine.trace(normal_dynamic_model).get_trace(size=5)
dynamic_trace.compute_log_prob()
clean_trace_view(dynamic_trace.nodes)

Out[12]:
OrderedDict([('input', {'size': 5}),
('probs',
{'fn': Beta(), 'value': tensor(0.5612), 'lpdf': tensor(0.)}),
('branch',
{'fn': Bernoulli(probs: 0.5611799955368042, logits: 0.24595242738723755),
'value': tensor(0.),
'lpdf': tensor(-0.8237)}),
('scale',
{'fn': Gamma(concentration: 2.0, rate: 2.0),
'value': tensor(0.9755),
'lpdf': tensor(-0.5895)}),
('values',
{'fn': Normal(loc: torch.Size([5]), scale: torch.Size([5])),
'value': tensor([2.5593, 3.5110, 3.5871, 2.3304, 1.7199]),
'lpdf': tensor(-7.2568)}),
('output',
{'value': tensor([2.5593, 3.5110, 3.5871, 2.3304, 1.7199])})])
In [13]:
# let's examine the traces from some model calls.
conditioned_static_model = pyro.poutine.condition(
normal_static_model, data=dict(values=torch.tensor(3.0))
)
static_trace = pyro.poutine.trace(conditioned_static_model).get_trace(size=1)
static_trace.compute_log_prob()
clean_trace_view(static_trace.nodes)

Out[13]:
OrderedDict([('input', {'size': 1}),
('loc',
{'fn': Normal(loc: 0.0, scale: 1.0),
'value': tensor(-1.7348),
'lpdf': tensor(-2.4237)}),
('scale',
{'fn': Gamma(concentration: 2.0, rate: 2.0),
'value': tensor(1.3444),
'lpdf': tensor(-1.0066)}),
('values',
{'fn': Normal(loc: tensor([-1.7348]), scale: tensor([1.3444])),
'value': tensor(3.),
'lpdf': tensor(-7.4164),
'observed': True}),
('output', {'value': tensor(3.)})])
In [14]:
# let's examine the traces from some model calls.
conditioned_dynamic_model = pyro.poutine.condition(
normal_dynamic_model, data=dict(values=torch.tensor(3.0), loc=torch.tensor(1.0))
)
dynamic_trace = pyro.poutine.trace(conditioned_dynamic_model).get_trace(size=1)
dynamic_trace.compute_log_prob()
clean_trace_view(dynamic_trace.nodes)

Out[14]:
OrderedDict([('input', {'size': 1}),
('probs',
{'fn': Beta(), 'value': tensor(0.1226), 'lpdf': tensor(0.)}),
('branch',
{'fn': Bernoulli(probs: 0.12255457043647766, logits: -1.9684582948684692),
'value': tensor(1.),
'lpdf': tensor(-2.0992)}),
('loc',
{'fn': Normal(loc: 0.0, scale: 1.0),
'value': tensor(1.),
'lpdf': tensor(-1.4189),
'observed': True}),
('scale',
{'fn': Gamma(concentration: 2.0, rate: 2.0),
'value': tensor(0.7808),
'lpdf': tensor(-0.4228)}),
('values',
{'fn': Normal(loc: tensor([1.]), scale: tensor([0.7808])),
'value': tensor(3.),
'lpdf': tensor(-3.9519),
'observed': True}),
('output', {'value': tensor(3.)})])