MCMC

  • Markov-chain Monte Carlo: the idea is to incrementally modify a trace in order to obtain a better approximation to the posterior density
  • MCMC sampling uses a proposal distribution (like $q(z)$ above) that proposes new values for the posterior $z'$ given the "current" posterior value $z$. This proposal behavior is usually local in the sense that the $z$s are usually "close" to each other.
  • Though we will discuss MCMC sampling in the context only of sampling from probability distributions, in fact this method is useful for many things (numerical integration and optimization are two other common use cases)

Metropolis-Hastings

  • Probably most common MCMC algorithm and very, very useful. Easily summarized:

    z ~ ...
    samples = []
    
    for i = 1,...,N:
      z' ~ q(z'|z)
      x ~ p(x | z); x' ~ p(x'|z')
      Q = q(z|z'); Q' = q(z'|z); P = p(x, z); P' = p(x', z')
      a = (P' / Q') / (P / Q)
      if min(1, a) > uniform():
          z = z'
      samples.append(z)
  • There are two principal difficulties:
    1. How should we define a proposal kernel $q(z'|z)$ that can be evaluated at both $z'$ and $z$? Involved in this is how to best define "closeness" between the points $z'$ and $z$.
    2. How should we deal with a program that emits traces of different lengths?

Metropolis-Hastings

  • Solving question (1) is pretty simple -- or, at least, there are simple answers that aren't too terrible. There are some common symmetric proposal kernels that work well when domsins are either unbounded or we expect the posterior density to lie very far from a boundary:
    • Symmetric normal: $z' \sim \text{Normal}(z' | z, \sigma^2)$, where $\sigma^2$ is usually calibrated during sampling
    • Symmetric categorical: $z' \sim \text{Bernoulli}(z'| \{z - 1, z, z + 1\}, p = 1/3)$
  • Any solution like this can be easily modified to incorporate boundary conditions (as long as probability density / mass is modified accordingly)
  • These univariate distributions can be used to learn / sample from posterior distributions of multivariate densities -- at each iteration, we choose a random site and propose only to that one, keeping other sites in the trace constant (see Algorithm 17 on p177 of this text for more details). In exchange for ease in calculation, we'll need to acquire more samples in order to cover the posterior mass.
  • Recommend checking out Gen-1) and / or Pyro for how this might be implemented in practice.
In [2]:
@gen function model(n_samples, verbose :: Bool)
    b = @trace(beta(2.0, 2.0), :b)
    z = @trace(bernoulli(b), :z)
    log_sigma = @trace(normal(-1.0, 1.0), :log_sigma)
    data = Array{Float64, 1}(undef, n_samples)
    for n in 1:n_samples
        data[n] = @trace(normal(z, exp(log_sigma)), (:data, n))
    end
    if verbose
       println("Used b = $b, z = $z, and log_sigma = $log_sigma") 
    end
    data
end
Out[2]:
DynamicDSLFunction{Any}(Dict{Symbol,Any}(), Dict{Symbol,Any}(), Type[Any, Bool], false, Union{Nothing, Some{Any}}[nothing, nothing], ##model#253, Bool[0, 0], false)
  • We'll write a simple proposal kernel for this model. The latent random variables are :b, :z, and :log_sigma. We have to be careful to not proposse to any of the (:data, n) since these are our observable rvs (at least, in this context).
In [3]:
mh_step(trace, address) = mh(trace, select(address))[1]
function mh_step(trace, addresses :: AbstractArray)
    for address in addresses
       trace = mh_step(trace, address) 
    end
    trace
end

function proposal(n_iter, data)
    cm = choicemap()
    for (n, d) in enumerate(data)
        cm[(:data, n)] = d
    end
    trace, _ = generate(model, (length(data), false), cm)
    posterior = Dict(:b => [], :z => [], :log_sigma => [])
    addresses = [:b, :z, :log_sigma]
    for i in 1:n_iter
        trace = mh_step(trace, addresses)
        for addr in addresses
           push!(posterior[addr], trace[addr]) 
        end
    end
    posterior
end
Out[3]:
proposal (generic function with 1 method)
In [4]:
Random.seed!(2021)
data = model(100, true)
post = proposal(100000, data)
Used b = 0.6105591808638493, z = true, and log_sigma = -1.9209084060204957
Out[4]:
Dict{Symbol,Array{Any,1}} with 3 entries:
  :b         => Any[0.488831, 0.807375, 0.75511, 0.75511, 0.418341, 0.898416, 0…
  :z         => Any[false, true, true, true, true, true, true, true, true, true…
  :log_sigma => Any[-0.902225, -0.902225, -1.29089, -1.91315, -1.91315, -1.9131…
In [6]:
p = plot(p1, p2,)
Out[6]:

Variational inference

  • Up until now, we've focused on inference method that let us "exactly approximate" the posterior. What we mean by this: we can prove that if the inference methods run for long enough, our approximate posterior is guaranteed to converge to the analytical posterior. However, this might take a long time.
  • Variational inference turns this idea on its head. Instead of attempting to "exactly approximate" the posterior, we give up on ever actually knowing the true posterior and instead settle for finding optimal parameters of a known probability distribution (called the variational posterior or guide, and often denoted by $q_\psi(z)$ where $\psi$ denotes the vector of parameters). There are positive and negatives to this approach:
    • (+) $q_\psi(z)$ is a parametric probability distribution -- in theory, this means we know lots of facts about it and might be able to perform analytical calculations with it (as opposed to the true posterior $p(z|x)$ about which we usually know almost nothing)
    • (+) This formulation turns the hard problem of sampling into an optimization problem -- and we are very, very good at gradient-based optimization.
    • (-) $q_\psi(z) \neq p(z|x)$ in general, for any value of $\psi$. How much of an issue this is depends on the characteristics of the joint density and the observed data. If $q_\psi$ isn't expressive enough (we'll give an example of this moomentarily), this approximation can be really bad.
    • (-) There are several technical shortcomings of (naive) VI, including a tendency to underestimate dispersion.

Variational inference

  • We have to cover a little more information theory: Kullback-Leibler divergence (KLD) and the evidence lower bound (ELBO).
  • KLD: Defined as $$ D(q||p) = E_{z\sim q(z)}[\log q(z) - \log p(z)] = \int\ dz\ q(z) [\log q(z) - \log p(z)]. $$ In other words, it's the average under $q(z)$ of the difference in score functions of $q$ and $p$. Unpacking this: if we observe $z$, we can compute its score (a statistic of how likely it is) under each of $p$ and $q$. KLD does this for each possible $z$ while assuming that $z$ is truly generated by $q(z)$. This is often referred to as "the distance from $p$ to $q$", but that isn't accurate since KLD isn't a distance metric (heuristically, how distance is measured depends on the endpoint $q$).

Variational inference

  • We'd like our variational posterior $q_\psi(z)$ to minimize the KLD from the true posterior. That is, we want to minimize $$ \begin{aligned} D(q_\psi(z)||p(z|x)) &= E_{z\sim q(z)}[\log q_\psi(z) - \log p(z|x)] = E_{z\sim q_\psi(z)}[\log q_\psi(z) - \log p(x, z) + \log p(x)]\\ &= \log p(x) + E_{z\sim q_\psi(z)}[\log q_\psi(z) - \log p(x, z)]. \end{aligned} $$ From this last equation, we have $\log p(x) - D(q_\psi(z)||p(z|x)) = E_{z\sim q_\psi(z)}[\log p(x, z) - \log q_\psi(z)] = ELBO(q_\psi(z) || p(x, z))$, the evidence lower bound -- this derivation shows you where that name comes from.
  • Now we can try to minimize $-ELBO(q_\psi(z) || p(x, z))$ using some variant of gradient descent. There are subtleties associated with this that we won't go into here -- see the Pyro docs if you're very interested in the actual computation of $\nabla_{\psi}ELBO(q_\psi(z) || p(x, z))$.
In [7]:
Random.seed!(2021)

@gen function m1(size :: Int, verbose :: Bool)
    loc = @trace(normal(0.0, 1.0), :loc)
    log_scale = @trace(normal(0.0, 1.0), :log_scale)
    if verbose
        println("loc = $loc, log_scale = $log_scale")
    end
    data = Array{Float64, 1}(undef, size)
    for i in 1:size
         data[i] = @trace(normal(loc, exp(log_scale)), (:data, i))
    end
    data
end
Out[7]:
DynamicDSLFunction{Any}(Dict{Symbol,Any}(), Dict{Symbol,Any}(), Type[Int64, Bool], false, Union{Nothing, Some{Any}}[nothing, nothing], ##m1#258, Bool[0, 0], false)
In [8]:
# we'll use grad to annotate the fact that we can differentiate w.r.t. parameters in g1
@gen (grad) function g1()
    @param loc_mu :: Float64
    @param loc_log_sigma :: Float64
    @param log_scale_mu :: Float64
    @param log_scale_log_sigma :: Float64
    @trace(normal(loc_mu, exp(loc_log_sigma)), :loc)
    @trace(normal(log_scale_mu, exp(log_scale_log_sigma)), :log_scale)
end
Out[8]:
DynamicDSLFunction{Any}(Dict{Symbol,Any}(), Dict{Symbol,Any}(), Type[], false, Union{Nothing, Some{Any}}[], ##g1#259, Bool[], true)
In [9]:
# generate some fake data
size = 500
data = m1(size, true)
data_cm = choicemap()
for (i, pt) in enumerate(data)
    data_cm[(:data, i)] = pt
end

# initialize parameters and set optimizer
t_pars = [:loc_mu, :loc_log_sigma, :log_scale_mu, :log_scale_log_sigma]  # everything we can grad
for p in t_pars
     init_param!(g1, p, 0.0)  # more intelligent choices are possible!
end
vi_update = ParamUpdate(GradientDescent(0.0005, 5), g1 => t_pars);
loc = 0.516653904160697, log_scale = -0.08057717729047323
In [10]:
(elbo_estimate, traces, elbo_history) = black_box_vi!(
    m1, (size, false),
    data_cm,
    g1, (),
    vi_update,
    iters=250
);
In [12]:
plot(p1, p2, p3)
Out[12]:
In [ ]: