drd | prob/lppl/deep-dive

lppl is a library that aims to give the ordinary C++ developer the ability to embed probabilistic code, and perform inference over that code, in a way that is extraordinarily low overhead while maintaining the maximal expressiveness of something like Pyro. The lppl vision is some convex combination of the following scenarios: (a) many, possibly hundreds, of small probabilistic programs executing simultaneously on an ordinary commodity laptop; or (b) between one and a few (say, 10) probabilistic programs executing in a very resource constrained environment (e.g., a dashcam or other microprocessor); and in either scenario doing inference over models that are useful in making complex decisions, not solving simple supervised learning problems or generating streams of text.

How it works

The core modeling API is disarmingly simple and follows textbook implementations closely. To state that a random variable x is sampled from a particular distribution (say, Normal(μ, σ)), you’d use a sample statement:

auto x = sample(r, "x", Normal(mu, sigma), rng);

where RNG meets the pseudorandom number engine interface. Likewise, to state that you observe a particular value that you are modeling as Normal(μ, σ), you’d use an observe statement:

auto obs = observe(r, "value", Normal(mu, sigma), value);

Note that x and obs are ordinary C++ values – in this case, each would be a double. To construct a probabilistic program, you just write a pure C++ function that uses sample and observe statements. There’s nothing else to it. (At least, there’s nothing else to it programmatically; knowing how to construct a useful generative probabilistic model of a process is an entirely different question and wholly out of scope for this document.) Do, however, note the pure qualifier – the probabilistic program should act like a pure function, as it will be executed potentially many times during inference with no consideration of any side effects it might have.

Once you have a probabilistic program, you probably want to do inference over it. Inference methods do expect probabilistic programs to meet a particular API:

template<typename I, typename O, typename... Ts>
using pp_t = std::function<O(record_t<DTypes<Ts...>>&, I)>;

but, of course, you can write a lambda to match the interface of whatever function you’ve written to the pp_t interface. lppl decomposes inference into two orthogonal components: (a) generating samples from the posterior probability distribution; and (b) constructing a view of the posterior. Decomposing inference into these components leads naturally to computation of query results using online algorithms – first you draw a sample from the posterior, then you update your online view, repeat. So, to do inference, you need to specify both an inference algorithm you want to use to draw samples from the posterior and a queryer you want to use to construct a view. By default, any inference algorithm can be used with any queryer (the only time that wouldn’t be true is if you specialized a template for some reason). For example, supposing our probabilistic program were called f, we might write

auto q = weighted_mean<double, Normal, Gamma>("scale");
...
auto infer = inference<GenericMetropolis>(f, *q, opts);
auto res = infer(data);

Distinguishing features

Flexibility

lppl is designed to be extraordinarily flexible and a useful tool for research in foundations of probabilistic reasoning, not only a library for low-resource probabilistic modeling and statistical inference. lppl does not stop you from doing all sorts of things that don’t make sense from a typicaly statistical inference perspective – this can be seen as a feature or a bug depending on your level of comfort with probabilsitic program semantics.

Streaming-first

lppl;s interfaces are flexible, so you can implement nearly any kind of inference algorithm and querying capability you can think of in the framework and it should play nicely with existing components. However, lppl is designed specifically around low-memory applications (not lowest memory applications; for that see the program synthesis offshoot). Using an 𝒪(1) memory complexity queryer and expressing the probabilstic program using the filtering interface, it should be possible to do inference over arbitrarily large data streams – whether those are view-once streams sourced from physical sensors or the result of incrementally exectuting a database query.

Probabilistic program variants

The basic probabilistic program is a callable that takes in a record and an input and returns an output, as described above:

template<typename I, typename O, typename... Ts>
using pp_t = std::function<O(record_t<DTypes<Ts...>>&, I)>;

An lppl probabilistic program defines a joint density p(y) where y = {y_a}_a ∈ A, A being the set of addresses in the program and y the corresponding random variable. An address uniquely identifies a random variable, and neither the contents nor the cardinality of A may be known at compile time. Where it does not run the risk of confusion, we will often write the joint density as p(x, z) where x are observed random variables – random variables with an observe statement – and the z are latent random variables – random variables with a sample statement. Inference methods draw samples from the posterior distribution p(z|x), while queryers return a view of that posterior distribution V[p(z|x)] computed by consuming posterior samples.

There are other probabilsitic program types that are specialize this alias or are similar to it, and are useful for other purposes.

Examples

There’s a repository of examples that showcases some of lppl’s capabilities. The examples are biased toward use cases that I think might be well-suited for the library, such as time series filtering, forecasting, and nowcasting, and open-universe modeling (e.g., clustering with an unknown number of clusters). Here’s a dive into two of the examples, with more to discover in the repo.

pipe-filter-forecast

This example constructs two related models – one for filtering a time series, the other for forecasting future values – and performs inference, variational updates, and forecasting tasks. The filtering model f(x_t, z_t) is a simple normal filtering model:
f(x_t, μ_t, τ_t) = Normal(x_t|μ_t, τ_t^− 1/2)q(μ_t)q(τ_t),
where q(μ_t) = Normal(μ_t|μ₀, σ₀) and q(τ_t) = Gamma(k, θ) initially. This is implemented as an updatable probabilistic program:

upp_t<DefaultPolicy, double, double, Normal, Gamma> f = [](
        record_t<DTypes<Normal, Gamma>>& r, 
        typed_map<DefaultPolicy, double, Normal, Gamma>& input
    ) {
        auto loc = sample_u<Normal>(r, "loc", input, rng);
        auto precision = sample_u<Gamma>(r, "precision", input, rng);
        return observe_u<double>(r, "obs", Normal(loc, 1.0 / std::sqrt(precision)), input);
    };

The forecasting model g(z_t, z_t − 1) uses the current estimate of latent state to construct a linear forecast of observed state:
g(x_t, z_t − 1, μ_t, τ_t) = Normal(x_t|z_t − 1 + μ_tdt, τ_t^− 1/2)q(μ_t)q(τ_t),
with the same initial distributions for q(μ_t) and q(τ_t) as in f. This is also implemented as an updateable probabilistic program:

upp_t<DefaultPolicy, double, double, Normal, Gamma> g = [](
        record_t<DTypes<Normal, Gamma>>& r, 
        typed_map<DefaultPolicy, double, Normal, Gamma>& input
    ){  
        auto dt = input.template extract_value<double>("dt");
        auto mu = sample_u<Normal>(r, "mu", input, rng);
        auto precision = sample_u<Gamma>(r, "precision", input, rng);
        auto x_prev = sample_u<Normal>(r, "x_prev", input, rng);
        return observe_u<double>(r, "x", Normal(x_prev + dt * mu, std::sqrt(dt) / std::sqrt(precision)), input);
    };

// queryer -- O(1) query computing sufficient statistics
auto q = ProductGenerator<WeightedMeanStd, std::pair<double, double>, double, Normal, Gamma>::generate({"loc", "precision"});
auto q_forecast = ProductGenerator<WeightedMeanStd, std::pair<double, double>, double, Normal, Gamma>::generate({"mu", "precision"});
auto opts = inf_options_t(10000);

// inference algorithm
auto infer = inference<LikelihoodWeighting>(f, q, opts);
auto infer_forecast = inference<LikelihoodWeighting>(g, q_forecast, opts);

// automatically derive a filtering algorithm
auto algo = filter<ParameterMatching, DefaultPolicy>(f, q, infer);
auto algo_forecast = filter<ParameterMatching, DefaultPolicy>(g, q_forecast, infer_forecast);

The filtering algorithm derived here is similar to a standard particle filter using the prior as a proposal distribution, but differs in some key ways: + The posterior is never built up in memory but instead generated one sample at a time + Each sample is used to incrementally compute a streaming mean and standard deviation of each latent variable site + Analytical prior distributions are used at the subsequent timestep (as opposed to using a weighted collection of particles). The DefaultPolicy mapping is used, meaning that the functional form of the prior distribution will be the same at the subsequent timestep.

During inference, the posterior distribution of μ_t from f is used as the input distribution for z_t − 1 in g, which is how the models are connected:

auto x_prev_dist = m.template extract<Normal>("loc");
m_forecast.insert_or_assign("x_prev", x_prev_dist);

cat ../data/doubles.stream 
1.0
1.1
2.0
2.5
3.0
3.1
3.125
3.5
3.2
2.6
2.5
2.0

The algorithm executes 10,000 iterations of importance sampling, posterior computations, and posterior -> prior updating for each of the 12 datapoints in about 2.3 seconds total while using 4MB RAM:

cat ../data/doubles.stream | /usr/bin/time -v ./pipe-filter-forecast > ../data/pipe-filter-forecast.csv 
    Command being timed: "./pipe-filter-forecast"
    User time (seconds): 2.31
    System time (seconds): 0.00
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.32
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 3968
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 151
    Voluntary context switches: 2
    Involuntary context switches: 9
    Swaps: 0
    File system inputs: 0
    File system outputs: 8
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

We can see the result of this (yes, I admit, visualized using Python…)

On the top chart, the blue curve are the actual “observed” (come on, play along) datapoints. The grey probability distributions are distributions of latent state, while the red lines are draws from the posterior forecast distribution over the next observed datapoint.

pipe-clustering

This demonstrates implementation of and inference over a canonical weak-open-universe clustering model. In words, the model a priori doesn’t bound the number of clusters, but instead draws the number from a probability distribution defined over the positive integers. Given the number of clusters, the structure of the rest of the program is known at compile time. Implemented, it looks like this:

template<typename RNG>
int clustering(cluster_t& r, data_t& data, RNG& rng) {

    auto num_clusters = sample(r, "num_clusters - 1", Poisson(1), rng) + 1;

    // cluster parameterization
    std::vector<Normal> dists;
    for (int ix = 0; ix != num_clusters; ++ix) {
        auto loc = sample(r, "cluster/" + std::to_string(ix) + "/loc", Normal(0.0, 4.0), rng);
        auto scale = sample(r, "cluster/" + std::to_string(ix) + "/scale", Gamma(2.0, 2.0), rng);
        dists.push_back(Normal(loc, scale));
    }

    // data point assignment
    int ix = 0;
    for (const auto& d : data) {
        auto weights = make_weights_l2(d, dists);
        auto this_cluster = sample(r, "data/" + std::to_string(ix) + "/cluster", Categorical(weights), rng);
        observe(r, "obs/" + std::to_string(ix), dists[this_cluster], d);
        ++ix;
    }
    return num_clusters;
}

There are lots of possible ways to assign points to clusters. This model uses make_weights_l2 to assign the point to a random cluster according to its score under the analytical cluster distribution,
$$ \pi_{n,m} \propto \exp\left\{-\frac{1}{2}\left( \frac{x_n - \mu_m}{\sigma_m}\right)^2\right\}. $$
We can make inference over this model interesting. The obvious way to do it is just to run some sort of importance sampling or MCMC algorithm over it, but this could lead to inaccurate and high-variance estimates due to the discrete latent random variable all the way upstream in the computation graph. In Stan, you’d handle this by manually enumerating out this random variable (e.g., see the Stan changepoint detection example). In Pyro’s infinite awesomeness, you’d breezily ask Pyro to get rid of that pesky latent random variable for you by passing {"config": "enumerate"} as an inference argument. lppl takes a middle approach, making interfaces flexible enough for you to implement your own automatic enumeration inference algorithm generator. This whole story is in infer_enumerate.hpp, but the core of the idea is contained in the call method of the generated inference algorithm and should be fairly self-explanatory:

VType operator()(I& input) {
    VType out;
    // could convert to embarassingly parallel algo
    for (int ix = low; ix != high; ++ix) {
        auto cond_f = condition<D>(this->f, address, ix);
        auto infer = Inference<A, I, O, V, Q, Ts...>(cond_f, this->queryer, this->opts);
        if constexpr (has_clear<decltype(this->queryer)>::value) {
            this->queryer.clear();
        }
        out[ix] = infer(input);
    }
    return out;
}

The enumeration functionality we define can automatically enumerate any type of discrete latent random variable. Here, we use enumeration in a maximum a posteriori inference problem, using Metropolis-Hastings to compute the MAP for each possible number of clusters in the reduced search space {1, ..., 7}.

auto q = Optimizer<double, int, Normal, Gamma, Categorical, Poisson>(
        [](cluster_t& r, int&, double&) { return logprob(r); },
        [](cluster_t& r, int&, double&) { return logprob(r); }
    );
...
auto enum_alg = enumerate<AncestorMetropolis, Poisson>(f, q, opts);
auto infer = enum_alg("num_clusters - 1", 0, 6);

cat ../data/clusters.stream 
1.1
1.2
1.14
0.85
0.9
3.1
3.3
3.2
3.1
3.0
2.9

cat ../data/clusters.stream | /usr/bin/time -v ./pipe-clustering 
0: -21.3459
1: -16.5283
2: -28.0056
3: -29.4129
4: -39.4436
5: -54.8752
    Command being timed: "./pipe-clustering"
    User time (seconds): 6.05
    System time (seconds): 0.00
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.06
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 4224
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 163
    Voluntary context switches: 1
    Involuntary context switches: 110
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

correctly inferring the (admittedly not that challenging) number of two clusters in about six seconds using about 4MB RAM.

Low-resource universal stochastic computation with `lppl`

How it works

Distinguishing features

Flexibility

Streaming-first

Probabilistic program variants

Examples

`pipe-filter-forecast`

`pipe-clustering`

Low-resource universal stochastic computation with lppl

How it works

Distinguishing features

Flexibility

Streaming-first

Probabilistic program variants

Examples

pipe-filter-forecast

pipe-clustering

Low-resource universal stochastic computation with `lppl`

`pipe-filter-forecast`

`pipe-clustering`