Low-resource universal stochastic computation with lppl

lppl is a library that aims to give the ordinary C++ developer the ability to embed probabilistic code, and perform inference over that code, in a way that is extraordinarily low overhead while maintaining the maximal expressiveness of something like Pyro. The lppl vision is some convex combination of the following scenarios: (a) many, possibly hundreds, of small probabilistic programs executing simultaneously on an ordinary commodity laptop; or (b) between one and a few (say, 10) probabilistic programs executing in a very resource constrained environment (e.g., a dashcam or other microprocessor); and in either scenario doing inference over models that are useful in making complex decisions, not solving simple supervised learning problems or generating streams of text.

How it works

The core modeling API is disarmingly simple and follows textbook implementations closely. To state that a random variable x is sampled from a particular distribution (say, Normal(μ, σ)), you’d use a sample statement:

auto x = sample(r, "x", Normal(mu, sigma), rng);

where RNG meets the pseudorandom number engine interface. Likewise, to state that you observe a particular value that you are modeling as Normal(μ, σ), you’d use an observe statement:

auto obs = observe(r, "value", Normal(mu, sigma), value);

Note that x and obs are ordinary C++ values – in this case, each would be a double. To construct a probabilistic program, you just write a pure C++ function that uses sample and observe statements. There’s nothing else to it. (At least, there’s nothing else to it programmatically; knowing how to construct a useful generative probabilistic model of a process is an entirely different question and wholly out of scope for this document.) Do, however, note the pure qualifier – the probabilistic program should act like a pure function, as it will be executed potentially many times during inference with no consideration of any side effects it might have.

Once you have a probabilistic program, you probably want to do inference over it. Inference methods do expect probabilistic programs to meet a particular API:

template<typename I, typename O, typename... Ts>
using pp_t = std::function<O(record_t<DTypes<Ts...>>&, I)>;

but, of course, you can write a lambda to match the interface of whatever function you’ve written to the pp_t interface. lppl decomposes inference into two orthogonal components: (a) generating samples from the posterior probability distribution; and (b) constructing a view of the posterior. Decomposing inference into these components leads naturally to computation of query results using online algorithms – first you draw a sample from the posterior, then you update your online view, repeat. So, to do inference, you need to specify both an inference algorithm you want to use to draw samples from the posterior and a queryer you want to use to construct a view. By default, any inference algorithm can be used with any queryer (the only time that wouldn’t be true is if you specialized a template for some reason). For example, supposing our probabilistic program were called f, we might write

auto q = weighted_mean<double, Normal, Gamma>("scale");
...
auto infer = inference<GenericMetropolis>(f, *q, opts);
auto res = infer(data);

to (a) infer the posterior mean of a random variable named "scale" (b) using an online algorithm and (c) using Metropolis-Hastings with a user-defined proposal kernel. (The opts thing is just a bunch of options to pass to the inference algorithm.) You can see all the inference algorithms here and queryers here. Have fun!

Distinguishing features

Flexibility

lppl is designed to be extraordinarily flexible and a useful tool for research in foundations of probabilistic reasoning, not only a library for low-resource probabilistic modeling and statistical inference. lppl does not stop you from doing all sorts of things that don’t make sense from a typicaly statistical inference perspective – this can be seen as a feature or a bug depending on your level of comfort with probabilsitic program semantics.

Streaming-first

lppl;s interfaces are flexible, so you can implement nearly any kind of inference algorithm and querying capability you can think of in the framework and it should play nicely with existing components. However, lppl is designed specifically around low-memory applications (not lowest memory applications; for that see the program synthesis offshoot). Using an 𝒪(1) memory complexity queryer and expressing the probabilstic program using the filtering interface, it should be possible to do inference over arbitrarily large data streams – whether those are view-once streams sourced from physical sensors or the result of incrementally exectuting a database query.

Probabilistic program variants

The basic probabilistic program is a callable that takes in a record and an input and returns an output, as described above:

template<typename I, typename O, typename... Ts>
using pp_t = std::function<O(record_t<DTypes<Ts...>>&, I)>;

An lppl probabilistic program defines a joint density p(y) where y = {ya}a ∈ A, A being the set of addresses in the program and y the corresponding random variable. An address uniquely identifies a random variable, and neither the contents nor the cardinality of A may be known at compile time. Where it does not run the risk of confusion, we will often write the joint density as p(x, z) where x are observed random variables – random variables with an observe statement – and the z are latent random variables – random variables with a sample statement. Inference methods draw samples from the posterior distribution p(z|x), while queryers return a view of that posterior distribution V[p(z|x)] computed by consuming posterior samples.

There are other probabilsitic program types that are specialize this alias or are similar to it, and are useful for other purposes.

• Graph probabilistic programs: sometimes, particularly for programs with static structure or a priori finite size, we might want to do inference using an explicit graph data structure. This could enable us to write lower-variance inference algorithms or use message passing methods, for example. Graph probabilsitic programs look similar to ordinary ones: template<typename I, typename O, typename... Ts> using gpp_t = std::function<O(gr_pair<Ts...>&, I)>; where the data structure gr_pair combines a record with a directed graph data structure. To sample and observe into a graph data structure requires slightly different syntax – sample now looks like cpp auto x = sample_g<Normal>(gr, "x", rng)(loc, scale); while observing a value now looks like cpp observe_g<Normal>(gr, "value", value)(loc, scale); The sample_g and observe_g calls construct an intermediate data structure that records the distribution choice, parent/child relationships, and posisble values to observe against the distribution in a graph data structure, while the second call defines the parameters of that constructed distribution.
• Updatable probabilistic programs: a filtering process defines a sequence of densities or mass functions that update over time as a function of streaming observed data. Particle filtering methods are subject to shortcomings such as particle starvation, where there are very few high-likelihood particles to use for subsequent filtering iterations. lppl introduces a parametric variational approach to filtering that is memory-efficient and cannot suffer from particle starvation (since it does not use particles to represent posterior distributions). This approach, which is generic over inference algorithm choice, introduces two new primitive inference operations: a filtering policy and an update policy. Given a sequence of joint densities {pn(x, z)}n ∈ {1, 2, ...} that factor as pn(x, z) = p(x|z)qn(z; θn), a filtering policy P is a functional qn = P[qn − 1], while an update policy U computes θn = U(V[pn − 1(z|x)]). Updatable probabilistic programs, defined template< template<typename> class Policy, typename I, typename O, typename... Ts > using upp_t = pp_t<typed_map<Policy, I, Ts...>&, O, Ts...>; are used in variational filtering applications. The Policy template class, corresponding to the filtering policy P, specifies what the family of the variational posterior / next iteration’s prior should be, A typed_map is an associative data structure that holds the distribution to be sampled from at each non-observed address. Similar to graph probabilistic programs, updatable probabilsitic programs have their own sample and observe syntax, cpp auto x = sample_u<Normal>(r, "x", input, rng); and cpp observe_u<double>(r, "obs", Normal(loc, scale), input); respectively. Here sample_u translates to “look in the input (a typed_map) for the distribution at the site "x", and sample from that; the initial distribution family for this site is Normal”. The distribution template parameter is necessary, as some policies will change the type of distribution used at a site during inference. For example, a NormalPolicy will represent every distribution whose output type is double by a Normal distribution, automatically backtransformign sampled values into the correct domain (e.g., from ( − ∞, ∞) to (0, ∞) if the original distribution were Gamma). There’s an example of putting an inference algorithm, queryer, filtering policy, and update policy all together further down.

Examples

There’s a repository of examples that showcases some of lppl’s capabilities. The examples are biased toward use cases that I think might be well-suited for the library, such as time series filtering, forecasting, and nowcasting, and open-universe modeling (e.g., clustering with an unknown number of clusters). Here’s a dive into two of the examples, with more to discover in the repo.

pipe-filter-forecast

This example constructs two related models – one for filtering a time series, the other for forecasting future values – and performs inference, variational updates, and forecasting tasks. The filtering model f(xt, zt) is a simple normal filtering model:
f(xt, μt, τt) = Normal(xt|μt, τt − 1/2)q(μt)q(τt),
where q(μt) = Normal(μt|μ0, σ0) and q(τt) = Gamma(k, θ) initially. This is implemented as an updatable probabilistic program:

upp_t<DefaultPolicy, double, double, Normal, Gamma> f = [](
record_t<DTypes<Normal, Gamma>>& r,
typed_map<DefaultPolicy, double, Normal, Gamma>& input
) {
auto loc = sample_u<Normal>(r, "loc", input, rng);
auto precision = sample_u<Gamma>(r, "precision", input, rng);
return observe_u<double>(r, "obs", Normal(loc, 1.0 / std::sqrt(precision)), input);
};

The forecasting model g(zt, zt − 1) uses the current estimate of latent state to construct a linear forecast of observed state:
g(xt, zt − 1, μt, τt) = Normal(xt|zt − 1 + μtdt, τt − 1/2)q(μt)q(τt),
with the same initial distributions for q(μt) and q(τt) as in f. This is also implemented as an updateable probabilistic program:

upp_t<DefaultPolicy, double, double, Normal, Gamma> g = [](
record_t<DTypes<Normal, Gamma>>& r,
typed_map<DefaultPolicy, double, Normal, Gamma>& input
){
auto dt = input.template extract_value<double>("dt");
auto mu = sample_u<Normal>(r, "mu", input, rng);
auto precision = sample_u<Gamma>(r, "precision", input, rng);
auto x_prev = sample_u<Normal>(r, "x_prev", input, rng);
return observe_u<double>(r, "x", Normal(x_prev + dt * mu, std::sqrt(dt) / std::sqrt(precision)), input);
};

Queryers and inference / filtering algorithms are automatically derived:

// queryer -- O(1) query computing sufficient statistics
auto q = ProductGenerator<WeightedMeanStd, std::pair<double, double>, double, Normal, Gamma>::generate({"loc", "precision"});
auto q_forecast = ProductGenerator<WeightedMeanStd, std::pair<double, double>, double, Normal, Gamma>::generate({"mu", "precision"});
auto opts = inf_options_t(10000);

// inference algorithm
auto infer = inference<LikelihoodWeighting>(f, q, opts);
auto infer_forecast = inference<LikelihoodWeighting>(g, q_forecast, opts);

// automatically derive a filtering algorithm
auto algo = filter<ParameterMatching, DefaultPolicy>(f, q, infer);
auto algo_forecast = filter<ParameterMatching, DefaultPolicy>(g, q_forecast, infer_forecast);

The filtering algorithm derived here is similar to a standard particle filter using the prior as a proposal distribution, but differs in some key ways: + The posterior is never built up in memory but instead generated one sample at a time + Each sample is used to incrementally compute a streaming mean and standard deviation of each latent variable site + Analytical prior distributions are used at the subsequent timestep (as opposed to using a weighted collection of particles). The DefaultPolicy mapping is used, meaning that the functional form of the prior distribution will be the same at the subsequent timestep.

During inference, the posterior distribution of μt from f is used as the input distribution for zt − 1 in g, which is how the models are connected:

auto x_prev_dist = m.template extract<Normal>("loc");
m_forecast.insert_or_assign("x_prev", x_prev_dist);

We use the tiny test dataset doubles.stream:

cat ../data/doubles.stream
1.0
1.1
2.0
2.5
3.0
3.1
3.125
3.5
3.2
2.6
2.5
2.0

The algorithm executes 10,000 iterations of importance sampling, posterior computations, and posterior -> prior updating for each of the 12 datapoints in about 2.3 seconds total while using 4MB RAM:

cat ../data/doubles.stream | /usr/bin/time -v ./pipe-filter-forecast > ../data/pipe-filter-forecast.csv
Command being timed: "./pipe-filter-forecast"
User time (seconds): 2.31
System time (seconds): 0.00
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.32
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3968
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 151
Voluntary context switches: 2
Involuntary context switches: 9
Swaps: 0
File system inputs: 0
File system outputs: 8
Socket messages sent: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

We can see the result of this (yes, I admit, visualized using Python…) On the top chart, the blue curve are the actual “observed” (come on, play along) datapoints. The grey probability distributions are distributions of latent state, while the red lines are draws from the posterior forecast distribution over the next observed datapoint.

pipe-clustering

This demonstrates implementation of and inference over a canonical weak-open-universe clustering model. In words, the model a priori doesn’t bound the number of clusters, but instead draws the number from a probability distribution defined over the positive integers. Given the number of clusters, the structure of the rest of the program is known at compile time. Implemented, it looks like this:

template<typename RNG>
int clustering(cluster_t& r, data_t& data, RNG& rng) {

auto num_clusters = sample(r, "num_clusters - 1", Poisson(1), rng) + 1;

// cluster parameterization
std::vector<Normal> dists;
for (int ix = 0; ix != num_clusters; ++ix) {
auto loc = sample(r, "cluster/" + std::to_string(ix) + "/loc", Normal(0.0, 4.0), rng);
auto scale = sample(r, "cluster/" + std::to_string(ix) + "/scale", Gamma(2.0, 2.0), rng);
dists.push_back(Normal(loc, scale));
}

// data point assignment
int ix = 0;
for (const auto& d : data) {
auto weights = make_weights_l2(d, dists);
auto this_cluster = sample(r, "data/" + std::to_string(ix) + "/cluster", Categorical(weights), rng);
observe(r, "obs/" + std::to_string(ix), dists[this_cluster], d);
++ix;
}
return num_clusters;
}

There are lots of possible ways to assign points to clusters. This model uses make_weights_l2 to assign the point to a random cluster according to its score under the analytical cluster distribution,
$$\pi_{n,m} \propto \exp\left\{-\frac{1}{2}\left( \frac{x_n - \mu_m}{\sigma_m}\right)^2\right\}.$$
We can make inference over this model interesting. The obvious way to do it is just to run some sort of importance sampling or MCMC algorithm over it, but this could lead to inaccurate and high-variance estimates due to the discrete latent random variable all the way upstream in the computation graph. In Stan, you’d handle this by manually enumerating out this random variable (e.g., see the Stan changepoint detection example). In Pyro’s infinite awesomeness, you’d breezily ask Pyro to get rid of that pesky latent random variable for you by passing {"config": "enumerate"} as an inference argument. lppl takes a middle approach, making interfaces flexible enough for you to implement your own automatic enumeration inference algorithm generator. This whole story is in infer_enumerate.hpp, but the core of the idea is contained in the call method of the generated inference algorithm and should be fairly self-explanatory:

VType operator()(I& input) {
VType out;
// could convert to embarassingly parallel algo
for (int ix = low; ix != high; ++ix) {
auto cond_f = condition<D>(this->f, address, ix);
auto infer = Inference<A, I, O, V, Q, Ts...>(cond_f, this->queryer, this->opts);
if constexpr (has_clear<decltype(this->queryer)>::value) {
this->queryer.clear();
}
out[ix] = infer(input);
}
return out;
}

The enumeration functionality we define can automatically enumerate any type of discrete latent random variable. Here, we use enumeration in a maximum a posteriori inference problem, using Metropolis-Hastings to compute the MAP for each possible number of clusters in the reduced search space {1, ..., 7}.

auto q = Optimizer<double, int, Normal, Gamma, Categorical, Poisson>(
[](cluster_t& r, int&, double&) { return logprob(r); },
[](cluster_t& r, int&, double&) { return logprob(r); }
);
...
auto enum_alg = enumerate<AncestorMetropolis, Poisson>(f, q, opts);
auto infer = enum_alg("num_clusters - 1", 0, 6);

Using the tinyclusters.stream data set,

cat ../data/clusters.stream
1.1
1.2
1.14
0.85
0.9
3.1
3.3
3.2
3.1
3.0
2.9

inference results look like

cat ../data/clusters.stream | /usr/bin/time -v ./pipe-clustering
0: -21.3459
1: -16.5283
2: -28.0056
3: -29.4129
4: -39.4436
5: -54.8752
Command being timed: "./pipe-clustering"
User time (seconds): 6.05
System time (seconds): 0.00
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.06
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4224
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 163
Voluntary context switches: 1
Involuntary context switches: 110
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Exit status: 0