lppl
lppl is a library that aims to give the ordinary C++ developer the ability to embed probabilistic code, and perform inference over that code, in a way that is extraordinarily low overhead while maintaining the maximal expressiveness of something like Pyro. The lppl vision is some convex combination of the following scenarios: (a) many, possibly hundreds, of small probabilistic programs executing simultaneously on an ordinary commodity laptop; or (b) between one and a few (say, 10) probabilistic programs executing in a very resource constrained environment (e.g., a dashcam or other microprocessor); and in either scenario doing inference over models that are useful in making complex decisions, not solving simple supervised learning problems or generating streams of text.
The core modeling API is disarmingly simple and follows textbook implementations closely. To state that a random variable x is sampled from a particular distribution (say, Normal(μ, σ)), you’d use a sample statement:
auto x = sample(r, "x", Normal(mu, sigma), rng);
where RNG
meets the pseudorandom number engine interface. Likewise, to state that you observe a particular value
that you are modeling as Normal(μ, σ), you’d use an observe statement:
auto obs = observe(r, "value", Normal(mu, sigma), value);
Note that x
and obs
are ordinary C++ values – in this case, each would be a double. To construct a probabilistic program, you just write a pure C++ function that uses sample
and observe
statements. There’s nothing else to it. (At least, there’s nothing else to it programmatically; knowing how to construct a useful generative probabilistic model of a process is an entirely different question and wholly out of scope for this document.) Do, however, note the pure qualifier – the probabilistic program should act like a pure function, as it will be executed potentially many times during inference with no consideration of any side effects it might have.
Once you have a probabilistic program, you probably want to do inference over it. Inference methods do expect probabilistic programs to meet a particular API:
template<typename I, typename O, typename... Ts>
using pp_t = std::function<O(record_t<DTypes<Ts...>>&, I)>;
but, of course, you can write a lambda to match the interface of whatever function you’ve written to the pp_t
interface. lppl decomposes inference into two orthogonal components: (a) generating samples from the posterior probability distribution; and (b) constructing a view of the posterior. Decomposing inference into these components leads naturally to computation of query results using online algorithms – first you draw a sample from the posterior, then you update your online view, repeat. So, to do inference, you need to specify both an inference algorithm you want to use to draw samples from the posterior and a queryer you want to use to construct a view. By default, any inference algorithm can be used with any queryer (the only time that wouldn’t be true is if you specialized a template for some reason). For example, supposing our probabilistic program were called f
, we might write
auto q = weighted_mean<double, Normal, Gamma>("scale");
...auto infer = inference<GenericMetropolis>(f, *q, opts);
auto res = infer(data);
to (a) infer the posterior mean of a random variable named "scale"
(b) using an online algorithm and (c) using Metropolis-Hastings with a user-defined proposal kernel. (The opts
thing is just a bunch of options to pass to the inference algorithm.) You can see all the inference algorithms here and queryers here. Have fun!
lppl
is designed to be extraordinarily flexible and a useful tool for research in foundations of probabilistic reasoning, not only a library for low-resource probabilistic modeling and statistical inference. lppl does not stop you from doing all sorts of things that don’t make sense from a typicaly statistical inference perspective – this can be seen as a feature or a bug depending on your level of comfort with probabilsitic program semantics.
lppl;s interfaces are flexible, so you can implement nearly any kind of inference algorithm and querying capability you can think of in the framework and it should play nicely with existing components. However, lppl is designed specifically around low-memory applications (not lowest memory applications; for that see the program synthesis offshoot). Using an 𝒪(1) memory complexity queryer and expressing the probabilstic program using the filtering interface, it should be possible to do inference over arbitrarily large data streams – whether those are view-once streams sourced from physical sensors or the result of incrementally exectuting a database query.
The basic probabilistic program is a callable that takes in a record and an input and returns an output, as described above:
template<typename I, typename O, typename... Ts>
using pp_t = std::function<O(record_t<DTypes<Ts...>>&, I)>;
An lppl probabilistic program defines a joint density p(y) where y = {ya}a ∈ A, A being the set of addresses in the program and y the corresponding random variable. An address uniquely identifies a random variable, and neither the contents nor the cardinality of A may be known at compile time. Where it does not run the risk of confusion, we will often write the joint density as p(x, z) where x are observed random variables – random variables with an observe
statement – and the z are latent random variables – random variables with a sample
statement. Inference methods draw samples from the posterior distribution p(z|x), while queryers return a view of that posterior distribution V[p(z|x)] computed by consuming posterior samples.
There are other probabilsitic program types that are specialize this alias or are similar to it, and are useful for other purposes.
template<typename I, typename O, typename... Ts> using gpp_t = std::function<O(gr_pair<Ts...>&, I)>;
where the data structure gr_pair
combines a record with a directed graph data structure. To sample and observe into a graph data structure requires slightly different syntax – sample now looks like cpp auto x = sample_g<Normal>(gr, "x", rng)(loc, scale);
while observing a value
now looks like cpp observe_g<Normal>(gr, "value", value)(loc, scale);
The sample_g
and observe_g
calls construct an intermediate data structure that records the distribution choice, parent/child relationships, and posisble values to observe against the distribution in a graph data structure, while the second call defines the parameters of that constructed distribution.template< template<typename> class Policy, typename I, typename O, typename... Ts > using upp_t = pp_t<typed_map<Policy, I, Ts...>&, O, Ts...>;
are used in variational filtering applications. The Policy
template class, corresponding to the filtering policy P, specifies what the family of the variational posterior / next iteration’s prior should be, A typed_map
is an associative data structure that holds the distribution to be sampled from at each non-observed address. Similar to graph probabilistic programs, updatable probabilsitic programs have their own sample and observe syntax, cpp auto x = sample_u<Normal>(r, "x", input, rng);
and cpp observe_u<double>(r, "obs", Normal(loc, scale), input);
respectively. Here sample_u
translates to “look in the input
(a typed_map
) for the distribution at the site "x"
, and sample from that; the initial distribution family for this site is Normal”. The distribution template parameter is necessary, as some policies will change the type of distribution used at a site during inference. For example, a NormalPolicy
will represent every distribution whose output type is double
by a Normal distribution, automatically backtransformign sampled values into the correct domain (e.g., from ( − ∞, ∞) to (0, ∞) if the original distribution were Gamma). There’s an example of putting an inference algorithm, queryer, filtering policy, and update policy all together further down.There’s a repository of examples that showcases some of lppl
’s capabilities. The examples are biased toward use cases that I think might be well-suited for the library, such as time series filtering, forecasting, and nowcasting, and open-universe modeling (e.g., clustering with an unknown number of clusters). Here’s a dive into two of the examples, with more to discover in the repo.
pipe-filter-forecast
This example constructs two related models – one for filtering a time series, the other for forecasting future values – and performs inference, variational updates, and forecasting tasks. The filtering model f(xt, zt) is a simple normal filtering model:
f(xt, μt, τt) = Normal(xt|μt, τt − 1/2)q(μt)q(τt),
where q(μt) = Normal(μt|μ0, σ0) and q(τt) = Gamma(k, θ) initially. This is implemented as an updatable probabilistic program:
upp_t<DefaultPolicy, double, double, Normal, Gamma> f = [](
record_t<DTypes<Normal, Gamma>>& r,
double, Normal, Gamma>& input
typed_map<DefaultPolicy,
) {auto loc = sample_u<Normal>(r, "loc", input, rng);
auto precision = sample_u<Gamma>(r, "precision", input, rng);
return observe_u<double>(r, "obs", Normal(loc, 1.0 / std::sqrt(precision)), input);
};
The forecasting model g(zt, zt − 1) uses the current estimate of latent state to construct a linear forecast of observed state:
g(xt, zt − 1, μt, τt) = Normal(xt|zt − 1 + μtdt, τt − 1/2)q(μt)q(τt),
with the same initial distributions for q(μt) and q(τt) as in f. This is also implemented as an updateable probabilistic program:
upp_t<DefaultPolicy, double, double, Normal, Gamma> g = [](
record_t<DTypes<Normal, Gamma>>& r,
double, Normal, Gamma>& input
typed_map<DefaultPolicy,
){ auto dt = input.template extract_value<double>("dt");
auto mu = sample_u<Normal>(r, "mu", input, rng);
auto precision = sample_u<Gamma>(r, "precision", input, rng);
auto x_prev = sample_u<Normal>(r, "x_prev", input, rng);
return observe_u<double>(r, "x", Normal(x_prev + dt * mu, std::sqrt(dt) / std::sqrt(precision)), input);
};
Queryers and inference / filtering algorithms are automatically derived:
// queryer -- O(1) query computing sufficient statistics
auto q = ProductGenerator<WeightedMeanStd, std::pair<double, double>, double, Normal, Gamma>::generate({"loc", "precision"});
auto q_forecast = ProductGenerator<WeightedMeanStd, std::pair<double, double>, double, Normal, Gamma>::generate({"mu", "precision"});
auto opts = inf_options_t(10000);
// inference algorithm
auto infer = inference<LikelihoodWeighting>(f, q, opts);
auto infer_forecast = inference<LikelihoodWeighting>(g, q_forecast, opts);
// automatically derive a filtering algorithm
auto algo = filter<ParameterMatching, DefaultPolicy>(f, q, infer);
auto algo_forecast = filter<ParameterMatching, DefaultPolicy>(g, q_forecast, infer_forecast);
The filtering algorithm derived here is similar to a standard particle filter using the prior as a proposal distribution, but differs in some key ways: + The posterior is never built up in memory but instead generated one sample at a time + Each sample is used to incrementally compute a streaming mean and standard deviation of each latent variable site + Analytical prior distributions are used at the subsequent timestep (as opposed to using a weighted collection of particles). The DefaultPolicy
mapping is used, meaning that the functional form of the prior distribution will be the same at the subsequent timestep.
During inference, the posterior distribution of μt from f is used as the input distribution for zt − 1 in g, which is how the models are connected:
auto x_prev_dist = m.template extract<Normal>("loc");
m_forecast.insert_or_assign("x_prev", x_prev_dist);
We use the tiny test dataset doubles.stream
:
cat ../data/doubles.stream
1.0
1.1
2.0
2.5
3.0
3.1
3.125
3.5
3.2
2.6
2.5
2.0
The algorithm executes 10,000 iterations of importance sampling, posterior computations, and posterior -> prior updating for each of the 12 datapoints in about 2.3 seconds total while using 4MB RAM:
cat ../data/doubles.stream | /usr/bin/time -v ./pipe-filter-forecast > ../data/pipe-filter-forecast.csv
Command being timed: "./pipe-filter-forecast"
User time (seconds): 2.31
System time (seconds): 0.00
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.32
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3968
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 151
Voluntary context switches: 2
Involuntary context switches: 9
Swaps: 0
File system inputs: 0
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
We can see the result of this (yes, I admit, visualized using Python…) On the top chart, the blue curve are the actual “observed” (come on, play along) datapoints. The grey probability distributions are distributions of latent state, while the red lines are draws from the posterior forecast distribution over the next observed datapoint.
pipe-clustering
This demonstrates implementation of and inference over a canonical weak-open-universe clustering model. In words, the model a priori doesn’t bound the number of clusters, but instead draws the number from a probability distribution defined over the positive integers. Given the number of clusters, the structure of the rest of the program is known at compile time. Implemented, it looks like this:
template<typename RNG>
int clustering(cluster_t& r, data_t& data, RNG& rng) {
auto num_clusters = sample(r, "num_clusters - 1", Poisson(1), rng) + 1;
// cluster parameterization
std::vector<Normal> dists;
for (int ix = 0; ix != num_clusters; ++ix) {
auto loc = sample(r, "cluster/" + std::to_string(ix) + "/loc", Normal(0.0, 4.0), rng);
auto scale = sample(r, "cluster/" + std::to_string(ix) + "/scale", Gamma(2.0, 2.0), rng);
dists.push_back(Normal(loc, scale));
}
// data point assignment
int ix = 0;
for (const auto& d : data) {
auto weights = make_weights_l2(d, dists);
auto this_cluster = sample(r, "data/" + std::to_string(ix) + "/cluster", Categorical(weights), rng);
"obs/" + std::to_string(ix), dists[this_cluster], d);
observe(r,
++ix;
}return num_clusters;
}
There are lots of possible ways to assign points to clusters. This model uses make_weights_l2
to assign the point to a random cluster according to its score under the analytical cluster distribution,
$$
\pi_{n,m} \propto \exp\left\{-\frac{1}{2}\left( \frac{x_n - \mu_m}{\sigma_m}\right)^2\right\}.
$$
We can make inference over this model interesting. The obvious way to do it is just to run some sort of importance sampling or MCMC algorithm over it, but this could lead to inaccurate and high-variance estimates due to the discrete latent random variable all the way upstream in the computation graph. In Stan, you’d handle this by manually enumerating out this random variable (e.g., see the Stan changepoint detection example). In Pyro’s infinite awesomeness, you’d breezily ask Pyro to get rid of that pesky latent random variable for you by passing {"config": "enumerate"}
as an inference argument. lppl takes a middle approach, making interfaces flexible enough for you to implement your own automatic enumeration inference algorithm generator. This whole story is in infer_enumerate.hpp
, but the core of the idea is contained in the call method of the generated inference algorithm and should be fairly self-explanatory:
operator()(I& input) {
VType
VType out;// could convert to embarassingly parallel algo
for (int ix = low; ix != high; ++ix) {
auto cond_f = condition<D>(this->f, address, ix);
auto infer = Inference<A, I, O, V, Q, Ts...>(cond_f, this->queryer, this->opts);
if constexpr (has_clear<decltype(this->queryer)>::value) {
this->queryer.clear();
}
out[ix] = infer(input);
}return out;
}
The enumeration functionality we define can automatically enumerate any type of discrete latent random variable. Here, we use enumeration in a maximum a posteriori inference problem, using Metropolis-Hastings to compute the MAP for each possible number of clusters in the reduced search space {1, ..., 7}.
auto q = Optimizer<double, int, Normal, Gamma, Categorical, Poisson>(
cluster_t& r, int&, double&) { return logprob(r); },
[](cluster_t& r, int&, double&) { return logprob(r); }
[](
);
...auto enum_alg = enumerate<AncestorMetropolis, Poisson>(f, q, opts);
auto infer = enum_alg("num_clusters - 1", 0, 6);
Using the tinyclusters.stream
data set,
cat ../data/clusters.stream
1.1
1.2
1.14
0.85
0.9
3.1
3.3
3.2
3.1
3.0
2.9
inference results look like
cat ../data/clusters.stream | /usr/bin/time -v ./pipe-clustering
0: -21.3459
1: -16.5283
2: -28.0056
3: -29.4129
4: -39.4436
5: -54.8752
Command being timed: "./pipe-clustering"
User time (seconds): 6.05
System time (seconds): 0.00
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.06
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4224
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 163
Voluntary context switches: 1
Involuntary context switches: 110
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
correctly inferring the (admittedly not that challenging) number of two clusters in about six seconds using about 4MB RAM.