strong open universe probabilistic programming with no dynamic allocation

Probabilistic programming uses tools of programming languages to encode expressive probabilistic models of data generating processes and to perform statistical inference – inferring latent or unobserved quantities – using those models. Many models of behavior encode an a priori unbounded number of latent random variables. Such models are known as open universe models. For example, in a measurement and signature intelligence (MASINT) application, a generative model might encode the possibility of observing arbitrarily many emitters of radiation; corresponding inference problems might be to infer, based on observed evidence, how many emitters of radiation were observed, how strong the 85th percentile radiation emitter was, or the most likely direction from which the emitters originated.

Performing statistical inference over open-universe models is a challenging task. The task is made harder in an environment that imposes severe constraints on available computation. Today, a distributed network of low-powered sensor nodes today is rarely, if ever, equipped with technology that enables statistical inference on the sensors themselves – let alone technology that enables reasoning over models containing arbitrarily many emitters of radiation! One reason is that such sensors are often built using very small microcontrollers that operate without many of the creature comforts of modern computing environments, such as an operating system and heap memory. Enabling complex statistical inference on such sensors could transform intelligence collection, predictive maintenance, and other fields of vital importance to defense, intelligence, and commercial operations of many types (e.g., oil and gas) in austere environments.

In this short document I discuss a few related topics that point the way toward realizing this goal:

I describe a typology of probabilistic programs with a focus on segmenting open-universe programs into further categories that facilitate analysis.
I enumerate theoretical reasons why we should expect, perhaps counterintuitively, that reasoning over open-universe programs is possible using only static and automatic memory allocation.
I highlight some features of a proof of concept of open-universe probabilistic programming with only static and automatic memory allocation, written in pretty-modern C++.

A typology

We can (almost!) segment probabilistic programs as follows:

Closed universe (CU)
- The number of addresses in the probabilistic program has a finite upper bound known at compile time. You will note that this definition includes all bounded stochastic control flow programs.
Weak open universe (WOU)
- There is no compile-time upper bound on the number of addresses in the program.
- There is a set of addresses A_s, known at compile time, that are guaranteed to be sampled during any execution of the program
- Once the random variable at each address in A_s is sampled, the cardinality of the collection of the remaining addresses in the probabilistic program has a finite upper bound.
Strong open universe (SOU)
- There is no compile-time upper bound on the number of addresses in the program.
- For any integer n > 0, for any set of addresses A_s of cardinality n that has been sampled, the cardinality of the collection of the remaining addresses in the probabilistic program does not have an upper bound.

Examples of each of these types of programs are easy to create. In what follows, we display example probabilistic programs written in fmcs. A simple example of a CU program is a normal model:

double norm(record_t<DTypes<Normal, Gamma>>& r, double obs) {
    auto loc = sample(r, "loc", Normal(0, 5), rng);
    auto scale = sample(r, "scale", Gamma(2, 2), rng);
    return observe(r, "value", Normal(loc, scale), obs);
}

A path of random length is a simple example of a WOU program:

unsigned path(record_t<DTypes<Poisson, Normal>>& r, unsigned mean) {
    auto num = sample(r, "num", Poisson(mean), rng);
    double loc = 0.0, scale = 1.0;
    for (auto t = 0; t != num; ++t) {
        auto m = sample(r, "pl/" + std::to_string(t), Normal(m, s), rng);
    }
    return num;
}

A very simple – and instructive – SOU program is the generative form of a Geometric distribution:

unsigned geom(record_t<DTypes<Categorical>>& r, double p) {
    unsigned n = 0, s = 0;
    while (s < 1) {
        s = sample(r, "choice/" + std::to_string(n), Categorical({p, 1.0 - p}), rng);
        ++n;
    }
    return n;
}

CU, WOU, and SOU programs do not cover the space of probabilistic programs. For example, imagine a program that chooses to execute geom with probability q or to execute path with probability 1 − q. We say that a program has the q-strong open universe property if, at compile time, a subprogram of the program has probability 1 − q of not exhibiting the SOU property.

How likely is it for a SOU program to terminate? Obviously this question is unanswerable for any arbitrary program; however, there are simple probabilistic arguments that are applicable to many examples. Suppose that each address in a SOU program has p as a constant lower bound to the probability of the value of the random variable sampled at that address terminating the program. The probability that the program terminates after values have been sampled at n addresses is then bounded below by 1 − (1 − p)ⁿ, approaching 1 at an exponential rate as n becomes large. Indeed, for there to be a constant probability of non-termination at any n requires a extremely steep tail-off in the probability of termination at the n-th sampled address. Extending the argument just used and supposing that Q ∈ (0, 1) is the limiting probability of termination of the SOU program, a similar argument gives the limiting probability of termination as
\[Q = 1 - \prod_{k=1}^\infty (1-p(k)),\]
by which we mean
\[ \sum_{k=1}^\infty \log (1 - p(k)) = \log(1-Q), \]
where we are making the assumption that the infinite series converges (i.e., that p(k) < 1 for all k). Taylor expanding the left-hand side and matching terms, we find that
\[ p(k) = 1-\exp(-Q^k/k), \]
which should be true in the limit (though, of course, this scaling behavior may be violated for any k less than some fixed constant).

Querying SOU programs on the stack

From the analysis just presented we can guess that most q-SOU and even SOU programs that we are likely to write are extraordinarily likely to terminate rapidly. From a practical perspective, this means that we may not need to resort to effectively unbounded heap memory in order to sample from – and thus perform inference over – such programs. A practical implementation will require our solving a few related problems, all of which boil down to a fundamental question: how can we maintain a finite-size representation of a program state that could grow without bound?

How should the addresses in such a program be represented? Very often we use semantic addresses that reflect the structure of the program as it’s created; such addresses often use heap-based constructs (e.g., strings as in pyro’s autoname).
How should inference be performed? It should be clear that algorithms that require full knowledge/memory of program structure (e.g., Metropolis Hastings, stochastic variational inference) will be difficult, if not impossible, to implement with the restriction to only automatic and static allocation.

We outline a methodology for performing inference over and querying SOU programs using only automatic and static allocation. This is feasible – even without arena-based coroutines – while imposing a mild constraint on the form of the query. The methodology is based on further decomposing the inference process into three parts:

Collection: deciding if a value should be retained after the execution of the program
Sampling: actually running the program
Querying: creating a view of the posterior using collected values

This decomposition extends the two-part decomposition introduced in fmcs. The process for drawing one sample of the view of the posterior of the program is displayed in the diagram below.

flowchart A[Sample] --> |Executing program| B[Collect?] B --> A B --> |After one program execution| C[Update view]

There are two restrictions placed on this process to guarantee operation using stack memory only:

The size of the object doing the collection (the “collector”) must be known at compile-time.
The view must be able to be computed using an online / one-pass streaming algorithm. Additionally, for guaranteed execution using stack memory only, the algorithm should take 𝒪(1) memory in the size of the collector, though in practice it is likely that logarithmic memory growth would be fine.

In other words, the space of posterior queries is limited to those that can be computed using a sequence of query graphs {G_n}_{1 ≤ n ≤ N} with max_n|G_n| bounded above at compile time, and with the following constraint. Suppose there is a mapping f such that V_N = f({G_n}_{1 ≤ n ≤ N}) for any N , where V is the desired view of the posterior and V_N is the N-th empirical estimate of it. Then there must exist mappings g and h such that V_n = h(U_n), where U_n = g(U_n − 1, G_n) and with max_n|U_n| bounded above at compile time.

`opencyan`

To demonstrate that performing inference over SOU programs using only static and automatic allocation is feasible, we implemented opencyan, a proof-of-concept probabilistic programming language. While definitely not intended for production use, it’s fully functional and concise (299 lines of C++17 code). interested readers can view the entire library and integration tests at the link in the previous sentence. We overview several key features of the library here.

The implementation of collection and view updating simultaneously addresses the finite-size query representation and address size restriction. Collector structs are parameterized over the address type – delegating to the user the responsibility of ensuring that addresses do not dynamically allocate – and over implementations of the collection strategy. Collector structs, by themselves, default to only enabling storage of the log probabilities of the latent random variables log p(z) and of the observed random variables log p(x|z):

template<template<typename> class Impl, typename Address>
struct Collector {
    Impl<Address> impl;
    double log_latent;
    double log_likelihood;

    Collector(Impl<Address> impl) : impl(impl), log_latent(0.0), log_likelihood(0.0) {}
    
    void operator()(std::optional<Address> a, Value v, Dist d) {
        impl(a, v, d);
    }  
    
    void clear() {
        log_latent = 0.0;
        log_likelihood = 0.0;
        impl.clear();
    }
};

What is probably the simplest Impl of a Collector simply records the value at a single address:

template<typename Address>
struct GrabOne {
    Address addr;
    MaybeValue val;

    GrabOne(Address address) : addr(address), val(std::nullopt) {}

    void operator()(std::optional<Address> a, Value v, Dist d) {
        if (a.has_value()) {
            if (a.value() == addr) {
                val = v;
            }
        }
    }

    void clear() {
        val = std::nullopt;
    }
};

At this point, you may be wondering why the Impl handles a std::optional<Address> rather than an Address. opencyan treats all but a finite number of random variables as untraced randomness; when passing the result of a sampled random variable into an Impl, it may be that such randomness does not have an address. Concomitantly, there are two related sample methods – one for dealing with randomness that is addressable and one for dealing with randomness that is not addressable:

template<template<typename> class Impl, typename Address>
Value sample(Collector<Impl, Address>& coll, Address a, Dist d, rngstate * rng, MaybeValue o = std::nullopt) {
    if (!o.has_value()) {
        Value v = std::visit(
            [&rng](auto the_dist){ return Value(the_dist.sample(rng)); },
            d
        );
        coll.log_latent += std::visit(
            [&v](auto the_dist){ return the_dist.logprob(v); },
            d
        );
        coll(a, v, d);
        return v;
    } else {
        coll.log_likelihood += std::visit(
            [&o](auto the_dist){ return the_dist.logprob(o.value()); },
            d
        );
        coll(a, o.value(), d);
        return o.value();
    }
}

template<template<typename> class Impl, typename Address>
Value sample(Collector<Impl, Address>& coll, Dist d, rngstate * rng) {
    Value v = std::visit(
        [&rng](auto the_dist){ return Value(the_dist.sample(rng)); },
        d
    );
    coll.log_latent += std::visit(
        [&v](auto the_dist){ return the_dist.logprob(v); },
        d
    );
    coll(std::nullopt, v, d);
    return v;
}

This design is very similar to that of fmcs; compare the syntax of the first implementation of sample here with that of the top-level definition of sampling in fmcs:

template<typename D, typename RNG, typename... Ts>
DSType<D> sample(record_t<DTypes<Ts...>>& r, std::string address, D dist, RNG& rng) {
    if (r.interp.empty()) return _sample_with_node_interp<D, RNG, Ts...>(r, address, dist, rng);  // \todo when is this empty?
    return std::visit(
        overloaded(
            [&r, &address, &dist, &rng](RecordStandard) {return _sample_with_node_interp<D, RNG, Ts...>(r, address, dist, rng); },
            [&r, &address, &dist, &rng](auto) {return _sample_with_record_interp<D, RNG, Ts...>(r, address, dist, rng); }
        ),
        r.interp.front()
    );
}

Constructing a view in opencyan is also similar in syntax and semantics to doing so in fmcs. An instance of a struct receives a collection of variables extracted from an instance of the program along with a possibly unnormalized posterior weight of the collection. After every sample from the program, it computes U_n = g(U_n − 1, G_n) then, when .emit() is called, it computes and returns V_n = h(U_n). A very simple example of this computes the online mean of a single address:

template<template<typename> class CollImpl, typename Address, typename> struct Mean;
template<template<typename> class CollImpl, typename Address>
struct Mean<CollImpl, Address, ViewParams<>> {
    Address addr;
    double value;
    double weight;
    Mean(Address addr) : addr(addr), value(0.0), weight(0.0) {}
    void operator()(Collector<CollImpl, Address>& c, unsigned n) {
        if (addr == c.impl.addr) {
            if (c.impl.val.has_value()) {
                value += std::exp(c.log_likelihood) * std::get<double>(c.impl.val.value());
                weight += std::exp(c.log_likelihood);
            }
        }
    }
    double emit() {
        return value / weight;
    }
};

A slightly more complicated example creates an online histogram of maximum value of the addresses sampled from a SOU program:

template<template<typename> class CollImpl, typename AddressType, unsigned N>
struct AddressHist<CollImpl, AddressType, Dimension<N>> {
    
    std::array<AddressType, N> bins;
    std::array<unsigned, N> values;
    AddressHist(std::array<AddressType, N> bins) : bins(bins) {
        for (unsigned ix = 0; ix != N; ++ix) values[ix] = 0;
    }
    void operator()(Collector<CollImpl, AddressType>& c, unsigned n) {
        AddressType left = bins[0];
        AddressType right = bins[N - 1];
        AddressType mid;
        AddressType val = c.impl.addr;
        while (left < right) {
            mid = (left + right) / 2;
            if (bins[mid] < val) {
                left += 1;
            } else {
                right = mid;
            }
        }
        values[left] += 1;
    }
    std::pair<std::array<AddressType, N>, std::array<unsigned, N>> emit() {
        return std::pair(bins, values);
    }
};

For more details on how all these pieces work together, as well as the definition of the (very simple) inference method, you should look at the source code.

Shortcomings

There are numerous shortcomings of the library. Aside from the fact that it’s in no way a robust, production-ready implementation (e.g., it is awfully trusting of users of a raw pointer), its core flaw is that it doesn’t easily enable implementation of inference methods that do anything other than sample the program forward. While this does enable importance sampling via likelihood weighting – which is the inference method we implemented – that’s not very sample efficient. A very useful extension of this library would implement the ability to halt and resume execution of probabilistic programs, thereby enabling other forms of importance sampling, variational inference, and markov-chain monte carlo methods. This is possible in C++ while only automatically allocating memory by overloading operator new in the promise type, or while using only statically allocated memory by overloading operator new to use a statically allocated memory arena. Another version of this library could eschew C++ altogether and re-implement the above design in Rust which defaults to statically allocated coroutines.