# Keep using plate notation

I recently read an older article by Rob Zinkov called Stop using Plate Notation. The author had a few main points, which I'll do my best to summarize fairly here --- never mind, the author summarized them himself:

• Plate notation "doesn't handle complex model [sic] very well"
• It "hides too many details while in many ways not hiding enough of them"
• It is "very hard to read and understand"

These critiques are fairly general, so the author helpfully provides us with some more specific ones. He first presents a few examples of plate notation that are used to describe (please note my specific word choice "used to describe", more on this momentarily) latent Dirichlet allocation, claiming that it is not obvious that that's what these plate notations are representing. He then asserts that plate notation is particularly bad at expressing "long chains of dependencies" such as what are present in a hidden Markov model, giving the examples of an infinite hidden Markov model and Trueskill architectures. He then proposes that the solution to what he perceives as a set of problems is simply to provide the generative story -- describing either in words or pseudocode how the draws from the joint distribution would be computed algorithmically.

The author is wrong on the first two bullet points listed above, while the third bullet point is an opinion (to which the author is, of course, entitled). Plate notation is not equivalent to a specific generative story but rather a more general one -- a point on which I will elaborate below. Plate notation is also perfectly capable of representing long chains of dependencies concisely. Whether or not we choose to do so is a question of taste.

## What is plate notation?

The fundamental issue with Zinkov's article is the apparent misconception of what a model expressed in plate notation actually represents. Let's go back to one panel of the "latent Dirichlet allocation" example that Zinkov presents, which I'll replicate here.

"Is it obvious from the above image?" Zinkov queries. Well, yeah, actually. This is pretty straightforward if we just remember basic plate notation rules: plates correspond to an independent tensor index (equivalently, a product), shaded nodes correspond to observed rvs, and clear nodes are latents. So: just by looking at the plate notation, we can easily convert to the joint likelihood:

Great. Wait -- where are the Dirichlet distributions? I thought that this plate notation corresponded to a latent Dirichlet allocation! Wrong. It corresponds to no such thing, and therein lies Zinkov's first mistake -- thinking that plate notation corresponds to actual functional forms of models. This plate diagram says nothing about a Dirichlet (or Categorical, or any) distribution because it's not intended to. It describes the topology of a graph encoding a specific causal relationship between observed and latent variables. How those latent variables are actually distributed does not matter at all.

And this should make sense to us intuitively. Suppose we had a basic LDA model and I changed the Dirichlet distribution governing topics to a multivariate logit normal distribution. Why in the world would that change anything about the model's topology / causality structure? How would that affect the fact that there are $T$ topics and that those topics affect the $N$ words in each of the $D$ documents? Of course it wouldn't -- it is just a functional reparameterization of the exact same concept -- and so the plate notation wouldn't change either.

## What about "long chains of dependencies"?

Well, what about them? Sure, sure -- the usual example here is a hidden Markov model, which is inarguably usually written as below:

First off, I will argue strenuously against the claim that this is a "bad" way of expressing a "long chain of dependencies". If you were into neural networks or compiler design, you'd call this a graphical expression of unrolling. We're making things more explicit there and sacrificing space in order to do so. But, fine, there's still no reason that we can't express this particular type of dependency more compactly:

Now we've "rolled up" the Markov dependency structure into a single recurrent edge. The double product structure fundamental to the joint density is also more explicit in this representation, but it takes a little annotation on the recurrent edge to express exactly what we mean -- that's the sacrifice between complexity and space taken up on the page that Io was talking about above.

If you're upset about my statement above that "plates correspond to an independent tensor index" while here I'm talking about a single-time-lag dependency, just remember that, given $z_{t-1}$, the densities of the rvs $z_t|z_{t-1}$ and $x_t | z_t$ are represented in the exact same ways (products of conditionally independent rvs, $\prod_{t=1}^T p(z_t|z_{t-1})$ and $\prod_{t=1}^T p(x_t|z_t)$ respectively) as independent rvs, so the nested plates are justified here.

This single-lag Markov model is actually pretty general, since we can convert any $p$-order ($p > 1$) AR / HMM model into an first-order model by increasing the dimensionality of the state space (see Lutkepohl or any other good panel data time series book for more on this).

## Not xor, but and

I agree with Zinkov wholeheartedly that we should tell the "generative story", as he puts it -- i.e., that we should write down some (pseudo or real) code corresponding to the data generating process. His complaint that "we never get such a story/pseudocode in many papers that use complex graphical models" I think is spot-on. My issue is that he presents this as an alternative to graphical models; I think that this is ill-advised for two reasons.

1. They don't serve the same purpose. The types of generative stories he's talking about look like (in real code, not pseudocode):

x1def linear_model(X, y=None):2  p = X.shape[-1]3  N = X.shape[0]4  5  noise_scale = pyro.sample('noise_scale', dist.LogNormal(0., 1.))6  beta = pyro.sample('beta', dist.Normal(0., 1.).expand((p,)))7  data_plate = pyro.plate('data_plate', N)8  9  with data_plate as n:10    mu = pyro.deterministic('mu', X.matmul(beta))11    response = pyro.sample('response', dist.Normal(mu, noise_scale), obs=y)

This is helpful. It also partially obscures the underlying simplicity of the graphical model:

Great -- we have two latent rvs, $\beta$ and $\sigma$, that affect an observed rv $y$ which is also affected by observed deterministic parameters $X$. This describes a wide variety of models, linear regression (as in the code example above) among them, and makes clear their commonalities. The code above is more specific, which is a good thing when we're trying to actually implement the models. Having both is important.

1. (Pseudo)code has the potential to really, really complicate things. A simple example: the code describing a variational autoencoder in Pyro / PyTorch clocks in at around 70 lines (and that's being generous). The plates are dead simple, though:

It's easy to tell what's going on. Fundamentally, the joint density consists of a latent rv that affects an observed rv along with an unknown parameter (for which we'll presumably also solve). The variational posterior turns this idea on its head, positing distribution for a latent rv that's affected by another unknown parameter and observed data. The precise functional forms of these parameters and the functtional form of the densities aren't included because they could change -- e.g., we could swap out a convolutional neural network for an LSTM -- but that doesn't change the fundamental concept of the model.

A possible rebuttal of this example might focus on the fact that I've cited "real code" length in my argument, not pseudocode. Fine, then, how about this:

x1function guide(x)2  for n = 1,...,N3    transformed_data[n] = g(x[n], psi)4    z[n] ~ q(z | transformed_data[n])5  return z6  7function joint_density(x)8  for n = 1,...,N9    z[n] ~ p(z)10    x'[n] ~ p(x' | f(z[n], phi)), observed = x[n]11  return x'

To paraphrase Zinkov, is it obvious from the above pseudocode? Well...kind of! This code gives us exactly the same information as do the above plates. It takes up 11 lines of space, requires us to read and understand each line (as opposed to aggregating information presented in a figure). We could go back and forth with claims that the above pseudocode isn't specific enough about what f and g are, it doesn't describe what the densities q and p are, to which I'd reply -- right, exactly. If that information is included the pseudocode is less general / more specific to the particular instantiation of the class of models that are topologically and semantically equivalent to the above plates. It doesn't do the same job as the plate notation anymore.

This shouldn't be an either / xor situation. In a well-written paper we should hope for a model to be expressed in three different ways, each of which highlights different information about the general class of models and that specific instantiation:

1. The graphical model / plate notation. For the reasons I've given here (namely: most general, concisely describes causal relationships) this is fundamental to the understanding of the entire class of models under study.
2. The generative story. Of course we want to know what distributions were used in actually writing the model down in code. We need these if we're going to understand, replicate, and extend the work.
3. The joint density (if possible, though actually writing this down is often prohibitively difficult in more complicated models). If someone reads the paper and doesn't have any probabilistic programming language to use, we would at least like to give them a function of which they can take the logarithm and maximize to get an MAP parameter estimate. Writing down the joint density is just being a good citizen.