I think (editorial alert) that it's relatively common for data scientists to use tools that have theoretical underpinnings with which they aren't really familiar. I think I might start a semi-regular series of posts concerning some theoretical backgrounds behind different algorithms and tools commonly used by data scientists. I'll start with this post about functional optimization and how it relates to some common statistical learning algorithms.

Functional optimization really comes from analytical mechanics, and that's where I'll start this post. Before we can see how these techniques of functional optimization apply to machine learning, we have to solve one of the basic problems of analytical mechanics --- how can we find a function to extremize an integral?

As abstract as this problem sounds, it's really quite applicable. If $x(t)$ is a time-dependent trajectory of a system, maybe there's a cost $C(x, \dot{x})$ associated with where this trajectory is in phase space. You want to minimize this total cost added up over time, i.e., you want to find $x(t)$ such that $\int dt\ C(x(t), \dot{x}(t))$ is minimized. This is a general formulation of a problem that's very common in macroeconomics, ecology, physics, and other fields.

So, for the general setup: we want to find a path $x(t)$ that minimizes the integral

$$ J = \int dt\ C(t, x(t), \dot{x}(t)). $$Even though I write the integral with $t$, evocative of time, this doeesn't have to be the case --- actually, we will demonstrate a non-temporal example shortly. How can we go about minimizing this integral? I will derive the definition of the optimal path using the more physics-y derivation. It's similar to the one on wikipedia but (hopefully) a little more clear. If we assume lots of things about $C(t, x, \dot{x})$ and about $x(t)$ itself (technicalities that aren't worth getting into here), then we can approximate the integral by the finite sum

$$ J_N = \sum_{n=1}^{N} \delta t\ C\left(t_n, x_n, \frac{x_{n + 1} - x_n}{\delta t}\right), $$where $x_n \equiv x(t_n)$ and so on. This is just an ordinary function of $N + 1$ variables. To find the $x_n$ that extremizes $J_N$, we just have to do what comes naturally (differentiate and set equal to zero):

\begin{align} \frac{\partial J_N}{\partial x_k} &= \sum_{n=1}^{N} \delta t\ \frac{\partial}{\partial x_k} C\left(t_n, x_n, \frac{x_{n + 1} - x_n}{\delta t}\right) = 0. \end{align}There are two terms in this sum that contain $x_k$, thanks to the first derivative term $x_n' \equiv \frac{x_{n + 1} - x_n}{\delta t}$. Extracting these, the derivative simplifies to

\begin{align} 0 = \frac{\partial J_N}{\partial x_k} &= \color{red}{\delta t\ \frac{\partial}{\partial x_k} C\Big(t_{k-1}, x_{k-1}, \underbrace{\frac{x_{k} - x_{k-1}}{\delta t}}_{x_{k-1}'}\Big)} + \color{blue}{\delta t\ \frac{\partial}{\partial x_k} C\Big(t_{k}, x_{k}, \underbrace{\frac{x_{k + 1} - x_{k}}{\delta t}}_{x_{k}'}\Big)} \\ &= \color{red}{\delta t\frac{\partial C}{\partial x_{k - 1}'}\frac{\partial x_{k-1'}}{\partial x_k}} + \color{blue}{\delta t \frac{\partial C}{\partial x_k} + \delta t\frac{\partial C}{\partial x_k'}\frac{\partial x_k'}{\partial x_k}} \\ &= \color{red}{\delta t\frac{\partial C}{\partial x_{k - 1}'} \times \frac{1}{\delta t}} + \color{blue}{\delta t \frac{\partial C}{\partial x_k} + \delta t\frac{\partial C}{\partial x_k'} \times - \frac{1}{\delta t}}. \end{align}Cancelling common terms of $\delta t$ and then dividing both sides by this quantity, we have

$$ 0 = \frac{1}{\delta t}\frac{\partial J_N}{\partial x_k} = \frac{\partial C}{\partial x_k} - \frac{1}{\delta t} \Big( \frac{\partial C}{\partial x_k'} - \frac{\partial C}{\partial x_{k - 1}'}\Big). $$Upon taking $\delta t \rightarrow 0$, the left-hand side of this equation becomes a pretty ODE,

\begin{equation} \frac{\partial C}{\partial x(t)} - \frac{d}{dt}\frac{\partial C}{\partial \dot{x}(t)}. \qquad (*) \end{equation}This is called the Euler-Lagrange (EL) equation. Specifically, this is the EL equation for a univariate function of a single variable --- but generalizations are really easy using the above derivation. See if you can figure out the generalization for a) a multivariate function of a single variable (e.g., a vector system of ordinary differential equations), and b) a univariate function of multiple variables $f(x_1,..,x_n)$. So you can check yourself: the answer for (b) is a nonlinear PDE,

$$ \frac{\partial C}{\partial f} - \nabla \cdot \frac{\partial C}{\partial \nabla f} = 0. \qquad (**) $$Before we move on, I'll just note that this general technique --- taking a continuum problem (an infinite-dimensional problem) and discretizing it so that we can use our basic multivariate calculus tools --- is an invaluable technique that makes lots of stuff easier, not just functional optimization. For example, calculations of the properties of various stochastic processes often become much easier when you approach them this way.

Okay, fine. What if we have constraints on our path $x(t)$? For example, maybe there's a restriction on the maximum distance traveled like $\int dt\ x(t)^2 \leq D$. In general, the problem we want to solve now is given by

$$ \min_{x(t)} \int dt \ C(t, x(t), \dot{x}(t)) \quad \text{subject to } \int dt\ g(t, x(t)) \leq D. $$Let's use the same trick that we used above again. If this were just a discrete optimization problem like

$$ \min_{x_k} \sum_{n=1}^N \delta t\ C(t_n, x_n, x_n') \quad \text{subject to }\quad \sum_{n=1}^N \delta t\ g(t_n, x_n) \leq D, $$we'd know exactly what to do: introduce a multiplier $\lambda$ and instead minimize the unconstrained function

$$ J_N = \sum_{n=1}^N \delta t\ \left[ C(t_n, x_n, x_n') - \lambda g(t_n, x_n) \right] $$in exactly the same fashion that we did above. And that's exactly what we'll do here and then just pass to the $\delta t \rightarrow 0$ limit. The resulting EL equation is

$$ \frac{\partial C}{\partial x(t)} - \frac{d}{dt}\frac{\partial C}{\partial \dot{x}(t)} - \lambda \frac{\partial g}{\partial x(t)} = 0. $$Cool.

I promised applications to data science-y things, so I'll try not to disappoint. (No guarantees.) We'll look at two scenarios: first, the generalized supervised learning problem; and second, a time series smoothing problem.

The general setup for supervised learning is as follows: given observed data $x$ and a response variable $y$, find a function $f(x)$ that best explains why. In other words, we want to solve

$$ \min_{f(x)} \int dy\ dx\ p(x, y) L(y, f(x)), $$where $L$ is some loss function. There are lots of reasonable choices for $L$ but we'll choose the quadratic loss, $L(a, b) = \frac{1}{2}(a - b)^\dagger (a - b)$, for both theoretical reasons (makes *lots* of sense if we think that errors after modelling should be normally distributed) and computational reasons. There isn't any derivative term in this loss function, so we'll essentially just be taking the derivative of what's inside the integral with respect to $f(x)$:

Rearranging things, we see that

$$ \int dy\ p(x, y) y = \int dy\ p(x, y) f(x) = f(x) p(x), $$since we just marginalized out the distribution of the response variable $y$. Now we see that the optimal $f(x)$ is given by

$$ f(x) = \frac{1}{p(x)}\int dy\ p(x, y)\ y \equiv E_{y \sim p(y|x)}[y]. $$Well, duh --- this whole procedure told us what we would have known if we'd just stepped back and thought about it. What's the best $f(x)$ to approximate $y$ under the quadratic loss? Well, the mean of $y$ conditioned on the fact that we observed $x$, of course. But it's reassuring to see this result come out of our formal optimization problem, and it's actually this exact solution that simple algorithms like $k$-nearest neighbors try to replicate as directly as possible.

Also, N.B.: I think this exact example, or something very close to it, is in Elements of Statistical Learning but I don't have my copy in front of me (it's actually in another state currently...) and so I can't check.

Here's an example that's a little more involved. Suppose we observe a time series $x_{1:T} \equiv (x_1, x_2, ..., x_T)$ but it's noisy, so we want to smooth it a little. What's an optimal way to do that? This is obviously a rather open-ended question, but we can impose a particular functional form on it. We'll minimize the MSE of our smoothed series $s_{1:T} \equiv (s_1, s_2, ..., s_T)$ from $x_{1:T}$ while also enforcing that it's not too far from the mean of the distribution of $x_{1:T}$, which I'll denote by $\mu = \int dx_{1:T}\ p(x_{1:T})\ x_{1:T}$.
The density $p(x_{1:T})$ is the probability density of time series *paths* $x_{1:T}$ --- we'll leave this alone for now, but how one derives this pathwise density for a variety of models is a whole other discussion that I might cover in a later post.
Anyway, in other words, we want to solve

Bang zoom, we use our Langrange multiplier trick from above, instead minimizing

$$ \int d x_{1:T}\ \left[ p(x_{1:T})\ (x_{1:T} - s_{1:T}(x_{1:T}))^\dagger (x_{1:T} - s_{1:T}(x_{1:T})) - \lambda (s_{1:T}(x_{1:t}) - \mu )^\dagger (s_{1:T}(x_{1:t}) - \mu ) \right]. $$After doing the differentiating and rearranging everything (while hopefully not messing up any algebra), we find that, in the optimum,

$$ s_{1:T}(x_{1:T}) = \frac{p(x_{1:T})x_{1:T} + \lambda \mu}{p(x_{1:T}) + \lambda} $$Let's look at some limits to see if this makes sense. If $\lambda \rightarrow 0$ then $s_{1:T}$ just becomes increasingly un-smooth, eventually equalling $x_{1:T}$ when $\lambda = 0$. Conversely, as $\lambda \rightarrow \infty$, the smoothed time series becomes closer and closer to the mean timeseries $\mu$. This solution also controls for outlier paths: if $x_{1:T}$ is very rare under the density $p(x_{1:T})$, the smoothed version will be far closer to $\mu$ than if $x_{1:T}$ has higher probability density. This solution very much describes what is going on in Gaussian process smoothing when we assume a multivariate normal model and $p(x_{1:T})$ is the posterior predictive distribution of the timeseries --- check out the final figures in that linked notebook and compare with the limiting description above.

You could imagine lots of other smoothing algorithms. For example, we could try to minimize the squared gradient of the smoothing curve subject to the restriction that the smoothing curve is still a decent approximation to the original time series:

$$ \min_{s_{1:T}(x_{1:T})} \int dx_{1:T} |\nabla s_{1:T}(x_{1:T})|^2 \quad \text{subject to } \quad \int dx_{1:T}\ p(x_{1:T})\ (x_{1:T} - s_{1:T}(x_{1:T}))^\dagger (x_{1:T} - s_{1:T}(x_{1:T})) \leq D. $$Translated, this objective function is saying that we don't want our smoothing approximation to change too rapidly when we change the value of $x_{1:T}$ a little bit. Following the generalization of $(**)$ above (that you should figure out), the resulting equation for $s_{1:T}$ is

$$ \nabla^2 s_{1:T}(x_{1:T}) = -\lambda p(x_{1:T}) [x_{1:T} - s_{1:T}(x_{1:T})]. $$This is a $T$-dimensional inhomogeneous Hemholtz equation which needs to be solved with appropriate boundary conditions. (For the PDE people out there: my taking the Laplacian of the vector $s_{1:T}$ is just the element-wise Laplacian.)

That's all for now. I might upload a youtube video or something explaining this a little bit more and hope to continue on with this series of posts.