In [1]:
using Pkg
Pkg.activate(".")
Pkg.add("Flux")
Pkg.add("Plots")
Pkg.add("LaTeXStrings")
Pkg.add("Distributions")
using Flux
using Plots
using LaTeXStrings
using Random
using Distributions

Random.seed!(2021);
 Activating environment at `~/Documents/teaching/2021-spring-ds2/scratch/Project.toml`
   Updating registry at `~/.julia/registries/General`
  Resolving package versions...
No Changes to `~/Documents/teaching/2021-spring-ds2/scratch/Project.toml`
No Changes to `~/Documents/teaching/2021-spring-ds2/scratch/Manifest.toml`
  Resolving package versions...
No Changes to `~/Documents/teaching/2021-spring-ds2/scratch/Project.toml`
No Changes to `~/Documents/teaching/2021-spring-ds2/scratch/Manifest.toml`
  Resolving package versions...
No Changes to `~/Documents/teaching/2021-spring-ds2/scratch/Project.toml`
No Changes to `~/Documents/teaching/2021-spring-ds2/scratch/Manifest.toml`
  Resolving package versions...
No Changes to `~/Documents/teaching/2021-spring-ds2/scratch/Project.toml`
No Changes to `~/Documents/teaching/2021-spring-ds2/scratch/Manifest.toml`

Continuous optimization, pt2

From last time -- we had begun to dig into solving problems that looked like

$$ \min_\theta f(\theta),\ \theta \in \mathbb R^D $$

or

$$ \min_\theta f(\theta) \text{ subject to } g_i(\theta),\ \theta \in \mathbb R^D,\ i = 1,...,I $$

We saw that one way to (try) to do this is to solve a system of nonlinear equations: either

$$ \nabla_\theta f(\theta) = 0 $$

in the first case, or (after introducing some multipliers, about which hopefully you read the theory) like

$$ \nabla_\theta f(\theta) + \sum_i \lambda_i g_i(\theta) = 0, $$

where each $\lambda_i \in \mathbb R$.

This is all well and good, but we know that these equations are, in all but the simplest cases, intractable. How can we go about solving them approximately?

One way is via an iterative scheme. The idea is this: if we can move $\theta$ in the right direction -- but just a little bit -- over a probably-large number of iterations, maybe it will converge to the $\theta^*$ that solves the above problems.

When formalized, this intuition is called gradient descent. We find a sequence of $\theta_n$ that hopefully converge to $\theta^*$ as $n \rightarrow \infty$ by stepping a small amount proportional to the negative of the gradient of $f$ at each iteration $n$:

$$ \theta_{n + 1} = \theta_n - \gamma \nabla_{\theta_n} f(\theta_n). $$

The (scalar or vector) $\gamma$ is called the learning rate (or decay rate, or...) and controls the size of the step.

In [2]:
f(θ) = θ^3 - 2θ^2
grad_f(θ) = gradient(f, θ)[1]
p = plot()
plot!(p, f, xlim=(0.0, 2.0), ylim=(-1.5, 0.5), label=L"f")
plot!(p, grad_f, label=L"\nabla f")
plot!(p, [0.0, 2.0], [0.0, 0.0], label=L"y = 0",
    xlabel=L"\theta", ylabel="Function value")
Out[2]:

Let's see if we can find the minimum of $f$ by rolling our own (very basic!) gradient descent.

In [3]:
function grad_descent(
    grad_f,
    init_guess :: T,
    timesteps :: Int,
    gamma :: S,
    verbose :: Bool
) :: Array{T, 1} where T where S <: Real
    thetas = Array{T, 1}()
    push!(thetas, init_guess)
    @assert timesteps > 1
    theta = init_guess
    for t in 2:timesteps
        if (t % 100 == 2) & verbose
            println("On t = $t, theta = $theta")
        end
       theta = theta - gamma * grad_f(theta)
       push!(thetas, theta)
    end
    thetas
end
Out[3]:
grad_descent (generic function with 1 method)
In [4]:
init_guess = 0.5
gamma = 0.01
timesteps = 1000

thetas = grad_descent(grad_f, init_guess, timesteps, gamma, true);
On t = 2, theta = 0.5
On t = 102, theta = 1.2954467609277882
On t = 202, theta = 1.332675030311007
On t = 302, theta = 1.3333222219329939
On t = 402, theta = 1.333333145878861
On t = 502, theta = 1.3333333301709158
On t = 602, theta = 1.3333333332799826
On t = 702, theta = 1.3333333333324338
On t = 802, theta = 1.333333333333318
On t = 902, theta = 1.3333333333333306
In [5]:
p = plot()
plot!(p, f, xlim=(0.0, 2.0), ylim=(-1.5, 0.5), label=L"f")
plot!(p, grad_f, label=L"\nabla f")
plot!(p, [0.0, 2.0], [0.0, 0.0], label=L"y = 0",
    xlabel=L"\theta", ylabel="Function value")
plot!(p, [thetas[end], thetas[end]], [-1.5, 0.5], label=L"\theta^*")
Out[5]:
In [6]:
p = plot()
plot!(p, thetas, label=L"\theta", xlabel=L"n", ylabel=L"\theta")
Out[6]:

This won't always work out so nicely -- on the next homework (Thursday) you'll get to play with some functions that aren't as kind to you as a nice, locally-convex $f$.

Let's make everything a little more realistic. Suppose, instead of a nice clean pretty function, we're in the machine learning trenches. We have some data $y|x \sim p(y|x)$, but we don't know what $p(y|x)$ is and instead all we have is a dataset $\mathcal D = \{(y_i,x_i)\}_{1 \leq i \leq N}$ of draws from $p(y|x)$. We have a model of this data, say $\hat y = f_\theta(x)$, and we'd like to find the value of the parameter $\theta$ that minimizes some loss function $\ell(\hat y, y)$ of your model's predictions $\hat y = f_\theta(x)$ and the true data $y$.

Well, using our gradient descent equation, we have

$$ \theta_{n + 1} = \theta_n - \gamma \nabla_{\theta_n} \ell(f_{\theta_n}(x), y), $$

but this forces us to confront what exactly we mean by $\ell(f_{\theta_n}(x), y)$.

One way to compute this is to average our loss over the entire $\mathcal D$, i.e., compute

$$ \begin{aligned} \nabla_{\theta_n} \ell(f_{\theta_n}(x), y) &= \nabla_{\theta_n} E_{(y_i, x_i)\sim \mathcal D}[\ell(f_{\theta_n}(x_i), y_i)] \\ &= \nabla_{\theta_n} \frac{1}{N}\sum_{i} \ell(f_{\theta_n}(x_i), y_i) \\ &= \frac{1}{N}\sum_{i} \nabla_{\theta_n} \ell(f_{\theta_n}(x_i), y_i), \end{aligned} $$

so that we're averaging the gradient function evaluated at each data point $(y_i, x_i)$. This is great when we can do it, but unfortunately it scales linearly with the cardinality of $\mathcal D$. I.e., if it takes $\mathcal O(M)$ operations to compute $\nabla_{\theta_n} \ell(f_{\theta_n}(x_i), y_i)$, then to compute a single gradient descent step with the entirety of $\mathcal D$ takes $\mathcal O(|\mathcal D| M)$ operations. Practically, this might not scale well at all particularly when $|\mathcal D|$ and the number of iterations of gradient descent to perform are both large.

Instead, we might implement what's called stochastic gradient descent, where we sample smaller datasets $\mathcal D' \sim p(\cdot | \mathcal D)$, where each $|\mathcal D'| \ll |\mathcal D|$ is fixed at some small value. The expectation is then approximated by

$$ \nabla_{\theta_n} \ell(f_{\theta_n}(x), y) \approx \nabla_{\theta_n} E_{(y_i, x_i)\sim \mathcal D'_n}[\ell(f_{\theta_n}(x_i), y_i)] $$

and a new $\mathcal D'_n$ is chosen at each iteration of the algorithm.

In [7]:
# check out how this works in practice
# we'll use both regular gradient and stochastic gradient descent on a logistic regression problem
Random.seed!(2021)
out_dim = 1
in_dim = 3
n_datapoints = 100

# first, create some fake training data
input_data = rand(in_dim, n_datapoints)
true_b = rand(out_dim)
true_W = rand(out_dim, in_dim)
output_probs = σ.(true_W * input_data .+ true_b)
output_data = [float(rand(Bernoulli(p))) for p in output_probs]

# now define our discriminative (new word, check back soon!) model
linear_model = Flux.Chain(Flux.Dense(in_dim, out_dim), p -> σ.(p))
bce_loss(x, y) = sum(Flux.crossentropy(linear_model(x), y))
Out[7]:
bce_loss (generic function with 1 method)
In [8]:
# set up gradient descent -- how we choose to send our data through is up to us
optim = Descent(0.01)  # learning rate gamma = 0.01, as above
n_iterations = 1500
verbose = 100
losses = Array{Float64, 1}()

for t in 1:n_iterations
    local loss
    grads = Flux.gradient(Flux.params(linear_model)) do  # this is an extended anonymous function in julia
        loss = bce_loss(input_data, output_data)
        loss
    end
    Flux.Optimise.update!(optim, Flux.params(linear_model), grads)
    if (t - 1) % 100 == 0
        println("On iteration $t, bce loss = $loss")
    end
    push!(losses, loss)
end
On iteration 1, bce loss = 0.6474316114187241
On iteration 101, bce loss = 0.39710098594427107
On iteration 201, bce loss = 0.2705532546341419
On iteration 301, bce loss = 0.20012248083949088
On iteration 401, bce loss = 0.15691558741033076
On iteration 501, bce loss = 0.12825543873012066
On iteration 601, bce loss = 0.108067628107965
On iteration 701, bce loss = 0.09317279726266861
On iteration 801, bce loss = 0.08177614130079747
On iteration 901, bce loss = 0.0727986303344369
On iteration 1001, bce loss = 0.06555718693882227
On iteration 1101, bce loss = 0.059600473828613755
On iteration 1201, bce loss = 0.054619304481893775
On iteration 1301, bce loss = 0.050395216569304464
On iteration 1401, bce loss = 0.04676979076117277
In [9]:
p = plot()
plot!(p, losses, label="The nicest gradient descent convergence ever",
    xlabel="Iterations", ylabel="BCE loss")
Out[9]:
In [10]:
# now try this again with same model structure, but stochastic GD
other_linear_model = Flux.Chain(Flux.Dense(in_dim, out_dim), p -> σ.(p))
other_bce_loss(x, y) = sum(Flux.crossentropy(other_linear_model(x), y))
optim = Descent(0.01)
other_losses = Array{Float64, 1}()
batch_size = 10

for t in 1:n_iterations
    local loss
    this_ix = sample(1:n_datapoints, batch_size, replace=true)
    grads = Flux.gradient(Flux.params(other_linear_model)) do
        loss = other_bce_loss(input_data[:, this_ix], output_data[:, this_ix])
        loss
    end
    Flux.Optimise.update!(optim, Flux.params(other_linear_model), grads)
    if (t - 1) % 100 == 0
        println("On iteration $t, bce loss = $loss")
    end
    push!(other_losses, loss)
end
On iteration 1, bce loss = 0.5019648492336273
On iteration 101, bce loss = 0.3267782121896744
On iteration 201, bce loss = 0.25163538604974744
On iteration 301, bce loss = 0.12920053377747537
On iteration 401, bce loss = 0.13604533299803734
On iteration 501, bce loss = 0.12636871859431267
On iteration 601, bce loss = 0.11784218475222588
On iteration 701, bce loss = 0.09574187844991684
On iteration 801, bce loss = 0.08606717474758625
On iteration 901, bce loss = 0.08363512754440308
On iteration 1001, bce loss = 0.07590431272983551
On iteration 1101, bce loss = 0.05576060116291046
On iteration 1201, bce loss = 0.07348808608949184
On iteration 1301, bce loss = 0.03130142353475094
On iteration 1401, bce loss = 0.059724750369787215
In [11]:
p = plot()
plot!(p, other_losses, label="A little-less-nice gradient descent curve")
plot!(p, losses, label="The nicest gradient descent convergence ever",
    xlabel="Iterations", ylabel="BCE loss")
Out[11]:

By using stochastic gradient descent and approximating the integral over our whole dataset (which in turn is approximating an integral over a probability distribution), we were still able to minimize the loss function but using a lot less computational effort.

We'll continue to explore the tradeoff between precision and computational cost throughout the course.