using Pkg
Pkg.activate(".")
Pkg.add("Flux")
Pkg.add("Plots")
Pkg.add("LaTeXStrings")
Pkg.add("Distributions")
using Flux
using Plots
using LaTeXStrings
using Random
using Distributions
Random.seed!(2021);
Activating environment at `~/Documents/teaching/2021-spring-ds2/scratch/Project.toml` Updating registry at `~/.julia/registries/General` Resolving package versions... No Changes to `~/Documents/teaching/2021-spring-ds2/scratch/Project.toml` No Changes to `~/Documents/teaching/2021-spring-ds2/scratch/Manifest.toml` Resolving package versions... No Changes to `~/Documents/teaching/2021-spring-ds2/scratch/Project.toml` No Changes to `~/Documents/teaching/2021-spring-ds2/scratch/Manifest.toml` Resolving package versions... No Changes to `~/Documents/teaching/2021-spring-ds2/scratch/Project.toml` No Changes to `~/Documents/teaching/2021-spring-ds2/scratch/Manifest.toml` Resolving package versions... No Changes to `~/Documents/teaching/2021-spring-ds2/scratch/Project.toml` No Changes to `~/Documents/teaching/2021-spring-ds2/scratch/Manifest.toml`
From last time -- we had begun to dig into solving problems that looked like
$$ \min_\theta f(\theta),\ \theta \in \mathbb R^D $$or
$$ \min_\theta f(\theta) \text{ subject to } g_i(\theta),\ \theta \in \mathbb R^D,\ i = 1,...,I $$We saw that one way to (try) to do this is to solve a system of nonlinear equations: either
$$ \nabla_\theta f(\theta) = 0 $$in the first case, or (after introducing some multipliers, about which hopefully you read the theory) like
$$ \nabla_\theta f(\theta) + \sum_i \lambda_i g_i(\theta) = 0, $$where each $\lambda_i \in \mathbb R$.
This is all well and good, but we know that these equations are, in all but the simplest cases, intractable. How can we go about solving them approximately?
One way is via an iterative scheme. The idea is this: if we can move $\theta$ in the right direction -- but just a little bit -- over a probably-large number of iterations, maybe it will converge to the $\theta^*$ that solves the above problems.
When formalized, this intuition is called gradient descent. We find a sequence of $\theta_n$ that hopefully converge to $\theta^*$ as $n \rightarrow \infty$ by stepping a small amount proportional to the negative of the gradient of $f$ at each iteration $n$:
$$ \theta_{n + 1} = \theta_n - \gamma \nabla_{\theta_n} f(\theta_n). $$The (scalar or vector) $\gamma$ is called the learning rate (or decay rate, or...) and controls the size of the step.
f(θ) = θ^3 - 2θ^2
grad_f(θ) = gradient(f, θ)[1]
p = plot()
plot!(p, f, xlim=(0.0, 2.0), ylim=(-1.5, 0.5), label=L"f")
plot!(p, grad_f, label=L"\nabla f")
plot!(p, [0.0, 2.0], [0.0, 0.0], label=L"y = 0",
xlabel=L"\theta", ylabel="Function value")
Let's see if we can find the minimum of $f$ by rolling our own (very basic!) gradient descent.
function grad_descent(
grad_f,
init_guess :: T,
timesteps :: Int,
gamma :: S,
verbose :: Bool
) :: Array{T, 1} where T where S <: Real
thetas = Array{T, 1}()
push!(thetas, init_guess)
@assert timesteps > 1
theta = init_guess
for t in 2:timesteps
if (t % 100 == 2) & verbose
println("On t = $t, theta = $theta")
end
theta = theta - gamma * grad_f(theta)
push!(thetas, theta)
end
thetas
end
grad_descent (generic function with 1 method)
init_guess = 0.5
gamma = 0.01
timesteps = 1000
thetas = grad_descent(grad_f, init_guess, timesteps, gamma, true);
On t = 2, theta = 0.5 On t = 102, theta = 1.2954467609277882 On t = 202, theta = 1.332675030311007 On t = 302, theta = 1.3333222219329939 On t = 402, theta = 1.333333145878861 On t = 502, theta = 1.3333333301709158 On t = 602, theta = 1.3333333332799826 On t = 702, theta = 1.3333333333324338 On t = 802, theta = 1.333333333333318 On t = 902, theta = 1.3333333333333306
p = plot()
plot!(p, f, xlim=(0.0, 2.0), ylim=(-1.5, 0.5), label=L"f")
plot!(p, grad_f, label=L"\nabla f")
plot!(p, [0.0, 2.0], [0.0, 0.0], label=L"y = 0",
xlabel=L"\theta", ylabel="Function value")
plot!(p, [thetas[end], thetas[end]], [-1.5, 0.5], label=L"\theta^*")
p = plot()
plot!(p, thetas, label=L"\theta", xlabel=L"n", ylabel=L"\theta")
This won't always work out so nicely -- on the next homework (Thursday) you'll get to play with some functions that aren't as kind to you as a nice, locally-convex $f$.
Let's make everything a little more realistic. Suppose, instead of a nice clean pretty function, we're in the machine learning trenches. We have some data $y|x \sim p(y|x)$, but we don't know what $p(y|x)$ is and instead all we have is a dataset $\mathcal D = \{(y_i,x_i)\}_{1 \leq i \leq N}$ of draws from $p(y|x)$. We have a model of this data, say $\hat y = f_\theta(x)$, and we'd like to find the value of the parameter $\theta$ that minimizes some loss function $\ell(\hat y, y)$ of your model's predictions $\hat y = f_\theta(x)$ and the true data $y$.
Well, using our gradient descent equation, we have
$$ \theta_{n + 1} = \theta_n - \gamma \nabla_{\theta_n} \ell(f_{\theta_n}(x), y), $$but this forces us to confront what exactly we mean by $\ell(f_{\theta_n}(x), y)$.
One way to compute this is to average our loss over the entire $\mathcal D$, i.e., compute
$$ \begin{aligned} \nabla_{\theta_n} \ell(f_{\theta_n}(x), y) &= \nabla_{\theta_n} E_{(y_i, x_i)\sim \mathcal D}[\ell(f_{\theta_n}(x_i), y_i)] \\ &= \nabla_{\theta_n} \frac{1}{N}\sum_{i} \ell(f_{\theta_n}(x_i), y_i) \\ &= \frac{1}{N}\sum_{i} \nabla_{\theta_n} \ell(f_{\theta_n}(x_i), y_i), \end{aligned} $$so that we're averaging the gradient function evaluated at each data point $(y_i, x_i)$. This is great when we can do it, but unfortunately it scales linearly with the cardinality of $\mathcal D$. I.e., if it takes $\mathcal O(M)$ operations to compute $\nabla_{\theta_n} \ell(f_{\theta_n}(x_i), y_i)$, then to compute a single gradient descent step with the entirety of $\mathcal D$ takes $\mathcal O(|\mathcal D| M)$ operations. Practically, this might not scale well at all particularly when $|\mathcal D|$ and the number of iterations of gradient descent to perform are both large.
Instead, we might implement what's called stochastic gradient descent, where we sample smaller datasets $\mathcal D' \sim p(\cdot | \mathcal D)$, where each $|\mathcal D'| \ll |\mathcal D|$ is fixed at some small value. The expectation is then approximated by
$$ \nabla_{\theta_n} \ell(f_{\theta_n}(x), y) \approx \nabla_{\theta_n} E_{(y_i, x_i)\sim \mathcal D'_n}[\ell(f_{\theta_n}(x_i), y_i)] $$and a new $\mathcal D'_n$ is chosen at each iteration of the algorithm.
# check out how this works in practice
# we'll use both regular gradient and stochastic gradient descent on a logistic regression problem
Random.seed!(2021)
out_dim = 1
in_dim = 3
n_datapoints = 100
# first, create some fake training data
input_data = rand(in_dim, n_datapoints)
true_b = rand(out_dim)
true_W = rand(out_dim, in_dim)
output_probs = σ.(true_W * input_data .+ true_b)
output_data = [float(rand(Bernoulli(p))) for p in output_probs]
# now define our discriminative (new word, check back soon!) model
linear_model = Flux.Chain(Flux.Dense(in_dim, out_dim), p -> σ.(p))
bce_loss(x, y) = sum(Flux.crossentropy(linear_model(x), y))
bce_loss (generic function with 1 method)
# set up gradient descent -- how we choose to send our data through is up to us
optim = Descent(0.01) # learning rate gamma = 0.01, as above
n_iterations = 1500
verbose = 100
losses = Array{Float64, 1}()
for t in 1:n_iterations
local loss
grads = Flux.gradient(Flux.params(linear_model)) do # this is an extended anonymous function in julia
loss = bce_loss(input_data, output_data)
loss
end
Flux.Optimise.update!(optim, Flux.params(linear_model), grads)
if (t - 1) % 100 == 0
println("On iteration $t, bce loss = $loss")
end
push!(losses, loss)
end
On iteration 1, bce loss = 0.6474316114187241 On iteration 101, bce loss = 0.39710098594427107 On iteration 201, bce loss = 0.2705532546341419 On iteration 301, bce loss = 0.20012248083949088 On iteration 401, bce loss = 0.15691558741033076 On iteration 501, bce loss = 0.12825543873012066 On iteration 601, bce loss = 0.108067628107965 On iteration 701, bce loss = 0.09317279726266861 On iteration 801, bce loss = 0.08177614130079747 On iteration 901, bce loss = 0.0727986303344369 On iteration 1001, bce loss = 0.06555718693882227 On iteration 1101, bce loss = 0.059600473828613755 On iteration 1201, bce loss = 0.054619304481893775 On iteration 1301, bce loss = 0.050395216569304464 On iteration 1401, bce loss = 0.04676979076117277
p = plot()
plot!(p, losses, label="The nicest gradient descent convergence ever",
xlabel="Iterations", ylabel="BCE loss")
# now try this again with same model structure, but stochastic GD
other_linear_model = Flux.Chain(Flux.Dense(in_dim, out_dim), p -> σ.(p))
other_bce_loss(x, y) = sum(Flux.crossentropy(other_linear_model(x), y))
optim = Descent(0.01)
other_losses = Array{Float64, 1}()
batch_size = 10
for t in 1:n_iterations
local loss
this_ix = sample(1:n_datapoints, batch_size, replace=true)
grads = Flux.gradient(Flux.params(other_linear_model)) do
loss = other_bce_loss(input_data[:, this_ix], output_data[:, this_ix])
loss
end
Flux.Optimise.update!(optim, Flux.params(other_linear_model), grads)
if (t - 1) % 100 == 0
println("On iteration $t, bce loss = $loss")
end
push!(other_losses, loss)
end
On iteration 1, bce loss = 0.5019648492336273 On iteration 101, bce loss = 0.3267782121896744 On iteration 201, bce loss = 0.25163538604974744 On iteration 301, bce loss = 0.12920053377747537 On iteration 401, bce loss = 0.13604533299803734 On iteration 501, bce loss = 0.12636871859431267 On iteration 601, bce loss = 0.11784218475222588 On iteration 701, bce loss = 0.09574187844991684 On iteration 801, bce loss = 0.08606717474758625 On iteration 901, bce loss = 0.08363512754440308 On iteration 1001, bce loss = 0.07590431272983551 On iteration 1101, bce loss = 0.05576060116291046 On iteration 1201, bce loss = 0.07348808608949184 On iteration 1301, bce loss = 0.03130142353475094 On iteration 1401, bce loss = 0.059724750369787215
p = plot()
plot!(p, other_losses, label="A little-less-nice gradient descent curve")
plot!(p, losses, label="The nicest gradient descent convergence ever",
xlabel="Iterations", ylabel="BCE loss")
By using stochastic gradient descent and approximating the integral over our whole dataset (which in turn is approximating an integral over a probability distribution), we were still able to minimize the loss function but using a lot less computational effort.
We'll continue to explore the tradeoff between precision and computational cost throughout the course.