In [1]:

```
using Pkg
Pkg.activate(".")
Pkg.add("Flux")
Pkg.add("Plots")
Pkg.add("LaTeXStrings")
Pkg.add("Distributions")
using Flux
using Plots
using LaTeXStrings
using Random
using Distributions
Random.seed!(2021);
```

From last time -- we had begun to dig into solving problems that looked like

$$ \min_\theta f(\theta),\ \theta \in \mathbb R^D $$or

$$ \min_\theta f(\theta) \text{ subject to } g_i(\theta),\ \theta \in \mathbb R^D,\ i = 1,...,I $$We saw that one way to (try) to do this is to solve a system of nonlinear equations: either

$$ \nabla_\theta f(\theta) = 0 $$in the first case, or (after introducing some multipliers, about which hopefully you read the theory) like

$$ \nabla_\theta f(\theta) + \sum_i \lambda_i g_i(\theta) = 0, $$where each $\lambda_i \in \mathbb R$.

This is all well and good, but we know that these equations are, in all but the simplest cases, intractable. How can we go about solving them approximately?

One way is via an iterative scheme. The idea is this: if we can move $\theta$ in the right *direction* -- but just a little bit -- over a probably-large number of iterations, maybe it will converge to the $\theta^*$ that solves the above problems.

When formalized, this intuition is called *gradient descent*. We find a sequence of $\theta_n$ that hopefully converge to $\theta^*$ as $n \rightarrow \infty$ by stepping a small amount proportional to the negative of the gradient of $f$ at each iteration $n$:

The (scalar or vector) $\gamma$ is called the *learning rate* (or decay rate, or...) and controls the size of the step.

In [2]:

```
f(Î¸) = Î¸^3 - 2Î¸^2
grad_f(Î¸) = gradient(f, Î¸)[1]
p = plot()
plot!(p, f, xlim=(0.0, 2.0), ylim=(-1.5, 0.5), label=L"f")
plot!(p, grad_f, label=L"\nabla f")
plot!(p, [0.0, 2.0], [0.0, 0.0], label=L"y = 0",
xlabel=L"\theta", ylabel="Function value")
```

Out[2]:

Let's see if we can find the minimum of $f$ by rolling our own (very basic!) gradient descent.

In [3]:

```
function grad_descent(
grad_f,
init_guess :: T,
timesteps :: Int,
gamma :: S,
verbose :: Bool
) :: Array{T, 1} where T where S <: Real
thetas = Array{T, 1}()
push!(thetas, init_guess)
@assert timesteps > 1
theta = init_guess
for t in 2:timesteps
if (t % 100 == 2) & verbose
println("On t = $t, theta = $theta")
end
theta = theta - gamma * grad_f(theta)
push!(thetas, theta)
end
thetas
end
```

Out[3]:

grad_descent (generic function with 1 method)

In [4]:

```
init_guess = 0.5
gamma = 0.01
timesteps = 1000
thetas = grad_descent(grad_f, init_guess, timesteps, gamma, true);
```

In [5]:

```
p = plot()
plot!(p, f, xlim=(0.0, 2.0), ylim=(-1.5, 0.5), label=L"f")
plot!(p, grad_f, label=L"\nabla f")
plot!(p, [0.0, 2.0], [0.0, 0.0], label=L"y = 0",
xlabel=L"\theta", ylabel="Function value")
plot!(p, [thetas[end], thetas[end]], [-1.5, 0.5], label=L"\theta^*")
```

Out[5]:

In [6]:

```
p = plot()
plot!(p, thetas, label=L"\theta", xlabel=L"n", ylabel=L"\theta")
```

Out[6]:

This won't always work out so nicely -- on the next homework (Thursday) you'll get to play with some functions that aren't as kind to you as a nice, locally-convex $f$.

Let's make everything a little more realistic. Suppose, instead of a nice clean pretty function, we're in the machine learning trenches. We have some data $y|x \sim p(y|x)$, but we don't know what $p(y|x)$ is and instead all we have is a dataset $\mathcal D = \{(y_i,x_i)\}_{1 \leq i \leq N}$ of draws from $p(y|x)$. We have a model of this data,
say $\hat y = f_\theta(x)$, and we'd like to find the value of the parameter $\theta$ that minimizes some *loss function* $\ell(\hat y, y)$ of your model's predictions $\hat y = f_\theta(x)$ and the true data $y$.

Well, using our gradient descent equation, we have

$$ \theta_{n + 1} = \theta_n - \gamma \nabla_{\theta_n} \ell(f_{\theta_n}(x), y), $$but this forces us to confront what exactly we mean by $\ell(f_{\theta_n}(x), y)$.

One way to compute this is to average our loss over the entire $\mathcal D$, i.e., compute

$$ \begin{aligned} \nabla_{\theta_n} \ell(f_{\theta_n}(x), y) &= \nabla_{\theta_n} E_{(y_i, x_i)\sim \mathcal D}[\ell(f_{\theta_n}(x_i), y_i)] \\ &= \nabla_{\theta_n} \frac{1}{N}\sum_{i} \ell(f_{\theta_n}(x_i), y_i) \\ &= \frac{1}{N}\sum_{i} \nabla_{\theta_n} \ell(f_{\theta_n}(x_i), y_i), \end{aligned} $$so that we're averaging the gradient function evaluated at each data point $(y_i, x_i)$. This is great when we can do it, but unfortunately it scales linearly with the cardinality of $\mathcal D$. I.e., if it takes $\mathcal O(M)$ operations to compute $\nabla_{\theta_n} \ell(f_{\theta_n}(x_i), y_i)$, then to compute a single gradient descent step with the entirety of $\mathcal D$ takes $\mathcal O(|\mathcal D| M)$ operations. Practically, this might not scale well at all particularly when $|\mathcal D|$ and the number of iterations of gradient descent to perform are both large.

Instead, we might implement what's called *stochastic gradient descent*, where we sample smaller datasets $\mathcal D' \sim p(\cdot | \mathcal D)$, where each $|\mathcal D'| \ll |\mathcal D|$ is fixed at some small value. The expectation is then approximated by

and a new $\mathcal D'_n$ is chosen at each iteration of the algorithm.

In [7]:

```
# check out how this works in practice
# we'll use both regular gradient and stochastic gradient descent on a logistic regression problem
Random.seed!(2021)
out_dim = 1
in_dim = 3
n_datapoints = 100
# first, create some fake training data
input_data = rand(in_dim, n_datapoints)
true_b = rand(out_dim)
true_W = rand(out_dim, in_dim)
output_probs = Ïƒ.(true_W * input_data .+ true_b)
output_data = [float(rand(Bernoulli(p))) for p in output_probs]
# now define our discriminative (new word, check back soon!) model
linear_model = Flux.Chain(Flux.Dense(in_dim, out_dim), p -> Ïƒ.(p))
bce_loss(x, y) = sum(Flux.crossentropy(linear_model(x), y))
```

Out[7]:

bce_loss (generic function with 1 method)

In [8]:

```
# set up gradient descent -- how we choose to send our data through is up to us
optim = Descent(0.01) # learning rate gamma = 0.01, as above
n_iterations = 1500
verbose = 100
losses = Array{Float64, 1}()
for t in 1:n_iterations
local loss
grads = Flux.gradient(Flux.params(linear_model)) do # this is an extended anonymous function in julia
loss = bce_loss(input_data, output_data)
loss
end
Flux.Optimise.update!(optim, Flux.params(linear_model), grads)
if (t - 1) % 100 == 0
println("On iteration $t, bce loss = $loss")
end
push!(losses, loss)
end
```

In [9]:

```
p = plot()
plot!(p, losses, label="The nicest gradient descent convergence ever",
xlabel="Iterations", ylabel="BCE loss")
```

Out[9]:

In [10]:

```
# now try this again with same model structure, but stochastic GD
other_linear_model = Flux.Chain(Flux.Dense(in_dim, out_dim), p -> Ïƒ.(p))
other_bce_loss(x, y) = sum(Flux.crossentropy(other_linear_model(x), y))
optim = Descent(0.01)
other_losses = Array{Float64, 1}()
batch_size = 10
for t in 1:n_iterations
local loss
this_ix = sample(1:n_datapoints, batch_size, replace=true)
grads = Flux.gradient(Flux.params(other_linear_model)) do
loss = other_bce_loss(input_data[:, this_ix], output_data[:, this_ix])
loss
end
Flux.Optimise.update!(optim, Flux.params(other_linear_model), grads)
if (t - 1) % 100 == 0
println("On iteration $t, bce loss = $loss")
end
push!(other_losses, loss)
end
```

In [11]:

```
p = plot()
plot!(p, other_losses, label="A little-less-nice gradient descent curve")
plot!(p, losses, label="The nicest gradient descent convergence ever",
xlabel="Iterations", ylabel="BCE loss")
```

Out[11]:

By using stochastic gradient descent and approximating the integral over our whole dataset (which in turn is approximating an integral over a probability distribution), we were still able to minimize the loss function but using a lot less computational effort.

We'll continue to explore the tradeoff between precision and computational cost throughout the course.