Um, What Is This?
It's a fork of TensorFlow Playground with additional features.
Deep learning research is famously empirical. But traditionally, getting good intuitions requires a cluster of servers with which to try out a wide variety of hyperparameters. Visualization allows you to quickly get a sense for how a neural net behaves & speed up feedback loops in research.
What did you add to the original Playground?
First, I changed the loss function for classification from the squared loss (used by the original Playground developers) to log loss. This change isn't visible in the interface. But it's essential for building intuition, since log loss is what's used for classification in the real world.
The remaining changes are a mixture of standard deep learning techniques not present in the original and research ideas of my own that I wanted to explore through visualization.
Activation: I added several activation functions not present in the original Playground: Leaky ReLU (leak parameter = 0.01), ELU (𝛼 = 1), "Swish" (𝛽 = 1; equivalent to the sigmoid-weighted linear unit), and SoftPlus. arrow_upward
Regularization rate: The original Playground resets the simulation whenever the regularization rate is modified. This fits with standard deep learning practice, where the regularization rate remains fixed during training.
But I was interested in varying the regularization rate during training, to see if it's possible to discover a simpler model this way. And indeed, this appears to be the case.
Here are the default Playground settings modified so that the regularization rate is a relatively high 0.1. The net usually isn't able to learn anything with the regularization set this high, and all the weights get zeroed out. But now try pressing "play" with the default settings unmodified, and adjust the regularization rate to 0.1 once a circle has been learned. You should get a nice faded circle shape:
Another example: These settings usually do an OK job of learning a spiral. Increasing the regularization to 0.1 makes it impossible for them to learn anything... unless you increase it to 0.1 after the spiral has already been learned, in which case you get a nice faded spiral shape.
"Post-training regularization" is a simple idea, but I think it could be really important.
- Copying a fully trained network and doing post-training regularization on each copy could let you quickly try out a variety of regularization parameters.
- Perturbing inputs just enough to push them across a network's decision boundary is one method for constructing adversarial examples. This attack is easier if a network has a sharp decision boundary, where a small change in the input can cause a big change in the assigned class. Post-training regularization could help us make decision boundaries softer and guard against this attack.
- It's well-known that simple models tend to generalize better than complex ones. Gradient descent benefits from complexity in some sense, because the more parameters you're optimizing, the less likely that they're all simultaneously stuck in local optima. So methods to simplify neural nets trained through gradient descent hold promise for improving generalization. A broad view of "post-training simplification" suggests ideas beyond regularization, like tweaking the loss function to penalize "dissimilarity from the most similar neuron in this layer" to merge neurons together. arrow_upward
Animation speed: Controls the speed of the animation. arrow_upward
To see the network's test time decision boundary, pause the sim and select "Drop 0%".
To help you avoid a seizure, the animation speed will automatically slow down when dropout is active. If you want to crank the speed back up, I suggest first resizing the window so the network is no longer visible. arrow_upward
Layerwise gradient normalization: A major issue that makes training deep neural networks hard is the vanishing gradient problem. As a case study, observe what happens when we take the default Playground settings and add a bunch more layers. Typically, there's a long period of slow learning before the net finally discovers a circle shape. Sometimes it never learns a circle at all.
My idea for layerwise gradient normalization is motivated by the following metaphor for vanishing gradients. Imagine the neurons in the network as employees in a corporation. During forward prop, each neuron takes a linear combination of the work done by its "subordinates" (neurons on the previous layer), applies a nonlinearity, and delivers the result to its "superiors" (neurons on the next layer). During backprop, a neuron gets feedback in the form of partial derivatives from its "superiors", adjusts its behavior, and passes its own feedback on to "subordinates".
A key observation is the circular dependency between the neurons at the top and the neurons at the bottom. Without knowing the roles of their subordinates, the neurons at the top don't know how to combine their work. But without knowing the demands of their superiors, the neurons at the bottom don't know what they are supposed to be doing. To overcome this chicken-and-egg problem, it's ideal for low and high level neurons to learn at roughly the same rate.
This gets us to the vanishing gradient issue. Because the gradient is small for lower-level neurons, they learn slowly. We can't increase the learning rate for the neural net as a whole... that would cause the neurons at the top to learn faster, and the feedback they deliver their subordinates would no longer be consistent.
The solution is to nudge the gradient towards being the same size at each layer. Layerwise gradient normalization accomplishes this by splitting the gradient into a separate vector for each layer and scaling each vector to unit length. This allows us to guarantee that each layer of the net is changing at the same rate.
In practice, information about how the gradient magnitude differs between layers is still useful—a gradient with a smaller magnitude indicates less confidence about the correct direction to adjust. So the dropdown lets you to interpolate between ordinary gradient descent (p = 0) and forcing a unit vector at each layer (p = 1). This is accomplished by dividing each layer's gradient by its norm taken to the power p before updating weights.
To see this in action, observe the example from before with p = 0.8. Despite the depth of the network, a circle is quickly learned. Some of the weights are a bit high because we haven't adjusted the learning rate to account for the rescaled gradient. To address this, we can reduce the learning rate a bit, or make use of post-training regularization.
Unlike other fixes for vanishing gradients, such as skip connections or ReLU activations, layerwise gradient normalization doesn't restrict the kind of model we learn. In fact, the corporation metaphor hints at a method for training a much broader range of deep models. Machine learning researcher Michael I. Jordan writes:
In other engineering areas, the idea of using pipelines, flow diagrams and layered architectures to build complex systems is quite well entrenched... I hope and expect to see more people developing architectures that use other kinds of modules and pipelines, not restricting themselves to layers of "neurons".
Instead of seeing a neural network as a differentiable function approximator, the corporation metaphor suggests seeing a neural network as a bunch of online learning algorithms, stacked on top of each other & learning in parallel. In principle, any online learning algorithm that smoothly adapts to new patterns could serve as a "hidden unit". (Almost any machine learning algorithm should meet this criteria by just weighting data according to recency.)
Learning rate autotuning: Goodfellow et al write:
One drawback common to most hyperparameter optimization algorithms with more sophistication than random search is that they require for a training experiment to run to completion before they are able to extract any information from the experiment. This is much less eﬃcient, in the sense of how much information can be gleaned early in an experiment, than manual search by a human practitioner, since one can usually tell early on if some set of hyperparameters is completely pathological.
Some recent work has addressed this by computing "hypergradients", i.e. taking the derivative with respect to hyperparameters. This allows a hyperparameter such as the learning rate to be updated dynamically during optimization.
With the learning rate autotuning dropdown, I take a very different approach to dynamic learning rate updates. Gradient descent is viewed as a multi-armed bandit problem. At the beginning of each epoch, there are three actions available: increase the learning rate, decrease the learning rate, or continue with no change. I compute the average relative loss decrease associated with each learning rate. Then I use softmax action selection to choose an action (Qt() computes the average observed score). The dropdown controls 𝜏; larger values correspond to more exploration. At the end of the epoch, the change in the loss is recorded for the learning used.
To prevent missteps, the increase learning rate action is only available if the average loss decrease for the current learning rate is favorable. (You might also want to err on the side of choosing a small learning rate to start with.) arrow_upward
Prevent loss increases: One problem with making the learning rate too high is that we may take a large step which overshoots and actually increases the loss. The "prevent loss increases" dropdown explores a simple idea for addressing this. Before each epoch, a snapshot of the neural net gets saved. Then at the end of the epoch, if the classification loss has increased, we roll back to the snapshot, decrease the learning rate, and try again. Turning on this option will also turn on learning rate autotuning by default, so the learning rate doesn't just continually ratchet downward. arrow_upward