A one-hidden-layer MLP is a Universal Boolean Function
But the largest number of perceptrons is expontial: 2N
How about depth?
Will require 3(N−1) perceptrons, linear in N to express the same function
Using associatable rules, can be arranged in 2log2N layers
eg. model O=W⊕X⊕Y⊕Z
The challenge of depth
Using only K hidden layers will require O(2CN) neurons in the Kth layer, where C=2−(k−1)/2
A network with fewer than the minimum required number of neurons cannot model the function
Universal classifiers
Composing complicated “decision” boundaries
Using OR to create more decision boundaries
Can compose arbitrarily complex decision boundaries
Even using one-layer MLP
Need for depth
A naïve one-hidden-layer neural network will required infinite hidden neurons
Construct basic unit and add more layers to decrese #neurons
The number of neurons required in a shallow network is potentially exponential in the dimensionality of the input
Universal approximators
A one-layer MLP can model an arbitrary function of a single input
MLPs can actually compose arbitrary functions in any number of dimensions
Even without “activation”
Activation
A universal map from the entire domain of input values to the entire range of the output activation
Optimal depth and width
Deeper networks will require far fewer neurons for the same approximation error
Sufficiency of architecture
Not all architectures can represent any function
Continuous activation functions result in graded output at the layer
To capture information “missed” by the lower layer
Width vs. Activations vs. Depth
Narrow layers can still pass information to subsequent layers if the activation function is sufficiently graded
But will require greater depth, to permit later layers to capture patterns
Capacity of the network
Information or Storage: how many patterns can it remember
VC dimension: bounded by the square of the number of …weights… in the network
Straight forward: largest number of disconnected convex regions it can represent
A network with insufficient capacity cannot exactly model a function that requires a greater minimal number of convex hulls than the capacity of the network