for deep learning, improves linear classifiers
requires activation functions to become non-linear
new hyperparams
- num layers
 - what the layer is?
 

activation functions
- step functions - binary output - hard to optimize
 - sigmoid - smooth, but vanishing gradients (flat regions = slow learning)
 - ReLU - keeps gradients constant for positive inputs, fixes flat regions
 
architecture
- fully connected layers: every input connects to every output
 - param efficiency: large inputs require billions of params…