is a set of perceptrons that produces one score for every category



- add on bias vector
 - argmax gives classification
 - interpret output as probability with softmax
 - interpret weights as templates:
- reshape the output vector back into shape of an image

 
 - reshape the output vector back into shape of an image
 - interpret weighs geometrically

 
training
- learn how to pick weights
 - find W such that 
- where y is true label, y_hat is predicted label
 - define loss function
- when classifier is correct, loss should be low
 - when classifier makes mistakes, loss should be high
 
 
 
summary
Key
- input layer (vector)
 - weight matrix - params that transform inputs to outputs
 - output layer - transformed predictions Learning
 - softmax activation - convert scores to probs
 - loss function (cross-entropy) - measure prediction error
- Loss = -log(prob of correct class)
 
 - gradient descent and backpropagation
- optimize weights using chain rule linear models are the foundation for neural networks
 
 
limitations
- assume linear separability (data can be divided by a hyperplane)
 - we need nonlinearity
- feature transformation: learn mapping that makes data linearly separable
 - neural networks - nonlinear activations