loss over entire dataset = avg loss over examples L=N1Σi=1NLi(yi,y^i) examples L2 loss: squared error (yi−yi^)2 not robust to outliers L1 loss ∣yi−y^i∣ to minimize loss → gradient descent