Structure from Motion

reconstruct 3D structures from collection of 2D images taken from diff viewpoints

triangulation: 2D points to 3D structure

Given:

Goal: Find the 3D point X that projects to x and x’

$x = PX$ and $x^{'} = P^{'} X$

applying camera matrix to 3D point gives pixel point
homogeneity , lose depth
backprojection
- apply pseudo-inverse of P on x, and connect the points — this doesn’t give full 3D position tho since we don’t know depth
fix: we enforce co-linear constraints:
- $x \times (PX) = 0$ cross-product
- this equality removes scale factor from $x = α PX$
with two linear equations per view and at least two views, we can solve for X with singular value decomposition (SVD) (similar to camera calibration & pose estimation)
triangulation requires two cameras to have enough equations to solve for all unknowns
- rays intersect at 3D object point

challenges

noise may prevent rays from intersecting well
- fix: add more rows to matrix with more cameras
singular value decomposition (SVD) provides least-squares solution (best fit)

by enforcing constraints, reduce complexity of matching points between images

baseline: line connecting camera optical centers O and O’
epipoles (e,e’): where baseline intersects the image planes
- projection of o’ on the image plane
epipolar plane: plane formed by baseline and 3D point X
epipolar line: intersection of epipolar plane and image plane (all possible matches are here) epipolar constraint
for point x in first image, its match x’ in second image must lie on epipolar line l’
reduces search space to 1D, so matching is more efficient importance
no need for depth sensors - pure geometry-based 3D reconstruction
fundamental

Where are epipoles?

encodes camera motion

$E = [t] \times R$ - encodes rotation and translation between cameras
- constraint: for points in 2D camera coordinates, $x^{' T} E x$
- properties
  - rank = 2 (due to cross-product)
  - singular values
- multiplying a point by E tells us the epipolar line in the second view
- diff from image homographies coz
  - E maps point to line
  - homography maps point to point
$F = K^{' - T} E K^{- 1}$
- constraints: for points in image coordinates, $x^{' T} F x = 0$
- estimation:
  - use 8-point algorithm and RANSAC
  - solvable from correspondences, doesn’t need known camera poses
Recovering camera motion
- decompose E into R and t (up to scale ambiguity)
- triangulate 3D points with recovered poses

combines everything for full 3D reconstruction

Given many images, how can we

Calibrate → Triangulate

for added views
- determine motion using all known 3D points that have correspondence in new image
- add structure by estimating new points in new image