Lecture 5b: Ways to achieve viewpoint invariance

“invariant” means, literally, that it doesn’t vary: it doesn’t change as a result of a change of viewpoint.

This means that if the neuron for the feature detector is fairly active (say it’s a logistic neuron and it has a value close to 1) for one input image, then if we give the neural network a image of that same scene from a somewhat different viewpoint, that same neuron will still be fairly active. Its activity is invariant under viewpoint changes.

“invariant” is a matter of degrees: there’s very little that’s completely invariant, or that has no invariance at all, but some things are more invariant than others.

The invariant features are things like “there’s a red circle somewhere in the image”, and the neuron for that feature detector should somehow learn to turn on when there is indeed a red circle in the input, and turn off if there isn’t.

Try to come up with examples of features that are largely invariant under viewpoint changes, and examples of features that don’t have that property.

Some ways to achieve viewpoint invariance

We are so good at viewpoint invariance that it is hard to appreciate how difficult it is.
- Its one of the main difficulties in making computers perceive.
- We still don’t have generally accepted solutions.
There are several different approaches:
- Use redundant invariant features.
- Put a box around the object and use normalized pixels.
Lecture 5c: Use replicated features with pooling. This is called “convolutional neural nets”
Use a hierarchy of parts that have explicit poses relative to the camera (this will be described in detail later in the course).

The invariant feature approach

Extract a large, redundant set of features that are invariant under transformations
- e.g. pair of roughly parallel lines with a red dot between them.
- This is what baby herring gulls use to know where to peck for food.
With enough invariant features, there is only one way to assemble them into an object.
- We don’t need to represent the relationships between features directly because they are captured by other features.
For recognition, we must avoid forming features from parts of different objects.

The judicious normalization approach

Put a box around the object and use it as a coordinate frame for a set of normalized pixels.
- This solves the dimension-hopping problem. If we choose the box correctly, the same part of an object always occurs on the same normalized pixels.
- The box can provide invariance to many degrees of freedom: translation, rotation, scale, shear, stretch …
- But choosing the box is difficult because of:
- Segmentation errors, occlusion, unusual orientations.
We need to recognize the shape to get the box right!

The brute force normalization approach

When training the recognizer, use well-segmented, upright images to fit the correct box.
At test time try all possible boxes in a range of positions and scales.
- This approach is widely used for detecting upright things like faces and house numbers in unsegmented images.
It is much more efficient if the recognizer can cope with some variation in position and scale so that we can use a coarse grid when trying all possible boxes.

Reuse

CC SA BY-NC-ND

Citation

BibTeX citation:

@online{bochman2017,
  author = {Bochman, Oren},
  title = {Deep {Neural} {Networks} - {Notes} for Lecture 5b},
  date = {2017-08-19},
  url = {https://orenbochman.github.io/notes/dnn/dnn-05/l05b.html},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2017. “Deep Neural Networks - Notes for Lecture 5b.” August 19, 2017. https://orenbochman.github.io/notes/dnn/dnn-05/l05b.html.