Image Understanding Workshop

[merged small][merged small][merged small][merged small][merged small][ocr errors][ocr errors][ocr errors][merged small][ocr errors][merged small][ocr errors][merged small][merged small][ocr errors][merged small][merged small][merged small][merged small][merged small]

Figure 6: Data of a single cylinder from a passive range finding system (stereopsis) and output of our system

[graphic][merged small][merged small][merged small][merged small][ocr errors][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small]

Figure 7: Data of two objects with occlusion from stereopsis and output of our system

[graphic][merged small][merged small][merged small]

Figure 8: Data of a simple cylinder from an active range finder (after processing using the system in Fan et al '86 and segment fitting) and output of our system.

Object Recognition Using Alignment

Daniel P. Huttenlocher

Shimon Ullman

Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Cambridge, Massachusetts 02139

Abstract. This paper presents an approach to recognition where an object is first aligned with an image using a small number of pairs of model and image features, and then the aligned model is compared directly against the image. For instance, the position, orientation, and scale of an object in three-space can be determined from three pairs of corresponding model and image features. By using a small fixed number of features to determine position and orientation, the alignment process avoids structuring the recognition problem as an exponential search. To demonstrate the method, we present some examples of recognizing flat rigid objects with arbitrary three-dimensional position, orientation, and scale, from a single two-dimensional image. The recognition system chooses features for alignment using a scale-space segmentation of edge contours. Finally, the method is extended to the domain of rigid objects in general.

Introduction

Object recognition involves identifying a correspondence between part of an image and a particular view of a known object. This requires matching the image against stored object models to determine if any of the models could produce a portion of the image. Even for a single model, a given object can appear very different depending on its position and angle with respect to the viewer. First, from a particular view part of an object will generally be occluded. Second, an object may be distorted by projection into the image plane (e.g., forshortening). Finally, an object may itself undergo transformations such as having parts which move independently, or being stretched or bent. Most recognition systems assume that objects are rigid, and do not undergo any transformation [12] [4] [9] [2]. Some systems allow for perspective projection [16], and some have parameterized models which can articulate at certain points [7].

The presence of more than one object in an image also complicates the recognition problem. First, objects may occlude one another. Second, different objects in the image must somehow be individuated. In the case of touching

and overlapping objects this generally cannot be done prior to recognition, but rather must be part of the recognition process itself.

The Task

In this paper we consider the problem of matching a twodimensional view of an object against a potential model. The viewed object can have arbitrary three-dimensional position, orientation, and scale, and may be touching or occluded by other objects (3D from 2D recognition). First we consider the domain of flat rigid objects such as the widget shown in Figure 1. While the viewed object is flat, the problem is not two-dimensional because a flat object positioned in three-space can undergo distortion such as forshortening when projected into the image plane.

Like the general recognition task, this task suffers from problems of occlusion and of individuating multiple objects in an image. There is also a limited kind of shape distortion caused by projecting a rigid object into the image. We then consider extending the recognizer to the domain of rigid objects in general, such as the personnel carrier show in Figure 2.

Figure 2. A personnel carrier used in recognition.

assumes that objects are polyhedral, and are viewed such that parallel surfaces appear more or less parallel.

In order to solve this task, we present a new approach where the recognition process is divided into two stages. In the first stage, a rigid object is aligned with an image using a small number of model and image features. In the second stage, the alignment is used to transform the model into image coordinates. The key observation underlying the alignment operation is that the position and orientation of a rigid object can be determined from a small number of position and orientation measures. Once the position and orientation have been determined, the model can be compared directly with the image. In contrast, current recognition systems search for the largest set of model and image feature pairs which are consistent with a single position and orientation of a rigid object. The number of such sets is exponential, requiring the use of various techniques to limit the search.

Matching Models and Images: Previous Approaches

In this section we briefly discuss the limitations of some recent recognition systems with respect to the recognition task described above (for a more general review see [3]). These systems all exploit rigidity by noting that for a given position and orientation of a rigid object, there must be a single transformation which maps each model feature onto its corresponding image feature. This transformation consists of a three-dimensional rotation and translation in 3D from 3D recognition, and a solution to the perspective viewing equation in 3D from 2D recognition [16].

Recognition is generally structured as a search for the largest pairing of model and image features for which there exists a single transformation mapping each model feature to its corresponding image feature [12] [4] [16] [7] [9]. For i image features and m model features there are at most pix m pairs of model and image features. Because of occluded image points, and image points which do not correspond to the model, any subset of these p pairs could be the largest set of matching model and image points, and thus the number of possible matches is exponential in the size of p. Two methods are used to limit this space of

possible matchings of model and image features.

The first method of limiting the possible matches is to use the identity of features to restrict the pairing of model and image features. However, even in the ideal case where each model feature has only a single corresponding image feature, there may be multiple matches to consider because of image features which actually correspond to other objects in the scene. Shape descriptions such as SLS [6], codons [14], and the curvature primal sketch [1] are all intended to produce relatively unique features for use in recognition.

The problem with using the identity of features in recognition is that there is a tradeoff between the uniqueness of a feature and the robustness with which it can be recognized. Since systems which rely heavily on the identity of features must use relatively unique features, they tend to be sensitive to noise and occlusion in the image.

For instance, the LFF [4] and 3DPO [5] recognition systems form feature descriptions by using local clusters of features. A "focus feature" in each cluster is chosen for use in matching. This feature is described in terms of its type (e.g., corner, hole), and the type, distance, and angle of the other features in the cluster. The use of local feature clusters yields relatively unique features. However, it is difficult to ensure that each cluster is composed of features from a single object, making the system sensitive to the position and orientation of neighboring and occluding objects.

The second method of limiting the possible matches is to use relations between features to eliminate inconsistent pairs of model and image features [12] [9] [4]. For instance, in order for two pairs of model and image features (m1,1) and (m2, 12) to be part of a consistent set, the distance between the image features i1 and 12 must be the same as the distance between the model features mi and m2, within some error bound. Similarly, the angle between orientation measures for any pair of image features must match the angle between the corresponding pair of model features.

The problem with using relations between features in recognition is that the relations must be measurable in the image. Since relations such as distance and angle are not invariant under projection, three-dimensional recognition systems which use these relations require three-dimensional data. Relations which are invariant under projection tend to be much weaker than distance and angle relations.

It has been demonstrated that distance and angle relations can be used to greatly limit the number of possible matches of a model to an image, for both 2D from 2D [12] [4] and 3D from 3D [13] [5] recognition tasks. However, the method cannot readily be extended to handle 3D from 2D tasks. In addition to empirical demonstrations of the power of distance and angle relations among features, it has been

« iepriekšējā Turpināt »

Grāmatas

Image Understanding Workshop: Proceedings of a Workshop Held at ..., 1. sējums