Indirect methods of obtaining depth maps, based largely on triangulation techniques, have provided the largest input in this area. This is thought to be due in part to the existing optoelectronics technology (camera tubes, photodiode arrays, etc.) which, being inherently 2-D devices, require triangulation techniques to integrate them within 3-D vision system, and in part to the analogy with the human vision system which is also based on a triangulation technique (Marr and Poggio, 1976, 1977) .
Stereo vision, in particular, has received considerable attention. The disparity technique is based on the correlation between images of the same object taken by two different cameras under the same lighting conditions (Marr and Poggio, 1976), while the photometric technique is based on the correlation between the images taken by the same camera under two different lighting conditions (Ikeuchi and Horn, 1979) .
Stereo vision sensors, like their 2-D counterparts, are also based on optical array transducers, both vacuum and solid-state (such as camera tubes, CCD, DRAM and photo diode arrays). Their main function is to provide multiple position views (usually two) of the object in question. Figure 1 shows a diagrammatic view of how these two images are obtained.
To draw the imaging lines in Figure 1 (which, for simplicity’s sake, were limited to two per image) one must consider that each point of the object’s image corresponds to one point on the object surface (assuming properly focused optics). This means that this object point must lie along the line joining the image point and the focal point of the imaging device lens, its distance along the line being unknown. If the object is now viewed from a different angle and the same point is visible in both views, then it must lie at the intersection of the lines determined from the two separate views; its position (i.e. the distance from the imaging devices) can then be calculated by triangulation.
For the computer to carry out this calculation automatically, however, it needs to process the two images in three main steps:
- Determine the point pairs in the two images, that is, determine which point in the right image corresponds to which point in the left image. This is the hardest, and therefore computationally the most expensive, part of stereo vision. It may, in fact, be very difficult to identify the same features in both images. The image of a small area on the object surface may be different in the two images because of the different perspective and surface reflectivity due to the viewing angles. Moreover, some of the points in one image may not be visible in the other.
- Translate the corresponding two points in the left and right images to yield the disparity measurement (i.e. the difference in the x-y position of the point in the left image compared with the x-y position of the corresponding point in the right image).
- Determine the distance of the object point from the imaging devices by triangulation. (This operation requires data on the relative positions and orientations of the stereo imaging device(s) which produced the left and right images.)
The essence of stereo vision, therefore, is step 1, namely the solution to the disparity problem. To help gauge the difficulty attached to such a step one needs to note that all disparity measurements computed using local similarities (or features) may be ambiguous if two or more regions of the image have similar properties.
Consider for example the left and right images as shown in Figure 2 consisting of three dark squares each as marked. Each square in one image is similar to any of the three in the other. If we now correspond L1 and R1, L2 and R2, L3 and R3, the three squares will be computed to be at the same height above the background, as
by the filled squares.If L1 were to be matched with R2, L2 with R3 and L3 with R1 then the computed heights would be shown by the empty triangles. Another possible interpretation is shown by the unfilled circles, thereby giving an indication of how critical the correspondence problem can become in the absence of any known and unique object features.
In spite of these difficulties and the relatively high expected price tag, robot stereo vision is a desirable goal. Stereo vision has the highest inherent 3-D image resolution (limited only by the type of camera and its optics) and flexibility (for instance it is the only method that can provide colour images relatively easily) and as such it comes closest to the aforementioned definition of a general-purpose, flexible vision sensor. This makes it a desirable goal but does require large investments and long project lead times.
The USA financial investment in stereo vision research, for instance, has already been considerable (approx. $3,000,000 to date), but the results so far have been limited mostly to laboratory prototypes. The reasons are thought to be many and varied, ranging from the aforementioned difficulty of solving the disparity problem in a sufficiently short time to the sheer complexity of a system that is essentially trying to emulate a major function of the human brain. Recently, however, there have been reports of successful industrial applications, such the positioning of car bodies, using stereo vision (Rooks, 1986).
There are 3 main methods of using the 2-D vision sensors to obtain multiple views as required for stereo vision:
- Disparity method 1. Use of two stationary imaging devices.
This could be defined as the ‘classical’ stereo vision method because of its analogy to the human vision system. As shown in Figure 3 it consists of an illumination system and two stationary cameras which provide the required two 2-D images. This method is inherently more expensive than the other two because it uses 2 cameras but does not require any mechanical movement and, therefore, compared to method ‘b’ is faster and can provide more accurate measurement of the cameras positions as required for the disparity calculations.
Disparity method 2. Use of one imaging device moved to different known positions. This is essentially a cost variation on the method ‘ a’ since, as shown in Figure 4, it only differs by the use of a single camera which, to provide images from a different angle, is mechanically moved to a different known position.
- Photometric method. Use of one stationary imaging device under different lighting conditions.
This method relies on maintaining a camera in the same position, thereby avoiding the pixel correspondence problem, and obtaining multiple images by changing the illumination conditions. Process ing of these images can uniquely determine the object surfaces orientation thus enabling its 3-D mapping (Woodham 1978).