CHITIKA

Saturday, August 29, 2009

OBJECT TRACKING

ABSTRACT
This paper gives an overview of OBJECT TRACKING, the problem of estimating the trajectory of an object in the image plane as it moves around a scene.. In this paper, we divide the tracking methods into three categories based on the use of object representations, namely, methods establishing point correspondence, methods using primitive geometric models, and methods using contour evolution. Note that all these classes require object detection at some point. Recognizing the importance of object detection for tracking systems, we include a short discussion on popular object detection methods. This paper includes discussion on the object representations, motion models, and the parameter estimation schemes employed by the tracking algorithms.

INTRODUCTION
Object tracking is an important task within the field of computer vision. The proliferation of high­powered computers, the availability of high quality and inexpensive video cameras, and the increasing need for automated video analysis has generated a great deal of interest in object tracking algorithms. There are three key steps in video analysis:

1. Detection of interesting moving objects,

2. Tracking of such objects from frame to frame, and

3. Analysis of object tracks to recognize their behaviour.

In its simplest form, tracking can be defined as the problem of estimating the trajectory of an object in the image plane as it moves around a scene. In other words, a tracker assigns consistent labels to the tracked objects in different frames of a video. Additionally, depending on the tracking domain, a tracker can also provide object­centric information, such as orientation, area, or shape of an object.
One can simplify tracking by imposing constraints on the motion and/or appearance of objects. For example, almost all tracking algorithms assume that the object motion is smooth with no abrupt changes. One can further constrain the object motion to be of constant velocity or constant acceleration based on a priori information. Prior knowledge about the number and the size of objects, or the object appearance and shape, can also be used to simplify the problem.
Numerous approaches for object tracking have been proposed. These primarily differ from each other based on the way they approach the following questions: Which object representation is suitable for tracking? Which image features should be used? How should the motion, appearance, and shape of the object be modeled? .The answers to these questions depend on the context/environment in which the tracking is performed and the end use for which the tracking information is being sought. A large number of tracking methods have been proposed which attempt to answer these questions for a variety of scenarios.
Figure 1 shows the schematic of a generic object tracking system. As can be seen, visual input is usually achieved through digitized images obtained from a camera connected to a digital computer. This camera can be either stationary or moving depending on the application. Beyond image acquisition, the computer performs thenecessary tracking and any higher­level tasks using the tracking results.







FIG.1
Fig. 1: Schematic of a generic object tracking system. The camera obtains visual images, and the computer tracks the seen object(s)
OBJECT REPRESENTATION
In a tracking scenario, an object can be defined as anything that is of interest for further analysis. For instance, boats on the sea, fish inside an aquarium, vehicles on a road, planes in the air, people walking on a road, or bubbles in the water are a set of objects that may be important to track in a specific domain. Objects can be represented by their shapes and appearances.
In this section, we will first describe the object shape representations commonly employed for tracking and then address the joint shape and appearance representations.
1. Points:
The object is represented by a point, that is, the centroid (Figure 2(a)) or by a set of points (Figure 2(b)). In general, the point representation is suitable for tracking objects that occupy small regions in an image.
2. Primitive geometric shapes:
Object shape is represented by a rectangle, ellipse (Figure 2(c), (d)) etc. Object motion for such representations is usually modelled by translation, affine, or projective (homography) transformation. Though primitive geometric shapes are more suitable for representing simple rigid objects, they are also used for tracking non rigid objects.
3. Object silhouette and contour:
Contour representation defines the boundary of an object (Figure 2(g), (h). The region inside the contour is called the silhouette of the object (see Figure 2(i)). Silhouette and contour representations are suitable for tracking complex non rigid shapes.
4. Articulated shape models:
Articulated objects are composed of body parts that are held together with joints. For example, the human body is an articulated object with torso, legs, hands, head, and feet connected by joints. In order to represent an articulated object, one can model the constituent parts using cylinders or ellipses as shown in Figure 2(e).
Skeletal models:
Object skeleton can be extracted by applying medial axis transform to the object silhouette. This model is commonly used as a shape representation for recognizing objects. Skeleton representation can be used to model both articulated and rigid objects (see Figure 2(f).
In general, there is a strong relationship between the object representations and the tracking algorithms. Object representations are usually chosen according to the application domain. For tracking objects, which appear very small in an image, point representation is usually appropriate. For the objects whose shapes can be approximated by rectangles or ellipses, primitive geometric shape representations are more appropriate.
For tracking objects with complex shapes, for example, humans, a contour or a silhouette based representation is appropriate.



FIG2
Fig. 2. Object representations. (a) Centroid, (b) multiple points, (c) rectangular patch, (d) elliptical patch, (e) part­based multiple patches, (f) object skeleton, (g) complete object contour, (h) control points on object contour, (i) object silhouette.
There are a number of ways to represent the appearance features of objects. Note that shape representations can also be combined with the appearance representations for tracking. Some common appearance representations in the context of object tracking are:
1. Probability densities of object appearance:
The probability density estimates of the object appearance can either be parametric, such as Gaussian or nonparametric, such as Parzen windows and histograms. The probability densities of object appearance features (color, texture) can be computed from the image regions specified by the shape models (interior region of an ellipse or a contour).
2. Templates:
Templates are formed using simple geometric shapes or silhouettes. An advantage of a template is that it carries both spatial and appearance information. Templates, however, only encode the object appearance generated from a single view. Thus, they are only suitable for tracking objects whose poses do not vary considerably during the course of tracking.
3. Active appearance models:
Active appearance models are generated by simultaneously modeling the object shape and appearance. In general, the object shape is defined by a set of landmarks. Similar to the contour­based representation, the landmarks can reside on the object boundary or, alternatively, they can reside inside the object region. For each landmark, an appearance vector is stored which is in the form of color, texture, or gradient magnitude. Active appearance models require a training phase where both the shape and its associated appearance are learned from a set of samples using, for instance, the principal component analysis.
4. Multiview appearance models:
These models encode different views of an object. One approach to represent the different object views is to generate a subspace from the given views. Subspace approaches, for example, Principal Component Analysis (PCA) and Independent Component Analysis (ICA), have been used for both shape and appearance representation.
Another approach to learn the different views of an object is by training a set of classifiers, for example, the support vector machines or Bayesian networks. One limitation of multiview appearance models is that the appearances in all views are required ahead of time.
3. FEATURE SELECTION FOR TRACKING
Selecting the right features plays a critical role in tracking. In general, the most desirable property of a visual feature is its uniqueness so that the objects can be easily distinguished in the feature space. Feature selection is closely related to the object representation. For example, color is used as a feature for histogram­based appearance representations, while for contour­based representation, object edges are usually used as features. In general, many tracking algorithms use a combination of these features. The details of common visual features are as follows.
1. Color: The apparent color of an object is influenced primarily by two physical factors,
1) the spectral power distribution of the illuminant and 2) the surface reflectance properties of the object. In image processing, the RGB (red, green, blue) color space is usually used to represent color.
2. Edges: Object boundaries usually generate strong changes in image intensities. Edge detection is used to identify these changes. An important property of edges is that they are less sensitive to illumination changes compared to color features. Algorithms that track the boundary of the objects usually use edges as the representative feature.
3. Optical Flow: Optical flow is a dense field of displacement vectors which defines the translation of each pixel in a region. It is computed using the brightness constraint, which assumes brightness constancy of corresponding pixels in consecutive frames. Optical flow is commonly used as a feature in motion­based segmentation and tracking applications.
4. Texture: Texture is a measure of the intensity variation of a surface which quantifies properties such as smoothness and regularity. Compared to color, texture requires a processing step to generate the descriptors. Similar to edge features, the texture features are less sensitive to illumination changes compared to color.
4. OBJECT DETECTION
Every tracking method requires an object detection mechanism either in every frame or when the object first appears in the video. A common approach for object detection is to use information in a single frame. However, some object detection methods make use of the temporal information computed from a sequence of frames to reduce the number of false detections. This temporal information is usually in the form of frame differencing, which highlights changing regions in consecutive frames. Given the object regions in the image, it is then the tracker’s task to perform object correspondence from one frame to the next to generate the tracks.
4.1. Point Detectors
Point detectors are used to find interest points in images which have an expressive texture in their respective localities. Interest points have been long used in the context of motion, stereo, and tracking problems. A desirable quality of an interest point is its invariance to changes in illumination and camera viewpoint. Commonly used interest point detectors include Moravec’s interest operator, Harris interest point detector, KLT detector, and SIFT detector. To find interest points, Moravec’s operator computes the variation of the image intensities in a 4 × 4 patch in the horizontal, vertical, diagonal, and antidiagonal directions and selects the minimum of the four variations as representative values for the window. A point is declared interesting if the intensity variation is a local maximum in a 12 × 12 patch.


Fig. 3. Interest points detected by applying the Harris

The Harris detector computes the first order image derivatives, (Ix , I y ), in x and
y directions to highlight the directional intensity variations, then a second moment matrix, which encodes this variation, is evaluated for each pixel in a small neighborhood:

4.2. Background Subtraction
Object detection can be achieved by building a representation of the scene called the background model and then finding deviations from the model for each incoming frame. Any significant change in an image region from the background model signifies a moving object. The pixels constituting the regions undergoing change are marked for further processing. Usually, a connected component algorithm is applied to obtain connected regions corresponding to the objects. This process is referred to as the background subtraction.

Fig. 4. Background subtraction. (a) Image from a sequence in which a person is walking across the scene. (b) stationary background. (c) changes from successive frames (d) Background subtraction result.

4.3. Segmentation
The aim of image segmentation algorithms is to partition the image into perceptually similar regions. Every segmentation algorithm addresses two problems, the criteria for a good partition and the method for achieving efficient partitioning. Several segmentation techniques are available. One of them is discussed below.

Image Segmentation Using Graph­Cuts:
Image segmentation can also be formulated as a graph partitioning problem, where the vertices (pixels), V = {u, v . . .}, of a graph (image), G, are partitioned into N disjoint sub graphs (regions), Ai, by pruning the weighted edges of the graph. The total weight of the pruned edges between two sub graphs is called a cut. The weight is typically computed by color, brightness, or texture similarity between the nodes. Here the goal is to find the partitions that minimize a cut. In their approach, the weights are defined based on the color similarity. One limitation of minimum cut is its bias toward over segmenting the image. This effect is due to the increase in cost of a cut with the number of edges going across the two partitioned segments.

Normalized cut can be used to overcome the over segmentation problem. In their approach, the cut not only depends on the sum of edge weights in the cut, but also on the ratio of the total connection weights of nodes in each partition to all nodes of the graph.

5. OBJECT TRACKING
The aim of an object tracker is to generate the trajectory of an object over time by locating its position in every frame of the video. Object tracker may also provide the complete region in the image that is occupied by the object at every time instant. The tasks of detecting the object and establishing correspondence between the object instances across frames can either be performed separately or jointly. In the first case, possible object regions in every frame are obtained by means of an object detection algorithm, and then the tracker corresponds objects across frames. In the latter case, the object region and correspondence is jointly estimated by iteratively updating object location and region information obtained from previous frames. In either tracking approach, the objects are represented using the shape and/or appearance models described in Section 2.
The model selected to represent object shape limits the type of motion or deformation it can undergo. For example, if an object is represented as a point, then only a translational model can be used. In the case where a geometric shape representation like an ellipse is used for the object, parametric motion models like affine or projective transformations are appropriate. These representations can approximate the motion of rigid objects in the scene. For a non rigid object, silhouette or contour is the most descriptive representation and both parametric and nonparametric models can be used to specify their motion.