Mpeg 2 Compressed Domain Algorithms For Video Analysis
Mpeg 2 Compressed Domain Algorithms For Video Analysis
low-textured areas in the video. In this paper, a camera move- these algorithms are not well suited for video sequences with
ment is a generic name encompassing a pan or a track- large moving objects. Kobla et al. [5] distinguish eight mo-
ing shot, where the camera changes its position. The object tion directions and assume a camera motion if the direc-
tracking algorithm is based on the camera motion detection tion receiving the highest number of vectors receives more
algorithm. Objects like faces can be tracked over several min- than twice as many vectors as the second highest direc-
utes in the ideal case. Finally, we present a cut detection algo- tion.
rithm using macroblock types and motion vectors. The object tracking algorithms that are commonly used
in the uncompressed domain are the mean shift algorithm
2. MPEG-2 VIDEO COMPRESSION [6] and the Kalman filter [7]. Also, block-matching algo-
rithms [8] are used where the object is divided into small
As mentioned above, MPEG-2 compression is the standard blocks of constant size that are searched for in the next frame.
for digital video compression. MPEG-2 is downwardly com- In the compressed domain, Kobla et al. [9] present a method
patible with MPEG-1, which means that the algorithms pre- for determining the flow of a pixel block using motion vec-
sented in this paper are also applicable to MPEG-1. The tors and frame types. However, this method is too imprecise
MPEG-2 algorithm uses two basic methods to compress to use for object tracking. In [10], Khan et al. present experi-
digital video: frequency-domain compression for spatial re- ments with object tracking using compressed-domain MPEG
dundancy (as with JPEG picture compression) and block- data with P-frames only.
based motion compensation for the temporal redundancy. Cut detection algorithms can be divided into compressed
An MPEG-2 video consists of three frame types: I-, P-, and and uncompressed algorithms. Uncompressed-domain algo-
B-frames, which are grouped into GOPs (group of pictures). rithms use pixelwise difference [11], histograms [12], edge
The I-frame is encoded similarly to a JPEG image. It does tracking [13], and so forth. Such algorithms are, how-
not make use of information from any other frame. P-frames ever, computationally expensive. Some compressed-domain
are predicted from the previous I- or P-frame. B-frames use methods use the increasing bit rate as an indicator [14] or ap-
parts of both the previous and the following I- or P-frame. All proximated DC coefficients [15]. Kobla et al. [5], Calic and
three frame types use the DCT to transform the picture from Izquierdo [16] use an approach similar to ours by checking
the spatial domain to the frequency domain, which is better the ratios of forward- and backward-predicted macroblocks.
suited to exploit redundancy. In the I-frames, the DCT data However, experiments show many erroneous detections in
is derived directly from the image; whereas in the P- and B- sections with no or little motion only.
frames, the DCT data is derived from the residual error after
prediction. Each frame is divided into 16 × 16 pixel blocks 3. DETECTING CAMERA MOTION
called macroblocks. In the motion estimation process, the
encoder checks each macroblock of a P-frame if it can be pre- The encoder has already done the computationally most
dicted from a pixel block in the previous reference frame (I- complex part of detecting camera motion by calculating a
or P-frame) with a possible offset to compensate for motion motion vector field. However, the vectors do not always cor-
(the so-called motion vector) and a residual error which is respond to true motion since the encoder optimizes the
DCT transformed. If the coefficients become too big, the en- MPEG encoding for picture quality and not to reflect the ex-
coder decides to intracode the macroblock. The encoding of act motion of the objects covered by the macroblocks. For
B-frames works similarly. The difference is that B-frames can this reason, numerous outliers are to be expected. Moreover,
be predicted from the preceding or the following reference moving objects will result in motion vectors that do not re-
frame or from both frames. flect the camera motion.
The MPEG-2 standard only specifies the bit format and
the decoding of the MPEG-2 stream, the algorithms and 3.1. Algorithm
methods for the encoding, such as those needed for motion
estimation, are not defined in the standard. For this reason, We only use P-frames for the analysis of camera movements
how well the motion vectors reflect the motion of the camera since the distances between B-frames and reference frames
or an object depends on the encoder. vary. This means that the camera motion can only be deter-
mined for every second or third frame. However this restric-
2.1. Related Work tion represents only a minor drawback, since every camera
movement that does not last longer than several frames can
Detecting camera motion in the compressed domain was rather be regarded as camera wobble.
investigated by Wang and Huang [2]. They use a recursive A two-dimensional histogram of motion vectors is used
outlier-rejecting least-square algorithm. Smolic et al. [3] use and the peak is interpreted as reflecting camera motion.
the M-estimator method to underweight outlier motion vec- However, the motion vectors spread widely, especially with
tors. Neither of these algorithms take into account motion a fast movement of either camera or objects. For this reason,
vectors of low-textured macroblocks that do not reflect mo- the histogram of the motion vectors will usually not result in
tion. Kuhn [4] uses the DCT coefficients to determine a list a steep peak, but rather in a peak region. We solve this prob-
of macroblocks with high-texture information and performs lem by using a pyramid filter that includes 2 pixels in each
an IDCT for them in order to calculate their motion. All direction to detect the maximum. This filter is represented as
W. Hesseler and S. Eickeler 3
a weight matrix W with elements wi, j : Texture analysis is only performed on the luminance compo-
nent since the encoder usually also uses only the luminance
⎛ ⎞ for motion estimation. When the motion vector histogram is
1 1 1 1 1 built, the count for each element is not increased by one, but
⎜1 2 2 2 1⎟
⎜
1 ⎜ ⎟ rather by the corresponding weight s(Sx,y ) of the macroblock.
W= ·⎜1 2 3 2 1⎟
⎟. (1) In this way, the low-textured macroblocks are underweighted
3 ⎜
⎝1 2 2 2
⎟
1⎠ or excluded from the analysis.
1 1 1 1 1 Experiments have shown that a simple approach that
classifies all macroblocks as either sufficiently or nonsuffi-
ciently textured yields adequate performance. So we get
hx,y are the accumulated weights of the motion vectors (x, y).
For each vector (x, y), the weighted frequency of the peak ⎧
region called Hx,y is calculated as follows: ⎨1 for Sx,y ≥ SThreshold ,
s(Sx,y ) = ⎩ (3)
0 for Sx,y < SThreshold .
2
2
Hx,y = wi+3, j+3 · hx+i,y+ j . (2) These weights insure that only the motion vectors of mac-
i=−2 j =−2 roblocks that are sufficiently textured are used in the analysis.
The peak region around the vectors (x, y) is then given by 3.1.2. Approximation of the DCT coefficients
the maximum max(Hx,y ). The Hx,y values can be calculated The DCT coefficients of I-frames and intracoded mac-
quickly using the integral image algorithm [17]. roblocks of P-frames can be used directly from the MPEG-
Motion vectors are stored with half-pixel accuracy. Thus, 2 compressed-domain data after extraction. For non-intra-
the vector of the peak region has to be divided by two in or- coded macroblocks, only the motion vector and the differ-
der to obtain pixel accuracy. We chose a computationally less ence with the referenced pixel block are stored as DCT coef-
complex method and divided all motion vectors by two and ficients. For this reason, the DCT coefficients for these mac-
built the histogram with these values. When interlaced mo- roblocks have to be calculated. There are computationally
tion estimation was used, the two motion vectors for each expensive algorithms that can do a precise calculation [18–
macroblock were averaged. 20]. Since we only need information concerning whether the
macroblock is sufficiently textured, an approximation is ad-
3.1.1. Low-textured areas equate. A DCT block of a P-frame can be built up by using
up to four adjacent DCT blocks in a reference frame. If all
Large areas in the video that are low textured, that is, that these four DCT blocks are low textured and the error DCT
do not have any edges, result in long chaotic motion vectors coefficients are small, the composed DCT block will be low
that do not reflect the camera motion. Since the macroblocks textured, too, assuming that there is no edge between the in-
in such areas are similar to each other, the encoder is free to dividual blocks in the spatial domain.
use an arbitrary pixel block as reference, which minimizes the To approximate the DCT matrix of the current 8 × 8 pixel
error, but does not reflect the true motion. Figure 1 shows an block BCur , we calculate the weighted sum of the DCT matri-
example. ces of the pixel blocks B1 − B4 plus the error DCT as shown
A motion vector can be regarded as usable for the de- below:
tection of camera motion if its macroblock is textured, that
is, if its brightness varies. This is reflected by the 63 AC co-
4
efficients of the four DCT blocks of which the macroblock DCT BCur = wi · DCT Bi + DCTErr . (4)
consists. The DC coefficient only gives information on the i=1
average brightness of the DCT block, but does not indicate
the texture. The weights wi are given by the fractions of the areas of the
Increased AC coefficients in the upper left area reflect DCT block that overlap with the reference DCT blocks. Since
variations in brightness of the 8 × 8 pixel block. Most of the we work with half-pixel accuracy, the weights are a multiple
other coefficients are zero or near zero. Experiments showed of 1/256. Figure 2 sketches the procedure.
that it is sufficient to evaluate the nine AC coefficients in the The approximated DCT coefficients will be used for the
top left triangle of the DCT block. calculation of the DCT coefficients of the subsequent P-
In order to be independent of the scanning pattern used, frame. Errors will therefore propagate.
we select the nine AC coefficients after scanning the coeffi- When interlaced motion compensation with two motion
cients in the zigzag or alternate scan order. The absolute val- vectors is used, a special treatment is required. In the frame
ues of the nine AC coefficients of all four DCT blocks com- picture modus, the pixel rows of both 16 × 8 pixel blocks are
prising the macroblock are added. The sum of these 36 coef- interleaved. This is emulated in the DCT domain by averag-
ficients for the macroblock at the macroblock position (x, y) ing each coefficient of the upper and lower DCT blocks. In
reflects its texture and is denoted as Sx,y . This value is used for the field picture modus, the upper and lower 16 × 8 pixel
the weight function s(Sx,y ). Its value range is between 0 and 1. blocks have their own motion vectors. In the DCT domain,
4 EURASIP Journal on Applied Signal Processing
BRef
B1 B2
y BCur
x
B3
B4
Figure 2: DCT approximation: BCur uses BRef as reference which is located inside B1 − B4 . The motion vector is (x, y).
the upper two DCT blocks can be approximated using the 3.3. Results
upper motion vector and the lower two DCT blocks can be
approximated using the lower motion vector. The approximated DCT using a simple weighted sum ap-
proach is adequate to be used to determine if the macroblock
is sufficiently textured so that its motion vector will probably
3.2. Experiments
reflect the camera motion. The peak of a two-dimensional
Comparisons of the approximated DCT with the forward histogram using a 5 × 5 pyramid filter is suited to determine
DCT of the decoded data show higher differences especially the camera motion.
if textured and low-textured areas are adjacent. However, an
exact calculation is not important, but only the classifica- 4. OBJECT TRACKING USING MOTION VECTORS
tion as sufficiently and nonsufficiently textured. Experiments
with newscasts video material resulted in 94% correct classi- Encouraged by the results of the camera motion detection,
fication of the macroblocks compared with the classification we performed tests involving the use of motion vectors to
using the forward DCT, which is sufficient. track objects. An object motion can be regarded as “camera”
We used test sequences mainly from newscasts that were motion that is restricted to the object. By using the motion
captured directly from the satellite in MPEG-2 format using vector information, a speed increase could be achieved com-
a Linux system with DVB-S card. The camera motion was pared to pixel domain tracking algorithms.
then determined by hand and compared with the computed
results of our algorithm, the M-estimator approach, and with
a simple average algorithm. The histogram algorithm with 4.1. Differences to camera motion detection
texture analysis was usually able to detect the camera mo-
tion precisely, whereas the M-estimator and the average al- Compared to the detection of camera motion for the whole
gorithms’ vectors were too small, especially with fast cam- frame, there are additional challenges for the object tracking
era movements. The algorithm proved to be highly robust that is restricted to a subimage.
against objects moving in the foreground. An object did not
have an influence on the peak region unless its size increased 4.1.1. Problems using MPEG-2 compressed-
to nearly 50% of the frame size within the artificial test se- domain data
quences. In this case, the object motion was interpreted as
camera motion. Many smaller moving objects could wrongly One problem when using the MPEG-2 compressed-domain
lead to a detection of the 0-vector as camera motion. data is that the motion vectors are not available continuously
W. Hesseler and S. Eickeler 5
for all frames. The number of frames between the current usually influence the peak of the histogram. The majority of
frame and the reference frame varies. Moreover, there are no the motion vectors still reflect the object movement.
motion vectors for I-frames. The object rectangle will in most cases not correspond
Another problem is that the macroblock grid does not to the macroblock grid so that many macroblocks or refer-
match the object boundaries. An edge exact tracking is there- enced pixel blocks will partly contain background. We there-
fore not possible. Instead, a rectangle of constant size con- fore introduce a weight for each macroblock or referenced
taining the object is tracked. The macroblocks and the refer- pixel block, respectively. If all 16 × 16 pixels are located inside
enced pixel blocks will partly contain background. the object rectangle, the weight is 1, otherwise, the weight is
Object tracking using motion vectors has a few require- proportional to the number of overlapping pixels.
ments concerning the object to be tracked: The analysis is done with half-pixel accuracy. This means
that the histogram is built using half-pixel without rounding
(i) object must be textured but not periodically textured;
to pixels beforehand. The object rectangle’s coordinates are
(ii) object must be of sufficient size;
also internally calculated with half-pixel accuracy. The two
(iii) object rectangle must contain no or small portions of
motion vectors of macroblocks using interlaced motion es-
background only;
timation are both used with half-weight when building the
(iv) object must not change its appearance abruptly, its size
histogram.
should remain constant.
hx,y are the accumulated weights of the motion vectors
If the object is not textured, the algorithm used to detect low- (x, y). For each vector (x, y) the weighted frequency of the
textured areas, as previously described, can be used. We pri- peak region called Hx,y is calculated. Because of the half-pixel
marily performed experiments with faces. If faces are suffi- accuracy, an extended range of three half-pixels (compared
ciently large, they meet the requirements on the object listed to two pixels with the detection of camera motion) is used
above and it is possible to track them using motion vec- to determine the peak region. The weight matrix W with el-
tors. ements wi, j is represented as
⎛ ⎞
1 1 1 1 1 1 1
4.1.2. The start object rectangle ⎜1
⎜ 2 2 2 2 2 1⎟
⎟
⎜1 2 3 3 3 2 1⎟
The first step to find the object rectangle that should be 1 ⎜
⎜
⎟
W= ·⎜ 1 2 3 4 3 2 1⎟
⎟ (5)
tracked is best performed in the uncompressed domain. For 4 ⎜
⎜1 2 3 3 3 2
⎟
1⎟
this purpose, the video must be decoded at least for the first ⎜ ⎟
⎝1 2 2 2 2 2 1⎠
frame after a cut. The object rectangle to be tracked will ei-
ther be determined by an automated system or by a human. 1 1 1 1 1 1 1
After the initial determination of the rectangle, the object and the peak region Hx,y is calculated as follows:
tracking algorithm only uses compressed-domain data. The
algorithm is passed the coordinates of the top left corner and
3
3
the size of the object rectangle. The size of the object will re- Hx,y = wi+4, j+4 · hx+i,y+ j . (6)
main constant throughout the whole tracking process. More- i=−3 j =−3
over, the edges of the object rectangle will always be located The calculation of the object rectangle is dependent on the
parallel to the image edges. frame type. Intracoded macroblocks are always ignored.
The motion vectors of forward-predicted macroblocks
4.2. Algorithm specify the position in the frame from which the macroblock
originates, whereas motion vectors of backward-predicted
The tracking algorithm applies the histogram in a manner macroblocks specify to which position in the next reference
similar to the camera motion detection, plus some additional frame the macroblock will move.
steps to increase the accuracy.
4.2.2. Backward search with P-frames
4.2.1. Histogram to determine the object’s
motion vector P-frames can only have motion vectors of forward-predicted
macroblocks. They specify the position of the pixel block that
Since the results of the object rectangle’s calculations are used is referred to in the previous reference frame. The number
again with subsequent frames, a high accuracy is required. As of pixels by which this pixel block and the object rectangle’s
seen with the camera motion detection, the histogram algo- overlap is calculated. The overlap specifies the weight of the
rithm with texture analysis can usually determine the exact motion vector used to build the motion vector histogram.
number of pixels. Since we assume that the object is suffi- The vector of the peak region is calculated by applying a
ciently textured, texture analysis is not required. We rely on a method similar to that used in the camera motion approach.
two-dimensional histogram which includes only the motion Adding this vector to the coordinates of the object rectan-
vectors of macroblocks that are at least partly located inside gle in the reference frame gives the coordinates of the object
the object rectangle. A slight and slow appearance change rectangle in the current P-frame. This rectangle is used as the
of the object like the mouth movement of a face does not reference rectangle for the following frames.
6 EURASIP Journal on Applied Signal Processing
4.2.3. Backward search and prediction with B-frames predicted macroblocks and backward- portions of bidirec-
tional motion vectors is built.
The determination of the current object rectangle for B- The algorithm does not support rotating objects. The ob-
frames is done in the same way that it was with P-frames, ject rectangle’s orientation will always remain the same so
except that the forward portions of bidirectional motion vec- that a 90-degree rotation will have the effect that the object
tors are included in the histogram as well. The reference rect- rectangle contains background from that point on. This is
angle is not updated since B-frames are never used as refer- also the case if the object size decreases. However, a slow in-
ence. crease of size will usually not result in problems. It is essential
No motion vectors are available, either for I-frames or for that the encoder uses B-frames without preferring the previ-
P-frames, in which all macroblocks that are part of the object ous or the following reference frame.
rectangle are intracoded. We therefore use the backward ref-
erences of B-frames to make a prediction concerning where 4.3. Experiments
the object rectangle will move in the next reference frame.
This is done again by using a histogram, but this time the We primarily performed experiments with faces. For com-
histogram is built up using the motion vectors of backward- parison, the ground-truth data of the position of the rect-
predicted macroblocks and the backward portions of bidi- angle containing the face were determined using the Rowley
rectional motion vectors. The sum of the peak-region vector face detection algorithm [21]. The rectangle position in the
and the object rectangle’s coordinates of the B-frame gives first frame of the sequence together with its size was used
the object rectangle’s coordinates in the next reference frame. as start object rectangle for the tracking algorithm. Another
Consequently, for a P-frame in the sequence B1 B2 P, face detection method, which could have been used to estab-
usually three calculations of the object rectangle are avail- lish ground truth, is presented in [22].
able which are identical in the ideal case: the two predictions Figure 4 shows the tracking of a speaker in the German
of the B-frames and the calculation for the P-frame. We use parliament. The object rectangle in the first frame was deter-
the latter one because it is most reliable since it is calculated mined using the Rowley algorithm. It is the starting point of
in one step. In contrast, a prediction of a B-frame requires the tracking. The white-dotted rectangle shows the reference
two steps, the B-frame calculation and the B-frame predic- tracking by the Rowley algorithm, whereas the solid rectan-
tion. However, experiments showed that the predictions that gle depicts our tracked object rectangle. The speaker’s head
were calculated with the B-frames are more precise than the moves several times quickly up and down and sidewards. The
P-frame calculations if the motion vectors spread, especially Rowley network was only trained on frontal faces and can-
in the case of fast object movements. Therefore, an additional not therefore always find the face with sufficient confidence.
pseudomotion vector with weight 2 is added to the P-frames Our tracking algorithm can link these face detections so that
histogram. It is determined by the difference between the co- it is possible to automatically determine that it is the same
ordinates of the object rectangle in the reference frame and face. The differences between the positions of the Rowley al-
the prediction. This pseudomotion vector does not influence gorithm and ours are only a few pixels. Even if the head is
the maximum in cases in which no fast movement occurs. slightly tilted, the face can still be tracked. The size of the
rectangle that was determined using the Rowley algorithm
varies, whereas the size of our rectangle is constant.
4.2.4. Using predicted coordinates with I-frames Experiments with low-quality video material, such as
recordings from security cameras, gave poor results. The mo-
For I-frames, the prediction of the object rectangle derived tion vectors do not reflect the object motion adequately. This
from the previous B-frame is used. This prediction is also is also the case if the object moves too fast or changes its ap-
used for P-frames in which all macroblocks that are part of pearance quickly. For this reason, the quality of the tracking
the object rectangle are intracoded. The reference rectangle is highly dependent on the quality of the motion vectors. A
is updated with these coordinates. high-quality video with low noise is required as well as an
Figure 3 shows an example sequence with three consecu- encoder with a good motion estimation.
tive frames with type P, B, and I. The object rectangle in the P- Table 1 shows the tracking results of five high-quality
frame (depicted with solid lines in the left picture) is known video sequences.
when the B-frame is analyzed. Each macroblock in the B-
frame is checked as to whether a pixel block of the object 4.4. Results
rectangle in the P-frame is used as reference. The macroblock
depicted with dashed lines is located completely inside the The results of the presented object tracking using MPEG-2
object rectangle so that its weight is 1 whereas the weight of motion vectors are very good if the video material is of high
the macroblock depicted with broken lines is smaller since it quality and the MPEG-2 encoders have a good motion esti-
is located partly inside the object rectangle. The object mo- mation as they do with newscasts and parliament transmis-
tion vector is calculated using the histogram. By subtracting sions. Consequently, these are the fields of application of ob-
the motion vector from the coordinates of the object rect- ject tracking using motion vectors. If the video material is of
angle in the reference frame, its position in the B-frame can low quality, especially noisy and blurry as with security cam-
be determined. To predict the object rectangle’s position in eras, the motion vectors do not reflect the object motion so
the I-frame, a histogram of the motion vectors of backward that object tracking using motion vectors is not possible.
W. Hesseler and S. Eickeler 7
P B I
Figure 3: Example of object tracking: the object rectangle (with solid lines) is calculated for the B-frame shown in the middle. An object’s
motion vector is calculated relative to the object rectangle’s position in the P-frame. After this, an object’s motion vector is calculated to
predict the object rectangle’s position in the I-frame. The grid depicts the macroblock boundaries, the squares with dashed and broken
squares depict two examples: macroblocks and the referenced pixel blocks, respectively.
Figure 4: Face tracking for 53 seconds of a parliament speaker. The solid rectangle localizes the tracked object. The dotted rectangle reflects
the reference data which were determined using the Rowley algorithm. In several cases, the reference algorithm failed to detect a face with
sufficient confidence.
Table 1: Tracking test results of five video sequences. Example frames of test sequence 3 are shown in Figure 4. Rectangle’s average overlap
means the average overlap of the tracked object rectangle and the face rectangle determined by the Rowley face detector. In test sequence 5
the face is lost when the speaker tilts his head too quickly. The rectangle overlap then falls below 50% so that the object can be regarded as
lost.
Test sequence Tracked object Duration Total frames Rectangle’s average overlap
1 Anchor person 50 s 1250 97.8%
2 Anchor person 52 s 1303 94.1%
3 Parliament speaker 53 s 1335 89%
4 Parliament speaker 140 s 3492 91%
5 Parliament speaker Lost after 17 s 428 82.4%
5.1. Algorithm
R1 B1 | B2 R2
We use a combined evaluation of B-frame references and
motion vectors. Figure 5 shows the three possible cut types.
(b)
R1 and R2 stand for reference frames, in between are the B-
frames B1 and B2 . The arrows indicate the references of the
majority of the macroblocks. The number of B-frames be- R1 B1 B2 |R2
tween the reference frames is arbitrary, but at least one. We
use two B-frames in the example, which give a GOP such as
IBBPBBPBBPBB.
(i) In the first case, most macroblocks from B1 use pixel (c)
blocks from R2 as reference. This indicates a cut be-
tween R1 and B1 . Figure 5: The three cut types. R1 and R2 are reference frames, B1
(ii) In the second and third cases, most macroblocks from and B2 are the two B-frames, | indicates the cut position. The arrows
B1 use the reference frame R1 . At the moment of ana- indicate the references of the majority of the macroblocks.
lyzing B1 , we can conclude whether the following cut
occurs before or at R2 . It is necessary to check all of the
following B-frames to determine if the macroblock-
type ratio has inverted in order to localize the cut more which direction a reference is established depends on the en-
exactly. coder’s preferences. For this reason, the cut detection method
(iii) In the second case, the macroblock-type ratio has in- described above will always detect a cut. We therefore discard
verted in B2 . Therefore, a cut is detected at that frame. macroblocks whose motion vectors are 0. The idea is that af-
(iv) If the ratio has not inverted by the time we reach R2 , ter a cut, the position of the macroblock most probably dif-
the cut must be at R2 . This is shown in the 3rd case. fers from the position of the referenced pixel block. Bidirec-
(v) If there was no significant ratio of forward/backward tional macroblocks are always used because they imply that
types, it is assumed that no cut occurred in within that no cut has occurred.
sequence. Experiments showed that the results improve if all mac-
roblocks with motion vectors whose absolute values are nei-
Note that after a cut has been detected at a B-frame, no fur- ther vertically nor horizontally greater than 1 are not used.
ther cut can be detected until the next reference frame. Such vectors are half-pixel motion vectors that are primarily
used because of coding efficiency. Effectively, we only con-
5.1.1. Problem: little motion activity sider the forward/backward predicted ratio of macroblocks
that have moved or will move compared to the previous or
There are general problems with cut detection using mac- following reference frame.
>1
roblock types if there is no or little motion as is the case with BFor is the number of forward-predicted macroblocks
video depicting static graphics. The encoder does not need whose motion vectors are greater than 1 for at least one di-
>1
to use references to both reference frames, but uses a refer- rection, BBack is the analogous value for backward-predicted
ence to the previous or following reference frame only; in macroblocks. The number of bidirectional macroblocks is
W. Hesseler and S. Eickeler 9
denoted as BBi and the number of intracoded macroblocks 5.1.2. Cut validation using the following P-frame
as BIntra .
The ratios AFor and ABack are defined as To lower the number of false detections such as the ones oc-
curing frequently with static computer-generated graphics,
the P-frame after a detected cut can be used to validate the
>1
BFor + BBi + R cut. The encoder will probably not be able to take over pixel
AFor = >1 >1 ,
BFor + BBack + BBi + R blocks in the P-frame from the same position in the previous
(7) reference frame. The ratio of nonmoving motion vectors is
B>1 + BBi + R
ABack = >1 Back>1 , calculated as
BFor + BBack + BBi + R
>1
BTotal − BFor − BBi − BIntra
where R is a value that compensates the noise in the distri- AFor
0/1
= (15)
BTotal
bution of macroblock types. It increases the ratio if the num-
ber of considered macroblocks is low. R depends on the total and compared with a threshold constant tValid . A cut hypoth-
number of macroblocks BTotal and a constant r as follows: esis occuring since the last reference frame can be assumed to
be unlikely if
R = BTotal · r. (8)
AFor
0/1
> tValid , (16)
A cut of type 1 (see Figure 5) is assumed if and can therefore be rejected. The cuts can only be verified
if the reference frame after cut is a P-frame. In case of an I-
AFor < tFor , (9) frame, no information can be obtained. Some encoders en-
force an I-frame after a cut. This is, however, not a problem
where tFor is a constant. If the ratio is in an expanded range since only false cuts need to be identified.
AFor with
5.2. Experiments
AFor < tFor
+
, (10)
The tests were primarily done with newscast video mate-
a cut hypothesis with lower probability is assumed. We there- rial. They contain a broad test spectrum including, for ex-
fore check the spread of the forward portion of motion vec- ample, speakers with static background, static computer-
tors of the considered macroblocks using the average devia- generated graphics, camera flashes, fast camera movement,
tion from the mean of the motion vectors. This value is de- reports with bad picture quality, and so forth. Table 2 shows
noted as SFor XY and is compared with the constant SDeviation . the results of five test sequences. The cut positions were de-
A cut is assumed if termined by hand.
The average percentage of detected cuts varies from
SFor XY > SDeviation . (11) 94.8% to 100.0% depending on the encoder and the video
content. The average percentage of incorrect detected cuts
The idea is that in case the reference frame is located before (false alarms) varies from 0% to 23.0%. Up to 4.1% of the
the cut, the forward-predicted macroblocks will change their detected cuts differed from the actual positions by one frame.
positions drastically without preferring one direction. Parliament transmissions generally give better results than
A cut of type 2 or 3 (see Figure 5) is assumed if newscasts, since they do not contain typical problems.
The causes for nondetected cuts are, in descending order
ABack < tBack . (12) of importance as follows.
Again, if the ratio is in an extended range with (i) The encoder did not use the characteristic distribution
of macroblock types.
(ii) The frames before and after the cut are too similar.
ABack < tBack
+
, (13)
(iii) The cut occurred between two half-pictures.
we check the spread of the considered motion vectors, here The causes for cuts detected where no cut actually occurred
the backward-predicted portions. Similarly, if are in descending order of importance as follows:
(i) the consecutive frames are too similar, there are (al-
SBack XY > SDeviation , (14)
most) no changes so that it was sufficient for the en-
coder to use references to one reference frame only;
a cut is assumed. However, in these cases we can only iden-
(ii) camera flashes;
tify a cut up until the next reference frame. For this reason,
(iii) the encoder did not use the characteristic distribution
each following B-frame is checked to see if the number of
of macroblock types;
forward-predicted macroblocks is smaller than the number
(iv) very quick camera motion.
of backward-predicted macroblocks and a cut is assumed if
this is the case. If no cut position was found by the next ref- No or little motion only is still a problem. This especially
erence frame, this reference frame must be the cut position. applies to computer-generated graphics since they do not
10 EURASIP Journal on Applied Signal Processing
Table 2: Results of cut detection test for five videos. # reference cuts is the number of cuts in the ground truth. B-frames is the num-
ber of B-frames between two reference frames. Deletion = number of missed cuts/divided
by number of reference cuts, insertion = num-
ber of falsely detected cuts/divided by number of detected cuts, and inaccuracy = displacement of cut/divided by number of detected cuts.
Video type Duration Total frames # Ref. cuts B-frames Deletion Insertion Inaccuracy
Newscast 15 min 22500 132 1 1.5% 2.9% 0.0 frames
Newscast 15 min 22500 167 1 1.2% 2.3% 0.0 frames
Newscast 15 min 22500 96 2 5.2% 11.1% 0.009 frames
Newscast 15 min 22500 97 2 4.1% 23% 0.032 frames
Parliament 54.7 min 82054 81 2 0% 0% 0.0 frames
contain the noise that leads to small differences between the the bottleneck. For this reason, an algorithm which can de-
frames. In order to avoid the repeated detection of a cut, only tect shots faster than real time is highly desirable.
the macroblocks with motion vectors of at least two half-
pixels are considered. However, imposing this limitation can 6. CONCLUSIONS
lead to a situation in which the number of remaining mac-
roblocks is small, causing the result to be unreliable. In this In this paper, three different algorithms for low-level meta-
case, both a nondetected cut and a erroneously detected cut data extraction were presented. All three algorithms work in
are possible. Camera flashes lead to increased brightness in the MPEG-2 compressed domain, and are therefore efficient,
a single frame or a half-frame only. If a flash occurs in a ref- saving time for the further processing steps that create meta-
erence frame that frame, will likely have less referential de- data on a higher semantic level. The algorithms operate in
pendencies, which may lead to an erroneously detected cut. realtime on a computer with an 866 MHz Pentium III pro-
Cuts between two half-pictures can occur if the video ma- cessor under Linux.
terial was produced with analog equipment. The first half- Detection of cuts and camera motion can be performed
picture of a frame belongs to a scene and the second half- immediately after parsing a frame in bitstream order with-
picture belongs to the next scene. Very quick camera motion out having to wait for the data of the following frames. This
can also lead to false detection of a cut since the motion vec- procedure allows on-the-fly cut and camera motion detec-
tors spread. tion, for example, with live videos from a satellite. The object
Generally, the number of nondetected cuts can be re- tracking algorithm requires the frames to be in display or-
duced by adjusting the parameters so that more cuts are de- der, which gives a delay of up to three frames (less than 100
tected, also resulting in more false alarms. All cut hypotheses milliseconds).
could be checked using the DC coefficients of the approx- Considering that the encoder’s task is to encode the best
imated DCT as was shown with the camera motion detec- picture quality with a given bit rate and not with the cam-
tion. Such checking does not compromise speed improve- era or object motion in mind, the results of the compressed-
ment over uncompressed-domain techniques. domain algorithms are convincing. The results show that it
The performance of the algorithm presented here de- is not necessary to expend computing power on decoding in
pends on the encoder. The parameters have to be adjusted order to perform low-level metadata extraction in the pixel
to accommodate the particular encoder. It is necessary that domain.
the encoder uses B-frames and does not always prefer one The algorithms presented in this paper are not funda-
reference frame like the next I-frame. mentally limited to MPEG-2. They can be extended to all
video compression methods that use motion compensation
5.3. Results and bidirectional frames, for example, MPEG-4 simple and
advanced profile (also known as DivX).
The cut detection algorithm using compressed-domain data
only can detect 94.8%–100% of all cuts in a video. The per- REFERENCES
formance is dependent on the encoder and the video mate-
rial. Only sequences with no or little motion remain prob- [1] B. S. Manjunath, P. Salembier, and T. Sikora, Introduction to
lematic despite special treatment and in such cases, the de- MPEG-7: Multimedia Content Description Interface, John Wi-
tection is not reliable. Also, encoders do not always encode ley & Sons, New York, NY, USA, 2002.
the video as presumed by the proposed algorithms but are [2] R. Wang and T. Huang, “Fast camera motion analysis in
rather programmed with the coding efficiency in mind. MPEG domain,” in Proceedings of IEEE International Confer-
ence on Image Processing (ICIP ’99), vol. 3, pp. 691–694, Kobe,
A typical application of the cut detection method would Japan, October 1999.
be integration in distributed online television monitoring [3] A. Smolic, M. Hoeynck, and J.-R. Ohm, “Low-complexity
systems. The basic units for processing of video sequences are global motion estimation from P-frame motion vectors for
the shots. These shots are distributed to the different nodes MPEG-7 applications,” in Proceedings of IEEE International
of a computer cluster for a fast processing. The detection of Conference on Image Processing (ICIP ’00), vol. 2, pp. 271–274,
the shots must occur before the distribution and is as such Vancouver, BC, Canada, September 2000.
W. Hesseler and S. Eickeler 11
[4] P. M. Kuhn, “Camera motion estimation using feature points [21] H. A. Rowley, S. Baluja, and T. Kanade, “Neural network-
in MPEG compressed domain,” in Proceedings of IEEE Inter- based face detection,” IEEE Transactions on Pattern Analysis
national Conference on Image Processing (ICIP ’00), vol. 3, pp. and Machine Intelligence, vol. 20, no. 1, pp. 23–38, 1998.
596–599, Vancouver, BC, Canada, September 2000. [22] H. Wang and S.-F. Chang, “A highly efficient system for auto-
[5] V. Kobla, D. Doermann, and A. Rosenfeld, “Compressed do- matic face region detection in MPEG video,” IEEE Transactions
main video segmentation,” Tech. Rep. CAR-TR-839 (CS-TR- on Circuits and Systems for Video Technology, vol. 7, no. 4, pp.
3688), University of Maryland, College Park, Md, USA, 1996. 615–628, 1997.
[6] D. Comaniciu and P. Meer, “Mean shift analysis and applica-
tions,” in Proceedings of 7th IEEE International Conference on
Computer Vision (ICCV ’99), vol. 2, pp. 1197–1203, Kerkyra, Wolfgang Hesseler received his Diploma
Greece, September 1999. in computer science from the University
[7] B. Danette Allen and G. Bishop, “Tracking: Beyond 15 Min- of Bonn, Germany. The algorithms pre-
utes of Thought,” in SIGGRAPGH 2001 Course 11 Booklet, Los sented in this paper were developed dur-
Angeles, Calif, USA, August 2001. ing his work at the Fraunhofer IMK in
[8] A. Gyaourova, C. Kamath, and S.-C. Cheung, “Block matching Birlinghoven, Germany, where he wrote
for object tracking,” Tech. Rep. UCRL-TR-200271, Lawrence his Diploma thesis about using MPEG-2
Livermore National Laboratory, Livermore, Calif, USA, 2003. compressed-domain data for computer vi-
[9] V. Kobla, D. Doermann, K.-I. Lin, and C. Faloutsos, “Fea- sion. His areas of interest are video indexing
ture normalization for video indexing and retrieval,” Tech. and video analysis.
Rep. CAR-TR-847 (CS-TR-3732), University of Maryland,
Stefan Eickeler received his Dipl.-Ing. de-
College Park, Md, USA, 1996.
gree and Dr.-Ing. degree in electrical engi-
[10] J. I. Khan, Z. Guo, and W. Oh, “Motion based object tracking neering from the University of Duisburg,
in MPEG-2 stream for perceptual region discriminating rate Germany, in 1996 and 2001, respectively. He
transcoding,” in Proceedings of 9th ACM International Confer- joined the Fraunhofer Institute for Media
ence on Multimedia (ACM Multimedia ’01), pp. 572–576, Ot- Communication (Fraunhofer IMK) in the
tawa, Ontario, Canada, September–October 2001. year 2001. Since 2003 he has taught as a
[11] H. Zhang, A. Kankanhalli, and S. W. Smoliar, “Automatic lecturer at the Bonn-Aachen International
partitioning of full-motion video,” ACM Multimedia Systems, Centre for Information Technology. His re-
vol. 1, no. 1, pp. 10–28, 1993. search interests are statistical pattern recog-
[12] A. Nagasaka and Y. Tanaka, “Automatic video indexing and nition and signal processing with the applications face and gesture
full-video search for object appearances,” in Proceedings of recognition and automatic media and document analysis.
IFIP 2nd Working Conference on Visual Database Systems, pp.
113–127, North-Holland, September–October 1991.
[13] R. Zabih, J. Miller, and K. Mai, “A feature-based algorithm for
detecting and classifying scene breaks,” in Proceedings of 3rd
ACM International Conference on Multimedia (ACM Multime-
dia ’95), pp. 189–200, San Francisco, Calif, USA, November
1995.
[14] G. Bozdagi and H. T. Sencar, “Preprocessing tool for com-
pressed video editing,” in Proceedings of IEEE 3rd Workshop on
Multimedia Signal Processing, pp. 283–288, Copenhagen, Den-
mark, September 1999.
[15] A. Bovik, Ed., Handbook of Image and Video Processing, Aca-
demic Press, New York, NY, USA, 2000.
[16] J. Calic and E. Izquierdo, “Towards real-time shot detection in
the MPEG-compressed domain,” in Proceedings of Workshop
on Image Analysis for Multimedia Interactive Services (WIAMIS
’01), pp. 95–100, Tampere, Finland, May 2001.
[17] P. Viola and M. J. Jones, “Robust real-time object detection,”
Tech. Rep. CRL 2001/01, Cambridge Research Laboratory,
Cambridge, Mass, USA, 2001.
[18] S.-F. Chang and D. G. Messerschmitt, “Manipulation and
compositing of MC-DCT compressed video,” IEEE Journal on
Selected Areas in Communications, vol. 13, no. 1, pp. 1–11,
1995.
[19] K. Shen and E. J. Delp, “A fast algorithm for video parsing us-
ing MPEG compressed sequences,” in Proceedings of IEEE In-
ternational Conference on Image Processing (ICIP ’95), vol. 2,
pp. 252–255, Washington, DC, USA, October 1995.
[20] B.-L. Yeo and B. Liu, “On the extraction of DC sequence from
MPEG compressed video,” in Proceedings of IEEE International
Conference on Image Processing (ICIP ’95), vol. 2, pp. 260–263,
Washington, DC, USA, October 1995.