Crowd Counting Using Multiple Local Features
Crowd Counting Using Multiple Local Features
Abstract—In public venues, crowd size is a key indicator to an individual or small group within an image. While
of crowd safety and stability. Crowding levels can be detected existing techniques have used similar local features such
using holistic image features, however this requires a large as foreground pixels, they are analysed at a holistic level.
amount of training data to capture the wide variations in crowd
distribution. If a crowd counting algorithm is to be deployed Local features are used here to estimate the number of
across a large number of cameras, such a large and burdensome people within each group, so that the total crowd estimate
training requirement is far from ideal. In this paper we propose is the sum of all group sizes. As local features are used,
an approach that uses local features to count the number of training data must also be annotated with local information.
people in each foreground blob segment, so that the total crowd To provide appropriate training data, a unique method of
estimate is the sum of the group sizes. This results in an
approach that is scalable to crowd volumes not seen in the localised ground truth annotation is proposed which greatly
training data, and can be trained on a very small data set. As reduces the required training data.
a local approach is used, the proposed algorithm can easily be As well as the reduced training requirement, a localised
used to estimate crowd density throughout different regions of approach also enables the estimation of crowd densities at
the scene and be used in a multi-camera environment. A unique
different locations within the scene (unlike holistic systems
localised approach to ground truth annotation reduces the
required training data is also presented, as a localised approach which can only provide a density for the whole scene),
to crowd counting has different training requirements to a and allows for a simplistic extension to a multi-camera
holistic one. Testing on a large pedestrian database compares environment. The ability to determine local crowd densities
the proposed technique to existing holistic techniques and greatly improves the systems ability to detect abnormalities
demonstrates improved accuracy, and superior performance
in a scene. While the overall number of people in a scene
when test conditions are unseen in the training set, or a minimal
training set is used. may be considered normal, there may be a very high
concentration of people in a small area. Holistic systems are
Keywords-Crowd Counting, Crowd Density, Local Features,
unable to detect such an abnormality, however the proposed
Foreground segmentation
local approach can easily detect such an occurrence.
I. I NTRODUCTION The proposed system is tested on a 2000 frame database
[4] featuring crowds of size 11-45 people. The proposed
In large public places, it is often impossible to monitor technique is compared to two holistic techniques, and is
every person for suspicious behaviour. The threats posed shown to outperform holistic techniques in terms of accu-
in crowded environments are of a different nature to those racy, scalability and practicality. The system is shown to be
posed by an individual, and arise from the crowd’s collective highly scalable, as it is capable of extrapolating to count
properties: “a crowd is something other than the sum of its crowds which are larger or smaller than those encountered
parts” [6]. These threats include fighting, rioting, violent during training; and highly practical, as it is able to count
protest, mass panic and excitement. The most common crowds when trained on as few as 10 frames of training data.
indicator of such behaviour is crowd size, which may also The remainder of the paper is structured as follows:
be an indicator of congestion, delay or other abnormality. As Section II provides an overview of existing crowd counting
crowd size is a holistic description of the scene, the majority techniques, Section III outlines the proposed algorithm,
of crowd counting techniques have utilised holistic features Section IV describes the proposed ground truth annotation
to estimate crowd size. However, due to the wide variability method, Section V presents experimental results and Section
in crowd behaviours, distribution, density and overall size, VI presents conclusions and possible directions for future
holistic systems require a very large training set. In a facility work.
containing numerous cameras, it is not practical to supply
hundreds of frames of ground truth for potentially hundreds II. E XISTING W ORK
of cameras.
In this paper we propose a novel approach that uses The task of crowd counting has been approached from
local features, defined here as features which are specific a number of angles, but the techniques share a common
model and tracking to estimate group size. While these pixel weighted by its value in the density map.
systems all employ local features, they often rest on specific Bsize = W (i, j)
assumptions, including image quality. When presented with
low-quality video and poor segmentation, it is difficult to where (i, j) ∈ B, and Bsize is the calculated area of
classify or track the local features unless ground truth is blob B.
also annotated on a local level. • Perimeter: The total pixel count for the blob’s perime-
Local features have been employed to other crowd related ter, each weighted by the square root of its value in the
problems though, such as crowd detection [2] (detection of density map.
human like objects and repeating structures) and analysis of • Perimeter-Area Ratio: The ratio of perimeter to area,
crowd stability [1] (using optical flow over time). However a measure of shape complexity [4].
82
75
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on June 16,2024 at 21:42:48 UTC from IEEE Xplore. Restrictions apply.
• Edges: The total pixel count of edges within the blob,
extracted from the image using Canny edge detection.
Each pixel was weighted by the square root of its value
in the density map.
• Edge Angle Histogram: The histogram of edge angles,
obtained from the edge detection. Six histogram bins
are used in the range 0◦ - 180◦ [9]. Each pixel’s
contribution to a histogram bin is the square root of
its value in the density map.
D. Crowd Counting
The features extracted from each blob serve as inputs (a) Frame 1280.
to a classifier. The output of the classifier is gi , the group
size estimate for the ith blob. A neural network was used
to perform classification, as this has proven successful in
previous research [12], [9]. In order to test whether local
features can be classified using simpler strategies, a basic
linear model was also tested:
NF
(b) Foreground mask. (c) Region of interest.
gi = w0 + wn fn (1)
n=1 Figure 1. A frame from the testing database.
where wn is the weight assigned to feature fn , given
NF features. The weights are calculated using least squares
regression. The total crowd estimate for a frame containing
NB blobs is then calculated:
NB
C= gi (2)
i=1 (a) Correct extraction of individuals, (b) Person (top, centre) is frag-
with additional noise (i.e. small bar mented into two blobs, one of which
The estimate will vary from frame to frame as pedestrians near centre). is merged with nearby blob(s).
enter and exit a scene simultaneously. A rapidly fluctuating
estimate is not usable or accurate. A median filter provides
smoothness and stability to the estimate, as well as making
it robust against outlier estimates.
A median filter of length 2n + 1 will select the median
(c) Person (left) is fragmented into (d) Person (top, centre) blends into
estimate from n frames either side of the frame in question. two blobs. background leaving few foreground
This is a non-causal filter which, implemented in practice, pixels. This person is barely visible
will introduce a delay of n frames. For this application we to the human eye.
use a median filter of length 41 (n = 20). At a frame rate
of 10 fps, the delay is 2.0 seconds. Figure 2. Typical errors in foreground extraction.
83
76
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on June 16,2024 at 21:42:48 UTC from IEEE Xplore. Restrictions apply.
leg respectively). The assignment of these weights is Ground Truth for Frames 1270 to 1310.
Number of people
made by the computer according to the blob sizes. 36
2) Part of a person is split in isolation from the group they
are with (Figure 2(b)). In this case, the contribution 34
of the person is split across multiple (n) blobs equally
(1/n to each). Proportional contributions would not 32
be suitable, because some fragments are merged with 1270 1280 1290 1300 1310
neighbouring blobs. Frame
3) The motion detection fails to detect a person (Figure
2(d)). In this case, no assignment is made because the Figure 3. Ground truth for frames 1270 to 1310.
person has blended completely into the background so
that very few, if any, foreground pixels are present. If
Figure 3 shows the number of people inside the region of
this is a common occurrence, then the problem must
interest over 40 frames (4.0 seconds). Based on the number
be addressed at the segmentation stage (if possible).
of increments and decrements in this graph, there are at
(In the database used there are only a small number of
least 13 instances of pedestrians either entering or exiting the
instances where this occurs, and these only occur in
scene in this time. An example frame from this sequence is
one part of the scene where the background is dark).
shown in Figure 1(a). The pedestrian at the bottom left in this
Assuming it is a rare occurrence, no contribution is
sequence takes more than 30 frames to fully enter the scene.
assigned to the faded person. The reason for this is
With groups entering and exiting the scene at this rate, yet
that assigning a large weight to a tiny blob may lead to
taking several frames to do so, it would be difficult even for
misclassification at other locations in the scene, where
a human to estimate the exact crowd size, and impossible
tiny blobs are merely products of noise, such as in
for them to remain consistent in their definition of what
Figure 2(a).
constitutes being ‘in’ or ‘out’ of the scene. In a scene such
The correspondences between pedestrians and foreground as this, where crowd size varies between 11 and 45 people,
blobs are entered via the GUI. The above scenarios and it is suggested that an estimate within 3 of the ground truth
the methods for handling them are used throughout the is acceptable. For testing purposes we consider the following
ground truth process to ensure that labelling is performed measures of accuracy:
in a consistent manner.
• Error: The mean value of the absolute difference
V. E VALUATION AND R ESULTS between the crowd estimate and the ground truth.
• MSE: The mean value of the error squared.
A. Testing Criteria • Acceptability: The percentage of frames for which the
The performance of the proposed system is assessed using absolute error may be deemed ‘acceptable’, that is, less
three criteria: than or equal to 3.
1) Accuracy, 2) Scalability: Ideally, the training data must cover a
2) Scalability, wide range of scenarios, similar to those which are expected
3) Practicality. to be found during operation. In the case of crowd counting,
Accuracy is measured by comparing the detected number however, we may not have access to video footage of
of pedestrians with the number annotated in the ground truth. all possible scenarios. Excessive levels of over or under
Scalability is evaluated by using training and testing sets crowding may not be present in the training data because
such that the types of crowds seen in testing are not present these events are abnormal, and this is the reason we wish
in the training set. Practicality is evaluated through the use to detect them. A system which cannot extrapolate in this
of reduced training sets. context is of little practical use. We test the scalability of
1) Accuracy: Although this system is trained on the basis this system using two methods:
of individual blobs, the testing still takes place on a holistic • Downscaling: The system is trained on large crowds,
level. The accuracy of a system can be any measure of how and tested on smaller crowds.
closely the estimate follows the ground truth. The ground • Upscaling: The system is trained on small crowds, and
truth for the holistic crowd count was taken as the number tested on larger crowds.
of (x, y) person coordinates which lay within the region of 3) Practicality: For a crowd counting system to be
interest. However, the exact point in time at which a person practical, it must be relatively easy to deploy. For real
is deemed to have entered or exited a frame is never clearly world deployment where the algorithm may be required
defined. It may take several seconds between a pedestrian run on several hundred different cameras within a single
reaching the border of the region of interest, and being fully installation, being able to use a reduced training set is highly
inside or outside of it. desirable. When training crowd counting algorithms, each
84
77
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on June 16,2024 at 21:42:48 UTC from IEEE Xplore. Restrictions apply.
training frame requires ground truth to be supplied. If several neural network classifier. The poorer performance of the
hundred training frames are needed for each camera ([6] uses neural network classifier can be attributed to the training data
150 frames, each taken 10 seconds apart for training; [4] used. It is expected that for a larger training set, performance
uses 800 consecutive frames for training), then the process would equal or exceed that of the linear classifier.
of training becomes very tedious and time consuming. To 2) Scalability: Scalability is tested in two steps, down-
assess practicality, systems are evaluated using reduced scaling and upscaling. To test downscaling, frames 1205,
training sets. 1210, ..., 1600 are designated for training (80 total), featur-
ing crowds of size 30-45. These frames contain a mixture
B. Systems Tested
of large and small blobs. Testing is performed on frames
Three crowd counting techniques are evaluated: 1-1200 and 1601-2000 (crowd sizes 11-40).
• Proposed: The system described here, in which local Due to the neural network’s poor extrapolation capa-
features are extracted for each blob and ground truth bilities, the holistic methods were unable to provide any
annotation is performed on a local level. meaningful results, as shown in Figure 5. The proposed
• Equivalent Holistic System: This is a system which system, trained on blobs of various sizes, was able to count
utilises the same features as the proposed system, taken smaller crowds.
on a holistic rather than local level. Ground truth is also The linear model is capable of superior extrapolation.
annotated on a holistic level. The results in Table II indicate that all three systems can
• Kong: Blobs are sorted into six histograms of bin width extrapolate downwards when linear fitting is used, however
1500, as described in [9]. An edge angle histogram is the proposed system is most accurate.
also calculated, for which we use six histogram bins To evaluate upscaling, frames 805, 810, ..., 1100 were
between 0◦ and 180◦ . This is also a holistic system. designated for training (60 total), featuring crowds of size
For each system, two classifiers are tested: a neural network 11-271. Testing was performed on frames 1-800 and 1101-
and linear model. 2000 (crowds 11-45). The blobs in the test set were larger
The results provided by [4] for this database can not be than those in the training set, therefore all systems were
compared, as their estimate was calculated for pedestrians unable to extrapolate when neural network classification
walking in either direction, rather than a total count. If was employed. As a result, evaluation results for the neural
the segmentation algorithm were changed from dynamic network classifier are not presented.
textures [5] to background subtraction, then the total count The linear model, however, is capable of extrapolation.
could be calculated. This would somewhat resemble the Table III and Figure 6 illustrate the ability of the system to
Equivalent Holistic System above, differentiated by the count crowds that are larger than those seen in the training
number of features. set. It can be seen that the proposed algorithm is better
equipped to deal with conditions that are unseen in the
C. Experimental Results training set.
1) Accuracy: The accuracy of each system listed in The superior performance on unseen conditions can be
Section V-B is tested. Frames 605, 610, ..., 1400 were desig- attributed to the manner in which the proposed algorithm
nated for training (160 total) and testing was performed on counts crowds. As each blob is considered individually, the
frames 1-600 and 1401-2000. Those in the training set were proposed algorithm only needs to have seen similar blobs
annotated with ground truth counts for each blob, which in the training data. The holistic approaches however need
was used to train the classifier. Neural network results differ to have seen a similar number of people overall in both and
slightly from test to test, therefore in order to determine a training and testing.
typical result for each system, the networks were retrained 3) Practicality: The fewer training frames required of a
five consecutive times. The test which returned the median system, the greater its practicality. While a neural network
MSE for the filtered output was taken. requires a large range of training data, the linear model
Results are tabulated in Table I. Results across the whole can be calculated with very little. Given this, only a linear
testing data set using the linear classifier are plotted in Figure classifier is used in evaluating the systems practicality. The
4. robustness of the proposed system is evaluated by testing the
By all three measures of accuracy, the proposed system systems using only 10 training frames (640, 720, ..., 1360).
significantly outperforms Kong and the equivalent holistic For Kong [9], in order to supply all of the histogram bins
system. The mean error of the filtered estimate is 1.353 and with sufficient data, it was necessary to train the algorithm
the estimate is acceptable (within 3 of ground truth) 95.67% on 40 frames (620, 640, ..., 1400). Testing was performed
of the time (for the linear classifier). The linear classifier on frames 1-600 and 1401-2000.
performs slightly better than the neural network, though 1 The training range was widened for Kong {805, 810, ..., 1300}, so that
similar performance trends are observed with the proposed the training data contained blobs large enough to contribute to each of the
system outperforming the other evaluated systems for a blob size histogram bins.
85
78
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on June 16,2024 at 21:42:48 UTC from IEEE Xplore. Restrictions apply.
Accuracy Testing Results
45
Ground Truth
Estimate (Proposed)
Estimate (Holistic)
40 Estimate (Kong)
35
Number of people
30
25
20
15
10
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Frame
Figure 4. Accuracy testing results. Estimate is rounded and median filtered, and is shown for the test set only.
40 40 40
Number of people
30 30 30
20 20 20
10 10 10
Ground Truth Ground Truth Ground Truth
Estimate Estimate Estimate
0 0 0
0 1000 2000 0 1000 2000 0 1000 2000
Frame Frame Frame
Figure 5. Downscaling testing results using neural network. Estimate has been rounded but not filtered.
86
79
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on June 16,2024 at 21:42:48 UTC from IEEE Xplore. Restrictions apply.
System Classifier Raw Estimate Median Filtered
Error MSE Accept. Error MSE Accept.
Proposed NN 2.086 6.701 82.56% 1.881 5.532 86.63%
Kong NN System failed. See Figure 5.
Holistic NN System failed. See Figure 5.
Proposed Linear 1.635 4.186 86.75% 1.537 3.674 92.81%
Kong Linear 2.659 10.074 59.31% 2.559 8.839 72.31%
Holistic Linear 2.341 8.787 71.69% 2.194 7.938 80.44%
Table II
D OWNSCALING TESTING RESULTS USING LINEAR FITTING .
Number of people
30 30
25 25
20 20
15 15
10 10
0 500 1000 1500 2000 0 500 1000 1500 2000
Frame Frame
Results are shown in Table IV. The proposed system training (10 frames), demonstrating practicality. The ability
outperforms the holistic systems using a limited training set, to train the system from as few as 10 frames means it can be
and achieves better results than when using a larger training easily deployed in a real world setting consisting of a large
set. The superior generalisation is likely due to the wider number (possibly hundreds) of cameras with much greater
spacing of the training frames. These results indicate that ease than holistic approaches.
the proposed system is highly practical, with accurate results The use of local features also makes estimating local
obtained from as few as 10 frames of training data. crowd density across the scene, and performing crowd
counting across a network of multiple overlapping cameras
VI. C ONCLUSIONS AND F UTURE W ORK possible. Analysing crowd densities at specific locations in
In this paper we have proposed the use of multiple a scene will enable the detection of local abnormalities. For
local features for crowd counting. This approach reduces example, a high-density crowd concentrated at one location
the task of crowd counting to the group level, so that the may require attention, even if the holistic count for the
crowd estimate is the sum of its parts. By three standards scene is at a safe level. The use of multiple cameras will
(accuracy, scalability and practicality), the proposed system enable larger environments to be covered and monitored,
outperforms existing holistic methods of crowd counting. as well as increasing accuracy in areas of overlap (due
The proposed system is capable of extrapolating outside of to the observations from multiple view points). Both these
the training range, and can also count crowds with minimal extensions will be investigated in the future. In addition,
87
80
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on June 16,2024 at 21:42:48 UTC from IEEE Xplore. Restrictions apply.
System Training Set Raw Estimate Median Filtered
Error MSE Accept. Error MSE Accept.
Proposed 640,720,...,1360 1.306 2.684 93.17% 1.047 1.902 99.25%
Kong 620,640,...,1400 1.710 4.642 84.25% 1.352 3.200 93.75%
Holistic 640,720,...,1360 4.462 31.24 41.58% 3.538 17.788 57.83%
Table IV
P RACTICALITY TESTING RESULTS .
future work will also focus on capturing additional data for Vision and Pattern Recognition (CVPR 2001), pages 1034–
further testing, and evaluating the proposed algorithm in 1040, Dec. 2001.
[15] H. Rahmalan, M. Nixon, and J. Carter. On crowd density
conditions where there is poor segmentation performance,
estimation for surveillance. Crime and Security, 2006. The
reduced image resolution, and erroneous ground truth la- Institution of Engineering and Technology Conference on,
belling. pages 540–545, June 2006.
[16] T. Zhao and R. Nevatia. Bayesian human segmentation in
crowded situations. Computer Vision and Pattern Recog-
R EFERENCES nition, 2003. Proceedings. 2003 IEEE Computer Society
Conference on, 2:II–459–66 vol.2, June 2003.
[1] S. Ali and M. Shah. A lagrangian particle dynamics approach
for crowd flow segmentation and stability analysis. In
Computer Vision and Pattern Recognition, 2007. CVPR ’07.
IEEE Conference on, pages 1–6, 2007.
[2] O. Arandjelović. Crowd detection from still images. In Proc.
British Machine Vision Conference, 1:523–532, 2008.
[3] H. Celik, A. Hanjalic, and E. Hendriks. Towards a robust
solution to people counting. Image Processing, 2006 IEEE
International Conference on, pages 2401–2404, Oct. 2006.
[4] A. Chan, Z.-S. Liang, and N. Vasconcelos. Privacy preserving
crowd monitoring: Counting people without people models or
tracking. CVPR 2008, pages 1–7, June 2008.
[5] A. Chan and N. Vasconcelos. Modeling, clustering, and
segmenting video with mixtures of dynamic textures. IEEE
Transactions on Pattern Analysis and Machine Intelligence,
30(5):909–926, May 2008.
[6] A. Davies, J. H. Yin, and S. Velastin. Crowd monitoring using
image processing. Electronics & Communication Engineering
Journal, 7(1):37–47, Feb 1995.
[7] S. Denman, V. Chandran, and S. Sridharan. An adaptive
optical flow technique for person tracking systems. Elsivier
Pattern Recognition Letters, 28(10):1232–1239, 2007.
[8] P. Kilambi, E. Ribnick, A. J. Joshi, O. Masoud, and N. Pa-
panikolopoulos. Estimating pedestrian counts in groups.
Comput. Vis. Image Underst., 110(1):43–59, 2008.
[9] D. Kong, D. Gray, and H. Tao. A viewpoint invariant
approach for crowd counting. Pattern Recognition, 2006.
ICPR 2006. 18th International Conference on, 3:1187–1190,
2006.
[10] S.-F. Lin, J.-Y. Chen, and H.-X. Chao. Estimation of number
of people in crowded scenes using perspective transformation.
Systems, Man and Cybernetics, Part A, IEEE Transactions on,
31(6):645–654, Nov 2001.
[11] A. Marana, L. Da Fontoura Costa, R. Lotufo, and S. Velastin.
Estimating crowd density with minkowski fractal dimension.
ICASSP ’99, 6:3521–3524 vol.6, Mar 1999.
[12] A. Marana, S. Velastin, L. Costa, and R. Lotufo. Estimation
of crowd density using image processing. Image Processing
for Security Applications (Digest No.: 1997/074), IEE Collo-
quium on, pages 11/1–11/8, Mar 1997.
[13] O. Masoud and N. Papanikolopoulos. A novel method
for tracking and counting pedestrians in real-time using a
single camera. Vehicular Technology, IEEE Transactions on,
50(5):1267–1278, Sep 2001.
[14] N. Paragios and V. Ramesh. A mrf-based approach for real-
time subway monitoring. In 2001 Conference on Computer
88
81
Authorized licensed use limited to: University of P.J. Safarik Kosice. Downloaded on June 16,2024 at 21:42:48 UTC from IEEE Xplore. Restrictions apply.