Pedestrian Detection: Domain Generalization, CNNS, Transformers and Beyond
Pedestrian Detection: Domain Generalization, CNNS, Transformers and Beyond
Abstract—Pedestrian detection is the cornerstone of many vision based applications, starting from object tracking to video
surveillance and more recently, autonomous driving. With the rapid development of deep learning in object detection, pedestrian
detection has achieved very good performance in traditional single-dataset training and evaluation setting. However, in this study on
generalizable pedestrian detectors, we show that, current pedestrian detectors poorly handle even small domain shifts in cross-dataset
evaluation. We attribute the limited generalization to two main factors, the method and the current sources of data. Regarding the
arXiv:2201.03176v2 [[Link]] 2 Mar 2022
method, we illustrate that biasness present in the design choices (e.g anchor settings) of current pedestrian detectors are the main
contributing factor to the limited generalization. Most modern pedestrian detectors are tailored towards target dataset, where they do
achieve high performance in traditional single training and testing pipeline, but suffer a degrade in performance when evaluated
through cross-dataset evaluation. Consequently, a general object detector performs better in cross-dataset evaluation compared with
state of the art pedestrian detectors, due to its generic design. As for the data, we show that the autonomous driving benchmarks are
monotonous in nature, that is, they are not diverse in scenarios and dense in pedestrians. Therefore, benchmarks curated by crawling
the web (which contain diverse and dense scenarios), are an efficient source of pre-training for providing a more robust representation.
Accordingly, we propose a progressive fine-tuning strategy which improves generalization. Additionally, this work also investigate the
recent Transformer Networks as backbones to test generalization. We demonstrate that as of now, CNNs outperform transformer
networks in terms of generalization and absorbing large scale datasets for learning robust representation. In conclusion, this paper
suggests a paradigm shift towards cross-dataset evaluation, for the next generation of pedestrian detectors. Code and models can be
accessed at [Link]
Index Terms—Pedestrian detection, Object detection, Generilizable pedestrian detection, Autonomous driving, Surveillance
Fig. 1: Left: Pedestrian detection performance over the years for Caltech, CityPersons and EuroCityPersons on the reasonable subset.
EuroCityPersons was released in 2018 but we include results of few older models on it as well. Dotted line marks the human performance
on Caltech. Right: We show comparison between traditional single-dataset train and test evaluation on Caltech [13] vs. cross-dataset
evaluation for three pedestrian detectors and one general object detector (Cascade R-CNN). Methods enclosed with bounding boxes are
trained on CityPersons [52] and evaluated on Caltech [13], while others are trained on Caltech.
in Table 4. Thirdly, these datasets have limited diversity as they based backbones such as MobileNet [21] Finally, we also compare
are captured during small number of sessions by a small team, the generaliztaion ability of CNNs against the recent transformer
primarily for dataset creation. Thesedays, dashcam videos are network (Swin-Transformer) [30]. We illustrate that, despite supe-
widely available online, e.g. youtube, facebook, etc, enabling the rior performance of Swin Transformer [30], it struggles when the
potential curation of much more diverse and realistic datasets. domain shift is large, in comparison with CNNs. To the best of
In recent years, few large and diverse person detection datasets, our knowledge, this is the first study to objectively illustrate this.
e.g. CrowdHuman [39], WiderPerson [54] and Wider Pedestrian The paper is organized in the following way. Section 2 re-
[31], have been created using images and videos available online views the relevant literature. We introduce datasets and evaluation
and through surveillance cameras. These datasets advance the protocol in Sec. 3. We benchmark our baseline in Sec. 4. We
general person detection research significantly but are not the most test the generalization capabilities of the pedestrian specific and
suitable datasets for pedestrian detection, as they contain people general object detectors in Sec. 5, along with qualitative results.
in a lot more diverse scenarios, than are relevant for autonomous Subsequently, we compare CNNs with Transformer networks in
driving. Nevertheless, they are still beneficial for learning a more Sec. 6. We also discuss effect of fine-tuning on the target set in
general and robust model of pedestrians as they contain more Sec. 7. Finally, conclude the paper in Section 8.
people per image, and they are likely to contain more human
poses, appearances and occlusion scenarios, which is beneficial 2 R ELATED W ORK
for autonomous driving scenarios, provided current pedestrian Pedestrian detection. Prior to CNNs, the pioneering work of
detectors have the capacity to digest these large-scale data. Viola and Jones [43] which slid windows over all scales and loca-
In this paper, we investigate the performance characteristics tions motivated many pedestrian detection methods. To better de-
of current pedestrian detection methods in cross-dataset setting. scribe the features of pedestrians, Histogram of Oriented Gradients
We show that 1) the existing methods fare poorly compared to (HOG) was presented in the work of Dalal and Triggs [11]. The
general object detectors, without any adaptations, when provided aggregate channel feature (ACF) leveraged features in extended
with larger and more diverse datasets, and 2) when carefully channels to improve the speed of pedestrian detection [12]. In
trained, the state-of-the-art general object detectors, without any similar ways, pedestrian detectors in [52], [36] employed spatial
pedestrian-specific adaptation on the target data, can significantly pooling with low-level features and filtered channel features,
out-perform pedestrian-specific detection methods on pedestrian respectively. Nonetheless, their performance and generalization
detection task (see Fig. 1 right). In addition, we propose a ability were still limited by the hand-crafted features.
progressive training pipeline for better utilization of general person As the great progress of Convolutional Neural Networks
datasets for improving the pedestrian detection performance in (CNNS), they dominated the research field of generic object
autonomous driving scenarios. We show that by progressively detection [38], [19], [42], [26] and considerably improved the
fine-tuning the models, from the dataset furthest from the target accuracy. The pedestrian detectors [1], [20], [7] also benefit from
domain to the dataset closest to the target domain, large gains in this powerful paradigm. R-CNN detector [16] was utilized in some
performance can be achieved in terms of M R−2 on reasonable of the pioneering efforts for pedestrian detection using CNNs
subset of Caltech (3.7%) and CityPerson (1.5%) without fine- [20], [51] and is still widely employed in this research field.
tuning on the target domain. These improvement hold true for RPN+BF [50] combined Region Proposal Network and boosted
models from all pedestrian detection families that we tested such forest to enhance the performance of pedestrian detection, which
as Cascade R-CNN [8], Faster RCNN [38] and embedded vision overcame the problems of poor resolution and imbalanced classes
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3
in Faster RCNN [38]. Although the performance of RPN+BF was TABLE 1: Experimental settings.
outstanding, its learning ability was limited by the non-end-to- Setting Height Visibility
end trainable architecture. Since the strong performance and high Reasonable [50, inf] [0.65, inf]
expandability of Faster RCNN [38] , it inspired a broad spectrum Small [50, 75] [0.65, inf]
of pedestrian detectors [55], [52], [6], [5], [34]. Heavy [50, inf] [0.2, 0.65]
Heavy* [50, inf] [0.0, 0.65]
Some pedestrian detection methods designed more sophis- All [20, inf] [0.2, inf]
ticated architectures and leveraged extra information to further
boost the detection performance. ALF [28] employed several
progressive detection heads on Single Shot MultiBox Detector autonomous driving to evaluate and compare with the state-of-
(SSD) [26] to gradually refine the initial anchors, which inher- the-art pedestrian detection algorithms. These three benchmarks
ited the merit of high efficiency in single stage detectors and include Caltech [13], CityPersons [52] and EuroCity Persons [4],
further improved the detection accuracy. Inspired by the blob and are categorized into the autonomous driving datasets in this
detection, CSP [29] reformulated the pedestrian detection task work. Caltech [13] is one of the most popular dataset in the
into an anchor-free manner which only needed to locate the center research field of pedestrian detection. It recorded 10 hours of
points and regress the scales of pedestrians without relying on the video in Los Angeles, USA by a front-view camera of vehicle,
complicated settings of anchor boxes. To improve the detection and contained roughly 43K images and 13K persons extracted
performance on occluded pedestrians, extra information of the from the video. We perform the evaluations on the refined Caltech
visible-area bounding-box was utilized as the guidance of attention annotations from [51]. Compared to Caltech, CityPersons [52] is
mask in MGAN [37]. built upon the dataset of Cityscapes, and contains more diverse
Pedestrian detection benchmarks. Due to the large practical scenarios. It was recorded from the street scenes of different
value of pedestrian detection, a lot of works were devoted to create cities in and close to Germany. CityPersons contains 2,975, 500,
benchmarks to promote the development of pedestrian detection, 1,575 images in the training, validation and testing sets, and
such as Daimler-DB [35], TownCenter [2], USC [47], INRIA [11], provides the full bounding-boxes and visible bounding-boxes for
ETH [14],and TUDBrussels [46], which were all from surveil- 31k pedestrians. EuroCity Persons (ECP) [4] is a recently re-
lance scenarios and not suitable for applications in autonomous leased larege-scale dataset recoded in 31 different European cities,
driving. Recently, the great progress of pedestrian detection also which contains more diverse scenarios and is more challenging
attracted the attention of autopilot systems, and several datasets for pedestrian detectors compared to the datasets of Caltech and
were created for this context, such as Caltech [13], KITTI [15], CityPersons. It contains two subsets of ECP day-time and ECP
CityPersons [52] and ECP [4]. The cameras in these datasets were night-time based on the recording time. ECP dataset has roughly
typically installed on the front windshield of cars to collect images 200K bounding-boxes. Similar to the evaluation procedure in
from the similar field of views as human drivers. Caltech [13] ECP [4], we conduct the experiments on the subset of ECP day-
and CityPersons [52] are the most popular benchmarks for recent time for the fair comparisons with existing literature. Unless other
learning-based pedestrian detectors, while their small data sizes statement, all experimental results are from the validation sets
and monotonous scenarios limit their capabilities in training more due to the frequent submissions to online testing server is not
robust methods. To solve these limitations, ECP [4] dataset col- allowed. Except to the autonomous driving datasets of Caltech,
lected images from diverse scenarios including various cities, all CityPersons and ECP, we further conduct the experiments on
seasons, and day and night times, which contains almost ten times two web-crawled datasets of CrowdHuman and Wider Pedestrian1
more images and eight times more persons than CityPersons [52]. [31]. We provide more details of above datasets in Table 2.
Although ECP [4] has a much larger scale, it still suffers from the Evaluation protocol. We evaluate the performance of pedes-
low density of persons and high similarity of background scenes, trian detectors by the widely used metric of log average miss
which could be the focus of future datasets. Thus, in this work, we rate over False Positive Per Image (FPPI) over range [10−2 ,
argue that the low density and diversity of these datasets constrains 100 ] (M R−2 ) on Caltech [13], CityPersons [52] and ECP [4].
the generalization ability of pedestrian detectors, while the web The experimental results on different occlusion levels including
crawled datasets, such as CrowdHuman [39], WiderPerson [54] Reasonable, Small, Heavy, Heavy*2 and All are reported unless
and Wider Pedestrian [31], including much diverse scenes and stated otherwise. Table 1 provides the specific settings of each set.
denser persons may increase the upper bound of pedestrian detec- Cross-dataset evaluation. To evaluate the generalization abil-
tors’ generalization ability. ity of pedestrian detectors, we perform cross-dataset evaluation
Cross-dataset evaluation. Some existing works [52], [4], [39] by only using the training set of dataset A to train models and
explored the relations between the performance of pedestrian directly testing then on the validation/testing set of dataset B. This
detectors and training datasets, whose purpose was to show how training and testing procedure is consistent for all experiments,
much performance advantage could be obtained on target datasets and denoted as A→B.
by pre-training on more diverse and dense scenes datasets. But, Baseline. Because most of the high performance pedestrian
in this work, we aim to thoroughly evaluate the generalization detectors on the datasets of Caltech, CityPersons and ECP are
abilities of some popular pedestrian detection methods by using built upon the two-stage detectors of Faster/Mask R-CNN [38],
the cross-dataset evaluation.
1. Wider Pedestrian contains images from the scenarios of autonomous
driving and surveillance. The data provided in 2019 challenge was used in
3 E XPERIMENTS our experiments. Data can be downloaded from : [Link]
3.1 Experimental Settings org/competitions/20132
2. In the case of CityPersons, for the fair comparison with some previous
Datasets. We conduct extensive experiments on three public methods under the same setting, we also report the numbers under the visibility
pedestrian detection datasets which collectd from the scenario of level between [0.0,0.65] which is denoted as Heavy* occlusion.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4
TABLE 3: Evaluating generalization abilities of different backbones using our baseline detector.
Backbone Training Testing Reasonable
HRNet WiderPedestrian + CrowdHuman CityPersons 12.8
ResNeXt WiderPedestrian + CrowdHuman CityPersons 12.9
Resnet-101 WiderPedestrian + CrowdHuman CityPersons 15.8
ResNet-50 WiderPedestrian + CrowdHuman CityPersons 16.0
[19], the more powerful multi-stage detector of Cascade R-CNN data are from the same domain dataset, i.e., within-dataset evalua-
[8] which also belongs to the R-CNN family is chosen as the tion. However, we argue that this algorithm development pipeline
baselines. In this work, the terms of baseline and Cascade RCNN ignores the generalization capability of pedestrian detectors and
are interchangeably used, which both means the same method of is easy to over-fit on a specific dataset. Thus, in this work, we
Cascade R-CNN [8]. Multiple detection heads are used step-by- emphasize the importance of cross-dataset evaluation in the design
step in the Cascade R-CNN to gradually filter out the false positive of pedestrian detectors. We can clearly show how well pedestrian
anchors and generate more and more high-quality proposals. We detectors perform on unseen domain by cross-dataset evaluation.
equipped different backbones on our baseline method to evaluate Thus, extensive cross-dataset experiments are conducted in this
its robustness, and the details of these experiments are shown section to evaluate the robustness of pedestrian detectors.
in Table 3. Among them, ResNeXt [48] and HRNet [44] show
the top ranked performance. Unless other statements, HRNet 5.1 Dataset Illustrations
[44] is used as the default backbone network of our baseline.
We showcase some examples of datasets related to pedestrian de-
HRNet simultaneously processes multiple levels feature maps
tection in Figure 2. Top row depicts different scenarios in diverse
in a parallel way retaining both low-level details and high-level
and dense datasets collected by crawling on web. Bottom row
semantic information, which may greatly benefit the pedestrian
illustrates images from traditional autonomous driving datasets. It
detection under the large scale variations.
can be observed that web-crawled datasets provide more enriched
representation of pedestrians, since they cover several scenarios,
4 B ENCHMARKING such as different poses, illumination and different types of occlu-
The benchmarking results of our baseline, Cascade R-CNN [8],
sion. Whereas, autonomous driving benchmarks are monotonous
on three autonomous driving datasets including Caltech [13]
in nature, i.e. same background, view-point etc. Interestingly,
dataset, CityPersons [52] and on ECP [4] are presented in Table
ECP [4] and CityPersons [52], illustrate striking resemblance
4. Without any ”bells and whistles”, our baseline achieved the
(where the camera is mounted, image resolution, geographical
performance comparable to the specially customized pedestrian
location etc.), this further stresses the point that even when the
detectors on the datasets of Caltech and CityPersons. Interestingly,
target domains are not drastically different, current pedestrian
the performance gap between our baseline and the state-of-the-art
detectors do not generalize well (cf. Tables 5,6 and 7 in the paper).
algorithms changes as the sizes of datasets increase. The relative
performance of our baseline is lowest on the smallest dataset of
Caltech and significantly improves on the largest dataset of ECP. 5.2 Cross Dataset Evaluation of Existing State-of-the-
Art
TABLE 4: Benchmarking on autonomous driving datasets. In this section we demonstrate that existing state-of-the art pedes-
Method Testing Reasonable Small Heavy trian detectors generalize worse than general object detector. We
ALFNet [28] Caltech 6.1 7.9 51.0 show that this is mainly due to the biases in the design of methods
Rep Loss [45] Caltech 5.0 5.2 47.9 for the target set, even when other factors, such as backbone, are
CSP [29] Caltech 5.0 6.8 46.6 kept consistent.
Cascade R-CNN [8] Caltech 6.2 7.4 55.3
To see how well state-of-the-art pedestrian detectors generalize
RepLoss [45] CityPersons 13.2 - - to different datasets, we performed cross dataset evaluation of
ALFNet [28] CityPersons 12.0 19.0 48.1
CSP [29] CityPersons 11.0 16.0 39.4 five state-of-the-art pedestrian detectors and our baseline (Cas-
Cascade R-CNN [8] CityPersons 11.2 14.0 37.1 cade RCNN) on CityPersons [52] and Caltech [13] datasets. We
Faster R-CNN [4] ECP 7.3 16.6 52.0 evaluated recently proposed BGCNet [23], CSP [29], PRNet [41],
YOLOv3 [4] ECP 8.5 17.8 37.0 ALFNet [28] and FRCNN [52](tailored for pedestrian detection).
SSD [4] ECP 10.5 20.5 42.0 Furthermore, we added along with baseline, Faster R-CNN [38],
Cascade R-CNN [8] ECP 6.6 13.6 33.3 without “bells and whistles”, but with a more recent backbone
ResNext-101 [48] with FPN [24]. Moreover, we implemented a
vanilla FRCNN [52] with VGG-16 [40] as a backbone and with no
5 G ENERALIZATION C APABILITIES pedestrian specific adaptations proposed in [52] (namely quantized
As previously mentioned, existing works evaluate the pedestrian anchors, input scaling, finer feature stride, adam solver, ignore
detectors with a traditional manner where training and evaluation region handling, etc).
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5
Fig. 2: Illustration of benchmarks. Top row shows images from diverse and dense datasets, such as CrowdHuman [39] and Wider
Pedestrian [31]. Bottom row presents images of autonomous driving benchmarks, ECP [4], CityPersons [52] Caltech [13]
TABLE 5: Cross dataset evaluation on Caltech and CityPersons. A→B refers to training on A and testing on B.
TABLE 6: Cross dataset evaluation of (Casc. R-CNN and CSP) on Autonomous driving benchmarks. Both detectors are trained with
HRNet as a backbone.
Method Training Testing Reasonable Small Heavy
Casc. RCNN CityPersons CityPersons 11.2 14.0 37.0
CSP CityPersons CityPersons 9.4 11.4 36.7
Casc. RCNN ECP CityPersons 10.9 11.4 40.9
CSP ECP CityPersons 11.5 16.6 38.2
Casc. RCNN ECP ECP 6.9 12.6 33.1
CSP ECP ECP 19.4 50.4 57.3
Casc. RCNN CityPersons ECP 17.4 40.5 49.3
CSP CityPersons ECP 19.6 51.0 56.4
Casc. RCNN CityPersons Caltech 8.8 9.8 28.8
CSP CityPersons Caltech 10.1 13.3 34.4
Casc. RCNN ECP Caltech 8.1 9.6 29.9
CSP ECP Caltech 10.4 13.7 31.3
We present results for Caltech and CityPersons in Table 5, on Caltech. Particularly, BCGNet [23], CSP [29], ALFNet [28]
respectively. We also report results when training is done on target and FRCNN [52] degraded by more than 100 % (in comparison
dataset for readability purpose. For our results presented in Table 5 with fifth column, Caltech→Caltech). Whereas in the case of
(Fourth column, CityPersons→Caltech), we trained each detector Cascade R-CNN [8], performance remained comparable to the
on CityPersons and tested on Caltech. Similarly, in the last column model trained and tested on target set. Since, CityPersons is a
of the Table 5, all detectors were trained on the Caltech and relatively diverse and dense dataset in comparison with Caltech,
evaluated on CityPersons benchmark. As expected, all methods this performance deterioration cannot be linked to dataset scale
suffer a performance drop when trained on CityPersons and tested and crowd density. This illustrates better generalization ability of
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6
general object detectors over state-of-the-art pedestrian detectors. training and testing datasets are varied. For fair comparsions, the
Moreover, it is noteworthy that BGCNet [23] like the Cascade R- backbone of CSP [29] is replaced from ResNet-50 to HRNet [44].
CNN [8], also uses HRNet [44] as a backbone, making it directly As shown in the second row of Table 6, this change improves the
comparably to the Cascade R-CNN [8]. performance of CSP from 11.0%M R−2 to 9.4%M R−2
Importantly, pedestrian specific FRCNN [52] performs worse First, we train Cascade RCNN and CSP on the pedestrian
in cross dataset (fourth column only), compared with its direct detection dataset of ECP which contains more countries and cities
variant vanilla FRCNN. The only difference between between and is the largest benchmark with regard to pedestrian density
the two being pedestrian specific adaptations for the target set, and diversity from the context of autonomous driving. CityPer-
highlighting the bias in the design of tailored pedestrian detectors. sons [52] is chosen as the evaluation benchmark, and results
Similarly, standard Faster R-CNN [38], though performs worse are shown in the third and fourth row of Table 6, respectively.
than FRCNN [52] when trained and tested on the target dataset, it It is clear that in the case of providing the same backbone,
performs better than FRCNN [52] when it is evaluated on Caltech Cascade RCNN generalizes better on CityPersons than CSP in
without any training on Caltech. the reasonable setting. Considering CSP significantly outperforms
It is noteworthy that Faster R-CNN [38] outperforms state-of- Cascade RCNN by nearly 2% M R−2 points when they are
the-art pedestrian detectors (except for BGCNet [23]) as well in evaluated in the within-dataset setting, it is surprising to see that
cross dataset evaluation, presented in Table 5. We again attribute the results are turn over.
this to the bias present in the design of current state-of-the-art Secondly, we train CSP and Cascade RCNN on CityPersons
pedestrian detectors, which are tailored for specific datasets and and evaluate them on ECP [4] to further study the generaliza-
therefore limit their generalization ability. Moreover, a significant tion abilities of them under different diverse degrees of training
performance drop for all methods (though ranking is preserved datasets. Similarly, when training dataset is with low diversity,
except for vanilla FRCNN), including Cascade R-CNN [8], can Cascade RCNN still outperforms than CSP. The performance
be seen in Table 5, last column. However, this performance drop difference is 10.5 % M R−2 , 7.1 % M R−2 and 2.2 % M R−2 in
is attributed to lack of diversity and density of the Caltech dataset. the small, heavy and reasonable settings, respectively.
Caltech dataset has less annotations than CityPersons and number Finally, we combine CityPersons and ECP as the training data
of people per frame is less than 1 as reported in Table 2. However, and perform the evaluate on the Caltech which is the smallest data
still it is important to highlight, even when trained on a limited source. The results of Cascade RCNN and CSP in all settings are
dataset, usually general object detectors are better at generalization shown in the last four rows of Table 6. We conclude that when we
than state-of-the-art pedestrian detectors. Interestingly, Faster R- use a diverse and dense training dataset, ECP Cascade RCNN has
CNN’s [38] error is nearly twice as high as that of BGCNet [23] more robust performance than CSP on all evaluating subsets.
in within-dataset evaluation, whereas it outperforms in BGCNet
[23] in cross-dataset evaluation.
As discussed previously, most pedestrian detection meth- 5.4 Diverse General Person Detection Datasets for
ods are extensions of general object detectors (FR-CNN, SSD, Generalization
etc.). However, they adapt to the task of pedestrian detection. We study how much performance improvement dense and diverse
These adaptations are often too specific to the dataset or detec- datasets can bring. When the testing source is small datasets
tor/backbones (e.g. anchor settings [52], [28], finer stride [52], from the context of autonomous driving, such as Caltech [13],
additional annotations [55], [37], constraining aspect-ratios and diverse and dense datasets are still beneficial for generalization
fixed body-line annotation [29], [23] etc.). These adaptations usu- even under large domain gaps between training and evaluation
ally limit the generalization as shown in Table 5, also discussed, datasets. Moreover, diverse and dense datasets can bring more
task specific configurations of anchors limits generalization as benefits to the general object detection methods, such as Cascade
discussed in [27]. RCNN than the specially tailored pedestrian detectors, such as
CSP.
5.3 Autonomous Driving Datasets for Generalization CrowdHuman [39] and Wider Pedestrian [31] datasets are two
We show that the general object detectors outperforms existing diverse and dense pedestrian detection datasets collected from
pedestrian detection methods (such as CSP [29]) as learning a web-crawling and surveillance cameras. Unlike the autonomous
generic feature representation for pedestrians, even when they driving datasets, the crowd density and scenario diversities are
are trained on the large dataset (such as ECP) and tested on the large in CrowdHuman [39] and Wider Pedestrian [31] since
small dataset (such as Caltech). Furthermore, detectors achieve they include images in diverse sources, such as street views and
higher generalization from larger and denser autonomous driving surveillance, which increases the data diversity from a different
datasets. form. Thus, they are idea sources to pre-train models. We pre-
As shown in the last section, cross dataset evaluation can shed train Cascade R-CNN [8] and CSP [29] on CrowdHuman [39] and
light on the generalization capabilities of pedestrian detectors. Wider Pedestrian [31] datasets, and shows corresponding results
Moreover, the characteristic of dataset is also an important de- in Table 7. It can be seen that pre-training significantly boost
terminant for the model generalization. The intrinsic nature of the the performance of pedestrian detectors. When tested on Caltech
real world can be more effectively draw by diverse datasets [4]. dataset, Cascade R-CNN outperforms all previous methods which
Consequently, such datasets potentially provide the chances for are only trained on Caltech, and the test error is reduced nearly
pedestrian detectors to learn more generic feature representation to by half. The trend of performance improvement is observed on
robustly tackle the domain shifts. Instead of exploring the impact the results of CSP [29], although its improvement is less than
of dataset in generalization as existing studies [4], [39], [54], we Cascaded R-CNN. The performance of either Cascade RCNN or
aim at presenting a detailed comparison of a general object de- CSP are not improved on CityPersons [52], when they are trained
tector and state-of-the-art pedestrian detection methods when the on CrowdHuman [39]. This is reasonable due to CityPersons [52]
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7
TABLE 8: Investigating the effect on performance when CrowdHuman, Wider Pedestrian and ECP are merged and Cascade R-CNN
[8] is trained only on the merged dataset.
Method Training Testing Reasonable Small Heavy
Casc. RCNN CrowdHuman → ECP CP 10.3 12.6 40.7
CSP CrowdHuman → ECP CP 10.4 10.0 36.2
Casc. RCNN Wider Pedestrian → ECP CP 9.7 11.8 37.7
CSP Wider Pedestrian → ECP CP 9.8 14.6 35.4
Casc. RCNN CrowdHuman → ECP Caltech 2.9 11.4 30.8
CSP CrowdHuman → ECP Caltech 11.0 14.7 32.2
Casc. RCNN Wider Pedestrian → ECP Caltech 2.5 9.9 31.0
CSP Wider Pedestrian → ECP Caltech 8.6 12.0 30.3
is more difficult than Caltech [13] with regard to density and Wider Pedestrian [31] → ECP [4]. Besides, progressive training
diversity. Similar trends are observed in Table 6 when detectors also helps Cascade R-CNN achieve new state-of-the-art results on
are tested on ECP [4]. Training on CityPersons [52] brings better the Caltech [13] dataset. It is worth noting that our performance
performance than training on CrowdHuman [39]. From the bottom on Caltech [13] is very close to the human baseline (0.88).
half of Table 7, we can see that general object detector benefits Finally, we show the experimental results of directly merg-
more from training on Wider Pedestrian [31]. We hypothesis ing all datasets in the third and fourth rows of Table 8. This
this is because that Wider Pedestrian [31] is with larger scale training strategy can also improve the performance but it still
and more similar to target domain than CrowdHuman [39]. The cannot reach the performance of progressive training pipeline,
domain difference is reflected in the scenarios of images where which demonstrates the value of pre-training on general dataset
CrowdHuman [39] includes web-crawled persons with diverse and then fine-tuning on the autonomous driving dataset. Without
poses while Wider Pedestrian [31] are mainly from street views touching the data in target domain, our progressive training helps
and surveillance cameras. to effectively improve the pedestrian detection performance of
state-of-the-art detectors. These experiments demonstrate that our
training pipeline explores a way to significantly improve the
5.5 Progressive Training Pipeline
generalization capability of Cascade R-CNN and makes it on
We propose a progressive training pipeline to take full charge a level with state-of-the-art detectors on CityPersons [52] and
of multi-source datasets and thus further improve the pedestrian achieve best performance on Caltech [13].
detection performance. This pipeline first trains detectors on a
dataset that is farther from target domain but general diverse
enough and then fine-tune them on a dataset which is similar to 5.6 Application Oriented Models
the target domain. In this section, we conducted experiments to show that pre-training
Extensive experiments are conducted to demonstrate the value on dense and diverse datasets can help a light-weight neural net-
of progressive training. To be in line with the study described in work architecture, MobileNet [21], to achieve competitive results
the previous section, target domain dataset is not touched, and as state-of-the-art detectors, such as CSP, on CityPersons [52]
only the training subset of each corresponding dataset is used in dataset.
our pipeline. We use the symbol of A → B to denote pre-training The computational cost and model size of pedestrian detectors
model on A dataset and fine-tuning it on B dataset. Besides, two are important factors in many real-world applications, such as
datasets of A and B could be directly merged together to train the drones and autonomous driving cars, which require real-time
model, which is denoted as A + B. In this section, Caltech [13] and detection and are usually run on limited hardware. To study
CityPersons [52] datasets are respectively used as the evaluation whether progressive training pipeline is still effective in improving
benchmark, and corresponding results are shown in Table 8. The the performance of light-weight backbone, we conduct experi-
upper part of Table 8 clearly shows that the performance of ments with a widely used light-weight neural network backbone,
Cascade RCNN can be significantly improved by the progressive MobileNet [21] v2, proposed for embedded and mobile computer
training pipeline. Noticeably, without training on CityPersons [52] vision tasks.
dataset, Cascade R-CNN achieves comparable results as the state- We replace the backbone of Cascade R-CNN [8] as a Mo-
of-the-art detectors through the progressive training pipeline of bileNet [21], and present the results on CityPersons [52] in
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8
Table 9. To establish reference performance, we train and test (d), small-scale pedestrians Fig.5 (g) to (l), and occlusion Fig.5
MobileNet [21] on CityPersons [52], and show its results in the (b, c, e, f, k). On the contrary, Cascade RCNN seems to be robust
first row of Table 9. Intuitively, the performance of MobileNet [21] to such domain shifts and handles the above mentioned challeng-
is lower than HRNet [44]. But, the results demonstrate that ing scenarios better than CSP. We argue that probable reasons
progressive training by pre-training MobileNet [21] on CrowdHu- behind CSP’s poor generalizations stems from the fact that it is a
man [39] and then fine-tuning on ECP [4] still effectively improves single stage detector without feature alignment compared to two-
the detection performance on CityPersons [52]. Moreover, further stage detector like Cascade RCNN. Feature alignment improves
performance improvement can be achieved by replacing the pre- generalization [10]. Moreover, two-stage detectors have specific
training dataset of CrowdHuman [39] as Wider Pedestrian [31]. module for feature-alignment (ROI-Align), therefore leading them
As shown in the first and forth rows, our progressive pipeline to more aligned features on unseen domains. This explicit feature
of Wider Pedestrian [31] → ECP [4] improves the performance alignment is absent in CSP and on unseen domains leads to failure
of 0.6% M R−2 n the reasonable subset of CityPersons [52]. cases, such as occlusion and small-scale pedestrians, where feature
The results in Table 9 and 7 both demonstrate that pre-training alignment becomes pivotal for precise localization.
on Wider Pedestrian [31] dataset can bring larger performance
improvement than CrowdHuman [39] when the evaluation bench-
mark is CityPersons [52]. This is because Wider Pedestrian [31]
includes scenario of autonomous driving and shares more common 6 H OW T RANSFORMERS FARE W ITH R ESPECT TO
characteristics with target domain. It is worth noting that our
progressive training pipeline makes Cascade R-CNN with a light-
CNN
weight backbone of MobileNet [21] approach the performance of
state-of-the-art method, CSP [29], which is equipped with a much In this section, we discuss the results when CNNs are pitched
larger backbone of ResNet-50. against the recent Transformer Networks. Intuitively, transformer
networks do outperform CNN based backbones in direct and cross
TABLE 9: Investigating the performance of embedded vision dataset evaluation. However, when the domain gap is large and
model, when pre-trained on diverse and dense datasets. we increase the size of the training dataset along with more
sources of pre-training, we quantitatively illustrate that CNNs are
Training Testing Reasonable Small Heavy
CP CP 12.0 15.3 47.8 more robust to domain shifts and have better ability to digest
ECP CP 19.1 19.3 51.3 data, compared with transformer networks. To the best of our
CrowdHuman→ECP CP 11.9 15.7 48.9 knowledge, this is one of the first studies that objectively illustrates
Wider Pedestrian→ECP CP 11.4 14.6 43.4 this.
In order to make comparison fair, we used the same detector,
Cascade R-CNN, in both backbone networks HRNet and Swin
5.7 Qualitative Results Transformer. We included Swin Transformer for evaluation, since,
We show detection quality of Cascade R-CNN† on different it has achieved state of the art performance on general object
datasets in Figure 3. Top row contain results from Caltech and detection benchmarks. Therefore, we start by benchmarking in
bottom row show images from CityPersons and ECP. One could Table 10. It can be observed that Swin Transformer outperforms
observe that Cascade R-CNN† is robust to crowd density, as HRNet on direct and cross-dataset evaluation. Which is intuitive
presented images show several instances of occlusion, people as, Swin Transformer also outperforms HRNet on general object
walking in close vicinity, etc. Furthermore, we extracted images detection as well, thanks to the shited window based self attention
from different regions across the globe under varying conditions, which captures a more powerful representation compared with
shown in Figure, 4. Four different conditions are showcased, for a CNN based backbone such as the HRNet. However, when
example, Netherlands shows rainy conditions, where people are the domain shift is large, sources of pre-training are fixed to
wearing jackets with hood, rain coats and holding umbrellas. In CrowdHuman and Wider Pedestrian and tested on autonomous
Italy, summer season is illustrated and people are walking in driving oriented datasets, we notice the Swin Transformer gets
city-center in close vicinity, often occluding each other. Winter outperformed by HRNet significantly, across all settings as shown
season can be seen in Germany, as snow is visible. Finally, France in Table11. This finding illustrate that CNNs are more tolerant
illustrates a low-illumination scenario, where car headlights can be to domain shifts, especially, if the shifts are not subtle. Inline
seen illuminating the scene (also bringing reflections and shadows with the studies in sections above, we test Swin Transformer,
into play). Aim of this figure is to demonstrate the robustness of through our progressive training pipeline. In table 12, we pre-
Cascade R-CNN† , since the pre-training is done on web-crawled train on diverse general person detection dataset and finetune on
datasets and thanks to the several diverse and dense scenes, ECP which is closer to the target domain. Except for the third
Cascade R-CNN† has learnt a representation capable of handling row, HRNet outperformed Swin Tranformers on all datasets. This
several real-world scenarios efficiently. trend persists even when we add the target set to the progressive
Finally, we show qualitatively in Figure 5, how pedestrian training pipeline as shown in Table 13. We attribute this to the
detectors such as CSP lacks in generalization compared with a fact that potentially one of the main reasons are that the hyper-
general object detector such as Cascade RCNN. In the figure 5, we parameters of transformer networks are not as optmized as they
trained both detectors with HRNet on CityPersons and evaluate on for CNNs. Moreover, transformer networks are also more prone
ECP. We picked cases from different cities, under varying lighting to overfitting compared with the CNNs. Nonetheless, it is still
conditions (afternoon, evening etc.) and under different weathers relatively early for transformer networks compared with CNNs,
such as sunny, raining or snowing. Common failure cases for which have been around and used extensively for nearly a decade
CSP includes low-illumination/blurry pedestrians Fig.5 (a) and and their hyperparameters are optimized quite thoroughly.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
Fig. 3: Qualitative results of Cascade R-CNN on different benchmarks. Top row includes example from Caltech, bottom left CityPersons
and ECP on bottom right. †
Fig. 4: Cascade R-CNN † across different scenarios, such as summer, winter, rain and low-illumination, illustrating the robustness of
the general object detector.
7 F INETUNING ON THE TARGET S ET of completeness, we also include results for target set fine-tuning.
Visible improvement across all splits can be observed for both
Finally, we add the training part of our target set to our progressive
methods. Interestingly, on CityPersons, CSP gets better than
training pipeline as illustrated in Table 14. Although this paper
Cascade RCNN. We attribute this to the fact that CSP’s design
stresses the importance of cross-dataset evaluation, for the sake
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10
Fig. 5: Generalization comparison between CSP [29] with HRNet and Cascade RCNN[8]. Both methods are trained on CityPersons
and evaluated on ECP. Images with Yellow bounding box are CSP’s detections where as magenta bounding box are Cascade RCNN’s
detections. Dotted green line illustrates instances where CSP fails to detect.
choice is optimized for CityPersons (at the cost of generalization). to real-world problems.
Therefore, it benefits more than a general design object detector
for target set fine-tuning on CityPersons. However, both methods 7.1 Quantitative Results on Leaderboard
are still comparable and in the case of Caltech, Cascade R- We further evaluated proposed training pipeline along with en-
CNN outperforms CSP significantly. Moreover, the performance semble of our two models, one pre-trained on CrowdHuman [39]
on Caltech is nearly in the same order, as that of humans (0.88). and the other one on Wider Pedestrian [31]. Ensembling is
This also brings to the conclusion that Caltech dataset is almost performed by combining the detections followed by non-maxima
solved. Next generation pedestrian detectors should use Caltech suppression, using soft-nms [3]. Final results are evaluated on the
as a reference, but focus on more challenging dataset such as dedicated server3 (test set annotations are withheld) of CityPer-
ECP and CityPersons. Importantly, general object detector such
as, Cascade RCNN, without fine-tuning on the target set (cf. Table 3. [Link]
8), can already achieve comparable results to that with fine-tuning [Link]
‡ : Correspond to our submissions and use of additional training data
as in Table 14, making it practical, ready to use and more suitable
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11
TABLE 10: Cross dataset evaluation of HRNet and Swin-Trans. on Autonomous driving benchmarks. Both detectors are trained with
Casc. R-CNN as a backbone.
Method Training Testing Reasonable Small Heavy
HRNet CityPersons CityPersons 11.2 14.0 37.0
Swin-Trans CityPersons CityPersons 9.2 10.5 36.9
HRNet ECP CityPersons 10.9 11.4 40.9
Swin-Trans ECP CityPersons 10.2 12.9 39.6
HRNet ECP ECP 6.9 12.6 33.1
Swin-Trans ECP ECP 4.5 9.6 25.2
HRNet CityPersons ECP 17.4 40.5 49.3
Swin-Trans CityPersons ECP 14.8 28.3 50.2
TABLE 11: Cross-dataset evaluation of HRNet and Swin-Trans., when sources of training are fixed to Wider Pedestrian and
CrowdHuman
Method Training Testing Reasonable Small Heavy
HRNet CrowdHuman Caltech 3.4 11.2 32.3
Swin-Trans CrowdHuman Caltech 10.1 37.2 55.1
HRNet CrowdHuman CityPersons 15.1 21.4 49.8
Swin-Trans CrowdHuman CityPersons 16.7 24.3 55.0
HRNet CrowdHuman ECP 17.9 36.5 56.9
Swin-Trans CrowdHuman ECP 21.1 42.8 56.0
HRNet Wider Pedestrian Caltech 3.2 10.8 31.7
Swin-Trans Wider Pedestrian Caltech 9.8 30.1 55.6
HRNet Wider Pedestrian CityPersons 16.0 21.6 57.4
Swin-Trans Wider Pedestrian CityPersons 13.9 32.5 57.7
HRNet Wider Pedestrian ECP 16.1 32.8 58.0
Swin-Trans Wider Pedestrian ECP 18.1 32.5 65.8
TABLE 12: Investigating the effect on performance when CrowdHuman, Wider Pedestrian and ECP are merged and both backbones
(HRNet and Swin-Trans.) are trained only on the merged dataset.
Method Training Testing Reasonable Small Heavy
HRNet CrowdHuman → ECP CP 10.3 12.6 40.7
Swin Trans. CrowdHuman → ECP CP 11.0 12.4 43.4
HRNet Wider Pedestrian → ECP CP 9.7 11.8 37.7
Swin Trans. Wider Pedestrian → ECP CP 9.5 10.8 39.7
HRNet CrowdHuman → ECP Caltech 2.9 11.4 30.8
Swin Trans. CrowdHuman → ECP Caltech 8.0 28.0 54.4
HRNet Wider Pedestrian → ECP Caltech 2.5 9.9 31.0
Swin Trans. Wider Pedestrian → ECP Caltech 8.8 28.1 33.9
TABLE 13: Evaluation of HRNet and Swin-Trans. after fine-tuning on the target set.
Method Training Testing Reasonable Small Heavy
HRNet CrowdHuman → ECP → CP CP 8.0 8.5 27.0
SwinTrans. CrowdHuman → ECP → CP CP 9.1 10.0 30.9
HRNet Wider Pedestrian → ECP → CP CP 7.5 8.0 28.0
SwinTrans. Wider Pedestrian → ECP → CP CP 8.9 10.4 33.8
sons [52] and ECP [4], maintained by the benchmark publishers respectively. These results serve as a reference for future methods.
and frequency of submissions are constraint. Moreover, we have However, no other method to the best of our knowledge uses
included results only for the published methods (detailed eval- extra training data. Therefore, giving our submissions an unfair
uations of all methods can be seen on the urls provided in the advantage. Finally, as stated above, fine-tuning on target set is not
footnote 3). Results are presented in Table 15 and 16. Our submis- the goal of the paper and in many cases it is not practical. In this
sion (Cascade RCNN) achieves 1st and 2nd on both leaderboards work, we argue in the favor of cross-dataset evaluation and its
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12
[22] C. Jiang, H. Xu, W. Zhang, X. Liang, and Z. Li. Sp-nas: Serial-to-parallel [48] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual
backbone search for object detection. In Proceedings of the IEEE/CVF transformations for deep neural networks. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 11863– conference on computer vision and pattern recognition, pages 1492–
11872, 2020. 1500, 2017.
[23] J. Li, S. Liao, H. Jiang, and L. Shao. Box guided convolution for [49] J. Zhang, L. Lin, J. Zhu, Y. Li, Y.-c. Chen, Y. Hu, and C. S. Hoi. Attribute-
pedestrian detection. In Proceedings of the 28th ACM International aware pedestrian detection in a crowd. IEEE Transactions on Multimedia,
Conference on Multimedia, pages 1615–1624, 2020. 2020.
[24] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. [50] L. Zhang, L. Lin, X. Liang, and K. He. Is faster r-cnn doing well for
Feature pyramid networks for object detection. In Proceedings of the pedestrian detection? In European conference on computer vision, pages
IEEE conference on computer vision and pattern recognition, pages 443–457. Springer, 2016.
2117–2125, 2017. [51] S. Zhang, R. Benenson, M. Omran, J. Hosang, and B. Schiele. How far
[25] S. Liu, D. Huang, and Y. Wang. Adaptive nms: Refining pedestrian are we from solving pedestrian detection? In Proceedings of the IEEE
detection in a crowd. In Proceedings of the IEEE Conference on Conference on Computer Vision and Pattern Recognition, pages 1259–
Computer Vision and Pattern Recognition, pages 6459–6468, 2019. 1267, 2016.
[26] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. [52] S. Zhang, R. Benenson, and B. Schiele. Citypersons: A diverse dataset
Berg. Ssd: Single shot multibox detector. In European conference on for pedestrian detection. In Proceedings of the IEEE Conference on
computer vision, pages 21–37. Springer, 2016. Computer Vision and Pattern Recognition, pages 3213–3221, 2017.
[27] W. Liu, I. Hasan, and S. Liao. Center and scale prediction: A [53] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li. Occlusion-aware r-
box-free approach for pedestrian and face detection. arXiv preprint cnn: detecting pedestrians in a crowd. In Proceedings of the European
arXiv:1904.02948, 2019. Conference on Computer Vision (ECCV), pages 637–653, 2018.
[28] W. Liu, S. Liao, W. Hu, X. Liang, and X. Chen. Learning efficient [54] S. Zhang, Y. Xie, J. Wan, H. Xia, S. Z. Li, and G. Guo. Widerperson:
single-stage pedestrian detectors by asymptotic localization fitting. In A diverse dataset for dense pedestrian detection in the wild. IEEE
Proceedings of the European Conference on Computer Vision (ECCV), Transactions on Multimedia, 2019.
pages 618–634, 2018. [55] C. Zhou and J. Yuan. Bi-box regression for pedestrian detection and
[29] W. Liu, S. Liao, W. Ren, W. Hu, and Y. Yu. High-level semantic feature occlusion estimation. In Proceedings of the European Conference on
detection: A new perspective for pedestrian detection. In Proceedings Computer Vision (ECCV), pages 135–151, 2018.
of the IEEE Conference on Computer Vision and Pattern Recognition,
2019.
[30] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin
transformer: Hierarchical vision transformer using shifted windows.
arXiv preprint arXiv:2103.14030, 2021.
[31] C. C. Loy, D. Lin, W. Ouyang, Y. Xiong, S. Yang, Q. Huang, D. Zhou,
W. Xia, Q. Li, P. Luo, et al. Wider face and pedestrian challenge 2018:
Methods and results. arXiv preprint arXiv:1902.06854, 2019.
[32] R. Lu and H. Ma. Semantic head enhanced pedestrian detection in a
crowd. arXiv preprint arXiv:1911.11985, 2019.
[33] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li,
A. Bharambe, and L. van der Maaten. Exploring the limits of weakly
supervised pretraining. In Proceedings of the European Conference on
Computer Vision (ECCV), pages 181–196, 2018.
[34] J. Mao, T. Xiao, Y. Jiang, and Z. Cao. What can help pedestrian
detection? In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 3127–3136, 2017.
[35] S. Munder and D. M. Gavrila. An experimental study on pedestrian
classification. IEEE transactions on pattern analysis and machine
intelligence, 28(11):1863–1868, 2006.
[36] S. Paisitkriangkrai, C. Shen, and A. Van Den Hengel. Strengthening
the effectiveness of pedestrian detection with spatially pooled features.
In European Conference on Computer Vision, pages 546–561. Springer,
2014.
[37] Y. Pang, J. Xie, M. H. Khan, R. M. Anwer, F. S. Khan, and L. Shao.
Mask-guided attention network for occluded pedestrian detection, 2019.
[38] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time
object detection with region proposal networks. In Advances in neural
information processing systems, pages 91–99, 2015.
[39] S. Shao, Z. Zhao, B. Li, T. Xiao, G. Yu, X. Zhang, and J. Sun.
Crowdhuman: A benchmark for detecting human in a crowd. arXiv
preprint arXiv:1805.00123, 2018.
[40] K. Simonyan and A. Zisserman. Very deep convolutional networks for
large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[41] X. Song, K. Zhao, W.-S. C. H. Zhang, and J. Guo. Progressive
refinement network for occluded pedestrian detection. In Proc. European
Conference on Computer Vision, volume 7, page 9, 2020.
[42] S. Sun, J. Pang, J. Shi, S. Yi, and W. Ouyang. Fishnet: A versatile
backbone for image, region, and pixel level prediction. In Advances in
Neural Information Processing Systems, pages 760–770, 2018.
[43] P. Viola and M. J. Jones. Robust real-time face detection. International
journal of computer vision, 57(2):137–154, 2004.
[44] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu,
M. Tan, X. Wang, et al. Deep high-resolution representation learning for
visual recognition. arXiv preprint arXiv:1908.07919, 2019.
[45] X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen. Repulsion loss:
detecting pedestrians in a crowd. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 7774–7783, 2018.
[46] C. Wojek, S. Walk, and B. Schiele. Multi-cue onboard pedestrian
detection. In 2009 IEEE Conference on Computer Vision and Pattern
Recognition, pages 794–801. IEEE, 2009.
[47] B. Wu and R. Nevatia. Cluster boosted tree classifier for multi-view,
multi-pose object detection. In 2007 IEEE 11th International Conference
on Computer Vision, pages 1–8. IEEE, 2007.