HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

Github repository for referring human action segmentation

Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person settings, where a textual description specifies the target person for segmentation. We introduce the first dataset for Referring Human Action Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137 fine-grained actions with 33h video data, together with textual descriptions for this new task. Benchmarking existing action recognition methods on RHAS133 using VLM-based feature extractors reveals limited performance and poor aggregation of visual cues for the target person. To address this, we propose aholistic-partial aware Fourier-conditioned diffusion framework, i.e., HopaDIFF, leveraging a novel cross-input gate attentional xLSTM to enhance holistic-partial long-range reasoning and a novel Fourier condition to introduce more fine-grained control to improve the action segmentation generation. HopaDIFF achieves state-of-the-art results on RHAS133 in diverse evaluation settings.

More Ablation Experiments

Table: Experimental results with BLIPv2 cross-modal feature extractor and under cross-movie evaluation setting on the RHAS dataset, using frame length 2000.

Method	Val ACC	Val EDIT	Val F1@10	Val F1@25	Val F1@50	Test ACC	Test EDIT	Test F1@10	Test F1@25	Test F1@50
FACT [1]	38.06	0.44	75.26	73.88	70.52	36.95	0.50	73.89	72.84	70.62
ActDiff [2]	4.59	25.48	38.96	38.59	37.66	3.00	54.96	27.41	27.08	26.14
ASQuery [3]	36.12	0.12	31.40	29.70	25.91	34.12	0.12	35.24	33.24	28.25
LTContent [4]	12.80	0.55	29.12	27.88	24.40	13.65	0.59	31.18	29.18	25.15
RefAtomNet [5]	29.47	0.12	32.80	31.00	27.14	33.62	0.15	46.29	44.13	39.32
Ours	59.60	35.00	94.33	92.62	88.85	62.63	90.78	94.71	93.57	91.19

References
[1] Lu et al., FACT, 2024
[2] Liu et al., ActDiff, 2023
[3] Gan et al., ASQuery, 2024
[4] Bahrami et al., LTContent, 2023
[5] Peng et al., RefAtomNet, 2024

Dataset Download Links

The dataset is uploading to this google drive link, https://drive.google.com/drive/folders/1061cqqvCdx-GC9a7JqNZ0FoovJqzeYbQ?usp=sharing.

Setup

Recommended Environment: Python 3.9.2, Cuda 11.4, PyTorch 1.10.0
Install dependencies: pip3 install -r requirements.txt
The number of parameters is 415M

Run

Generate config files by python3 default_configs.py
Simply run python3 main.py --config configs/some_config.json --device gpu_id
Trained models and logs will be saved in the result folder

Action Name Index

Index	Action
1	arranging
2	asking
3	beckoning
4	bending
5	bleeding
6	bowing
7	breathing
8	brushing
9	calling
10	carrying
11	catching
12	clapping
13	climbing
14	closing
15	cooking
16	coughing
17	covering
18	crawling
19	crossing
20	crying
21	cutting
22	dancing
23	dodging
24	dragging
25	drawing
26	drinking
27	driving
28	dropping
29	eating
30	entering
31	falling
32	fixing
33	flipping
34	flying
35	frowning
36	gesturing
37	getting
38	gettingdown
39	gettingon
40	gettingup
41	giving
42	grabbing
43	handsinpocket
44	hanging
45	helping
46	hitting
47	holding
48	hugging
49	jumping
50	kicking
51	kissing
52	kneeling
53	knocking
54	laughing
55	leaning
56	leaving
57	lifting
58	listening
59	litting
60	looking
61	lookingdown
62	lying
63	makinghair
64	moveeyes
65	movehand
66	movehead
67	movemouth
68	moving
69	movingbody
70	movingonstairs
71	no
72	nodding
73	opening
74	picking
75	picturing
76	playing
77	playingmusic
78	pointing
79	pouring
80	pulling
81	pushing
82	putting
83	raising
84	reaching
85	reading
86	receiving
87	removing
88	riding
89	rolling
90	rubbing
91	running
92	sawing
93	seaching
94	searching
95	shaking
96	shakingbody
97	shocking
98	shooting
99	shouting
100	showing
101	sighing
102	singing
103	sitting
104	sittingdown
105	slapping
106	sleeping
107	smearing
108	smelling
109	smiling
110	smoking
111	speaking
112	squatting
113	standing
114	standingup
115	stopping
116	straighteningup
117	stretching
118	struggling
119	swimming
120	swing
121	taking
122	takingoff
123	talking
124	thinking
125	throwing
126	tidying
127	touching
128	turning
129	walking
130	watching
131	watering
132	waving
133	wearing
134	wiping
135	working
136	writing
137	yawning

##Clusterring of the actions

Action Clustering

🔵 Person-Person Interaction (PPI)

Index	Action
2	asking
3	beckoning
9	calling
12	clapping
16	coughing
41	giving
42	grabbing
45	helping
46	hitting
48	hugging
51	kissing
53	knocking
54	laughing
58	listening
72	nodding
78	pointing
86	receiving
99	shouting
100	showing
102	singing
105	slapping
111	speaking
123	talking
132	waving
137	yawning

🟢 Person-Object Interaction (POI)

Index	Action
1	arranging
4	bending
5	bleeding
8	brushing
10	carrying
11	catching
14	closing
15	cooking
17	covering
21	cutting
24	dragging
25	drawing
26	drinking
27	driving
28	dropping
29	eating
30	entering
32	fixing
33	flipping
39	gettingon
38	gettingdown
44	hanging
46	hitting
47	holding
50	kicking
57	lifting
59	litting
63	makinghair
73	opening
74	picking
75	picturing
77	playingmusic
79	pouring
80	pulling
81	pushing
82	putting
85	reading
87	removing
88	riding
89	rolling
90	rubbing
92	sawing
94	searching
95	shaking
98	shooting
106	sleeping
107	smearing
108	smelling
110	smoking
114	standingup
116	straighteningup
118	struggling
120	swing
121	taking
122	takingoff
125	throwing
126	tidying
127	touching
131	watering
133	wearing
134	wiping
135	working
136	writing

🟠 Person Physical Movements (PPM)

Index	Action
6	bowing
7	breathing
13	climbing
18	crawling
19	crossing
20	crying
22	dancing
23	dodging
31	falling
34	flying
35	frowning
36	gesturing
37	getting
40	gettingup
43	handsinpocket
49	jumping
52	kneeling
55	leaning
56	leaving
60	looking
61	lookingdown
62	lying
64	moveeyes
65	movehand
66	movehead
67	movemouth
68	moving
69	movingbody
70	movingonstairs
71	no
76	playing
91	running
93	seaching
96	shakingbody
97	shocking
101	sighing
103	sitting
104	sittingdown
112	squatting
113	standing
115	stopping
117	stretching
119	swimming
124	thinking
128	turning
129	walking
130	watching

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
__pycache__		__pycache__
configs		configs
xLSTM2		xLSTM2
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
default_configs.py		default_configs.py
inference.py		inference.py
main.py		main.py
model.py		model.py
requirements.txt		requirements.txt
teaser-1.png		teaser-1.png
teaser.pdf		teaser.pdf
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

More Ablation Experiments

Dataset Download Links

Setup

Run

Action Name Index

Action Clustering

🔵 Person-Person Interaction (PPI)

🟢 Person-Object Interaction (POI)

🟠 Person Physical Movements (PPM)

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

KPeng9510/HopaDIFF

Folders and files

Latest commit

History

Repository files navigation

HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

More Ablation Experiments

Dataset Download Links

Setup

Run

Action Name Index

Action Clustering

🔵 Person-Person Interaction (PPI)

🟢 Person-Object Interaction (POI)

🟠 Person Physical Movements (PPM)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages