0% found this document useful (0 votes)

50 views10 pages

Caption Refinement - Guide

Hahha

Uploaded by

body.yosf1971

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views10 pages

Caption Refinement - Guide

Hahha

Uploaded by

body.yosf1971

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Caption Refinement

⚠️Please note that the content of the document shall be kept confidential and
shall not be disclosed

Background
The purpose of this project is to help train multimodal models for extracting and
summarizing video content. We need to refine the English caption generated by the
multimodal model based on the video, and the desired refined caption can fully
represent the content contained in the video and the meaning that the camera wants
to convey.

Guidelines

Discard
• If the image/video cannot be played

• 如果图像/视频无法播放

• If the content in the video is completely unrecognizable because the video is too
blurry.
如果视频中的内容完全无法辨认，因为视频太模糊。

￮ The video should not be discarded if it's only partially covered and the main
contents can be recognized.

￮如果视频仅部分被遮挡且主要内容可以识别，则不应丢弃该视频。

• If there's any sensitive textual content appearing in the video

• 如果视频中出现任何敏感的文字内容

• If the subtitles contain any non-Alphabetic characters that cannot be typed out by
English keyboard;
• 如果字幕包含任何无法通过英文键盘输入的非字母字符；

• If there is a Mosaic watermark in the video

• 如果视频中有马赛克水印

*If the video has no subtitles, but has non-English audio, it should not be
discarded
*如果视频没有字幕，但有非英语音频，则不应丢弃

Audio consideration: Audio is only used to understand the video content, if any
object/information only appeared in audio, but not the visual of the video, no need
to add to the caption.
Non-English text in alphabetic characters: If the non-English subtitles are all in
Alphabetic characters, they should not discard the case. The texts can still be
transcribed in the caption and processed as usual;
音频理解：音频仅用于理解视频内容，如果任何对象/信息仅出现在音频中，而未出
现在视频的视觉中，则无需添加到字幕中。

非英语字母字符的文本：如果非英语字幕全部使用字母字符，则不应忽略大小写。
文本仍然可以在字幕中转录并按常规处理；

Error Category
Error Type Branch Description

Spacial Transition Video comprehension is performed by cutting

Relationship the video into many small segments, and
转场
sometimes the comprehension of neighboring
空间结构问
时间（反应时间 segments can be contradictory, which gives rise
题
的变化） to the problem of transitions. This can result in
turning an ongoing event into a new event to be
described and repetitive description.

视频理解是通过将视频切割成许多小片段来进行
的，有时相邻片段的理解可能会相互矛盾，这就
产生了过渡的问题。这可能导致将一个正在进行
的事件转变为一个新的事件进行描述，以及重复
描述。

Relative When the video involves up, down, left, and

position right orientation issues, sometimes it is based
on the user's perspective when looking at the
相对方位
screen, and sometimes it is based on the
orientation of a specific object in the video. And
a camera shift can cause a change in the
空间
position of an item in the video. Making sure that
the relative positions in the video are correct, as
well as the viewer not getting frustrated by the
positions.
当视频涉及上下左右的方向问题时，有时是基于
用户看屏幕时的视角，有时则是基于视频中某个
特定物体的方向。相机的移动可能导致视频中物
体位置的变化。确保视频中的相对位置是正确
的，同时让观众不会因为位置而感到沮丧。

Trajectory When involving the movement path of an object,

the description is a little simplified.
运动轨迹
当涉及到物体的运动路径时，描述有些简单化。

关系

Recognition Celebrity Due to the limitations of the AI model regarding

Recognition celebrities, it does not recognize public
识别问题
figures/celebrity and fails to refer to them by
名人识别
names such as Trump and James in the video.

由于 AI 模型在名人方面的局限性，它无法识别公
众人物/名人，并且在视频中无法提及像特朗普和
詹姆斯这样的名字。

Object Due to the limitations of the AI model regarding

Recognition everyday objects, sometimes it does not
recognize some objects correctly, such as a
物品识别
hotdog in the video.
由于 AI 模型在日常物品方面的局限，有时它无法
正确识别某些物体，例如视频中的热狗。

Special Action Due to the limitations of the AI model regarding

Recognition commonsense of human activity, sometimes it
does not know the certain underlying meaning in
行为识别
some famous, traditional, symbolic, or
professional action. We should tell it to use
those proprietary behavioral descriptors.
由于 AI 模型在理解人类活动常识方面的局限，有
时它无法理解某些著名、传统、象征性或专业行
为的潜在含义。我们应该告诉它使用那些专有的
行为描述符。
Color Due to some disturbance or others, AI model
recognition sometimes does not recognize the correct color
of objects. We should pay attention to the
颜色识别
description of colors. If there is error, it's
necessary to correct it.

由于某些干扰，AI 模型有时无法正确识别物体的
颜色。我们应该注意颜色的描述。如果有错误，
必须进行纠正。

Person Pronoun • Be sure to modify the wrong pronoun like

Reference incorrect "them/they" as it's a common mistake made by
model.
人物指代问代词错用
题 • Instead of general reference like "the
person"， specify gender/age if possible:
she/he/they/them, man/woman/lady/girl/boy.
• If recognizable in the image/video, add
specification to the reference like the bride/the
high school boys.
• Make sure the pronoun is consistent
throughout the caption, avoid using pronoun
variation to cause confusion.

• 特写：the person

• 确保修改错误的代词，如“他们/她们”，因为
这是模型常犯的错误。

• 避免使用一般性称谓如“the person”，如果可
能，请具体说明性别/年龄：she/he/they/them,
man/woman/lady/girl/boy。

• 如果在图像/视频中可以识别，请添加具体的
参考，如新娘/高中男孩。

• 确保代词在整个说明中保持一致，避免使用
不同的代词导致混淆。

• 特写：the person

Appearance • For people who can't simply be referred to

description by their gender, etc., it is necessary to refer to
them by their clothing, etc.
外表描述
• It is necessary to ensure a correct record of
people's characteristics. This also includes
recognizing what the person is holding and
clarifying the attribution of various belongings in
the scene.
• Characters' facial expressions can show a
wide range of emotions, so it is important to
recognize and understand facial expressions
• Don't use discriminatory characterization.

• 对于那些不能简单用性别等来称呼的人，有
必要根据他们的穿着等来称呼他们。

• 有必要确保对人们特征的正确记录。这也包
括识别一个人手中持有的物品，并澄清场景中各
种物品的归属。

• 角色的面部表情可以传达广泛的情感，因此
识别和理解面部表情是很重要的。

• 不要使用歧视性的刻画。

Identity match • After a scene change, sometimes the same

character changes clothing or camera angle,
人物对应
leading to a bias in the AI's understanding of the
participants in the event, which includes the
same person being seen as a new participant,
or a mismatch between participants A and B and
their action.
• An event occurs when multiple participants
are included, and it is important to distinguish
the roles that these participants play in an event.
• 在场景切换后，有时同一个角色会更换服装
或摄像角度，这会导致人工智能对事件参与者的
理解产生偏差，包括同一个人被视为新的参与
者，或者参与者 A 和 B 之间及其行为不匹配。

• 当多个参与者被包含时，事件就会发生，区
分这些参与者在事件中扮演的角色是很重要的。

Key Event Event clip • Key information is missing, such as the key
and integrity actions of characters, or a change in action
Background within a series of movement (eg. The finishing
事件片段完整度
Description action of a football/boxing match).
关键事件与 • The key funny points of the funny/meme
背景问题 videos 一名男子和一名女子生活中的默契，一方
总是能预判另一方可能遇到的问题，并为其及时
避免问题的发生。

• The progress/development/continuous
actions demonstrated in the video (eg.
Practicing playing basketball).
• The process of teaching/learning, as these
kinds of tutorials are portrayed in detail (eg.
Teaching how to draw one's eyebrows).

• 关键信息缺失，例如角色的关键动作，或在
一系列动作中的动作变化（例如，足球/拳击比赛
的结束动作）。

• 搞笑/表情包视频的关键搞笑点：一名男子和
一名女子生活中的默契，一方总是能预判另一方
可能遇到的问题，并为其及时避免问题的发生。

• 视频中展示的进展/发展/持续行动（例如：练
习打篮球）。

• 教学/学习的过程，因为这些类型的教程被详
细描述（例如，教如何画眉毛）。

Timestamp The description of a certain timestamp does not

segmentation match with the actual visual from the video.
叙述顺序某个时间点的描述与视频中的实际画面不符。

The principle of • Something weakly relevant makes up the

attention bulk of the video explanation, such as
background information, but key event-related
注意力原则
depictions appear trivial.
• Ignoring obvious people's movement in the
background.
• Unnecessary repetitive description attracts
too much attention.
• 视频解释的主要内容是一些弱相关的信息，
例如背景资料，但与关键事件相关的描绘显得微
不足道。
• 忽视背景中明显的人物移动。

• 不必要的重复描述吸引了过多的注意力。

Logic and Content • Excessive interpretation of scenes or

reasoning extension background characters in the video. For
interpretation example, a man passed by the main character
逻辑推理问
in the video, then the text generated by AI said
题内容延伸解读
that the man is the main character's father
(which is not true).
• Insufficient reasoning on the topic of the
video, subject relationship, etc. For example,
there are several scenes in the video showing a
baby who was raised to a grown-up, but the AI
can't recognize that the grown-up and the baby
are the same person.
• 对视频中场景或背景人物等的过度解读。

• 对视频主题，主体关系等的推理不足。

Virtual things to • Correspondence between virtual characters

reality or objects and reality, such as abstract and
exaggerated game characters or animated
虚拟事物与现实
characters, representing a complete event
的对应
through something minimalist or symbolic.
• 虚拟人物或物体与现实的对应关系，比如抽
象且夸张的游戏人物或者动画人物，通过极简或
者象征性的事物代表一个完整事件。

Logical • Some items with the progress of the video

deduction will transfer some form changes, cannot be
recognition ordinary identification, need to be through the
accumulation of knowledge related to the items
逻辑推理识别
and video logic common sense deduction to
judge.
For example, we see a chef holding a
shellfish in the first scene in a video, then
he brought out a meat dish in the next
scene. The AI can't identify what the meat
is, but we can deduce that this meat is shell
meat based on the information from the
footage.

• 某些物品随着视频的进展会转移某种形式的
变化，无法通过普通的识别来判断，需要通过对
与物品和视频逻辑相关的知识的积累进行常识推
理来判断。例如，贝类肉是从贝类中取出的，经
过炸制成肉菜，这无法通过图像识别，但根据视
频的内容，应该是贝类肉。

Text Subtitle Reco • Complete the screen prompts in text.

and (Fragmented subtitles, linked into complete
文本问题
Understanding sentences.)
字幕识别和理解 • Multiple subtitles should be understood
together.
• The dialog box corresponds to the correct
speaker.

• 完成屏幕提示的文本。（将分散的字幕连成完
整的句子。）

• 多个字幕应一起理解。

• 对话框对应于正确的发言者。

Text in Video • Recognition and reading of text on clothes,

Reco and printed materials, walls, etc.
Understanding
• Determine the relevance of text and video,
场景内文字识别 display and interpret the text with high
和理解 relevance.

• 在衣物、印刷材料、墙壁等上识别和读取文
本。

• 确定文本和视频的相关性，展示并解释高度
相关的文本。

Primary Grammar Error Be sure to extend the same tense as the original
Error caption.
语法错误
初级错误确保使用与原始字幕相同的时态。
Bug Unintelligible piles of words or things.
乱码无法理解的一堆词语或事物。

Image display Overly blurred and masked images

error
过于模糊和遮挡的图像
图像显示

Fictionalized Things and events not appearing in the video

scenario are put in the description
虚构情节视频中未提及的事物和事件会在描述中列出

Redundant • Summarizing the video with too many words

retelling or a complete paraphrase of the previous
content.
冗余复述
• Videos with unchanging settings (only
object/person action change), reduce the
repetitive description (on
backgroun/surroundings/person's dressing...) to
no more than twice to avoid redundancy
• 用过多的词语或对之前内容的完整改写来总
结视频。

• 设置不变的视频（只有物体/人物动作变
化），将对背景/环境/人物着装等的重复描述减少
到不超过两次，以避免冗余。

*Audio consideration: Audio is only used to understand the video content, if any
object/information only appeared in audio, but not the visual of the video, no need to
add to the caption.
音频注意事项：音频仅用于理解视频内容，如果有任何对象/信息仅出现在音频中，而
视频的视觉部分没有出现，则无需添加到字幕中。

Acceptable Error Range

1. Inference/supposition
2. Background information details omission
3. Small texts or stickers in the video omitted
4. Vague descriptions of unclear objects
可容错：

1. 推理/假设

2. 背景信息细节遗漏

3. 视频中小文本或贴纸遗漏

4. 对不清晰物体的模糊描述

Notes：

1. Please be sure to evaluate the text with the videos in column B

2. Please determine if the text should be discarded according to the Discard rules,
and fill in your results in column D
3. If the text is remained, please check what error types are involved in the text
according to the Error category, and select the error types in column E（Multiple
Choice Available）

4. Please refine the text in column F

a. Please mark the error part in the original text in red font
b. Please refer to the rules to rewrite the original text, and the new content
compared to the original text should be marked in green font

Seminar Report 6657
No ratings yet
Seminar Report 6657
32 pages
TA12 - Unit 6
No ratings yet
TA12 - Unit 6
53 pages
Entertainment and Media 7
No ratings yet
Entertainment and Media 7
3 pages
Koorathota Editing Like Humans A Contextual Multimodal Framework For Automated Video CVPRW 2021 Paper
No ratings yet
Koorathota Editing Like Humans A Contextual Multimodal Framework For Automated Video CVPRW 2021 Paper
9 pages
Journal Publication
No ratings yet
Journal Publication
5 pages
Discussing Robots and AI Benefits
No ratings yet
Discussing Robots and AI Benefits
7 pages
ITC PPT - Rishabh Verma
No ratings yet
ITC PPT - Rishabh Verma
12 pages
Paper - 3
No ratings yet
Paper - 3
33 pages
Te Groupwork PG6
No ratings yet
Te Groupwork PG6
2 pages
Ảnh Màn Hình 2024-04-22 Lúc 21.42.32
No ratings yet
Ảnh Màn Hình 2024-04-22 Lúc 21.42.32
8 pages
Video To Text Summarization
No ratings yet
Video To Text Summarization
17 pages
Lecture Video 1 Analyse ELC1A08 v6
No ratings yet
Lecture Video 1 Analyse ELC1A08 v6
58 pages
Video Captioning with LSRT and G3RM
No ratings yet
Video Captioning with LSRT and G3RM
13 pages
Video Storyboard Template For ELC1A08
No ratings yet
Video Storyboard Template For ELC1A08
3 pages
Transformer-Based Video Captioning
No ratings yet
Transformer-Based Video Captioning
4 pages
Transformer Network For Video To Text Translation
No ratings yet
Transformer Network For Video To Text Translation
6 pages
Resources - Integrating Digital Resources
No ratings yet
Resources - Integrating Digital Resources
6 pages
Vocabulary for Unit 6: AI and Robots
No ratings yet
Vocabulary for Unit 6: AI and Robots
50 pages
(IJCST-V12I3P20) :bassant Mohamed Elamir, Amany Fawzy Elgamal, Marwa Hussein Abdelfattah
No ratings yet
(IJCST-V12I3P20) :bassant Mohamed Elamir, Amany Fawzy Elgamal, Marwa Hussein Abdelfattah
17 pages
10 1109@tetci 2019 2892755
No ratings yet
10 1109@tetci 2019 2892755
16 pages
Video Production Ed 4760 Final
No ratings yet
Video Production Ed 4760 Final
4 pages
Unit 23 A02 Pass Booklet
No ratings yet
Unit 23 A02 Pass Booklet
8 pages
Final Report Major
No ratings yet
Final Report Major
43 pages
Vicol MovieGraphs Towards Understanding CVPR 2018 Paper
No ratings yet
Vicol MovieGraphs Towards Understanding CVPR 2018 Paper
10 pages
Report
No ratings yet
Report
29 pages
Comms Project
No ratings yet
Comms Project
4 pages
AI in Education: A Double-Edged Sword
No ratings yet
AI in Education: A Double-Edged Sword
2 pages
Advancements in Educational Technology and Video Summarization Techniques: A Comprehensive Review
No ratings yet
Advancements in Educational Technology and Video Summarization Techniques: A Comprehensive Review
7 pages
Video Summarization for Educators
No ratings yet
Video Summarization for Educators
17 pages
AI's Role in Our Future
No ratings yet
AI's Role in Our Future
3 pages
Media and Information Literacy - 1ST Sem - Finals
No ratings yet
Media and Information Literacy - 1ST Sem - Finals
3 pages
Parallel - Pathway - Dense - Video - Captioning - 2022
No ratings yet
Parallel - Pathway - Dense - Video - Captioning - 2022
12 pages
OmAgent A Multi-Modal Agent Framework For Complex Video
No ratings yet
OmAgent A Multi-Modal Agent Framework For Complex Video
15 pages
03 MOOC - Unit 1 - Expert Teachers Introduction
No ratings yet
03 MOOC - Unit 1 - Expert Teachers Introduction
2 pages
Troubleshooting Video Transcriptsba 8 D 4 F 3 Ec 7 B 88 BFB
No ratings yet
Troubleshooting Video Transcriptsba 8 D 4 F 3 Ec 7 B 88 BFB
7 pages
Urgent Call to Address AI Risks
No ratings yet
Urgent Call to Address AI Risks
11 pages
Prompt Engineering Guide
No ratings yet
Prompt Engineering Guide
25 pages
Listening - Speaking (Unit 7.3 - Unit 7.6)
No ratings yet
Listening - Speaking (Unit 7.3 - Unit 7.6)
4 pages
Using Authentic Video in The Language Classroom
100% (2)
Using Authentic Video in The Language Classroom
25 pages
Grade 6 Movie
No ratings yet
Grade 6 Movie
2 pages
WCAG 2.1 AA Checklist Excel Sheet
No ratings yet
WCAG 2.1 AA Checklist Excel Sheet
22 pages
TA12 - Unit 6
No ratings yet
TA12 - Unit 6
28 pages
Vpat Final Cut Pro Ipad 1 0 PDF
No ratings yet
Vpat Final Cut Pro Ipad 1 0 PDF
19 pages
Interview
No ratings yet
Interview
4 pages
Chat-Centric Video Understanding
No ratings yet
Chat-Centric Video Understanding
16 pages
AI's Future: Understanding and Ethics
No ratings yet
AI's Future: Understanding and Ethics
2 pages
Sora The New Era of AI SV
No ratings yet
Sora The New Era of AI SV
15 pages
SS - Visual Literacy 2324
No ratings yet
SS - Visual Literacy 2324
47 pages
Video Captioning Using Neural Networks
No ratings yet
Video Captioning Using Neural Networks
13 pages
AI City Challenge 2024
No ratings yet
AI City Challenge 2024
2 pages
教育培训汇报模板下载
No ratings yet
教育培训汇报模板下载
24 pages
GI B2 CLIL U7 8 Teacher S Notes
No ratings yet
GI B2 CLIL U7 8 Teacher S Notes
1 page
个人辅导
No ratings yet
个人辅导
4 pages
Visual Conventions List
No ratings yet
Visual Conventions List
2 pages
资料合集
No ratings yet
资料合集
20 pages
人工智能术语词汇库
No ratings yet
人工智能术语词汇库
26 pages
AI English Learning for Teens & Adults
No ratings yet
AI English Learning for Teens & Adults
4 pages
Rubrics For Logo Making
No ratings yet
Rubrics For Logo Making
2 pages
BUET Civil Engineering Thesis Style Guide
No ratings yet
BUET Civil Engineering Thesis Style Guide
16 pages
MCC Implemenation Guide-Fedral
No ratings yet
MCC Implemenation Guide-Fedral
37 pages
Marshal Debbarma 1st Semester Results New
No ratings yet
Marshal Debbarma 1st Semester Results New
1 page
Achieve Customer Service Excellence with NLP
No ratings yet
Achieve Customer Service Excellence with NLP
19 pages
Thesis Final Effectiveness of Worktext I
No ratings yet
Thesis Final Effectiveness of Worktext I
59 pages
Introduction To Software Process Improvement: Technical Report CMU/SEI-92-TR-007 ESC-TR-92-007
No ratings yet
Introduction To Software Process Improvement: Technical Report CMU/SEI-92-TR-007 ESC-TR-92-007
44 pages
Ugc Net Memory Based Question Paper 1 7 January 2025 Shift 2
No ratings yet
Ugc Net Memory Based Question Paper 1 7 January 2025 Shift 2
7 pages
Public Relation As A Tool For Eliminating Cultism in Nigeria Tertiary Institutions
No ratings yet
Public Relation As A Tool For Eliminating Cultism in Nigeria Tertiary Institutions
18 pages
Resume - Software Test Engineer: Performed
No ratings yet
Resume - Software Test Engineer: Performed
3 pages
Grade 6 Weekly Plan - Week 5
No ratings yet
Grade 6 Weekly Plan - Week 5
3 pages
Social Studies Education: Roles & Challenges
100% (1)
Social Studies Education: Roles & Challenges
10 pages
B.Ed Admission Form - Spring 2020
No ratings yet
B.Ed Admission Form - Spring 2020
3 pages
Detailed Instructions To The Faculty Invigilators - May2025 End Exams
No ratings yet
Detailed Instructions To The Faculty Invigilators - May2025 End Exams
4 pages
Last Draft Semester II 2024-2025
No ratings yet
Last Draft Semester II 2024-2025
24 pages
Medicine Standards Final (Ready For Approval)
No ratings yet
Medicine Standards Final (Ready For Approval)
14 pages
DLL - Mathematics 3 - Q2 - W7
No ratings yet
DLL - Mathematics 3 - Q2 - W7
3 pages
Research Methodology & Technical Comunication (HSS 501) (RCS)
100% (1)
Research Methodology & Technical Comunication (HSS 501) (RCS)
2 pages
2023 24 - Faculty Lists
No ratings yet
2023 24 - Faculty Lists
5 pages
Act 94 Accountants Act 1967
No ratings yet
Act 94 Accountants Act 1967
31 pages
Mbiti's African Time Concept Explained
No ratings yet
Mbiti's African Time Concept Explained
16 pages
Completion Letter Sample
100% (1)
Completion Letter Sample
6 pages
In Gov cbse-HSCER-146377352025
No ratings yet
In Gov cbse-HSCER-146377352025
1 page
NJ Autism Services Family Guide
No ratings yet
NJ Autism Services Family Guide
46 pages
Communication Style Self-Assessment
No ratings yet
Communication Style Self-Assessment
14 pages
CHAPTER 1 of Prac RESEARCH
No ratings yet
CHAPTER 1 of Prac RESEARCH
5 pages
PRE-SCHOOL FOREIGN LANGUAGE TEACHING (1st Part)
No ratings yet
PRE-SCHOOL FOREIGN LANGUAGE TEACHING (1st Part)
16 pages
Gfa Hs Nda
No ratings yet
Gfa Hs Nda
5 pages
A2 Biology Practical Coursework Guide
100% (2)
A2 Biology Practical Coursework Guide
11 pages
Braeburn Schools Employment Application Form - Teaching - Non-Teaching
No ratings yet
Braeburn Schools Employment Application Form - Teaching - Non-Teaching
6 pages