Caption Refinement
⚠️Please note that the content of the document shall be kept confidential and
shall not be disclosed
Background
The purpose of this project is to help train multimodal models for extracting and
summarizing video content. We need to refine the English caption generated by the
multimodal model based on the video, and the desired refined caption can fully
represent the content contained in the video and the meaning that the camera wants
to convey.
Guidelines
Discard
• If the image/video cannot be played
• 如果图像/视频无法播放
• If the content in the video is completely unrecognizable because the video is too
blurry.
如果视频中的内容完全无法辨认,因为视频太模糊。
○ The video should not be discarded if it's only partially covered and the main
contents can be recognized.
○ 如果视频仅部分被遮挡且主要内容可以识别,则不应丢弃该视频。
• If there's any sensitive textual content appearing in the video
• 如果视频中出现任何敏感的文字内容
• If the subtitles contain any non-Alphabetic characters that cannot be typed out by
English keyboard;
• 如果字幕包含任何无法通过英文键盘输入的非字母字符;
• If there is a Mosaic watermark in the video
• 如果视频中有马赛克水印
*If the video has no subtitles, but has non-English audio, it should not be
discarded
*如果视频没有字幕,但有非英语音频,则不应丢弃
Audio consideration: Audio is only used to understand the video content, if any
object/information only appeared in audio, but not the visual of the video, no need
to add to the caption.
Non-English text in alphabetic characters: If the non-English subtitles are all in
Alphabetic characters, they should not discard the case. The texts can still be
transcribed in the caption and processed as usual;
音频理解:音频仅用于理解视频内容,如果任何对象/信息仅出现在音频中,而未出
现在视频的视觉中,则无需添加到字幕中。
非英语字母字符的文本:如果非英语字幕全部使用字母字符,则不应忽略大小写。
文本仍然可以在字幕中转录并按常规处理;
Error Category
Error Type Branch Description
Spacial Transition Video comprehension is performed by cutting
Relationship the video into many small segments, and
转场
sometimes the comprehension of neighboring
空间结构问
时间(反应时间 segments can be contradictory, which gives rise
题
的变化) to the problem of transitions. This can result in
turning an ongoing event into a new event to be
described and repetitive description.
视频理解是通过将视频切割成许多小片段来进行
的,有时相邻片段的理解可能会相互矛盾,这就
产生了过渡的问题。这可能导致将一个正在进行
的事件转变为一个新的事件进行描述,以及重复
描述。
Relative When the video involves up, down, left, and
position right orientation issues, sometimes it is based
on the user's perspective when looking at the
相对方位
screen, and sometimes it is based on the
orientation of a specific object in the video. And
a camera shift can cause a change in the
空间
position of an item in the video. Making sure that
the relative positions in the video are correct, as
well as the viewer not getting frustrated by the
positions.
当视频涉及上下左右的方向问题时,有时是基于
用户看屏幕时的视角,有时则是基于视频中某个
特定物体的方向。相机的移动可能导致视频中物
体位置的变化。确保视频中的相对位置是正确
的,同时让观众不会因为位置而感到沮丧。
Trajectory When involving the movement path of an object,
the description is a little simplified.
运动轨迹
当涉及到物体的运动路径时,描述有些简单化。
关系
Recognition Celebrity Due to the limitations of the AI model regarding
Recognition celebrities, it does not recognize public
识别问题
figures/celebrity and fails to refer to them by
名人识别
names such as Trump and James in the video.
由于 AI 模型在名人方面的局限性,它无法识别公
众人物/名人,并且在视频中无法提及像特朗普和
詹姆斯这样的名字。
Object Due to the limitations of the AI model regarding
Recognition everyday objects, sometimes it does not
recognize some objects correctly, such as a
物品识别
hotdog in the video.
由于 AI 模型在日常物品方面的局限,有时它无法
正确识别某些物体,例如视频中的热狗。
Special Action Due to the limitations of the AI model regarding
Recognition commonsense of human activity, sometimes it
does not know the certain underlying meaning in
行为识别
some famous, traditional, symbolic, or
professional action. We should tell it to use
those proprietary behavioral descriptors.
由于 AI 模型在理解人类活动常识方面的局限,有
时它无法理解某些著名、传统、象征性或专业行
为的潜在含义。我们应该告诉它使用那些专有的
行为描述符。
Color Due to some disturbance or others, AI model
recognition sometimes does not recognize the correct color
of objects. We should pay attention to the
颜色识别
description of colors. If there is error, it's
necessary to correct it.
由于某些干扰,AI 模型有时无法正确识别物体的
颜色。我们应该注意颜色的描述。如果有错误,
必须进行纠正。
Person Pronoun • Be sure to modify the wrong pronoun like
Reference incorrect "them/they" as it's a common mistake made by
model.
人物指代问 代词错用
题 • Instead of general reference like "the
person", specify gender/age if possible:
she/he/they/them, man/woman/lady/girl/boy.
• If recognizable in the image/video, add
specification to the reference like the bride/the
high school boys.
• Make sure the pronoun is consistent
throughout the caption, avoid using pronoun
variation to cause confusion.
• 特写:the person
• 确保修改错误的代词,如“他们/她们”,因为
这是模型常犯的错误。
• 避免使用一般性称谓如“the person”,如果可
能,请具体说明性别/年龄:she/he/they/them,
man/woman/lady/girl/boy。
• 如果在图像/视频中可以识别,请添加具体的
参考,如新娘/高中男孩。
• 确保代词在整个说明中保持一致,避免使用
不同的代词导致混淆。
• 特写:the person
Appearance • For people who can't simply be referred to
description by their gender, etc., it is necessary to refer to
them by their clothing, etc.
外表描述
• It is necessary to ensure a correct record of
people's characteristics. This also includes
recognizing what the person is holding and
clarifying the attribution of various belongings in
the scene.
• Characters' facial expressions can show a
wide range of emotions, so it is important to
recognize and understand facial expressions
• Don't use discriminatory characterization.
• 对于那些不能简单用性别等来称呼的人,有
必要根据他们的穿着等来称呼他们。
• 有必要确保对人们特征的正确记录。这也包
括识别一个人手中持有的物品,并澄清场景中各
种物品的归属。
• 角色的面部表情可以传达广泛的情感,因此
识别和理解面部表情是很重要的。
• 不要使用歧视性的刻画。
Identity match • After a scene change, sometimes the same
character changes clothing or camera angle,
人物对应
leading to a bias in the AI's understanding of the
participants in the event, which includes the
same person being seen as a new participant,
or a mismatch between participants A and B and
their action.
• An event occurs when multiple participants
are included, and it is important to distinguish
the roles that these participants play in an event.
• 在场景切换后,有时同一个角色会更换服装
或摄像角度,这会导致人工智能对事件参与者的
理解产生偏差,包括同一个人被视为新的参与
者,或者参与者 A 和 B 之间及其行为不匹配。
• 当多个参与者被包含时,事件就会发生,区
分这些参与者在事件中扮演的角色是很重要的。
Key Event Event clip • Key information is missing, such as the key
and integrity actions of characters, or a change in action
Background within a series of movement (eg. The finishing
事件片段完整度
Description action of a football/boxing match).
关键事件与 • The key funny points of the funny/meme
背景问题 videos 一名男子和一名女子生活中的默契,一方
总是能预判另一方可能遇到的问题,并为其及时
避免问题的发生。
• The progress/development/continuous
actions demonstrated in the video (eg.
Practicing playing basketball).
• The process of teaching/learning, as these
kinds of tutorials are portrayed in detail (eg.
Teaching how to draw one's eyebrows).
• 关键信息缺失,例如角色的关键动作,或在
一系列动作中的动作变化(例如,足球/拳击比赛
的结束动作)。
• 搞笑/表情包视频的关键搞笑点:一名男子和
一名女子生活中的默契,一方总是能预判另一方
可能遇到的问题,并为其及时避免问题的发生。
• 视频中展示的进展/发展/持续行动(例如:练
习打篮球)。
• 教学/学习的过程,因为这些类型的教程被详
细描述(例如,教如何画眉毛)。
Timestamp The description of a certain timestamp does not
segmentation match with the actual visual from the video.
叙述顺序 某个时间点的描述与视频中的实际画面不符。
The principle of • Something weakly relevant makes up the
attention bulk of the video explanation, such as
background information, but key event-related
注意力原则
depictions appear trivial.
• Ignoring obvious people's movement in the
background.
• Unnecessary repetitive description attracts
too much attention.
• 视频解释的主要内容是一些弱相关的信息,
例如背景资料,但与关键事件相关的描绘显得微
不足道。
• 忽视背景中明显的人物移动。
• 不必要的重复描述吸引了过多的注意力。
Logic and Content • Excessive interpretation of scenes or
reasoning extension background characters in the video. For
interpretation example, a man passed by the main character
逻辑推理问
in the video, then the text generated by AI said
题 内容延伸解读
that the man is the main character's father
(which is not true).
• Insufficient reasoning on the topic of the
video, subject relationship, etc. For example,
there are several scenes in the video showing a
baby who was raised to a grown-up, but the AI
can't recognize that the grown-up and the baby
are the same person.
• 对视频中场景或背景人物等的过度解读。
• 对视频主题,主体关系等的推理不足。
Virtual things to • Correspondence between virtual characters
reality or objects and reality, such as abstract and
exaggerated game characters or animated
虚拟事物与现实
characters, representing a complete event
的对应
through something minimalist or symbolic.
• 虚拟人物或物体与现实的对应关系,比如抽
象且夸张的游戏人物或者动画人物,通过极简或
者象征性的事物代表一个完整事件。
Logical • Some items with the progress of the video
deduction will transfer some form changes, cannot be
recognition ordinary identification, need to be through the
accumulation of knowledge related to the items
逻辑推理识别
and video logic common sense deduction to
judge.
For example, we see a chef holding a
shellfish in the first scene in a video, then
he brought out a meat dish in the next
scene. The AI can't identify what the meat
is, but we can deduce that this meat is shell
meat based on the information from the
footage.
• 某些物品随着视频的进展会转移某种形式的
变化,无法通过普通的识别来判断,需要通过对
与物品和视频逻辑相关的知识的积累进行常识推
理来判断。例如,贝类肉是从贝类中取出的,经
过炸制成肉菜,这无法通过图像识别,但根据视
频的内容,应该是贝类肉。
Text Subtitle Reco • Complete the screen prompts in text.
and (Fragmented subtitles, linked into complete
文本问题
Understanding sentences.)
字幕识别和理解 • Multiple subtitles should be understood
together.
• The dialog box corresponds to the correct
speaker.
• 完成屏幕提示的文本。(将分散的字幕连成完
整的句子。)
• 多个字幕应一起理解。
• 对话框对应于正确的发言者。
Text in Video • Recognition and reading of text on clothes,
Reco and printed materials, walls, etc.
Understanding
• Determine the relevance of text and video,
场景内文字识别 display and interpret the text with high
和理解 relevance.
• 在衣物、印刷材料、墙壁等上识别和读取文
本。
• 确定文本和视频的相关性,展示并解释高度
相关的文本。
Primary Grammar Error Be sure to extend the same tense as the original
Error caption.
语法错误
初级错误 确保使用与原始字幕相同的时态。
Bug Unintelligible piles of words or things.
乱码 无法理解的一堆词语或事物。
Image display Overly blurred and masked images
error
过于模糊和遮挡的图像
图像显示
Fictionalized Things and events not appearing in the video
scenario are put in the description
虚构情节 视频中未提及的事物和事件会在描述中列出
Redundant • Summarizing the video with too many words
retelling or a complete paraphrase of the previous
content.
冗余复述
• Videos with unchanging settings (only
object/person action change), reduce the
repetitive description (on
backgroun/surroundings/person's dressing...) to
no more than twice to avoid redundancy
• 用过多的词语或对之前内容的完整改写来总
结视频。
• 设置不变的视频(只有物体/人物动作变
化),将对背景/环境/人物着装等的重复描述减少
到不超过两次,以避免冗余。
*Audio consideration: Audio is only used to understand the video content, if any
object/information only appeared in audio, but not the visual of the video, no need to
add to the caption.
音频注意事项:音频仅用于理解视频内容,如果有任何对象/信息仅出现在音频中,而
视频的视觉部分没有出现,则无需添加到字幕中。
Acceptable Error Range
1. Inference/supposition
2. Background information details omission
3. Small texts or stickers in the video omitted
4. Vague descriptions of unclear objects
可容错:
1. 推理/假设
2. 背景信息细节遗漏
3. 视频中小文本或贴纸遗漏
4. 对不清晰物体的模糊描述
Notes:
1. Please be sure to evaluate the text with the videos in column B
2. Please determine if the text should be discarded according to the Discard rules,
and fill in your results in column D
3. If the text is remained, please check what error types are involved in the text
according to the Error category, and select the error types in column E(Multiple
Choice Available)
4. Please refine the text in column F
a. Please mark the error part in the original text in red font
b. Please refer to the rules to rewrite the original text, and the new content
compared to the original text should be marked in green font