For video-understanding tasks, CLIP is used in video-text retrieval and video classification. Trained by 400 million image–sentence pairs collected from the Internet, CLIP is a very powerful model which could be used in many computer vision tasks, such as image classification, object detection, image generation and image manipulation. The pre-trained model could be used to estimate the semantic similarity between a sentence and an image. Compared with supervised methods, they have worse performance.ĬLIP (Contrastive Language-Image Pre-training) was developed by OpenAI in 2021, aimed at learning a joint representation of image and text. DSCNet mines the deep semantic features from the query set to compose possible activities in each video. DSCNet was proposed in 2022, and it is the first unsupervised method in the temporal video-grounding task. The authors of trained a joint visual-text embedding and obtained the moment location using the latent alignment obtained by text-guided attention (TGA). SCN constructs a proposal generation module which can aggregate the context information to obtain candidate proposals. The authors of added audio features to improve the performance. is the first work outlining a weakly supervised methods in an iterative way to obtain a temporal boundary prediction, and the segment is fed into the event captioner to generate a sentence. Weakly supervised methods require paired video-query knowledge without detailed segment annotations, whereas the unsupervised method requires only the video set and the query set for training. In order to reduce the labeling cost, some weakly supervised methods and even an unsupervised methods are proposed. The selection of start–end times could be regarded as a sequential decision-making process.ĭue to the huge cost of labeling, it is difficult to apply supervised methods to practical scenarios. With the development of reinforcement learning, SM-RL used this technique in video grounding. The authors of used the distances between the frames within the ground-truth period and the start–end frames as dense supervisions to improve accuracy. The authors of proposed a method using an attention mechanism directly to predict the coordinates of the queried video clip. One-stage methods generate video clips related to the sentence directly. To achieve more efficient performance in this task, end-to-end (one-stage) methods have become popular in recent years. Indeed, 2D-TAN introduced a two-dimensional temporal map to model temporally adjacent relations of video clips and MS-2D-TAN is the multi-scale version of 2D-TAN. In, a graph was used to explicitly model temporal relationships among proposals. Traditional methods are two-stage solutions following the rule of “propose and rank”. For each sentence, information on the start and end times in the video are needed. Most methods for video grounding operate in a supervised manner.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |