当前位置：网站首页>Video summary with long short term memory

Video summary with long short term memory

2022-06-12 06:17:00 【Caicaicaicaicaicai】

Video Summarization with Long Short-term Memory

Abstract

We propose a novel supervised learning technique , You can summarize videos by automatically selecting key frames or key sub shots . Project the task as a structured prediction problem , Our main idea is to use short - and long-term memory （LSTM） The variable range time correlation between video frames is modeled , So as to export representative video and compact video summary . The proposed model successfully illustrates the sequence structure which is very important for generating meaningful video abstracts , Thus, the latest results are produced on two benchmark data sets . In addition to advances in Modeling Technology , We also introduced a strategy , It is used to solve the problem that a large amount of annotation data is required when training complex summary learning methods . Despite the heterogeneity of visual style and content , Our main idea is to use the annotated video summarization data set . say concretely , We prove that domain adaptive technology can improve learning by reducing the statistical attribute difference of the original data set .

We propose a novel supervised learning technique for summarizing videos by automatically selecting keyframes or key subshots. Casting the task as a structured prediction problem, our main idea is to use Long Short-Term Memory (LSTM) to model the variable-range temporal dependency among video frames, so as to derive both representative and compact video summaries. The proposed model successfully accounts for the sequential structure crucial to generating meaningful video summaries, leading to state-of-the-art results on two benchmark datasets. In addition to advances in modeling techniques, we introduce a strategy to address the need for a large amount of annotated data for training complex learning approaches to summarization. There, our main idea is to exploit auxiliary annotated video summarization datasets, in spite of their heterogeneity in visual styles and contents. Specifically, we show that domain adaptation techniques can improve learning by reducing the discrepancies in the original datasets’ statistical properties.

Video has quickly become one of the most common sources of visual information . There's a huge amount of video data , Watch and upload to every day YouTube All videos of need 82 Years or more ！ therefore , Automated tools for analyzing and understanding video content are critical . Specially , Automatic video summarization is a key tool to help human users browse video data . A good video summary will compactly describe the original video , And refine its important events into a brief viewable summary . Video summarization can shorten video in several ways . In this paper , We focus on the two most common methods ： Keyframe selection （ The system recognizes a series of defined frames [1,2,3,4,5]） And key sub snapshot selection （ The system identifies a series of defined sub snapshots ） , Each is across short time intervals [6,7,8,9] A continuous set of frames in time .

Video has rapidly become one of the most common sources of visual information. The amount of video data is daunting — it takes over 82 years to watch all videos uploaded to YouTube per day! Automatic tools for analyzing and understanding video contents are thus essential. In particular, automatic video summarization is a key tool to help human users browse video data. A good video summary would compactly depict the original video, distilling its important events into a short watchable synopsis. Video summarization can shorten video in several ways. In this paper, we focus on the two most common ones: keyframe selection, where the system identifies a series of defining frames [1,2,3,4,5] and key subshot selection, where the system identifies a series of defining subshots, each of which is a temporally contiguous set of frames spanning a short time interval [6,7,8,9].

Interest in studying learning techniques for video abstracts has been growing steadily . Many methods are based on unsupervised learning , And define intuitive criteria to select the framework [1、5、6、9、10、11、12、13、14], There is no clear optimization evaluation index . Recent work has begun to explore supervised learning techniques [2,15,16,17,18]. Contrary to the unsupervised approach , Supervised methods learn directly from artificially created abstracts , To capture the underlying frame selection criteria , And output a subset of those frames that are more consistent with human semantic understanding of video content .

There has been a steadily growing interest in studying learning techniques for video summarization. Many approaches are based on unsupervised learning, and define intuitive criteria to pick frames [1,5,6,9,10,11,12,13,14] without explicitly optimizing the evaluation metrics. Recent work has begun to explore supervised learning techniques [2,15,16,17,18]. In contrast to unsupervised ones, supervised methods directly learn from human-created summaries to capture the underlying frame selection criterion as well as to output a subset of those frames that is more aligned with human semantic understanding of the video contents.

Supervised learning of video abstracts involves two problems ： What type of learning model to use ？ And how to get enough annotated data to fit these models ？ In abstract terms , Video summarization is a structured prediction problem ： The input of the algorithm is video frame sequence , The output is a binary vector indicating whether to select a frame . This type of sequential prediction task is aimed at speech recognition , The basis of many popular algorithms for problems such as language processing . The most important aspect of such a task is , Selection decisions cannot be made locally and in isolation - Dependencies require a decision to be made after considering all the data in the original sequence .

Supervised learning for video summarization entails two questions: what type of learning model to use? and how to acquire enough annotated data for fitting those models? Abstractly, video summarization is a structured prediction problem: the input to the summarization algorithm is a sequence of video frames, and the output is a binary vector indicating whether a frame is to be selected or not. This type of sequential prediction task is the underpinning of many popular algorithms for problems in speech recognition, language processing, etc. The most important aspect of this kind of task is that the decision to select cannot be made locally and in isolation — the inter-dependency entails making decisions after considering all data from the original sequence.

For video summary , The interdependence between video frames is very complex and highly uneven . This is not entirely surprising , Because human viewers rely on high-level semantic understanding of video content （ And follow the story ） To determine whether the framework is valuable for preserving Abstracts . for example , When determining what a keyframe is , Video frames that are close in time are usually visually similar , So it conveys redundant information , So it should be compressed . however , This is not the case . namely , Visually similar frames do not have to be close in time . for example , Consider summarizing the video ：“ Leave home in the morning , Go home for lunch , Then leave again , Come home at night .” Despite the “ At home ” Scene related frames can be visually similar , However, any one of the semantic streaming videos indicates that no one should be deleted . therefore , Summarization algorithms that rely only on checking visual cues but fail to take into account the high-level semantic understanding of video over a long time range will erroneously eliminate important frames . Essentially , The nature of these decisions is sequential – Any decision to include or exclude frames depends on other decisions made on the timeline .

For video summarization, the inter-dependency across video frames is complex and highly inhomogeneous. This is not entirely surprising as human viewers rely on high-level semantic understanding of the video contents (and keep track of the unfolding of storylines) to decide whether a frame would be valuable to keep for a summary. For example, in deciding what the keyframes are, temporally close video frames are often visually similar and thus convey redundant information such that they should be condensed. However, the converse is not true. That is, visually similar frames do not have to be temporally close. For example, consider summarizing the video “leave home in the morning and come back to lunch at home and leave again and return to home at night.” While the frames related to the “at home” scene can be visually similar, the semantic flow of the video dictates none of them should be eliminated. Thus, a summarization algorithm that relies on examining visual cues only but fails to take into consideration the high-level semantic understanding about the video over a long-range temporal span will erroneously eliminate important frames. Essentially, the nature of making those decisions is largely sequential – any decision including or excluding frames is dependent on other decisions made on a temporal line.

Modeling variable range dependencies under the interweaving of short-range and long-range relationships is a long-standing challenge in machine learning . Our work has been influenced by the recent successful application of long-term short-term memory （LSTM） To solve the problem of structured forecasting （ For example, speech recognition [19,20,21] And image and video captions [22,23,24,25,26]） Inspired by the . LSTM It is particularly advantageous in modeling remote structural dependencies , under these circumstances , The impact of the distant past on the present and future must be adjusted in a data related manner . In the context of the video summary ,LSTM Use its storage unit explicitly to learn “ storyline ” The progress of the , To know when to forget or merge past events to make a decision .

Modeling variable-range dependencies where both short-range and long-range relationships intertwine is a long-standing challenging problem in machine learning. Our work is inspired by the recent success of applying long short-term memory (LSTM) to structured prediction problems such as speech recognition [19,20,21] and image and video captioning [22,23,24,25,26]. LSTM is especially advantageous in modeling long-range structural dependencies where the influence by the distant past on the present and the future must be adjusted in a data-dependent manner. In the context of video summarization, LSTMs explicitly use its memory cells to learn the progression of “storylines”, thus to know when to forget or incorporate the past events to make decisions.

In this paper , We studied how to LSTM And its variants are used to supervise video summarization . We make the following contributions . We have put forward vsLSTM, It's based on LSTM Video aggregation model （ The first 3.3 section ）. chart 2 The conceptual design of the model is explained . We prove LSTM The sequential modeling aspect of . A multilayer neural network using adjacent frames as features （MLP） Poor performance . We have further shown how to use the LSTM With the setpoint process （DPP） Combine to enhance LSTM The ability of , The deterministic process is a recently introduced probabilistic model for subset selection [2,27]. The resulting model achieves the best results on two recent challenging benchmark datasets （ The first 4 section ）. In addition to advances in modeling , We also showed how to solve the practical challenge of insufficient examples of manually annotated video summaries . We show that model fitting can benefit from combining video datasets , Although they are heterogeneous in content and visual style . In especial , It is possible to reduce the difference in statistical characteristics across different data sets by “ Domain adaptation ” Technology to improve this advantage .

In this paper, we investigate how to apply LSTM and its variants to supervised video summarization. We make the following contributions. We propose vsLSTM, a LSTM-based model for video summarization (Sec. 3.3). Fig. 2 illustrates the conceptual design of the model. We demonstrate that the sequential modeling aspect of LSTM is essential; the performance of multi-layer neural networks (MLPs) using neighboring frames as features is inferior. We further show how LSTM’s strength can be enhanced by combining it with the determinantal point process (DPP), a recently introduced probabilistic model for diverse subset selection [2,27]. The resulting model achieves the best results on two recent challenging benchmark datasets (Sec. 4). Besides advances in modeling, we also show how to address the practical challenge of insufficient human-annotated video summarization examples. We show that model fitting can benefit from combining video datasets, despite their heterogeneity in both contents and visual styles. In particular, this benefit can be improved by “domain adaptation” techniques that aim to reduce the discrepancies in statistical characteristics across the diverse datasets.

The rest of this paper is arranged as follows . The first 2 Section reviews the work related to video summarization , The first 3 Section describes the recommended based on LSTM And its variants . In the 4 In the festival , We report empirical results . We examined our approach in several supervised learning environments , It is compared with other existing methods , The influence of domain adaptation on the training of merged summary data sets is analyzed （ The first 4.4 section ）. We are the first 5 Section summarizes our paper .

The rest of the paper is organized as follows. Section 2 reviews related work of video summarization, and Section 3 describes the proposed LSTM-based model and its variants. In Section 4, we report empirical results. We examine our approach in several supervised learning settings and contrast it to other existing methods, and we analyze the impact of domain adapation for merging summarization datasets for training (Section 4.4). We conclude our paper in Section 5.

2 Related Work

The techniques for automatic video summarization fall into two broad categories ： Unsupervised technology , These techniques rely on manually designed criteria to prioritize and select frames or sub snapshots from video [1、3、5、6、9、10、11、12、14、28、29 ,30、31、32、33、34、35、36] And supervised projects , These projects take advantage of manually edited summary examples （ Or framework importance level ） To learn how to summarize new videos [2,15,16,17,18]. Compared with the traditional unreserved method , The latest results of the latter show great hope .

Techniques for automatic video summarization fall in two broad categories: unsupervised ones that rely on manually designed criteria to prioritize and select frames or subshots from videos [1,3,5,6,9,10,11,12,14,28,29,30,31,32,33,34,35,36] and supervised ones that leverage human-edited summary examples (or frame importance ratings) to learn how to summarize novel videos [2,15,16,17,18]. Recent results by the latter suggest great promise compared to traditional unupservised methods.

Information standards include relevance [10,13,14,31,36], Representativeness or importance [5,6,9,10,11,33,35] And diversity or coverage [1,12,28,30,34] . Several of the most recent methods also utilize auxiliary information , for example Web Images [10,11,33,35] Or video category [31], To facilitate the summary process .

Informative criteria include relevance [10,13,14,31,36], representativeness or importance [5,6,9,10,11,33,35], and diversity or coverage [1,12,28,30,34]. Several recent methods also exploit auxiliary information such as web images [10,11,33,35] or video categories [31] to facilitate the summarization process.

Because they explicitly learn from artificially created Abstracts , Therefore, the supervised method can better adapt to the human way of summarizing the input video . for example , Previous monitoring methods learned to combine several manually developed standards , So that the summary is consistent with the authenticity of the ground [15,17]. in addition , Fix point process （DPP） It's a probabilistic model , It represents how to extract representative and diverse subsets from the ground set . It is a valuable tool for modeling aggregations in a supervised environment [2,16,18].

Because they explicitly learn from human-created summaries, supervised methods are better equipped to align with how humans would summarize the input video. For example, a prior supervised approach learns to combine multiple hand-crafted criteria so that the summaries are consistent with ground truth [15,17]. Alternatively, the determinatal point process (DPP) — a probabilistic model that characterizes how a representative and diverse subset can be sampled from a ground set — is a valuable tool to model summarization in the supervised setting [2,16,18].

None of the above work is used LSTM Modeling short-range and long-range dependencies in continuous video frames . The literature [2] The order proposed in DPP Use a predefined time structure , So the correlation is “ Hard wired ”. by comparison ,LSTM Data related switches can be used / Off switches model dependencies , This is very powerful for modeling sequential data [20].

None of above work uses LSTMs to model both the short-range and long-range dependencies in the sequential video frames. The sequential DPP proposed in [2] uses pre-defined temporal structures, so the dependencies are “hard-wired”. In contrast, LSTMs can model dependencies with a data-dependent on/off switch, which is extremely powerful for modeling sequential data [20].

LSTM stay [37] Is used to model time dependence , To identify video highlights , And it is converted to outlier detection based on automatic encoder . LSTM It is also used to model the visual attention of the observer , To analyze the image [38,39] And execute natural language video description [23,24,25]. however , As far as we know , Our work is the first exploration LSTM Tools for video summarization . As our results show , Their flexibility in capturing sequential structures is promising for this task .

LSTMs are used in [37] to model temporal dependencies to identify video highlights, cast as auto-encoder-based outlier detection. LSTMs are also used in modeling an observer’s visual attention in analyzing images [38,39], and to perform natural language video description [23,24,25]. However, to the best of our knowledge, our work is the first to explore LSTMs for video summarization. As our results will demonstrate, their flexibility in capturing sequential structure is quite promising for the task.

3 Approach

In this section , We described the method of summarizing video . Let's start by formally stating the problem and the notation , Then briefly review LSTM [40,41,42], This is the basis of our approach . then , Let's introduce our first aggregation model vsLSTM. Then we describe how to use vsLSTM And fix point processing （DPP） Combine to enhance vsLSTM, Fix point processing （DPP） The abstract structure is further considered （ for example , Diversity between selected frames ）.

In this section, we describe our methods for summarizing videos. We first formally state the problem and the notations, and briefly review LSTM [40,41,42], the building block of our approach. We then introduce our first summarization model vsLSTM. Then we describe how we can enhance vsLSTM by combining it with a determinantal point process (DPP) that further takes the summarization structure (e.g., diversity among selected frames) into consideration.

3.1 Problem Statement

We use x = {x1,x2,···,xt,···,xT} Represents the sequence of frames in the video to be summarized , and xt In the first t Visual features extracted from frames .

We use x = {x1, x2, · · · , xt, · · · , xT } to denote a sequence of frames in a video to be summarized while xt is the visual features extracted at the t-th frame.

The output of the algorithm can take one of the following two forms . First select the key frame [2,3,12,28,29,43], The summary result is （ Segregated ） A subset of frames . The second one is the interval based key [15,17,31,35], The summary is a set of along the time axis （ short ） interval . Instead of binary information （ Selected or not selected ）, Some datasets provide frame level importance scores calculated from human annotations [17,35]. These scores represent the possibility of selecting the framework as part of the summary . Our model leverages all types of annotations （ Binary keyframe labels , Binary sub snapshot labels or frame level importance ） As a learning signal .1

The output of the summarization algorithm can take one of two forms. The first is selected keyframes [2,3,12,28,29,43], where the summarization result is a subset of (isolated) frames. The second is interval-based keyshots [15,17,31,35], where the summary is a set of (short) intervals along the time axis. Instead of binary information (being selected or not selected), certain datasets provide frame-level importance scores computed from human annotations [17,35]. Those scores represent the likelihoods of the frames being selected as a part of summary. Our models make use of all types of annotations — binary keyframe labels, binary subshot labels, or frame-level importances — as learning signals.1

原网站

版权声明
本文为[Caicaicaicaicaicai]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/03/202203010611281703.html