In this work we introduce a time- and memory-efficient method for structured prediction that couples neuron decisions across both space at time. We show that we are able to perform exact and efficient inference on a densely-connected spatio-temporal graph by capitalizing on recent advances on deep Gaussian Conditional Random Fields (GCRFs).
Our method, called VideoGCRF is (a) efficient, (b) has a unique global minimum, and (c) can be trained end-to-end alongside contemporary deep networks for video understanding. We experiment with multiple connectivity patterns in the temporal domain, and present empirical improvements over strong baselines on the tasks of both semantic and instance segmentation of videos.
We jointly segment multiple images by passing them firstly through a fully convolutional network to obtain per-pixel class scores (unary terms U), alongside with spatial (S) and temporal (T) embeddings. We couple predictions at different spatial and temporal positions in terms of the inner product of their respective embeddings, shown here as arrows pointing to a graph edge. The final prediction is obtained by solving a linear system; this can eliminate spurious responses, e.g. on the left pavement, by diffusing the per-pixel node scores over the whole spatio-temporal graph. The CRF and CNN architecture is jointly trained end-to-end, while CRF inference is exact and particularly efficient.
Visualization of instance segmentation through VideoGCRF: In row 1 we focus on a single point of the CRF graph, shown as a cross, and show as a heatmap its spatial (inter-frame) and temporal (intra-frame) affinities to all other graph nodes. In row 2 we show the predictions that would be obtained by frame-by-frame segmentation, relying exclusively on the FCN's unary terms, while in row 3 we show the results obtained after solving the VideoGCRF inference problem. We observe that in frame-by-frame segmentation a second camel is incorrectly detected due to its similar appearance properties. However, VideoGCRF inference exploits temporal context and focuses solely on the correct object.
We compare our results with some of the previously published methods, as well as our own implementation of the ResNet-101 network which serves as our base network.