TextOCVP: Object-Centric Image to Video Generation with Language Guidance

Angel Villar-Corrales^*, Gjergj Plepi^*, Sven Behnke,

* Indicates equal contribution

Autonomous Intelligent Systems, University of Bonn

Abstract

Accurate and flexible world models are crucial for autonomous systems to understand their environment and predict future events. Object-centric models, with structured latent spaces, have shown promise in modeling object dynamics and interactions, but often face challenges in scaling to complex datasets and incorporating external guidance, limiting their applicability in robotics. To address these limitations, we propose TextOCVP, an object-centric model for image-to-video generation guided by textual descriptions. TextOCVP parses an observed scene into object representations, called slots, and utilizes a text-conditioned transformer predictor to forecast future object states and video frames. Our approach jointly models object dynamics and interactions while incorporating textual guidance, thus leading to accurate and controllable predictions. Our method’s structured latent space offers enhanced control over the prediction process, outperforming several image-to-video generative baselines. Additionally, we demonstrate that structured object-centric representations provide superior controllability and interpretability, facilitating the modeling of object dynamics and enabling more precise and understandable predictions.

TextOCVP Architecture

Overview of TextOCVP model.
a) Overview: TextOCVP parses the reference frame into object slot representations. The text-conditioned object-centric predictor module models the object dynamics and interactions, incorporating information from the textual description to predict future object states, which can be decoded into frames.
b) Predictor: Overview of our proposed text-conditioned object-centric predictor module.

Benchmarking

TextOCVP outperforms all baselines on the CATER dataset, and performs among the best image-to-video generation models on the more challenging CLIPort dataset.

Benchmark on CLIPort — Benchmark on CLIPport

Qualitative Evaluation

TextOCVP generates, given a single reference frame and a text caption, a sequence that closely aligns to the ground-truth. We observe that TextOCVP maintains sharp object representations and correctly models the dynamics and interactions between the robot arm and the objects. In contrast, the baseline model features multiple errors and artifacts, such as missing objects, blurry contours, or failure in robot arm movement.

CATER Qualitative Evaluation

the medium green metal sphere is sliding to (2, 1). the small brown metal cube is picked up and placed to (-3, 1).

MAGE Baseline

TextOCVP (ours)

the large yellow rubber cone is sliding to (2, 3). the small gold metal snitch is picked up and placed to (-3, 1).

MAGE Baseline

TextOCVP (ours)

the medium green rubber cone is picked up and containing the small gold metal snitch. the large purple rubber cone is picked up and placed to (-1, 3).

MAGE Baseline

TextOCVP (ours)

CLIPort Qualitative Evaluation

put the gray block in the brown bowl.

MAGE-Dino Baseline

TextOCVP (ours)

put the blue block in the gray bowl.

MAGE-Dino Baseline

TextOCVP (ours)

put the gray block in the brown bowl.

MAGE-Dino Baseline

TextOCVP (ours)

Object-Centric Video Generation

TextOCVP represents each object in its corresponding object slot, learning accurate and interpretable object representations.

the medium brown rubber cone is picked up and containing the small gold metal snitch. the medium gray metal cube is rotating.

Preds

Obj 1.

Obj 2.

Obj 3.

Obj 4.

the large blue metal cone is picked up and containing the small yellow rubber cone. the medium green metal cylinder is sliding to (-1, 1).

Preds

Obj 1.

Obj 2.

Obj 3.

Obj 4.

the medium brown metal cone is picked up and placed to (-3, -3). the large brown metal cone is picked up and containing the small gold metal snitch.

Preds

Obj 1.

Obj 2.

Obj 3.

Obj 4.

Controllability

We demonstrate that TextOCVP can generate multiple possible sequence continuations conditioned on a single reference frame and different captions.

Controllability on CATER

Original Caption

the large purple rubber cone is picked up and placed to (2, 3). the small gold metal snitch is picked up and placed to (-1, 1).

Changed Actions

the large purple rubber cone is sliding to ( -1 , -3 ) . the small gold metal snitch is rotating .

Changed Moving Objects and Actions

the medium cyan rubber sphere is picked up and placed to ( -2 , -2 ) . the  medium purple metal cone is sliding to ( 2 , 3 ) .

Single Action in Caption

the medium purple metal cone is picked up and placed to ( -2 , -3 ) .

Three Distinct Actions in Caption

the large purple rubber cone is sliding to ( -1 , 1 ) . the  small gold metal snitch is rotating . the  medium purple metal cone is picked up and placed to ( -1 , -3 ) .

Controllability on CLIPort

Changing the Target Bowl

put the green block in the cyan bowl.

put the green block in the red bowl.

Changing the Picked Block

put the cyan block in the brown bowl.

put the blue block in the brown bowl.

Text-to-Slot Attention

An additional advantage of using object-centric representations is the improved interpretability. This can be shown in the text-to slot attention weights, which help us understand how the textual information influences and guides the model predictions.

First, we visualize the text-to-slot attention weights for different cross-attention heads for a single object. We observe that the slot representing the rotating cube attends to relevant text-tokens from the input, such as the object shape, size and the action taking place.

We additionally visualize the Text-to-Slot attention weights, averaged across attention heads, for different objects in a CATER sequence. We observe that slots that represent objects in the textual description attend to relevant text tokens, such as their target coordinate locations.

Citation

Loading...