TextOCVP: Object-Centric Image to Video Generation with Language Guidance
* Indicates equal contribution

Abstract

Accurate and flexible world models are crucial for autonomous systems to understand their environment and predict future events. Object-centric models, with structured latent spaces, have shown promise in modeling object dynamics and interactions, but often face challenges in scaling to complex datasets and incorporating external guidance, limiting their applicability in robotics. To address these limitations, we propose TextOCVP, an object-centric model for image-to-video generation guided by textual descriptions. TextOCVP parses an observed scene into object representations, called slots, and utilizes a text-conditioned transformer predictor to forecast future object states and video frames. Our approach jointly models object dynamics and interactions while incorporating textual guidance, thus leading to accurate and controllable predictions. Our method’s structured latent space offers enhanced control over the prediction process, outperforming several image-to-video generative baselines. Additionally, we demonstrate that structured object-centric representations provide superior controllability and interpretability, facilitating the modeling of object dynamics and enabling more precise and understandable predictions.

TextOCVP Architecture
Overview of TextOCVP model.
a) Overview: TextOCVP parses the reference frame into object slot representations. The text-conditioned object-centric predictor module models the object dynamics and interactions, incorporating information from the textual description to predict future object states, which can be decoded into frames.
b) Predictor: Overview of our proposed text-conditioned object-centric predictor module.

MainPredictor
Benchmarking
TextOCVP outperforms all baselines on the CATER dataset, and performs among the best image-to-video generation models on the more challenging CLIPort dataset.

Benchmark on CATER
Benchmark on CATER
Benchmark on CLIPport
Benchmark on CLIPort

Qualitative Evaluation
TextOCVP generates, given a single reference frame and a text caption, a sequence that closely aligns to the ground-truth. We observe that TextOCVP maintains sharp object representations and correctly models the dynamics and interactions between the robot arm and the objects. In contrast, the baseline model features multiple errors and artifacts, such as missing objects, blurry contours, or failure in robot arm movement.
qual_01qual_02


CATER Qualitative Evaluation

the medium green metal sphere is sliding to (2, 1). the small brown metal cube is picked up and placed to (-3, 1).

GT
GT
MAGE Baseline
GT
TextOCVP (ours)

the large yellow rubber cone is sliding to (2, 3). the small gold metal snitch is picked up and placed to (-3, 1).

GT
GT
MAGE Baseline
GT
TextOCVP (ours)

the medium green rubber cone is picked up and containing the small gold metal snitch. the large purple rubber cone is picked up and placed to (-1, 3).

GT
GT
MAGE Baseline
GT
TextOCVP (ours)


CLIPort Qualitative Evaluation

put the gray block in the brown bowl.

GT
GT
MAGE-Dino Baseline
GT
TextOCVP (ours)

put the blue block in the gray bowl.

GT
GT
MAGE-Dino Baseline
GT
TextOCVP (ours)

put the gray block in the brown bowl.

GT
GT
MAGE-Dino Baseline
GT
TextOCVP (ours)
Object-Centric Video Generation
TextOCVP represents each object in its corresponding object slot, learning accurate and interpretable object representations.

the medium brown rubber cone is picked up and containing the small gold metal snitch. the medium gray metal cube is rotating.

GT
GT
Preds
GT
Obj 1.
Obj 2.
Obj 3.
Obj 4.

the large blue metal cone is picked up and containing the small yellow rubber cone. the medium green metal cylinder is sliding to (-1, 1).

GT
GT
Preds
GT
Obj 1.
Obj 2.
Obj 3.
Obj 4.

the medium brown metal cone is picked up and placed to (-3, -3). the large brown metal cone is picked up and containing the small gold metal snitch.

GT
GT
Preds
GT
Obj 1.
Obj 2.
Obj 3.
Obj 4.
Controllability
We demonstrate that TextOCVP can generate multiple possible sequence continuations conditioned on a single reference frame and different captions.

Controllability on CATER


Original Caption
the large purple rubber cone is picked up and placed to (2, 3). the small gold metal snitch is picked up and placed to (-1, 1).
GIF 1
Changed Actions
the large purple rubber cone is sliding to ( -1 , -3 ) . the small gold metal snitch is rotating .
GIF 4
Changed Moving Objects and Actions
the medium cyan rubber sphere is picked up and placed to ( -2 , -2 ) . the medium purple metal cone is sliding to ( 2 , 3 ) .
GIF 5
Single Action in Caption
the medium purple metal cone is picked up and placed to ( -2 , -3 ) .
GIF 2
Three Distinct Actions in Caption
the large purple rubber cone is sliding to ( -1 , 1 ) . the small gold metal snitch is rotating . the medium purple metal cone is picked up and placed to ( -1 , -3 ) .
GIF 3

Controllability on CLIPort


Changing the Target Bowl
put the green block in the cyan bowl.
GIF 1
put the green block in the red bowl.
GIF 2
Changing the Picked Block
put the cyan block in the brown bowl.
GIF 1
put the blue block in the brown bowl.
GIF 2

Text-to-Slot Attention

An additional advantage of using object-centric representations is the improved interpretability. This can be shown in the text-to slot attention weights, which help us understand how the textual information influences and guides the model predictions.

First, we visualize the text-to-slot attention weights for different cross-attention heads for a single object. We observe that the slot representing the rotating cube attends to relevant text-tokens from the input, such as the object shape, size and the action taking place.

tts 1

We additionally visualize the Text-to-Slot attention weights, averaged across attention heads, for different objects in a CATER sequence. We observe that slots that represent objects in the textual description attend to relevant text tokens, such as their target coordinate locations.
tts 2
Citation
Loading...