Abstract
Understanding and forecasting future scene states is critical for autonomous agents to plan and act effectively in complex environments. Object-centric models, with structured latent spaces, have shown promise in modeling object dynamics and predicting future scene states, but often struggle to scale beyond simple synthetic datasets and to integrate external guidance, limiting their applicability in robotic scenarios. To address these limitations, we propose TextOCVP, an object-centric model for video prediction guided by textual descriptions. TextOCVP parses an observed scene into object representations, called slots, and utilizes a text-conditioned transformer predictor to forecast future object states and video frames. Our approach jointly models object dynamics and interactions while incorporating textual guidance, enabling accurate and controllable predictions. TextOCVP’s structured latent space offers a more precise control of the forecasting process, outperforming several video prediction baselines on two datasets. Additionally, we show that structured object-centric representations provide superior robustness to novel scene configurations, as well as improved controllability and interpretability, enabling more precise and understandable predictions.
a) Overview: TextOCVP parses the reference frame into object slot representations. The text-conditioned object-centric predictor module models the object dynamics and interactions, incorporating information from the textual description to predict future object states, which can be decoded into frames.
b) Predictor: Overview of our proposed text-conditioned object-centric predictor module.






CATER Qualitative Evaluation
the medium green metal sphere is sliding to (2, 1). the small brown metal cube is picked up and placed to (-3, 1).



the large yellow rubber cone is sliding to (2, 3). the small gold metal snitch is picked up and placed to (-3, 1).



the medium green rubber cone is picked up and containing the small gold metal snitch. the large purple rubber cone is picked up and placed to (-1, 3).



CLIPort Qualitative Evaluation
put the gray block in the brown bowl.



put the blue block in the gray bowl.



put the gray block in the brown bowl.



the medium brown rubber cone is picked up and containing the small gold metal snitch. the medium gray metal cube is rotating.






the large blue metal cone is picked up and containing the small yellow rubber cone. the medium green metal cylinder is sliding to (-1, 1).






the medium brown metal cone is picked up and placed to (-3, -3). the large brown metal cone is picked up and containing the small gold metal snitch.






Controllability on CATER
the large purple rubber cone is picked up and placed to (2, 3). the small gold metal snitch is picked up and placed to (-1, 1).
the large purple rubber cone is sliding to ( -1 , -3 ) . the small gold metal snitch is rotating .
the medium cyan rubber sphere is picked up and placed to ( -2 , -2 ) . the medium purple metal cone is sliding to ( 2 , 3 ) .
the medium purple metal cone is picked up and placed to ( -2 , -3 ) .
the large purple rubber cone is sliding to ( -1 , 1 ) . the small gold metal snitch is rotating . the medium purple metal cone is picked up and placed to ( -1 , -3 ) .
Controllability on CLIPort
put the green block in the cyan bowl.
put the green block in the red bowl.
put the cyan block in the brown bowl.
put the blue block in the brown bowl.
An additional advantage of using object-centric representations is the improved interpretability. This can be shown in the text-to slot attention weights, which help us understand how the textual information influences and guides the model predictions.
First, we visualize the text-to-slot attention weights for different cross-attention heads for a single object. We observe that the slot representing the rotating cube attends to relevant text-tokens from the input, such as the object shape, size and the action taking place.

We additionally visualize the Text-to-Slot attention weights, averaged across attention heads, for different objects in a CATER sequence. We observe that slots that represent objects in the textual description attend to relevant text tokens, such as their target coordinate locations.

Loading...