PlaySlot: Learning Inverse Latent Dynamics for Controllable Object-Centric Video Prediction and Planning

Abstract

Predicting future scene representations is a crucial task for enabling robots to understand and interact with the environment. However, most existing methods rely on video sequences and simulations with precise action annotations, limiting their ability to leverage the large amount of available unlabeled video data. To address this challenge, we propose PlaySlot, an object-centric video prediction model that infers object representations and latent actions from unlabeled video sequences. It then uses these representations to forecast future object states and video frames. PlaySlot allows to generate multiple possible futures conditioned on latent actions, which can be inferred from video dynamics, provided by a user, or generated by a learned action policy, thus enabling versatile and interpretable world modeling. Our results show that PlaySlot outperforms both stochastic and object-centric baselines for video prediction across different environments. Furthermore, we show that our inferred latent actions can be used to learn robot behaviors sample-efficiently from unlabeled video demonstrations.

PlaySlot Training and Inference
Overview of PlaySlot training and inference processes.
a) Training: PlaySlot is trained given unlabeled video sequences by inferring object representations and latent actions, and using these representations to autoregressively forecast future video frames and object states.
b) Inference: PlaySlot autoregressively forecasts future frames conditioned on a single frame and latent actions, which can be inferred from observations, provided by a user, or output by a learned action policy.

train_inference
PlaySlots Predictions
GT
GT 00
Pred
Pred 00
Slot Masks
Segm 00
Obj. 1
Obj1 00
Obj. 2
Obj1 01
Obj. 3
Obj1 02
Obj. 4
Obj1 03
GT 00
Pred 00
Segm 00
Obj1 00
Obj1 01
Obj1 02
Obj1 03
GT 00
Pred 00
Segm 00
Obj1 00
Obj1 01
Obj1 02
Obj1 03

GT 00
Pred 00
Segm 00
Obj1 00
Obj1 01
Obj1 02
Obj1 03
GT 00
Pred 00
Segm 00
Obj1 00
Obj1 01
Obj1 02
Obj1 03
GT 00
Pred 00
Segm 00
Obj1 00
Obj1 01
Obj1 02
Obj1 03

GT 00
Pred 00
Segm 00
Obj1 00
Obj1 01
Obj1 02
Obj1 03
GT 00
Pred 00
Segm 00
Obj1 00
Obj1 01
Obj1 02
Obj1 03
GT 00
Pred 00
Segm 00
Obj1 00
Obj1 01
Obj1 02
Obj1 03
Benchmarking
PlaySlot achieves the best results in datasets that require modeling object interactions (BlockPush) or feature multiple moving objects (GridShapes), while maintaining competitive performance on ButtonPress.

benchmark
Our proposed PlaySlot model accurately predicts the scene dynamics, even in the presence of occlusions and interactions between the robot and the objects; whereas the baselines (SVG and CADDY) fail to model object interactions, leading to blurriness or disappearing objects.
qual_01qual_02
Learned Actions
We visualize the video frames generated by PlaySlot be repeatedly conditioning the prediction process on a single action prototype.
InvDyn
InvDym
Act. 1
Act 0
Act. 2
Act 1
Act. 3
Act 2
Act. 4
Act 3
Act. 5
Act 4
Act. 6
Act 5
Act. 7
Act 6
InvDym
Act 0
Act 1
Act 2
Act 3
Act 4
Act 5
Act 6
InvDyn
InvDym
Act. 1
Act 0
Act. 2
Act 1
Act. 3
Act 2
Act. 4
Act 3
Act. 5
Act 4
Act. 6
Act 5
Act. 7
Act 6
InvDym
Act 0
Act 1
Act 2
Act 3
Act 4
Act 5
Act 6
InvDyn
InvDym
Act. 1
Act 0
Act. 2
Act 1
Act. 3
Act 2
Act. 4
Act 3
Act. 5
Act 4
InvDym
Act 0
Act 1
Act 2
Act 3
Act 4
InvDyn
InvDym
Act. 1
Act 0
Act. 2
Act 1
Act. 3
Act 2
Act. 4
Act 3
Act. 5
Act 4
Act. 6
Act 7
Act. 7
Act 6
InvDym
Act 0
Act 1
Act 2
Act 3
Act 4
Act 7
Act 6
Learned Behaviors from Unlabelled Expert Demonstrations
PlaySlot can learn robot behaviors from *unlabelled* expert demonstrations in a sample-efficent manner. Furthermore, we train a shallow model to map from PlaySlot's latent actions to the real action space, thus allowing to execute the learn behaviors. In the GIFs below, we show predicted trajectories within PlaySlot’s latent imagination, where the model, starting from a single reference frame, autoregressively generates latent actions using the policy model and predicts future scene states in the latent space, as well as the simulated execution of the decoded latent actions in the corresponding environment.
Latent Predictions
Pred
Predicted Slot Masks
Masks
Simulated Execution
Sim
Pred
Masks
Sim
Pred
Masks
Sim
Latent Predictions
Pred
Predicted Slot Masks
Masks
Simulated Execution
Sim
Pred
Masks
Sim
Pred
Masks
Sim
Citation
Loading...