PlaySlot: Object-Centric Prediction and Planning

PlaySlot: Learning Inverse Latent Dynamics for Controllable Object-Centric Video Prediction and Planning

Abstract

Predicting future scene representations is a crucial task for enabling robots to understand and interact with the environment. However, most existing methods rely on videos and simulations with precise action annotations, limiting their ability to leverage the large amount of available unlabeled video data. To address this challenge, we propose PlaySlot, an object-centric video prediction model that infers object representations and latent actions from unlabeled video sequences. It then uses these representations to forecast future object states and video frames. PlaySlot allows to generate multiple possible futures conditioned on latent actions, which can be inferred from video dynamics, provided by a user, or generated by a learned action policy, thus enabling versatile and interpretable world modeling. Our results show that PlaySlot outperforms both stochastic and object-centric baselines for video prediction across different environments. Furthermore, we show that our inferred latent actions can be used to learn robot behaviors sample-efficiently from unlabeled video demonstrations

PlaySlot Training and Inference

Overview of PlaySlot training and inference processes.
a) Training: PlaySlot is trained given unlabeled video sequences by inferring object representations and latent actions, and using these representations to autoregressively forecast future video frames and object states.
b) Inference: PlaySlot autoregressively forecasts future frames conditioned on a single frame and latent actions, which can be inferred from observations, provided by a user, or output by a learned action policy.

PlaySlot Predictions

BlockPush

Pred

Slot Masks

Obj. 1

Obj. 2

Obj. 3

Obj. 4

MetaWorld Button-Press

Sketchy

Benchmarking

PlaySlot achieves the best results in datasets that require modeling object interactions (BlockPush) or feature multiple moving objects (GridShapes), while maintaining competitive performance on ButtonPress and the real-world robotics dataset Sketchy.

Our proposed PlaySlot model accurately predicts the scene dynamics, even in the presence of occlusions and interactions between the robot and the objects; whereas the baselines (SVG and CADDY) fail to model object interactions, leading to blurriness or disappearing objects.

Comparison with Baselines

BlockPush

PlaySlot

CADDY

SVG

PlaySlot

CADDY

SVG

MetaWorld Button-Press

PlaySlot

CADDY

SVG

PlaySlot

CADDY

SVG

GridShapes

PlaySlot

CADDY

SVG

PlaySlot

CADDY

SVG

Sketchy

PlaySlot

CADDY

SVG

PlaySlot

CADDY

SVG

Learned Actions

We visualize the video frames generated by PlaySlot be repeatedly conditioning the prediction process on a single action prototype.

BlockPush

InvDyn

Act. 1

Act. 2

Act. 3

Act. 4

Act. 5

Act. 6

Act. 7

MetaWorld: Button-Press

InvDyn

Act. 1

Act. 2

Act. 3

Act. 4

Act. 5

Act. 6

Act. 7

GridShapes

InvDyn

Act. 1

Act. 2

Act. 3

Act. 4

Act. 5

Sketchy

InvDyn

Act. 1

Act. 2

Act. 3

Act. 4

Act. 5

Act. 6

Act. 7

Learned Behaviors from Unlabelled Expert Demonstrations

PlaySlot can learn robot behaviors from *unlabelled* expert demonstrations in a sample-efficent manner. Furthermore, we train a shallow model to map from PlaySlot's latent actions to the real action space, thus allowing to execute the learn behaviors.

ButtonPress Behavior

BlockPush Behavior

In the GIFs below, we show predicted trajectories within PlaySlot’s latent imagination, where the model, starting from a single reference frame, autoregressively generates latent actions using the policy model and predicts future scene states in the latent space, as well as the simulated execution of the decoded latent actions in the corresponding environment.

MetaWorld: Button-Press

Latent Predictions

Predicted Slot Masks

Simulated Execution

BlockPush

Latent Predictions

Predicted Slot Masks

Simulated Execution

Citation

Loading...