Abstract
Predicting future scene representations is a crucial task for enabling robots to understand and interact with the environment. However, most existing methods rely on videos and simulations with precise action annotations, limiting their ability to leverage the large amount of available unlabeled video data. To address this challenge, we propose PlaySlot, an object-centric video prediction model that infers object representations and latent actions from unlabeled video sequences. It then uses these representations to forecast future object states and video frames. PlaySlot allows to generate multiple possible futures conditioned on latent actions, which can be inferred from video dynamics, provided by a user, or generated by a learned action policy, thus enabling versatile and interpretable world modeling. Our results show that PlaySlot outperforms both stochastic and object-centric baselines for video prediction across different environments. Furthermore, we show that our inferred latent actions can be used to learn robot behaviors sample-efficiently from unlabeled video demonstrations
a) Training: PlaySlot is trained given unlabeled video sequences by inferring object representations and latent actions, and using these representations to autoregressively forecast future video frames and object states.
b) Inference: PlaySlot autoregressively forecasts future frames conditioned on a single frame and latent actions, which can be inferred from observations, provided by a user, or output by a learned action policy.

















































































































































In the GIFs below, we show predicted trajectories within PlaySlot’s latent imagination, where the model, starting from a single reference frame, autoregressively generates latent actions using the policy model and predicts future scene states in the latent space, as well as the simulated execution of the decoded latent actions in the corresponding environment.


















Loading...