Title: Decoupling Exploitation and Intrinsically Motivated Exploration in Reinforcement Learning
Abstract: Intrinsic rewards can improve exploration in reinforcement learning, but often suffer from instability caused by non-stationary reward shaping and strong dependency on hyperparameters. In this work, we introduce Decoupled RL (DeRL) as a general framework which trains separate policies for intrinsically motivated exploration and exploitation. Such decoupling allows DeRL to leverage the benefits of intrinsic rewards for exploration while demonstrating improved robustness and sample efficiency. We evaluate DeRL algorithms in two sparse-reward environments with multiple types of intrinsic rewards. Our results show that DeRL is more robust to varying scale and rate of decay of intrinsic rewards and converges to the same evaluation returns than intrinsically motivated baselines in fewer interactions. Lastly, we discuss the challenge of distribution shift and show that divergence constraint regularisers can successfully minimise instability caused by divergence of exploration and exploitation policies.
Title: Generalisation in Deep Reinforcement Learning using Causal Inference
Abstract:Reinforcement Learning (RL) and Causal Inference have evolved as independent disciplines and have both shown promising progress in reasoning and decision-making. RL algorithms have shown to be successful on the specific tasks on which they have been trained but do not generalise well to new tasks and environments. In practice, it is common to train RL algorithms with random initialisations of environment variables to maximise the variations of the environment seen during training. We explore how techniques from Causal Inference could be used to encourage an RL agent to learn causal relationships without the need to randomise all environment variables. The causal knowledge can then be used for decision-making in unseen tasks where the causal relationships remain unchanged to improve generalisation.
Shangmin (Shawn) Guo
Title: Better Supervisory Signals by Observing Learning Paths
Abstract:Better supervision might lead to better generalization performance. In this paper, we first clarify what makes supervision good for a classification problem, and then explain two existing label refining methods, label smoothing and knowledge distillation (KD), in terms of our proposed criterion. To further answer why and how better supervision emerges, we look deeper into the learning path of the network's predicted distribution for training samples with different difficulty. A ``zig-zag pattern'' for hard samples is the crux of KD's success. Observing the learning path not only provides a new perspective for understanding KD, overfitting, and learning dynamics, but also points out the high variance issue of KD on real tasks. Inspired by this, we propose Filter-KD to further enhance classification performance, as verified by experiments in various settings.