Cesar Salcedo

Learnable Reward Functions for Reinforcement Learning

Aug 9, 2022

This project started with the goal of improving learning speed for long horizon, sparse reward tasks in Reinforcement Learning (RL). The approach consists of learning a smooth reward function from a sparse reward function, effectively increasing the amount of feedback received by the RL agent for learning. In general, the idea was to improve learning stability and convergence speed, while maintaining a reward structure coherent with the original task. The learnable reward function is modeled by a neural network. The training loop consists of simultaneously training a REINFORCE agent alongside the reward function. The agent weights backpropagate through the reward function, giving necessary feedback for learning. Since feedback is sparse at first, we use curriculum learning to gradually increase task difficulty according to the difficulty of the task. After testing on the CartPole environment, the agent was able to learn a meaningful reward function as shown in the image below.
Heatmap of learnt reward on CartPole environment.
Heatmap of learnt reward on CartPole environment.
In the CartPole environment it is ideal to keep the cart at the center of the screen and the pole in the upright position. The heatmap shows that the reward function captures the essence of the task, with the highest rewards in the center and decreasing rewards as the pole moves away from the center. However, note that even under these conditions the reward function is still not ideal for learning because all rewards are negative, including those in the center. Further improvements for this project, like normalizing over all rewards, are left for future work.

References