Theoretical Machine Learning Seminar

Curiosity, Intrinsic Motivation, and Provably Efficient Maximum Entropy Exploration

Suppose an agent is in an unknown Markov environment in the absence of a reward signal, what might we hope that an agent can efficiently learn to do? One natural, intrinsically defined, objective problem is for the agent to learn a policy which induces a distribution over state space that is as uniform as possible, which can be measured in an entropic sense. More broadly, the class of objectives defined solely as functions of the state-visitation frequencies can encode expressive a rich set of preferences on the agent's behavior.

We provide an efficient algorithm to optimize such such intrinsically defined objectives, when given access to a black box planning oracle (which is robust to function approximation). Furthermore, when restricted to the tabular setting where we have sample based access to the MDP, our proposed algorithm is provably efficient, both in terms of its sample and computational complexities. Key to our algorithmic methodology is utilizing the conditional gradient method (a.k.a. the Frank-Wolfe algorithm) which utilizes an approximate MDP solver. (edited)

(With Elad Hazan, Sham Kakade, Abby Van Soest.)

Date & Time

February 18, 2019 | 12:15pm – 1:45pm

Location

Princeton University, CS 302

Speakers

Karan Singh

Affiliation

Princeton University