Some summer paper summaries

High-level analysis of reinforcement learning

Building machines that learn and think like people. Lake et al. think that human learners are doing something fundamentally different from machine learners: we carry out tasks in the context of many years of related tasks, whereas ML systems cannot. So, they ask, "how do we bring to bear rich prior knowledge to learn new tasks and solve problems so quickly?" The core ingredients they identify are intuitive models of physics and psychology, which exist from infanthood, and the capacity for rapid building of mental models which can be used for classification, prediction, communication, explanation, and composition. The skill of model-building can be distinguished from pattern recognition, which the authors suggest makes up much of the recent progress in deep learning. In particular, notions of composition, causation and hierarchy which are central to models may be weak or non-existent in pattern recognition. Further, they note that progress in this area may require searching over many structural variations to find new architectures, which currently is done by researchers and seems difficult to automate.

Building machines that learn and think for themselves. DeepMind respond to Lake et al. by arguing for AIs with the autonomy to design their own internal models. They note that there are many domains which, unlike physics, we can't describe in enough detail to hardwire into AIs. Because of this, systems which don't require inbuilt knowledge seem more promising, especially if they can be applied to many different domains. However, Botnivick et al. also defend knowledge being inbuilt in a very generic form, e.g. the translational invariance of CNN's convolutional layers. This allows models to be tailored to both the agent's tasks and its control structures.

A real world reinforcement learning research program. Langford argues against the approach, led by DeepMind and OpenAI, of starting with reinforcement learning for simulated environments like games. He notes that algorithms based on Q learning do badly on certain classes of problems - for example, cases where rewards near the start state favour staying near the start state, or where most transitions lead back to the start state. He also points out that when training in simulation, there's less need to focus on algorithms with good sample complexity - but that simulations are still a long way from being faithful representations of the real world. He worries that even if these approaches work out eventually, less value is being created in the meantime compared with his approach of working on real-world problems informed by fundamental theories of RL. In a comment, Dietterich disagrees and cites the example of SAT, which is theoretically intractable but has still seen exponential speedups.

A roadmap towards machine intelligence. Mikolov et al. identify two characteristics - communication and learning - which they consider crucial to machine intelligence. They argue that we can make progress towards these by training agents in an artificial environment with a teacher who assigns them tasks using natural language. Notably, they also think the agent's only interactions with the environment should be language-based, as in classic text-based adventure games. As the agent progresses through levels, they are assigned more abstract tasks which require previously-learned skills; the authors envisage this culminating in interactions with real people. However, they seem to be overoptimistic about the extent to which an agent can learn from language alone.

Improving RL agents' goals

Intrinsically motivated learning of hierarchical collections of skills and Intrinsically motivated reinforcement learning. This pair of papers introduces the idea of agents which learn for themselves a "knowledge base" of skills, where possessing easier skills can make it quicker to learn harder ones. This approach is inspired by neuroscience - in particular the release of dopamine in response to novelty. Theories of child development also suggest that children are most attracted to moderate novelty, especially when they caused it. In Barto et al.'s toy simulation of this, the RL agent needs hardwired notions of what counts as an interesting event (e.g. changes in light or sound intensity). When one occurs, the agent notes it and from then on is intrinsically rewarded for recreating it (with reward proportional to its error in predicting that event - analogous to how children get bored once they've understood how something works).

Intrinsic motivation and automatic curricula via asymmetric self-play. This paper introduces a new form of unsupervised training for RL agents: Alice proposes a task (by doing it) and then Bob has to either reverse the task (in reversible environments) or repeat it (in resettable environments). Alice's optimal behaviour is to find the simplest task that Bob can't reverse/repeat.

Automatic goal generation for RL agents. Florensa et al. use a generative adversarial network to decide on goals which are at the appropriate level of difficulty for a reinforcement learner. It's started using labelled examples, and then learns to identify goals of intermediate difficulty where the expected reward is within some range.

Hierarchical Deep RL. The authors propose a scheme which simultaneously learns how to choose subgoals and how to achieve subgoals. A controller is trained to achieve its current goal using "intrinsic rewards" for doing so; a meta-controller which chooses new subgoals for the controller is learned using deep reinforcement learning, based on the rewards the controller receives from the environment. Note that subgoals are defined in terms of objects which needed to be identified by a separate system.

Programmable agents. Denil et al. build networks that execute a simple declarative language. A goal is specified as a state of the world that satisfies a relation between two objects, which are associated with sets of properties. They make their boolean operators differentiable by assigning not(x) = 1-x, and(x,y) = xy, or(x,y) = x+y-xy. The particular goal used is for a simulated robotic arm to reach towards certain objects. This is an easy task, but standard deep learning methods totally fail to generalise to unseen objects, whereas apparently this can do so. The paper is somewhat confusing, but it seems like objects are baked in rather than learned.

Improving RL agents' object representations

Interaction networks for learning about objects, relations and physics. Interaction networks are models which can reason about how objects interact in a way analogous to simulation. Objects and the relations between them are stored in the format of a directed multigraph. At each timestep, the effect of each relation on its target object is computed by an MLP. Then another MLP determines how all the relations applied to each object influence its next state. These two neural networks can be learned efficiently because they are shared over all relations and objects respectively. The authors tested interaction networks on their ability to predict simulated physical interactions over many rollout steps. While they end up diverging considerably from the physics engine, the predictions seem fairly realistic.

Discovering objects and their relations from entangled scene representations. Inspired by the above paper, Raposo et al. introduce relation networks, which can be used to reason about objects and their relations. They use a MLP which operates on all pairs of objects, and whose output is then aggregated in an order-invariant way (e.g. addition). Experiments were done with scene descriptions as inputs, as well as inputs pre-processed by variational autoencoders. I didn't quite understand the setup of the experiments, but the results seem promising.

A simple neural network module for relational reasoning. This paper applies relation networks to the CLEVR dataset, both from pixels (which were processed first by a CNN) and from image descriptions. This achieves improves significantly on the state-of-the-art, and also beats human performance - an achievement which seems to be driven by improvements on relational questions in particular.

Schema networks. Researchers at Vicarious created the Schema network, an object-oriented generative physics simulator. It was able to generalise to variations of Breakout with offset or resized paddles, a wall in front of the bricks, and even three balls which needed to be juggled. Their approach is motivated by the gestalt principle: that the ability to perceive objects as a bounded figure in front of an unbounded background is fundamental to all perception. Each schema in the network is a local cause-effect relationship involving one or more objects (which each can possess some attributes); unfortunately, schema networks don't themselves learn how to recognise objects, nor what types of properties and relationships objects can have.

Towards deep symbolic reinforcement learning. Garnelo et al. propose an architecture with a neural back end and a symbolic front end. The neural network is a convolutional autoencoder which is trained to recognise basic patterns of circles and crosses. In a somewhat hacky way which wouldn't scale to any real images, they extract from the autoencoder the locations and types of these objects, and then process them using symbolic logic.

An object-oriented representation for efficient reinforcement learning. Diuk et al. extend the MDP formalism to object-oriented MDPs. When two objects interact, they define a relation between them, which may produce an effect - a change in value of one or multiple attributes in either or both interacting objects. But again, there's no good way of learning what the objects and attributes are...

Other deep learning

Neural Turing machines and Differentiable neural computers. These two papers present neural networks which can write to and read from external memory, in a fully differentiable way. A controller decides at each timestep whether to read or write, and if so where to. It does so using an attention mechanism over all memory locations, determined by the cosine similarity between those locations and a key vector emitted by the controller. In NTMs, there is also a way to shift the focus to the next location. However, since write operations were often not performed at sequential locations, this was of limited use. In DNCs, that mechanism was replaced by a way to shift focus to the location which had been written to immediately after the current location was, and an additional variable was added to keep track of the "usage" of each memory location. Both architectures were tested on question-answering tasks including navigating the London underground based on a graph of the stations. In addition, they were trained via reinforcement learning on a block puzzle game. DNCs did significantly better than both NTMs and the best previous results.

Neural programmer-interpreters. Another DeepMind paper which merges symbolic and neural processing - in this case, by using a controller to decide which program to run and when to stop it. It is applied to tasks like addition and sorting, being trained to maximise the probability of the correct execution trace. The NPI trains faster than a LSTM, and also generalises much better to inputs longer than any seen in training.

Episodic exploration for deep deterministic policies. Usunier et al. work on micromanagement in Starcraft by learning a greedy MDP: an action is chosen for each unit in turn, conditioning on the actions of previous units. Their exploration algorithm (zero-order gradient estimation) samples a vector on the unit sphere and then adds this to the parameters which determine the policy; the estimated gradient is then the cumulative reward times that vector.

Progressive neural networks. Progressive networks instantiate a new neural network for each task being solved, and then connect them laterally to allow transfer learning. It works on variants of Pong, but overall seems like a hacky solution which wouldn't scale.

Taking the human out of the loop: a review of bayesian optimisation. Bayesian optimisation is a way of jointly optimising hyperparameters of black-box functions, by starting with a prior and then describing an acquisition function for sampling from the black box which balances between exploration and exploitation. This can be used for a/b testing, reinforcement learning, combinatorial optimisation, etc.

Neural architecture search with reinforcement learning. In this paper, a controller RNN is used to generate neural network hyperparameters. After an architecture is generated, it is built and trained, and the controller is updated based on how well the architecture did on a test set. As well as variables like number of filters, filter size and stride sizes, the controller can propose skip connections and branching layers using an attention mechanism. A variant of this algorithm generates recurrent cell architectures, including one which outperforms LSTMs (although its method of modifying the computation graph is rather ad-hoc).

Natural language processing

Supervised learning of universal sentence representations from natural language inference data. The authors train sentence representations using the Stanford Natural Language Inference database, which has pairs of sentences labeled either as entailment, contradiction or neutral, and find that their representations outperform Skipthought, FastSent, etc.

A structured self-attentive sentence embedding. This paper tries to make sentence embeddings more interpretable by creating them as a linear combination of hidden states of a bidirectional LSTM, which the authors call a "self-attention" model. This allows identification of which parts of the sentence are being focused on to create the overall embedding.

Just for fun

The Evolution of Ethnocentrism. A cool paper showing that, in a simple model of an evolving population of agents facing prisoners dilemmas, the strategy of ethnocentrism (cooperating with your own tribe, but defecting against others) comes to dominate. The population is set in a 2-dimensional space, with agents interacting locally with others around them; the dominance of ethnocentrism is robust to changes in hyperparameters.


Popular posts from this blog

In Search of All Souls

25 poems

Moral strategies at different capability levels