State-of-the-art approaches to traffic flow control fall into one of two categories:
1) Distributed reinforcement learning approaches that take advantage of rich multi-agent models of traffic and learn distributed policies for traffic flow optimization that are able to incorporate rich traffic patterns, including disruptive events, pedestrians and vehicles (Aslani, 2017);
2) Single-agent deep reinforcement learning approaches that take advantage of the ability of deep learning architectures (namely convolutional networks) to extract high-level information from complex input data and has focused mostly on learning reactive policies to the traffic conditions.

Task 7 will combine the best of both categories above. In particular, this task aims to train a network of reinforcement learning agents that are able to optimize traffic flow and adapt to context information. The proposed approach will build on existing architectures (Aslani, 2017) and extend it to accommodate richer reinforcement learning models based on recent advances on deep reinforcement learning. Our control agents will build on the predictive models developed in the context of Tasks 5 and 6.
We divide Task 7 into 3 subtasks:

1) Single-agent deep reinforcement learning (RL) approach to traffic light control. Classical RL approaches tend to exhibit unstable behavior when combined with non-linear representations due, among other things, to the temporal correlation between the samples used in learning and the overestimation of the value of actions. Recent algorithmic advances from the deep RL literature (such as experience replay, redundant value estimates) provide powerful tools to handle such limitations of classical RL. The main goal of this subtask is thus to leverage such algorithmic advances to build well established RL algorithms (such as Q-learning and actor-critic algorithms) on top of the predictive models developed in Tasks 5 and 6. The resulting RL algorithm is expected to provide locally improved traffic control that is able to take into account context information provided by the predictive models.

2) Metrics for agent performance. This subtask addresses the very challenging problem of assessing the performance of the learning agents. In particular, this task will consider two distinct problems: (i) what are the most relevant metrics to assess the performance of the control agent? and (ii) how can we reliably assess such performance? Concerning (i), we will explore different metrics from literature and carefully design adequate reward functions to drive the RL agents, for which the expertise and knowledge domain of LNEC and CML will play a central role. Concerning (ii), the main challenge is to devise a robust test methodology that ensures, with some level of confidence, that the performance obtained from the collected and simulated data reflects the expected performance of the agents in real world application.

3) Hierarchical multi-agent traffic control. This subtask will build over the single-agent RL approaches developed in Subtask 5.1 and develop a multi-agent, hierarchical approach (in the guise of Task 5 Description of (Choy, 2003)) that is able to address the traffic control problem at different levels of granularity. Such validation will contribute to the recent work on the risks of overfitting in deep RL.

The subtasks above are expected to yield the following outputs:
1) A deep RL approach to traffic light control
2) A methodology for robust validation of deep RL agents in traffic control domain
3) A hierarchical multi-agent architecture for traffic control

The results from our control agents are expected to inform broader actuation policies for traffic optimization.
The reinforcement learning expertise necessary to the work in Task 5 is concentrated mostly in INESC-ID, which will be responsible for this task. The work will require close collaboration with LNEC and CML to ensure task interdependencies are properly handled.