DISTRIBUTED MACHINE LEARNING SYSTEM

(19)

(11)

EP 3 525 136 A1

(12)	EUROPEAN PATENT APPLICATION

(43)	Date of publication:
	14.08.2019 Bulletin 2019/33

(21)	Application number: 18275016.6

(22)	Date of filing: 08.02.2018

(51)

International Patent Classification (IPC):

G06N 3/00^(2006.01)
G06N 99/00^(2019.01)

G06N 7/00^(2006.01)

(84)	Designated Contracting States:
	AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR
	Designated Extension States:
	BA ME
	Designated Validation States:
	MA MD TN

(71)	Applicant: Prowler.io Limited
	Cambridge, Cambridgeshire CB2 1LA (GB)

(72)	Inventors:
	BOU-AMMAR, Haitham Cambridge, Cambridgeshire CB2 1LA (GB) TUTUNOV, Rasul Cambridge, Cambridgeshire CB2 1LA (GB) KIM, Dongho Cambridge, Cambridgeshire CB2 1LA (GB) TOMCZAK, Marcin Cambridge, Cambridgeshire CB2 1LA (GB)

(74)	Representative: EIP
	EIP Europe LLP Fairfax House 15 Fulwood Place London WC1V 6HU London WC1V 6HU (GB)

(54)	DISTRIBUTED MACHINE LEARNING SYSTEM

(57) There is provided a computer-implemented method of determining policy parameters for multiple reinforcement learning tasks. The policy parameters for each task depend on a common set of parameters shared by the tasks and a task-specific set of parameters. The method includes each of a plurality of processors receiving trajectory data from a subset of the tasks, and determining a local version of the common set of parameters as well as the task-specific set of parameters for each task in the subset. Determining the local version includes iteratively determining partial input data and taking part in a distributed computation with the other processors to determine, using the partial input data, a first set of intermediate variables. The method then includes updating the local version of the common set of parameters using a subset of the first set of intermediate variables. The local versions of the common set of parameters converge.

Description

Technical Field

[0001] This invention is in the field of machine learning systems, and has particular applicability to distributed reinforcement learning systems.

Background

[0002] Reinforcement learning involves a computer system learning how to perform tasks by analysing data corresponding to previously-experienced instances of the same or similar tasks. Multi-task reinforcement learning (MTRL) has been proposed as a means of increasing the efficiency of reinforcement learning in cases where data is analysed from several different tasks, allowing for knowledge transfer between the different tasks.

[0003] The application of MTRL to particular problems is often computationally expensive, especially for tasks with high-dimensional state and/or action spaces, or in cases where extensive volumes of experience data need to be analysed. Accordingly, there is a need for an efficient and accurate distributed implementation of MTRL such that the computational burden can be shared by multiple processors. Implementing a distributed routine for MTRL is not straightforward due to the need for knowledge transfer between different tasks, requiring sharing of data between processor nodes. Existing attempts to implement MTRL using multiple processors, for example the method discussed in "Scalable Multitask Policy Gradient Reinforcement Learning" by S. El Bsat, H. Bou Ammar, and M. E. Taylor (Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp 1847-1853, February 4-9, 2017), result in relatively slow convergence rates, which are only achieved after making restrictive assumptions about an objective function.

Summary

[0004] According to one aspect of the present invention, there is provided a computer-implemented method of determining policy parameters for each of a plurality of reinforcement learning tasks. The policy parameters depend on a common set of parameters shared by the plurality of tasks and a task-specific set of parameters. The method includes each of a plurality of processors receiving trajectory data from a subset of the plurality of tasks, and determining using the received trajectory data: a local version of the common set of parameters; and the task-specific set of parameters for each of the subset of the plurality of tasks. Determining the local version of the common set of parameters includes iteratively determining partial input data from the received trajectory data for determining a first set of intermediate variables, and taking part in a distributed computation with the other processors to determine the first set of intermediate variables. Once the first set of intermediate variables have been determined, the method then includes updating the local version of the common set of parameters using a subset of the first set of intermediate variables. After at least one iteration, the local versions of the common set of parameters determined by the plurality of processors converge.

[0005] Further aspects of the invention will become apparent from the following description of an embodiment of the invention, given by way of example only, which is made with reference to the accompanying drawings.

Brief Description of the Drawings

[0006]

Figure 1 shows an example of a task for a reinforcement learning problem.

Figure 2 is a schematic diagram showing a network of processors being used to solve a reinforcement learning problem involving multiple tasks.

Figure 3 is a schematic diagram of a network of processors configured in accordance with the present invention.

Figure 4 is a flow diagram representing a routine for solving a multi-task reinforcement learning problem.

Figure 5 is a flow diagram representing a subroutine for updating a shared knowledge base.

Figure 6 shows examples of benchmark tasks used to test the performance of a multi-task reinforcement learning routine.

Figure 7 shows bar charts illustrating times taken by processor networks to reach solutions to multi-task reinforcement learning problems for different task domains.

Figure 8 shows bar charts illustrating times taken by processor networks to reach solutions to multi-task reinforcement learning problems for different network topologies.

Figure 9 shows plots illustrating the convergence of consensus errors in multi-task reinforcement learning problems.

Figure 10 shows plots illustrating the convergence of objective values in multi-task reinforcement learning problems.

Figure 11 is a schematic diagram of a computing device configured for use in a network according to the present invention.

Detailed Description

Reinforcement learning: overview

[0007] For the purposes of the following description and accompanying drawings, a reinforcement learning problem is definable by specifying the characteristics of one or more agents and an environment. The methods and systems described herein are applicable to a wide range of reinforcement learning problems.

[0008] A software agent, referred to hereafter as an agent, is a computer program component that makes decisions based on a set of input signals and performs actions based on these decisions. In some applications of reinforcement learning, each agent is associated with a real-world entity (for example an autonomous robot). In other applications of reinforcement learning, an agent is associated with a virtual entity (for example, a non-playable character (NPC) in a video game). In some examples, an agent is implemented in software or hardware that is part of a real world entity (for example, within an autonomous robot). In other examples, an agent is implemented by a computer system that is remote from a real world entity.

[0009] An environment is a system with which agents interact, and a complete specification of an environment is referred to as a task. It is assumed that interactions between an agent and an environment occur at discrete time steps labelled h = 0,1, 2, 3, .... The discrete time steps do not necessarily correspond to times separated by fixed intervals. At each time step, the agent receives data corresponding to an observation of the environment and data corresponding to a reward. The data corresponding to an observation of the environment is referred to as a state signal and the observation of the environment is referred to as a state. The state perceived by the agent at time step h is labelled s_h. The state observed by the agent may depend on variables associated with the agent-controlled entity, for example in the case that the entity is a robot, the position and velocity of the robot within a given spatial domain. A state s_h is generally represented by a k-dimensional state vector x_h, whose components are variables that represent various aspects of the corresponding state.

[0010] In response to receiving a state signal indicating a state s_h at a time step h, an agent is able to select and perform an action a_h from a set of available actions in accordance with a Markov Decision Process (MDP). In some examples, the state signal does not convey sufficient information to ascertain the true state of the environment, in which case the agent selects and performs the action a_h in accordance with a Partially-Observable Markov Decision Process (PO-MDP). Performing a selected action generally has an effect on the environment. Data sent from an agent to the environment as an agent performs an action is referred to as an action signal. At the next time step h + 1 , the agent receives a new state signal from the environment indicating a new state s_h+1. In some examples, the new state signal is initiated by the agent completing the action a_h. In other examples the new state signal is sent in response to a change in the environment.

[0011] Depending on the configuration of the agents and the environment, the set of states, as well as the set of actions available in each state, may be finite or infinite. The methods and systems described herein are applicable in any of these cases.

[0012] Having performed an action a_h, an agent receives a reward signal corresponding to a numerical reward r_h+1, where the reward r_h+1 depends on the state s_h, the action a_h and the state s_h+1. The agent is thereby associated with a sequence of states, actions and rewards (s_h, a_h, r_h+1, s_h+1, ...) referred to as a trajectory τ. The reward is a real number that may be positive, negative, or zero.

[0013] In response to an agent receiving a state signal, the agent selects an action to perform based on a policy. A policy is a stochastic mapping from states to actions. If an agent follows a policy π, and receives a state signal at time step h indicating a specific state s_h, the probability of the agent selecting a specific action a_h is denoted by π(a_h|s_h). A policy for which π(a_h|s_h) takes values of either 0 or 1 for all possible combinations of a_h and s_h is a deterministic policy. Reinforcement learning algorithms specify how the policy of an agent is altered in response to sequences of states, actions, and rewards that the agent experiences.

[0014] Generally, the objective of a reinforcement learning algorithm is to find a policy that maximises the expectation value of a return, where the value of a return at any time depends on the rewards received by the agent at future times. For some reinforcement learning problems, the trajectory τ is finite, indicating a finite sequence of time steps, and the agent eventually encounters a terminal state s_H from which no further actions are available. In a problem for which τ is finite, the finite sequence of time steps refers to an episode and the associated task is referred to as an episodic task. For other reinforcement learning problems, the trajectory τ is infinite, and there are no terminal states. A problem for which τ is infinite is referred to as an infinite horizon task. As an example, a possible definition of a return for an episodic task is given by Equation (1) below:

In this example, the return

(τ) is equal to the average reward received per time step for the trajectory τ. A skilled person will appreciate that this is not the only possible definition of a return. For example, in R-learning algorithms, the return given by Equation (1) is replaced with an infinite sum over rewards minus an average expected reward.

Policy search reinforcement learning

[0015] In policy search reinforcement learning, a policy is parameterised by a k-dimensional vector

where k is the length of the state vector x_h encoding relevant information about the state s_h. An example of such a policy is a stochastic Gaussian policy given by Equation (1):

Equation (2) states that in response to experiencing a state at time step h represented by a state vector x_h, a computer-controlled entity following the Gaussian policy selects an action a_h according to a Gaussian (normal) probability distribution with mean θ^Tx_h and variance σ². Figure 1a shows a specific example of a task for a one-dimensional dynamical system in which a uniform rigid pole 101 is connected at one end by a hinge to a cart 103. The position and velocity of the cart are given by x and ẋ respectively, and the angular position and velocity with respect to the vertical are given by φ and φ̇ respectively. The state at any time step h can be represented by a four-dimensional state vector x_h = (x, ẋ, φ, φ̇). The aim of the reinforcement learning problem is to learn a policy that maps each possible state vector x_h within a given domain to a force F_h to be applied to the cart, such that given any initial state x₀ the domain, a sequence of forces will be applied such that the cart ends up in a target state x_H = (0,0,0,0) as shown in Figure 1b. In this example, the reward is chosen to be r_h+1 = -∥x_h - x_H∥₂, so that the system is penalised at each time step according to the L₂ norm of the state vector at that time step and the target state vector. The task is parameterised by the mass of the cart, the mass and length of the pole, and a damping constant governing the torque exerted by the hinge. In this example, the force is assumed to be normally-distributed around about a mean value that depends linearly on the state vector such that F_h = θ^Tx_h + ε, where ε ∼

which corresponds to the policy of Equation (2). The stochastic term ε is included to ensure that a range of states are explored during the learning process (a larger value of σ² leading to a greater degree of exploration). In this example, the dynamical system is simulated, but the methods and systems described herein are equally applicable to systems of real world entities, as well as in a wide range of contexts other than that of dynamical systems.

[0016] The example task of Figure 1 is characterised by a four-dimensional state space to a one-dimensional action space. The methods and systems described herein equally apply to tasks having higher-dimensional state and action spaces. For example, the policy of Equation (2) is extendable to tasks having a multi-dimensional action space by replacing the normal distribution with a multivariate normal distribution. For tasks having many degrees of freedom, other types of policy may be suitable, for example policies based on neural network function approximators, in which the vector of parameters represents the connection weights within the network. The methods and systems described herein are applicable to any reinforcement learning problem in which a policy is expressible in terms of a vector of parameters.

[0017] Given a task such as that described above, an agent is faced with the problem of searching for a parameter θ* that maximises the objective given by Equation (3) below, which represents the expected return for the policy π_θ:

where τ is a trajectory consisting of a sequence of state-action pairs [x_0:_H, a_0:H] and

(τ) is defined as in Equation (1). The probability p_θ(τ) of acquiring a trajectory τ whilst following a policy parameterised by a vector θ is given by Equation (4):

in which P₀(x₀) is a distribution of initial states x₀, and p(x_h+1|x_h,a_h) denotes the probability density for transitioning to a state x_h+1 if the action a_h is performed from a state x_h.

Policy gradient methods

[0018] In order to search for a parameter vector θ* that maximises the objective of Equation (3), a policy gradient method is adopted, in which trajectories are generated using a fixed policy π_θ, and a new policy π_θ̃ is determined by searching for a parameter vector θ̃ that maximises a lower bound on the corresponding new value of the objective. An appropriate lower bound for the logarithm of the objective is given by Equation (5):

in which D_KL denotes the KL divergence between two probability distributions. The method proceeds with the aim of maximising this lower bound, or equivalently solving the following minimisation problem:

Multi-task reinforcement learning

[0019] Reinforcement learning algorithms often require large numbers of trajectories in order to determine policies which are successful with regard to a particular task. Acquiring a large number of trajectories may be time-consuming, and particularly in the case of physical systems such as those in robotics applications, may lead to damage to, or wear on, the entities being controlled during the learning process. Furthermore, a policy learned using a method such as that described above, is only likely to be successful for the specific task used for the learning phase. In particular, if parameters of the task are varied (for example, the mass of the cart, the mass or length of the pole, and/or the damping constant, in the example of Figure 1), the agent will be required to learn an entirely new policy, requiring another extensive phase of collecting data from multiple trajectories.

[0020] Multi-task reinforcement learning (MTRL) methods have been developed to overcome the problems described above. In MTRL, an agent receives trajectories from multiple tasks, and the objective of the learning algorithm is to determine policies for each of these tasks, taking into account the possibility of knowledge transferability between tasks. MTRL problems can be divided into two cases, the first case being single domain MTRL and the second case being cross-domain MTRL. In the single domain case, the tasks vary with respect to each other only by varying parameter values, and accordingly have state and action spaces with common dimensionality. Figure 2 shows an example of a single domain MTRL problem having multiple tasks 201, of which three tasks 201a-c are shown. Each of the tasks 201 is a cart-pole system similar to the example of Figure 1, but the mass of the cart and the length of the pole differs between tasks. The aim of the MTRL problem is to learn a policy for each of the tasks analogous to that in the example of Figure 1, making use of knowledge transferability between the tasks to reduce the number of trajectories required from each task for successful behaviour.

[0021] In cross-domain MTRL, the dimensionality of the state and action spaces may differ between tasks, and the tasks may correspond to different types system altogether.

[0022] Given a set of T tasks, an MTRL algorithm searches for a set of optimal policies Π* = {π^(1)*, ..., π^(T)*} with corresponding parameters

The method is formulated analogously to the policy search method outlined above, but instead of posing T minimisation problems of the form given in (6), a single multi-task learning objective is formed:

where

is the lower bound on the expected return for task t, and Reg(θ̃₁, ..., θ̃_T) is a regularising term used to impose a common structure needed for knowledge transfer. Knowledge transfer is encoded into the objective by representing the parameter vector associated with each task as a linear combination of shared latent components such that θ̃_t = Ls_t, where

is a task-specific vector, and L ∈

is a matrix that serves as a shared knowledge base for all of the tasks. The regularising term in (7) is chosen such that the components of L are ensured to be bounded, and the components of s_t for each task are encouraged to be sparse, resulting in a high degree of knowledge transfer between tasks. The learning objective is thus given by:

where ∥·∥₁ is the L₁ norm, ∥·∥_F is the Frobenius norm, and µ₁, µ₂ are parameters that control the degree of knowledge transfer.

Distributed MTRL

[0023] A straightforward approach to solving the minimisation problem (8) is to construct a concatenated data set of trajectories from all of the tasks, and to solve the minimisation problem in a batch fashion using central processing. However, the computational cost of this approach increases rapidly with both the number of tasks and the number of trajectories per task, potentially rendering the memory requirements and computing time prohibitive for all but the simplest MTRL problems.

[0024] The present invention provides an efficient distributed implementation of MTRL in the case that that multiple processors are available, thereby overcoming the scalability issues associated with central processing.

[0025] The present invention has significant advantages over known MTRL methods. For example, the present method achieves quadratic convergence in solving for the coefficients of the shared knowledge base L. By contrast, state-of-the-art known methods achieve only linear convergence. Furthermore, existing methods require restrictive assumptions to be made about the learning objective, whereas the present method requires no such assumptions, and is hence more general.

Laplacian-based distributed MTRL

[0026] In one embodiment of the present invention, a network of processors is provided, in which the processors are connected by an undirected graph

with v denoting the processors (nodes) and ε denoting the connections between processors (edges). The total number of processors is n, and no restrictions are made on the graph topology. As shown in Figure 4, each of T learning tasks is assigned, at S401, to one of the n processors, such that the total number of tasks assigned to the i^th processor is given by T_i, and therefore ∑_iT_i = T. Each processor receives, at S403, trajectory data for each task it has been assigned, the trajectory data for each task corresponding to sequences of states, actions and rewards generated by an agent following the policy π_θt.

[0027] The present method proceeds with the objective of minimising the objective (8) in two stages. During the first stage the network updates, at S405, the coefficients of the shared knowledge matrix L. During this stage, the task-specific s_t values are held constant. This update is not straightforward, and is generally expensive in terms of computational resources due to the dependence of L on all of the training tasks. One of the contributions of the present invention is to provide an efficient distributed method for updating L.

[0028] Having updated the shared knowledge matrix L, each processor updates, at S407, the task-specific vectors s_t for each task it has been assigned. During this stage, the i^th processor solves T_i task-specific problems of the form

which are immediately recognisable as separate Lasso problems, and can thus be solved straightforwardly using a conventional Lasso solver. The application of Lasso solvers to problems such as (9) is well-known in the art, and therefore will not be described in further detail. Having updated L and s_t, the processors update the policy parameters θ̃_t = Ls_t for each task. If necessary, further trajectory data is generated for each task using the updated policy parameters θ̃_t in place of θ_t, and the iterative reinforcement learning process begins again from S403.

[0029] A method for updating L in accordance with the present invention is described hereafter. Holding s_t fixed as described above leads the following reduced minimisation problem being posed:

In order to solve (10) using a distributed implementation, the bi-product Ls_t is rewritten in terms of the column-wise vectorisation vec(L) of the shared knowledge matrix L:

Using this notation, the minimisation problem can be rewritten for distributed processing as

such that

Where µ̃₂ = Tµ₂, and L_i for i = 1, ..., n denotes a local version of L stored by the i^th processor. Equation (13) imposes a consensus condition between the local copies. For convenience, a set of dk vectors y₁, ..., y_dk is introduced, defined in terms of the components of L_i for i = 1, ..., n as y_r = [vec(L₁)_(r), ...,vec(L_n)_(r)]^T, allowing (12) to be rewritten as

in which

is the cost function associated with the i^th node. The consensus condition of Equation (13) is replaced by the following condition:

in which

is the Laplacian matrix of the graph

, defined by:

where deg(i) denotes the degree of the i^th node. It is well known that the Laplacian matrix of any graph is symmetric diagonally dominant (SDD), and it is this property that allows the constrained minimisation problem defined by (14) and (16) to be solved efficiently using a distributed implementation as described hereafter. Equation (16) can be written concisely as My = 0, where

and

Solving Laplacian-based distributed MTRL

[0030] According to a well-known method in the field of optimisation theory, the constrained optimisation problem defined by (14) and (16) is solved by reference to the dual function q(λ):

in which

is a vector of Lagrange multipliers, otherwise referred to as dual variables. The dual variables are initialised to arbitrary values, and starting with an initial set of values for y, the method proceeds to update λ such that q(λ) moves closer to its maximum. Having updated λ, the optimisation routine then updates y. The method continues to update λ and y in turn until convergence is achieved, and the matrix L is constructed from the converged coefficients of y.

[0031] During the l^th update step, the dual variables λ are updated using the rule

where

is an approximation of the Newton direction. The Newton direction v^[l] is the solution to the equation H^[l]v^[l] = -g^[l], where H^[l] is the Hessian matrix of the dual function and g^[l] = ∇q(λ) is the gradient of the dual function. For the dual function given by Equation (17), the Hessian and gradient are given by H(λ) = -M(∇²f(y(λ)))^-1M and g(λ) = My(λ), where f = ∑_if_i is the cumulative cost function summed over all the nodes. An approximate Newton direction is thus given by a solution to the following system of equations:

which reduces to dk SDD systems per node given by:

for i = 1, ..., n and r = 1, ..., dk, where the Newton direction vector v^[l] is split into sub-vectors such that

The right hand side of Equation (21) can be computed locally without the need for the i^th processor to communicate with any other nodes, and hence Equation (21) poses a set of simultaneous linear equations to be solved by the network. Those skilled in the art will be aware of known methods for computing approximate solutions to SDD systems of equations up to any given accuracy ε. In this example, the distributed method described in "Distributed SDDM Solvers: Theory and Applications" by R. Tutunov, H. Bou-Ammar, and A. Jadbabaie (ArXiv e-prints, 2015) is used to determine an approximate solution to the SDD system (21).

[0032] Having determined, in collaboration with the other processors in the network, the components of the vectors

as a solution of Equation (21), the i^th processor locally updates the i^th components of the vectors

according to Equation (19), without the i^th processor needing to communicate with any other nodes of then network. The i^th processor then updates the corresponding i^th components of the vectors

as solutions of the following system of partial differential equations:

The i^th processor is only required to receive data from nodes in its neighbourhood N(i) in order to solve Equation (22). In other words, the i^th processor is only required to receive data from memory that is directly accessible (accessible without the need to communicate with any of the other processors in the network) by processors in its neighbourhood N(i). Figure 3 shows an example of a node 301, labelled i, in a network 303 according to the present invention. The node 301 has a processor 305 and an associated memory 307, which the processor 305 is able access directly without the need to communicate with any of the other processors in the network, and which in preparation for the l^th update has stored: the cost function f_i; the i^th components

of the dual variable vectors; the i^th components

of the primal variable vectors; and the i^th components

of the approximate Newton direction vectors. Due to the structure of the Lagrangian matrix of the network graph, in solving Equation (22) the processor 305 of node 301 is required to receive components of

for j ∈ N(i), corresponding to components stored by nodes 307a-c. Explicitly, the right hand side of Equation (22) is given by:

The procedure of updating λ and y continues until convergence is achieved, as determined by any suitable convergence criteria known in the art. When convergence has been achieved on all of the processors in the network, each processor has stored an updated version L_i of the shared knowledge matrix L, where the copies are (approximately) equal to each other according to the consensus condition (12).

[0033] The routine for updating the shared knowledge base, as described above, is summarised by Figure 5. The network first determines, at S501, an approximate Newton direction in dual space. This is achieved by the i^th processor computing the right hand side of Equation (21) locally without the need for the i^th processor to communicate with any other nodes, and the network solving the resulting linear system of simultaneous equations (which requires communication between the nodes). The network next updates, at S503, the dual variables λ. In order to update the dual variables λ, the i^th processor updates the i^th components of λ_r for r = 1, ..., dk using the corresponding components of the update rule (19). The network next updates, at S505, the primal variables y. In order to update the primal variables y, the i^th processor solves Equation (22) locally, thereby determining the i^th component of each of the sub-vectors

for r = 1, ..., dk. The procedure of updating λ and y continues until convergence is achieved. Finally, the i^th processor reconstructs, at S507, the local version L_i of the shared knowledge matrix L from the corresponding converged components of y.

[0034] Having calculated an updated version L_i of the shared knowledge matrix L, the i^th processor then calculates updated task-specific vectors s_t for each task it has been assigned by solving the corresponding minimisation problems (9). Finally, the i^th processor uses the updated values of L_i and s_t to calculate the updated policy parameters θ̃_t = L_is_t for t = 1, ..., T_i.

[0035] Formulating the MTRL problem in accordance with the present invention leads to substantial advantages when compared with existing MTRL methods. In particular, the present method performs computations in a distributed manner, allowing the necessary computations for updating the policy parameters to be performed simultaneously by the n processors in the network. By contrast, the existing state of the art method described in "Scalable Multitask Policy Gradient Reinforcement Learning" by S. El Bsat, H. Bou Ammar, and M. E. Taylor (Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp 1847-1853, February 4-9, 2017), relies on computations being performed in a sequential manner, rather than a distributed manner, by the processors. Furthermore, the SDD property of the graph Laplacian employed in the present method ensures that the sequence {y^[l],λ[l]}_l≥0 of primal variables and dual variables calculated during the iterative loop of Figure 5 exhibits quadratic convergence, as opposed to linear convergence achieved by the most efficient existing methods.

Performance

[0036] In order to assess the performance of the method described herein, experiments were run in which the performance of a network implementing the present distributed MTRL method was compared against the performance of equivalent networks implementing existing methods. Specifically, solutions to the minimisation problem defined by (12) and (13) were implemented using the present method and several known methods, namely:

Accelerated Dual Descent (ADD) as described in "Accelerated Dual Descent for Network Flow Optimisation" by M. Zargham, A. Ribeiro, A. E. Ozdagler, and A. Jadbabaie (IEEE Transactions on Automatic Control, 2014).
Network Newton (NN) as described in "Network Newton-Part II: Convergence Rate and Implementation" by A. Mokhtari, Q. Ling, and A. Ribeiro (ArXiv e-prints, 2015).
Alternating Direction Method of Multipliers (ADMM) as described in "Scalable Multitask Policy Gradient Reinforcement Learning" by S. El Bsat, H. Bou Ammar, and M. E. Taylor (Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp 1847-1853, February 4-9, 2017).
Distributed Averaging (DA) as described in "Linear Time Average Consensus on Fixed Graphs and Implications for Decentralised Optimization and Multi-Agent Control" by A. Olshevsky (ArXiv e-prints, 2014).
Distributed Subgradients (DS), as described in "Distributed Subgradient Methods for Multi-Agent Optimization" by A. Nedic and A. E. Ozdaglar (IEEE Trans. Automat. Contr. (1) pp 48-61, 2009).

[0037] In order to compare performances, convergence of the minimisation objective of (12) and convergence of the consensus error were separately investigated, where the consensus error is defined as the mean difference between the norms of the local copies L_i computed by the network.

[0038] Initially, experiments were conducted using the three benchmark virtual task domains shown in Figure 6. Figure 6a shows a simple mass-spring (SM) system in which a cart 601 is connected by means of a linear spring 603 to a fixed wall 605. The objective is to control a force F_h applied to the cart 601 such that the cart 601 reaches a specific state, for example (x,ẋ) = (0,0). Specific tasks are parameterised by the mass of the cart 601, the spring constant of the spring 603, and the damping constant of the spring 603.

[0039] Figure 6b shows a double mass-spring (DM) system in which a first cart 607 is connected by means of a first linear spring 609 to a second cart 611, which is connected by means of a second linear spring 613 to a fixed wall 615. The objective is to control a force F_h applied to the first cart 607 such that the second cart 611 reaches a specific state, for example (x,ẋ) = (0,0). Specific tasks are parameterised by the mass of the first cart 607, the spring constant of the first spring 609, the damping constant of the first spring 609, the mass of the second cart 611, the spring constant of the second spring 613, and the damping constant of the second spring 613.

[0040] Figure 6c shows a cart-pole (CP) system in which a uniform rigid pole 615 is connected at one end by a hinge 617 to a cart 619. The objective is to control a force F_h applied to the cart 619 such that the cart and pole end up in a specific state, for example (x,x,̇φ,φ̇) = (0,0,0,0). Specific tasks are parameterised by the mass of the cart 619, the mass and length of the pole 615, and a damping constant governing the torque exerted by the hinge 617.

[0041] For the experiments, 5000 SM tasks, 500 DM tasks, and 1000 CP tasks were generated by varying the task parameters. For the SM and DM domains, the tasks were distributed over a network of 10 nodes connected by 25 randomly-generated edges (the same network being used for all of the SM and DM experiments). For the CP domain, the tasks were distributed over a network of 50 nodes (5 nodes being associated with each of 10 processors) connected by 150 randomly-generated edges (the same network being used for all of the CP experiments). In each case, tasks were assigned evenly between the nodes.

[0042] Table 1 shows the number of iterations required to reduce the consensus error to 10^-8 for several distributed methods. For the present method - Symmetric Diagonally Dominant Newton (SDDN) - a single iteration includes steps S501-S505 of Figure 5.

Table 1: order of magnitude of number of iterations required to achieve a consensus error of 10^-8.

	SDDN	ADD	NN	ADDM	DA	DS
SM	10¹	10²	10²	10⁴	10⁴	10⁵
DM	10²	10³	10⁵	10⁴	10⁵	10⁵
CP	10³	10⁴	10⁵	10⁵	10⁵	10⁵

[0043] The results of Table 1 demonstrate that the present method yields an order of magnitude improvement in the number of iterations required to achieve consensus between nodes.

[0044] Experiments were conducted for two higher-dimensional task domains. The first task domain is a linear model of a CH-47 tandem-rotor "Chinook" helicopter (HC) flying horizontally at 40 knots. The objective is to control the collective and differential robot thrusts in order to stabilise the flight. 500 different HR tasks were generated by varying the separation of the rotors and damping coefficients of the rotors. The HC task domain has a 12-dimensional state space.
The second higher-dimensional task domain is a dynamic model of a humanoid robot (HR) based on the model proposed in "The eMOSAIC model for humanoid robot control" by N. Sugimoto, J. Morimoto, S. H. Hyon, and M. Kawatom (Neural Networks, 13 January 2012). The task is to control policies to ensure stable walking conditions. 100 different HR tasks were generated by varying the length of the legs and the body. The HR task domain has a 36-dimensional state space.

[0045] The bar charts of Figure 7 show the running time required to reach a consensus error of 10^-5 for the five task domains (SM, DM, CP, HC, and HR) using the six methods mentioned above, for equivalent processor networks. The bars are numbered according to the method used as follows: 1 = SDDN; 2 = ADD; 3 = ADMM; 4 = DA; 5 = NN; 6 = DS. For each of the experiments in Figure 7, the tasks were distributed over a network of 20 nodes connected by 50 randomly-generated edges (the same network being used for all of the experiments). For each experiment, tasks were assigned evenly between the nodes. In every experiment, convergence was achieved in a shorter time for the present method (SDDN) than for any of the other methods.

[0046] The bar chart of Figure 8 shows the effect of network topology on the time taken to reach a consensus error of 10^-5 using the six methods mentioned above, for equivalent processor networks, for the HC task domain. Four network topologies were used: small random - 10 nodes connected by 25 randomly-generated edges; medium random - 50 nodes connected by 150 randomly-generated edges; large random - 150 nodes connected by 250 randomly-generated edges; and barbell - two cliques of 10 nodes each connected by a 10 node line graph. The bars are numbered in accordance with Figure 7. For each experiment, tasks were assigned evenly between the nodes. The experiments demonstrate that the increase in speed achieved by the present method is robust under changes of network topology.

[0047] The plots of Figure 9 demonstrate the convergence of the consensus error with respect to the number of iterations for the experiments of Figure 7b. Figure 9a corresponds to the HC task domain, and Figure 9b corresponds to the HR task domain. For each of these task domains, the consensus error converges in fewer iterations for the present method (SDDN) than for any of the other methods.

[0048] The plots of Figure 10 demonstrate the convergence of the objective (12) with respect to the number of iterations for the experiments of Figure 7b. Figure 10a corresponds to the HC task domain, and Figure 10b corresponds to the HR task domain. In Figure 10a, the number of iterations required to reach convergence appears indistinguishable for the present method (SDDN) and ADD, and the objective (12) appears to converge in fewer iterations for SDDN and ADD than for the other methods. However, convergence is only achieved when the consensus error has also converged, and therefore Figure 9a shows that convergence is achieved in fewer iterations for SDDN than for ADD. For the HR task domain of Figure 9b, the objective (12) converges in fewer iterations for SDDN than for any of the other methods.

Example computing device

[0049] Figure 11 shows an example of a computing device 1101 configured for use in a network in accordance with the present invention in order to implement the methods described above. Computing device 1101 includes power supply 1103 and system bus 1105. System bus 1105 is connected to: CPU 1107; communication module 1109; memory 1111; and storage 1113. Memory 1111 stores: program code 1115; trajectory data 1117; policy data 1119; update data 1121; and environment data 1123. Program code 1115 includes agent code 1125. Storage 1113 includes policy store 1127 and trajectory store 1129. Communication module 1109 is communicatively connected to communication modules of neighbouring computing devices in the network.

[0050] In this example, agent code 1123 is run to implement a policy based on policy data 1119, and accordingly generates trajectory data 1117 through interaction with simulated tasks (the tasks being defined according to environment data 1123). Having generated trajectory data 1117, computing device 1101 communicates with neighbouring devices in the network to generate update data 1121, which is then used to update the policy data 1119. Optionally, policy data 1119 and trajectory data 1117 are stored in the policy store 1127 and the trajectory store 1129 respectively.

[0051] In other examples, a computing device may receive trajectory data generated by simulated tasks run on other devices, or alternatively by agents associated with real-life entities such as autonomous robots.

[0052] The above embodiments are to be understood as illustrative examples of the invention. Further embodiments of the invention are envisaged. In particular, the computations described in the method above may be distributed between one or more processors having multiple processor cores.

[0053] It is to be understood that any feature described in relation to any one embodiment may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the embodiments, or any combination of any other of the embodiments. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the invention, which is defined in the accompanying claims.

Claims

1. A computer-implemented method of determining policy parameters for each of a plurality of reinforcement learning tasks, the policy parameters depending on:

i) a common set of parameters shared by the plurality of tasks; and

ii) a task-specific set of parameters,

the method comprising each of a plurality of processors:

receiving trajectory data from a subset of the plurality of tasks; and

determining, using the received trajectory data:

i) a local version of the common set of parameters; and

ii) the task-specific set of parameters for each of the subset of the plurality of tasks,

wherein determining the local version of the common set of parameters comprises iteratively:

locally determining, from the received trajectory data, partial input data for determining a first set of intermediate variables;

taking part in a distributed computation with the other processors of the plurality of processors, whereby to determine, using the partial input data, the first set of intermediate variables; and

updating, using a subset of the first set of intermediate variables, the local version of the common set of parameters,

wherein, after at least one iteration, the local versions of the common set of parameters determined by the plurality of processors converge.

2. The method of claim 1, wherein the first set of intermediate variables comprises components of an approximate Newton direction vector.

3. The method of claim 2, wherein taking part in a distributed computation comprises determining an approximate solution to one or more symmetric diagonally-dominant linear systems of equations.

4. The method of any previous claim, wherein updating the local version of the common set of parameters comprises locally determining a second set of intermediate variables.

5. The method of any previous claim, wherein each processor is communicatively coupled with a subset of the plurality of processors, and updating the local version of the common set of parameters comprises receiving data that is directly accessible by the subset of the plurality of processors.

6. The method of any previous claim, further comprising determining, after at least one iteration, that the local versions of the common set of parameters determined by the plurality of processors have converged.

7. The method of any previous claim, further comprising generating control signals for one or more entities in a real world system.

8. A computer processor configured to perform the method of any previous claim.

9. A computer program product comprising instructions which, when executed by a processor, cause the processor to carry out the method of any of claims 1 to 6.

10. A distributed computing system operable to determine policy parameters for each of a plurality of reinforcement learning tasks, the policy parameters depending on:

iii) a common set of parameters shared by the plurality of tasks; and

iv) a task-specific set of parameters,

the system comprising a plurality of processors connected to form a network, each processor operable to:

receive trajectory data from a subset of the plurality of tasks; and

determine, using the received trajectory data:

i) a local copy of the common set of parameters;

ii) the task-specific set of parameters for each of the subset of the plurality of tasks,

wherein determining the local copy of the common set of parameters comprises iteratively:

locally determining, from the received trajectory data, partial input data for determining a first set of intermediate variables;

taking part in a distributed computation with the other processors of the plurality of processors, whereby to determine, using the partial input data, the first set of intermediate variables; and

updating, using a subset of the first set of intermediate variables, the local version of the common set of parameters,

wherein, after at least one iteration, the local versions of the common set of parameters determined by the plurality of processors converge.

11. The system of claim 10, wherein the first set of intermediate variables comprises components of an approximate Newton direction vector.

12. The system of claim 11, wherein taking part in a distributed computation comprises determining an approximate solution to one or more symmetric diagonally-dominant linear systems of equations.

13. The system of any of claims 10 to 12, wherein updating the local version of the common set of parameters comprises receiving data that is directly accessible by a neighbouring subset of processors.

14. The system of any of claims 10 to 13, further operable to determine, after at least one iteration, that the local versions of the common set of parameters determined by the plurality of processors have converged.

15. The system of any of claims 10 to 14, further operable to generate control signals for one or more entities in a real world system.

Drawing

Search report

Search report

Cited references

REFERENCES CITED IN THE DESCRIPTION

This list of references cited by the applicant is for the reader's convenience only. It does not form part of the European patent document. Even though great care has been taken in compiling the references, errors or omissions cannot be excluded and the EPO disclaims all liability in this regard.

Non-patent literature cited in the description

S. EL BSATH. BOU AMMARM. E. TAYLORScalable Multitask Policy Gradient Reinforcement LearningProceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 1847-1853 [0003]
R. TUTUNOVH. BOU-AMMARA. JADBABAIEDistributed SDDM Solvers: Theory and Applications, 2015, [0031]
S. EL BSATH. BOU AMMARM. E. TAYLORScalable Multitask Policy Gradient Reinforcement LearningProceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017, 1847-1853 [0035] [0036]
M. ZARGHAMA. RIBEIROA. E. OZDAGLERA. JADBABAIEAccelerated Dual Descent for Network Flow OptimisationIEEE Transactions on Automatic Control, 2014, [0036]
A. MOKHTARIQ. LINGA. RIBEIROConvergence Rate and ImplementationNetwork Newton, 2015, [0036]
A. OLSHEVSKYLinear Time Average Consensus on Fixed Graphs and Implications for Decentralised Optimization and Multi-Agent Control, 2014, [0036]
A. NEDICA. E. OZDAGLARDistributed Subgradient Methods for Multi-Agent OptimizationIEEE Trans. Automat. Contr., 2009, vol. 1, 48-61 [0036]
N. SUGIMOTOJ. MORIMOTOS. H. HYONM. KAWATOMThe eMOSAIC model for humanoid robot controlNeural Networks, 2012, [0044]