Technical Field
[0001] This invention is in the field of machine learning systems, and has particular applicability
to distributed reinforcement learning systems.
Background
[0002] Reinforcement learning involves a computer system learning how to perform tasks by
analysing data corresponding to previously-experienced instances of the same or similar
tasks. Multi-task reinforcement learning (MTRL) has been proposed as a means of increasing
the efficiency of reinforcement learning in cases where data is analysed from several
different tasks, allowing for knowledge transfer between the different tasks.
[0003] The application of MTRL to particular problems is often computationally expensive,
especially for tasks with high-dimensional state and/or action spaces, or in cases
where extensive volumes of experience data need to be analysed. Accordingly, there
is a need for an efficient and accurate distributed implementation of MTRL such that
the computational burden can be shared by multiple processors. Implementing a distributed
routine for MTRL is not straightforward due to the need for knowledge transfer between
different tasks, requiring sharing of data between processor nodes. Existing attempts
to implement MTRL using multiple processors, for example the method discussed in "
Scalable Multitask Policy Gradient Reinforcement Learning" by S. El Bsat, H. Bou Ammar,
and M. E. Taylor (Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence,
pp 1847-1853, February 4-9, 2017), result in relatively slow convergence rates, which are only achieved after making
restrictive assumptions about an objective function.
Summary
[0004] According to one aspect of the present invention, there is provided a computer-implemented
method of determining policy parameters for each of a plurality of reinforcement learning
tasks. The policy parameters depend on a common set of parameters shared by the plurality
of tasks and a task-specific set of parameters. The method includes each of a plurality
of processors receiving trajectory data from a subset of the plurality of tasks, and
determining using the received trajectory data: a local version of the common set
of parameters; and the task-specific set of parameters for each of the subset of the
plurality of tasks. Determining the local version of the common set of parameters
includes iteratively determining partial input data from the received trajectory data
for determining a first set of intermediate variables, and taking part in a distributed
computation with the other processors to determine the first set of intermediate variables.
Once the first set of intermediate variables have been determined, the method then
includes updating the local version of the common set of parameters using a subset
of the first set of intermediate variables. After at least one iteration, the local
versions of the common set of parameters determined by the plurality of processors
converge.
[0005] Further aspects of the invention will become apparent from the following description
of an embodiment of the invention, given by way of example only, which is made with
reference to the accompanying drawings.
Brief Description of the Drawings
[0006]
Figure 1 shows an example of a task for a reinforcement learning problem.
Figure 2 is a schematic diagram showing a network of processors being used to solve
a reinforcement learning problem involving multiple tasks.
Figure 3 is a schematic diagram of a network of processors configured in accordance
with the present invention.
Figure 4 is a flow diagram representing a routine for solving a multi-task reinforcement
learning problem.
Figure 5 is a flow diagram representing a subroutine for updating a shared knowledge
base.
Figure 6 shows examples of benchmark tasks used to test the performance of a multi-task
reinforcement learning routine.
Figure 7 shows bar charts illustrating times taken by processor networks to reach
solutions to multi-task reinforcement learning problems for different task domains.
Figure 8 shows bar charts illustrating times taken by processor networks to reach
solutions to multi-task reinforcement learning problems for different network topologies.
Figure 9 shows plots illustrating the convergence of consensus errors in multi-task
reinforcement learning problems.
Figure 10 shows plots illustrating the convergence of objective values in multi-task
reinforcement learning problems.
Figure 11 is a schematic diagram of a computing device configured for use in a network
according to the present invention.
Detailed Description
Reinforcement learning: overview
[0007] For the purposes of the following description and accompanying drawings, a reinforcement
learning problem is definable by specifying the characteristics of one or more
agents and an
environment. The methods and systems described herein are applicable to a wide range of reinforcement
learning problems.
[0008] A software agent, referred to hereafter as an
agent, is a computer program component that makes decisions based on a set of input signals
and performs actions based on these decisions. In some applications of reinforcement
learning, each agent is associated with a real-world entity (for example an autonomous
robot). In other applications of reinforcement learning, an agent is associated with
a virtual entity (for example, a non-playable character (NPC) in a video game). In
some examples, an agent is implemented in software or hardware that is part of a real
world entity (for example, within an autonomous robot). In other examples, an agent
is implemented by a computer system that is remote from a real world entity.
[0009] An
environment is a system with which agents interact, and a complete specification of an environment
is referred to as a
task. It is assumed that interactions between an agent and an environment occur at discrete
time steps labelled h = 0,1, 2, 3, .... The discrete time steps do not necessarily
correspond to times separated by fixed intervals. At each time step, the agent receives
data corresponding to an observation of the environment and data corresponding to
a reward. The data corresponding to an observation of the environment is referred
to as a
state signal and the observation of the environment is referred to as a
state. The state perceived by the agent at time step
h is labelled
sh. The state observed by the agent may depend on variables associated with the agent-controlled
entity, for example in the case that the entity is a robot, the position and velocity
of the robot within a given spatial domain. A state
sh is generally represented by a
k-dimensional state vector
xh, whose components are variables that represent various aspects of the corresponding
state.
[0010] In response to receiving a state signal indicating a state
sh at a time step
h, an agent is able to select and perform an
action ah from a set of available actions in accordance with a Markov Decision Process (MDP).
In some examples, the state signal does not convey sufficient information to ascertain
the true state of the environment, in which case the agent selects and performs the
action
ah in accordance with a Partially-Observable Markov Decision Process (PO-MDP). Performing
a selected action generally has an effect on the environment. Data sent from an agent
to the environment as an agent performs an action is referred to as an
action signal. At the next time step
h + 1 , the agent receives a new state signal from the environment indicating a new
state
sh+1. In some examples, the new state signal is initiated by the agent completing the action
ah. In other examples the new state signal is sent in response to a change in the environment.
[0011] Depending on the configuration of the agents and the environment, the set of states,
as well as the set of actions available in each state, may be finite or infinite.
The methods and systems described herein are applicable in any of these cases.
[0012] Having performed an action
ah, an agent receives a
reward signal corresponding to a numerical
reward rh+1, where the reward
rh+1 depends on the state
sh, the action
ah and the state
sh+1. The agent is thereby associated with a sequence of states, actions and rewards (
sh, ah, rh+1, sh+1, ...) referred to as a trajectory
τ. The reward is a real number that may be positive, negative, or zero.
[0013] In response to an agent receiving a state signal, the agent selects an action to
perform based on a
policy. A policy is a stochastic mapping from states to actions. If an agent follows a policy
π, and receives a state signal at time step
h indicating a specific state
sh, the probability of the agent selecting a specific action
ah is denoted by
π(
ah|
sh)
. A policy for which
π(
ah|
sh) takes values of either 0 or 1 for all possible combinations of
ah and
sh is a
deterministic policy. Reinforcement learning algorithms specify how the policy of an agent is altered
in response to sequences of states, actions, and rewards that the agent experiences.
[0014] Generally, the objective of a reinforcement learning algorithm is to find a policy
that maximises the expectation value of a
return, where the value of a return at any time depends on the rewards received by the agent
at future times. For some reinforcement learning problems, the trajectory
τ is finite, indicating a finite sequence of time steps, and the agent eventually encounters
a
terminal state sH from which no further actions are available. In a problem for which
τ is finite, the finite sequence of time steps refers to an
episode and the associated task is referred to as an
episodic task. For other reinforcement learning problems, the trajectory
τ is infinite, and there are no terminal states. A problem for which
τ is infinite is referred to as an infinite horizon task. As an example, a possible
definition of a return for an episodic task is given by Equation (1) below:
In this example, the return
(
τ) is equal to the average reward received per time step for the trajectory
τ. A skilled person will appreciate that this is not the only possible definition of
a return. For example, in R-learning algorithms, the return given by Equation (1)
is replaced with an infinite sum over rewards minus an average expected reward.
Policy search reinforcement learning
[0015] In policy search reinforcement learning, a policy is parameterised by a
k-dimensional vector
where
k is the length of the state vector
xh encoding relevant information about the state
sh. An example of such a policy is a stochastic Gaussian policy given by Equation (1):
Equation (2) states that in response to experiencing a state at time step
h represented by a state vector
xh, a computer-controlled entity following the Gaussian policy selects an action
ah according to a Gaussian (normal) probability distribution with mean
θTxh and variance
σ2. Figure 1a shows a specific example of a task for a one-dimensional dynamical system
in which a uniform rigid pole 101 is connected at one end by a hinge to a cart 103.
The position and velocity of the cart are given by
x and
ẋ respectively, and the angular position and velocity with respect to the vertical
are given by
φ and
φ̇ respectively. The state at any time step
h can be represented by a four-dimensional state vector
xh = (
x, ẋ, φ, φ̇). The aim of the reinforcement learning problem is to learn a policy that maps each
possible state vector
xh within a given domain to a force
Fh to be applied to the cart, such that given any initial state
x0 the domain, a sequence of forces will be applied such that the cart ends up in a
target state
xH = (0,0,0,0) as shown in Figure 1b. In this example, the reward is chosen to be
rh+1 = -∥x
h -
xH∥
2, so that the system is penalised at each time step according to the L
2 norm of the state vector at that time step and the target state vector. The task
is parameterised by the mass of the cart, the mass and length of the pole, and a damping
constant governing the torque exerted by the hinge. In this example, the force is
assumed to be normally-distributed around about a mean value that depends linearly
on the state vector such that
Fh =
θTxh +
ε, where
ε ∼
which corresponds to the policy of Equation (2). The stochastic term
ε is included to ensure that a range of states are explored during the learning process
(a larger value of
σ2 leading to a greater degree of exploration). In this example, the dynamical system
is simulated, but the methods and systems described herein are equally applicable
to systems of real world entities, as well as in a wide range of contexts other than
that of dynamical systems.
[0016] The example task of Figure 1 is characterised by a four-dimensional state space to
a one-dimensional action space. The methods and systems described herein equally apply
to tasks having higher-dimensional state and action spaces. For example, the policy
of Equation (2) is extendable to tasks having a multi-dimensional action space by
replacing the normal distribution with a multivariate normal distribution. For tasks
having many degrees of freedom, other types of policy may be suitable, for example
policies based on neural network function approximators, in which the vector of parameters
represents the connection weights within the network. The methods and systems described
herein are applicable to any reinforcement learning problem in which a policy is expressible
in terms of a vector of parameters.
[0017] Given a task such as that described above, an agent is faced with the problem of
searching for a parameter
θ* that maximises the objective given by Equation (3) below, which represents the expected
return for the policy
πθ:
where
τ is a trajectory consisting of a sequence of state-action pairs [
x0:H, a0:H] and
(
τ) is defined as in Equation (1). The probability
pθ(
τ) of acquiring a trajectory
τ whilst following a policy parameterised by a vector
θ is given by Equation (4):
in which
P0(
x0) is a distribution of initial states
x0, and
p(
xh+1|
xh,
ah) denotes the probability density for transitioning to a state
xh+1 if the action
ah is performed from a state
xh.
Policy gradient methods
[0018] In order to search for a parameter vector
θ* that maximises the objective of Equation (3), a policy gradient method is adopted,
in which trajectories are generated using a fixed policy
πθ, and a new policy
πθ̃ is determined by searching for a parameter vector
θ̃ that maximises a lower bound on the corresponding new value of the objective. An
appropriate lower bound for the logarithm of the objective is given by Equation (5):
in which D
KL denotes the KL divergence between two probability distributions. The method proceeds
with the aim of maximising this lower bound, or equivalently solving the following
minimisation problem:
Multi-task reinforcement learning
[0019] Reinforcement learning algorithms often require large numbers of trajectories in
order to determine policies which are successful with regard to a particular task.
Acquiring a large number of trajectories may be time-consuming, and particularly in
the case of physical systems such as those in robotics applications, may lead to damage
to, or wear on, the entities being controlled during the learning process. Furthermore,
a policy learned using a method such as that described above, is only likely to be
successful for the specific task used for the learning phase. In particular, if parameters
of the task are varied (for example, the mass of the cart, the mass or length of the
pole, and/or the damping constant, in the example of Figure 1), the agent will be
required to learn an entirely new policy, requiring another extensive phase of collecting
data from multiple trajectories.
[0020] Multi-task reinforcement learning (MTRL) methods have been developed to overcome
the problems described above. In MTRL, an agent receives trajectories from multiple
tasks, and the objective of the learning algorithm is to determine policies for each
of these tasks, taking into account the possibility of knowledge transferability between
tasks. MTRL problems can be divided into two cases, the first case being single domain
MTRL and the second case being cross-domain MTRL. In the single domain case, the tasks
vary with respect to each other only by varying parameter values, and accordingly
have state and action spaces with common dimensionality. Figure 2 shows an example
of a single domain MTRL problem having multiple tasks 201, of which three tasks 201a-c
are shown. Each of the tasks 201 is a cart-pole system similar to the example of Figure
1, but the mass of the cart and the length of the pole differs between tasks. The
aim of the MTRL problem is to learn a policy for each of the tasks analogous to that
in the example of Figure 1, making use of knowledge transferability between the tasks
to reduce the number of trajectories required from each task for successful behaviour.
[0021] In cross-domain MTRL, the dimensionality of the state and action spaces may differ
between tasks, and the tasks may correspond to different types system altogether.
[0022] Given a set of
T tasks, an MTRL algorithm searches for a set of optimal policies Π* = {
π(1)*, ...,
π(T)*} with corresponding parameters
The method is formulated analogously to the policy search method outlined above,
but instead of posing
T minimisation problems of the form given in (6), a single multi-task learning objective
is formed:
where
is the lower bound on the expected return for task
t, and Reg(
θ̃1, ...,
θ̃T) is a regularising term used to impose a common structure needed for knowledge transfer.
Knowledge transfer is encoded into the objective by representing the parameter vector
associated with each task as a linear combination of shared latent components such
that
θ̃t =
Lst, where
is a task-specific vector, and
L ∈
is a matrix that serves as a shared knowledge base for all of the tasks. The regularising
term in (7) is chosen such that the components of
L are ensured to be bounded, and the components of
st for each task are encouraged to be sparse, resulting in a high degree of knowledge
transfer between tasks. The learning objective is thus given by:
where ∥·∥
1 is the L
1 norm, ∥·∥
F is the Frobenius norm, and
µ1,
µ2 are parameters that control the degree of knowledge transfer.
Distributed MTRL
[0023] A straightforward approach to solving the minimisation problem (8) is to construct
a concatenated data set of trajectories from all of the tasks, and to solve the minimisation
problem in a batch fashion using central processing. However, the computational cost
of this approach increases rapidly with both the number of tasks and the number of
trajectories per task, potentially rendering the memory requirements and computing
time prohibitive for all but the simplest MTRL problems.
[0024] The present invention provides an efficient distributed implementation of MTRL in
the case that that multiple processors are available, thereby overcoming the scalability
issues associated with central processing.
[0025] The present invention has significant advantages over known MTRL methods. For example,
the present method achieves quadratic convergence in solving for the coefficients
of the shared knowledge base
L. By contrast, state-of-the-art known methods achieve only linear convergence. Furthermore,
existing methods require restrictive assumptions to be made about the learning objective,
whereas the present method requires no such assumptions, and is hence more general.
Laplacian-based distributed MTRL
[0026] In one embodiment of the present invention, a network of processors is provided,
in which the processors are connected by an undirected graph
with
v denoting the processors (nodes) and
ε denoting the connections between processors (edges). The total number of processors
is
n, and no restrictions are made on the graph topology. As shown in Figure 4, each of
T learning tasks is assigned, at S401, to one of the
n processors, such that the total number of tasks assigned to the
ith processor is given by
Ti, and therefore ∑
iTi =
T. Each processor receives, at S403, trajectory data for each task it has been assigned,
the trajectory data for each task corresponding to sequences of states, actions and
rewards generated by an agent following the policy
πθt.
[0027] The present method proceeds with the objective of minimising the objective (8) in
two stages. During the first stage the network updates, at S405, the coefficients
of the shared knowledge matrix
L. During this stage, the task-specific
st values are held constant. This update is not straightforward, and is generally expensive
in terms of computational resources due to the dependence of
L on all of the training tasks. One of the contributions of the present invention is
to provide an efficient distributed method for updating
L.
[0028] Having updated the shared knowledge matrix
L, each processor updates, at S407, the task-specific vectors
st for each task it has been assigned. During this stage, the
ith processor solves
Ti task-specific problems of the form
which are immediately recognisable as separate Lasso problems, and can thus be solved
straightforwardly using a conventional Lasso solver. The application of Lasso solvers
to problems such as (9) is well-known in the art, and therefore will not be described
in further detail. Having updated
L and
st, the processors update the policy parameters
θ̃t =
Lst for each task. If necessary, further trajectory data is generated for each task using
the updated policy parameters
θ̃t in place of
θt, and the iterative reinforcement learning process begins again from S403.
[0029] A method for updating
L in accordance with the present invention is described hereafter. Holding
st fixed as described above leads the following reduced minimisation problem being posed:
In order to solve (10) using a distributed implementation, the bi-product
Lst is rewritten in terms of the column-wise vectorisation vec(
L) of the shared knowledge matrix
L:
Using this notation, the minimisation problem can be rewritten for distributed processing
as
such that
Where
µ̃2 =
Tµ2, and
Li for
i = 1, ...,
n denotes a local version of
L stored by the
ith processor. Equation (13) imposes a consensus condition between the local copies.
For convenience, a set of
dk vectors
y1,
...,
ydk is introduced, defined in terms of the components of
Li for
i = 1, ...,
n as
yr = [vec(
L1)
(r), ...,vec(
Ln)
(r)]
T, allowing (12) to be rewritten as
in which
is the cost function associated with the
ith node. The consensus condition of Equation (13) is replaced by the following condition:
in which
is the Laplacian matrix of the graph
, defined by:
where deg(
i) denotes the degree of the
ith node. It is well known that the Laplacian matrix of any graph is symmetric diagonally
dominant (SDD), and it is this property that allows the constrained minimisation problem
defined by (14) and (16) to be solved efficiently using a distributed implementation
as described hereafter. Equation (16) can be written concisely as
My = 0, where
and
Solving Laplacian-based distributed MTRL
[0030] According to a well-known method in the field of optimisation theory, the constrained
optimisation problem defined by (14) and (16) is solved by reference to the dual function
q(
λ):
in which
is a vector of Lagrange multipliers, otherwise referred to as dual variables. The
dual variables are initialised to arbitrary values, and starting with an initial set
of values for
y, the method proceeds to update
λ such that
q(
λ) moves closer to its maximum. Having updated
λ, the optimisation routine then updates
y. The method continues to update
λ and
y in turn until convergence is achieved, and the matrix
L is constructed from the converged coefficients of
y.
[0031] During the
lth update step, the dual variables
λ are updated using the rule
where
is an approximation of the Newton direction. The Newton direction
v[l] is the solution to the equation
H[l]v[l] = -g
[l], where H
[l] is the Hessian matrix of the dual function and g
[l] = ∇
q(
λ) is the gradient of the dual function. For the dual function given by Equation (17),
the Hessian and gradient are given by
H(λ
) =
-M(∇
2f(
y(λ)))
-1M and
g(
λ) =
My(
λ)
, where
f = ∑
ifi is the cumulative cost function summed over all the nodes. An approximate Newton
direction is thus given by a solution to the following system of equations:
which reduces to
dk SDD systems per node given by:
for
i = 1, ...,
n and r = 1, ...,
dk, where the Newton direction vector
v[l] is split into sub-vectors such that
The right hand side of Equation (21) can be computed locally without the need for
the
ith processor to communicate with any other nodes, and hence Equation (21) poses a set
of simultaneous linear equations to be solved by the network. Those skilled in the
art will be aware of known methods for computing approximate solutions to SDD systems
of equations up to any given accuracy
ε. In this example, the distributed method described in "
Distributed SDDM Solvers: Theory and Applications" by R. Tutunov, H. Bou-Ammar, and
A. Jadbabaie (ArXiv e-prints, 2015) is used to determine an approximate solution to the SDD system (21).
[0032] Having determined, in collaboration with the other processors in the network, the
components of the vectors
as a solution of Equation (21), the
ith processor locally updates the
ith components of the vectors
according to Equation (19), without the
ith processor needing to communicate with any other nodes of then network. The
ith processor then updates the corresponding
ith components of the vectors
as solutions of the following system of partial differential equations:
The
ith processor is only required to receive data from nodes in its neighbourhood
N(
i) in order to solve Equation (22). In other words, the
ith processor is only required to receive data from memory that is directly accessible
(accessible without the need to communicate with any of the other processors in the
network) by processors in its neighbourhood
N(
i). Figure 3 shows an example of a node 301, labelled
i, in a network 303 according to the present invention. The node 301 has a processor
305 and an associated memory 307, which the processor 305 is able access directly
without the need to communicate with any of the other processors in the network, and
which in preparation for the
lth update has stored: the cost function
fi; the
ith components
of the dual variable vectors; the
ith components
of the primal variable vectors; and the
ith components
of the approximate Newton direction vectors. Due to the structure of the Lagrangian
matrix of the network graph, in solving Equation (22) the processor 305 of node 301
is required to receive components of
for
j ∈
N(
i), corresponding to components stored by nodes 307a-c. Explicitly, the right hand
side of Equation (22) is given by:
The procedure of updating
λ and
y continues until convergence is achieved, as determined by any suitable convergence
criteria known in the art. When convergence has been achieved on all of the processors
in the network, each processor has stored an updated version
Li of the shared knowledge matrix
L, where the copies are (approximately) equal to each other according to the consensus
condition (12).
[0033] The routine for updating the shared knowledge base, as described above, is summarised
by Figure 5. The network first determines, at S501, an approximate Newton direction
in dual space. This is achieved by the
ith processor computing the right hand side of Equation (21) locally without the need
for the
ith processor to communicate with any other nodes, and the network solving the resulting
linear system of simultaneous equations (which requires communication between the
nodes). The network next updates, at S503, the dual variables
λ. In order to update the dual variables
λ, the
ith processor updates the
ith components of
λr for
r = 1, ...,
dk using the corresponding components of the update rule (19). The network next updates,
at S505, the primal variables
y. In order to update the primal variables
y, the
ith processor solves Equation (22) locally, thereby determining the
ith component of each of the sub-vectors
for
r = 1, ...,
dk. The procedure of updating
λ and
y continues until convergence is achieved. Finally, the
ith processor reconstructs, at S507, the local version
Li of the shared knowledge matrix
L from the corresponding converged components of
y.
[0034] Having calculated an updated version
Li of the shared knowledge matrix
L, the
ith processor then calculates updated task-specific vectors
st for each task it has been assigned by solving the corresponding minimisation problems
(9). Finally, the
ith processor uses the updated values of
Li and
st to calculate the updated policy parameters
θ̃t =
List for
t = 1, ...,
Ti.
[0035] Formulating the MTRL problem in accordance with the present invention leads to substantial
advantages when compared with existing MTRL methods. In particular, the present method
performs computations in a distributed manner, allowing the necessary computations
for updating the policy parameters to be performed simultaneously by the
n processors in the network. By contrast, the existing state of the art method described
in "
Scalable Multitask Policy Gradient Reinforcement Learning" by S. El Bsat, H. Bou Ammar,
and M. E. Taylor (Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence,
pp 1847-1853, February 4-9, 2017), relies on computations being performed in a sequential manner, rather than a distributed
manner, by the processors. Furthermore, the SDD property of the graph Laplacian employed
in the present method ensures that the sequence {
y[l],λ[l]}
l≥0 of primal variables and dual variables calculated during the iterative loop of Figure
5 exhibits quadratic convergence, as opposed to linear convergence achieved by the
most efficient existing methods.
Performance
[0036] In order to assess the performance of the method described herein, experiments were
run in which the performance of a network implementing the present distributed MTRL
method was compared against the performance of equivalent networks implementing existing
methods. Specifically, solutions to the minimisation problem defined by (12) and (13)
were implemented using the present method and several known methods, namely:
- Accelerated Dual Descent (ADD) as described in "Accelerated Dual Descent for Network Flow Optimisation" by M. Zargham, A. Ribeiro,
A. E. Ozdagler, and A. Jadbabaie (IEEE Transactions on Automatic Control, 2014).
- Network Newton (NN) as described in "Network Newton-Part II: Convergence Rate and Implementation" by A. Mokhtari, Q. Ling,
and A. Ribeiro (ArXiv e-prints, 2015).
- Alternating Direction Method of Multipliers (ADMM) as described in "Scalable Multitask Policy Gradient Reinforcement Learning" by S. El Bsat, H. Bou Ammar,
and M. E. Taylor (Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence,
pp 1847-1853, February 4-9, 2017).
- Distributed Averaging (DA) as described in "Linear Time Average Consensus on Fixed Graphs and Implications for Decentralised Optimization
and Multi-Agent Control" by A. Olshevsky (ArXiv e-prints, 2014).
- Distributed Subgradients (DS), as described in "Distributed Subgradient Methods for Multi-Agent Optimization" by A. Nedic and A. E.
Ozdaglar (IEEE Trans. Automat. Contr. (1) pp 48-61, 2009).
[0037] In order to compare performances, convergence of the minimisation objective of (12)
and convergence of the consensus error were separately investigated, where the consensus
error is defined as the mean difference between the norms of the local copies
Li computed by the network.
[0038] Initially, experiments were conducted using the three benchmark virtual task domains
shown in Figure 6. Figure 6a shows a simple mass-spring (SM) system in which a cart
601 is connected by means of a linear spring 603 to a fixed wall 605. The objective
is to control a force
Fh applied to the cart 601 such that the cart 601 reaches a specific state, for example
(
x,
ẋ) = (0,0). Specific tasks are parameterised by the mass of the cart 601, the spring
constant of the spring 603, and the damping constant of the spring 603.
[0039] Figure 6b shows a double mass-spring (DM) system in which a first cart 607 is connected
by means of a first linear spring 609 to a second cart 611, which is connected by
means of a second linear spring 613 to a fixed wall 615. The objective is to control
a force
Fh applied to the first cart 607 such that the second cart 611 reaches a specific state,
for example (
x,
ẋ) = (0,0). Specific tasks are parameterised by the mass of the first cart 607, the
spring constant of the first spring 609, the damping constant of the first spring
609, the mass of the second cart 611, the spring constant of the second spring 613,
and the damping constant of the second spring 613.
[0040] Figure 6c shows a cart-pole (CP) system in which a uniform rigid pole 615 is connected
at one end by a hinge 617 to a cart 619. The objective is to control a force
Fh applied to the cart 619 such that the cart and pole end up in a specific state, for
example (
x,
x,̇
φ,φ̇) = (0,0,0,0). Specific tasks are parameterised by the mass of the cart 619, the mass
and length of the pole 615, and a damping constant governing the torque exerted by
the hinge 617.
[0041] For the experiments, 5000 SM tasks, 500 DM tasks, and 1000 CP tasks were generated
by varying the task parameters. For the SM and DM domains, the tasks were distributed
over a network of 10 nodes connected by 25 randomly-generated edges (the same network
being used for all of the SM and DM experiments). For the CP domain, the tasks were
distributed over a network of 50 nodes (5 nodes being associated with each of 10 processors)
connected by 150 randomly-generated edges (the same network being used for all of
the CP experiments). In each case, tasks were assigned evenly between the nodes.
[0042] Table 1 shows the number of iterations required to reduce the consensus error to
10
-8 for several distributed methods. For the present method - Symmetric Diagonally Dominant
Newton (SDDN) - a single iteration includes steps S501-S505 of Figure 5.
Table 1: order of magnitude of number of iterations required to achieve a consensus
error of 10
-8.
|
SDDN |
ADD |
NN |
ADDM |
DA |
DS |
SM |
101 |
102 |
102 |
104 |
104 |
105 |
DM |
102 |
103 |
105 |
104 |
105 |
105 |
CP |
103 |
104 |
105 |
105 |
105 |
105 |
[0043] The results of Table 1 demonstrate that the present method yields an order of magnitude
improvement in the number of iterations required to achieve consensus between nodes.
[0044] Experiments were conducted for two higher-dimensional task domains. The first task
domain is a linear model of a CH-47 tandem-rotor "Chinook" helicopter (HC) flying
horizontally at 40 knots. The objective is to control the collective and differential
robot thrusts in order to stabilise the flight. 500 different HR tasks were generated
by varying the separation of the rotors and damping coefficients of the rotors. The
HC task domain has a 12-dimensional state space.
The second higher-dimensional task domain is a dynamic model of a humanoid robot (HR)
based on the model proposed in "
The eMOSAIC model for humanoid robot control" by N. Sugimoto, J. Morimoto, S. H. Hyon,
and M. Kawatom (Neural Networks, 13 January 2012). The task is to control policies to ensure stable walking conditions. 100 different
HR tasks were generated by varying the length of the legs and the body. The HR task
domain has a 36-dimensional state space.
[0045] The bar charts of Figure 7 show the running time required to reach a consensus error
of 10
-5 for the five task domains (SM, DM, CP, HC, and HR) using the six methods mentioned
above, for equivalent processor networks. The bars are numbered according to the method
used as follows: 1 = SDDN; 2 = ADD; 3 = ADMM; 4 = DA; 5 = NN; 6 = DS. For each of
the experiments in Figure 7, the tasks were distributed over a network of 20 nodes
connected by 50 randomly-generated edges (the same network being used for all of the
experiments). For each experiment, tasks were assigned evenly between the nodes. In
every experiment, convergence was achieved in a shorter time for the present method
(SDDN) than for any of the other methods.
[0046] The bar chart of Figure 8 shows the effect of network topology on the time taken
to reach a consensus error of 10
-5 using the six methods mentioned above, for equivalent processor networks, for the
HC task domain. Four network topologies were used: small random - 10 nodes connected
by 25 randomly-generated edges; medium random - 50 nodes connected by 150 randomly-generated
edges; large random - 150 nodes connected by 250 randomly-generated edges; and barbell
- two cliques of 10 nodes each connected by a 10 node line graph. The bars are numbered
in accordance with Figure 7. For each experiment, tasks were assigned evenly between
the nodes. The experiments demonstrate that the increase in speed achieved by the
present method is robust under changes of network topology.
[0047] The plots of Figure 9 demonstrate the convergence of the consensus error with respect
to the number of iterations for the experiments of Figure 7b. Figure 9a corresponds
to the HC task domain, and Figure 9b corresponds to the HR task domain. For each of
these task domains, the consensus error converges in fewer iterations for the present
method (SDDN) than for any of the other methods.
[0048] The plots of Figure 10 demonstrate the convergence of the objective (12) with respect
to the number of iterations for the experiments of Figure 7b. Figure 10a corresponds
to the HC task domain, and Figure 10b corresponds to the HR task domain. In Figure
10a, the number of iterations required to reach convergence appears indistinguishable
for the present method (SDDN) and ADD, and the objective (12) appears to converge
in fewer iterations for SDDN and ADD than for the other methods. However, convergence
is only achieved when the consensus error has also converged, and therefore Figure
9a shows that convergence is achieved in fewer iterations for SDDN than for ADD. For
the HR task domain of Figure 9b, the objective (12) converges in fewer iterations
for SDDN than for any of the other methods.
Example computing device
[0049] Figure 11 shows an example of a computing device 1101 configured for use in a network
in accordance with the present invention in order to implement the methods described
above. Computing device 1101 includes power supply 1103 and system bus 1105. System
bus 1105 is connected to: CPU 1107; communication module 1109; memory 1111; and storage
1113. Memory 1111 stores: program code 1115; trajectory data 1117; policy data 1119;
update data 1121; and environment data 1123. Program code 1115 includes agent code
1125. Storage 1113 includes policy store 1127 and trajectory store 1129. Communication
module 1109 is communicatively connected to communication modules of neighbouring
computing devices in the network.
[0050] In this example, agent code 1123 is run to implement a policy based on policy data
1119, and accordingly generates trajectory data 1117 through interaction with simulated
tasks (the tasks being defined according to environment data 1123). Having generated
trajectory data 1117, computing device 1101 communicates with neighbouring devices
in the network to generate update data 1121, which is then used to update the policy
data 1119. Optionally, policy data 1119 and trajectory data 1117 are stored in the
policy store 1127 and the trajectory store 1129 respectively.
[0051] In other examples, a computing device may receive trajectory data generated by simulated
tasks run on other devices, or alternatively by agents associated with real-life entities
such as autonomous robots.
[0052] The above embodiments are to be understood as illustrative examples of the invention.
Further embodiments of the invention are envisaged. In particular, the computations
described in the method above may be distributed between one or more processors having
multiple processor cores.
[0053] It is to be understood that any feature described in relation to any one embodiment
may be used alone, or in combination with other features described, and may also be
used in combination with one or more features of any other of the embodiments, or
any combination of any other of the embodiments. Furthermore, equivalents and modifications
not described above may also be employed without departing from the scope of the invention,
which is defined in the accompanying claims.
1. A computer-implemented method of determining policy parameters for each of a plurality
of reinforcement learning tasks, the policy parameters depending on:
i) a common set of parameters shared by the plurality of tasks; and
ii) a task-specific set of parameters,
the method comprising each of a plurality of processors:
receiving trajectory data from a subset of the plurality of tasks; and
determining, using the received trajectory data:
i) a local version of the common set of parameters; and
ii) the task-specific set of parameters for each of the subset of the plurality of
tasks,
wherein determining the local version of the common set of parameters comprises iteratively:
locally determining, from the received trajectory data, partial input data for determining
a first set of intermediate variables;
taking part in a distributed computation with the other processors of the plurality
of processors, whereby to determine, using the partial input data, the first set of
intermediate variables; and
updating, using a subset of the first set of intermediate variables, the local version
of the common set of parameters,
wherein, after at least one iteration, the local versions of the common set of parameters
determined by the plurality of processors converge.
2. The method of claim 1, wherein the first set of intermediate variables comprises components
of an approximate Newton direction vector.
3. The method of claim 2, wherein taking part in a distributed computation comprises
determining an approximate solution to one or more symmetric diagonally-dominant linear
systems of equations.
4. The method of any previous claim, wherein updating the local version of the common
set of parameters comprises locally determining a second set of intermediate variables.
5. The method of any previous claim, wherein each processor is communicatively coupled
with a subset of the plurality of processors, and updating the local version of the
common set of parameters comprises receiving data that is directly accessible by the
subset of the plurality of processors.
6. The method of any previous claim, further comprising determining, after at least one
iteration, that the local versions of the common set of parameters determined by the
plurality of processors have converged.
7. The method of any previous claim, further comprising generating control signals for
one or more entities in a real world system.
8. A computer processor configured to perform the method of any previous claim.
9. A computer program product comprising instructions which, when executed by a processor,
cause the processor to carry out the method of any of claims 1 to 6.
10. A distributed computing system operable to determine policy parameters for each of
a plurality of reinforcement learning tasks, the policy parameters depending on:
iii) a common set of parameters shared by the plurality of tasks; and
iv) a task-specific set of parameters,
the system comprising a plurality of processors connected to form a network, each
processor operable to:
receive trajectory data from a subset of the plurality of tasks; and
determine, using the received trajectory data:
i) a local copy of the common set of parameters;
ii) the task-specific set of parameters for each of the subset of the plurality of
tasks,
wherein determining the local copy of the common set of parameters comprises iteratively:
locally determining, from the received trajectory data, partial input data for determining
a first set of intermediate variables;
taking part in a distributed computation with the other processors of the plurality
of processors, whereby to determine, using the partial input data, the first set of
intermediate variables; and
updating, using a subset of the first set of intermediate variables, the local version
of the common set of parameters,
wherein, after at least one iteration, the local versions of the common set of parameters
determined by the plurality of processors converge.
11. The system of claim 10, wherein the first set of intermediate variables comprises
components of an approximate Newton direction vector.
12. The system of claim 11, wherein taking part in a distributed computation comprises
determining an approximate solution to one or more symmetric diagonally-dominant linear
systems of equations.
13. The system of any of claims 10 to 12, wherein updating the local version of the common
set of parameters comprises receiving data that is directly accessible by a neighbouring
subset of processors.
14. The system of any of claims 10 to 13, further operable to determine, after at least
one iteration, that the local versions of the common set of parameters determined
by the plurality of processors have converged.
15. The system of any of claims 10 to 14, further operable to generate control signals
for one or more entities in a real world system.