CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY
TECHNICAL FIELD
[0002] The disclosure herein generally relates to the field of robotics and more particularly,
a method and system for imitation learning in robotics for complex task learning.
BACKGROUND
[0003] Humans and chimpanzees have the unique ability to learn object manipulation actions
from their environment by mere observation i.e., without any explicit teaching. This
has inspired the robotics and AI community to use concepts of cognitive psychology
to create intelligent robots that could match human dexterity through a similar learning
paradigm of avoiding explicit task-specific programming. In this pursuit, imitation
learning (IL) has gained traction in robotic control in recent years.
[0004] Traditionally, IL methodologies require explicit robot states and actions from the
demonstration, more commonly known under the umbrella of Learning from Demonstration
(LfD). The algorithms directly map robot states to actions through a nonlinear function.
In this paradigm, dynamic movement primitives (DMPs) have gained popularity. DMPs
follow the procedure of kinesthetic demonstration of a movement trajectory which is
then learnt using a set of differential equations. The trajectory can then be reproduced
with different start and goal positions for different objects. DMPs have shown success
in short-horizon1 manipulations tasks but their use in long-horizon tasks still remains
a challenge. Also, these approaches fail to accommodate skill learning from expert
demonstrations which are available in other forms such as videos.
[0005] Visual imitation learning accommodates skill transfer from video demonstrations to
robots. These approaches are more similar to the learning capabilities of humans as
they can learn by mere scene observation. Current technologies use an end-to-end approach
where image space is directly mapped to the robot actions. This reduces efficiency
and scope of task customization. Also, End-to-end approaches are harder to train.
[0006] The key capabilities that an intelligent robot must possess is to be able to perform
complex and long-horizon object manipulation tasks without human intervention. For
example, consider a service robot where it is expected to learn and perform task such
as reaching to a water container - grabbing it - getting it to a glass - pouring water
into the glass - bringing the container to the original place - releasing it. The
position of glass and container can vary greatly in different situations. Humans on
the other hand have the unique ability to learn and mimic the complex tasks with ease
by joining the task primitives (that they have learnt during the course of life) in
correct order. However, the existing IL methods are not equipped enough to handle
such challenges.
SUMMARY
[0007] Embodiments of the present disclosure present technological improvements as solutions
to one or more of the above-mentioned technical problems recognized by the inventors
in conventional systems. For example, in one embodiment, a method for imitation learning
in robotics for complex task learning is provided. The processor-implemented method
includes collecting a plurality of videos of long-horizon task demonstration and a
plurality of kinesthetic robot demonstration, sampling a plurality of frames at a
plurality of time steps from each of the plurality of videos using a uniform temporal
sampling to identify a sequence of one or more sub-tasks in the plurality of videos
of task demonstration, and labeling the sampled plurality of frames with an associated
sub-task of the one or more sub-tasks to generate a training dataset. Further, the
processor-implemented method includes training a Task Sequencing Network (TSNet) using
the generated training dataset comprising long-horizon task demonstration to predict
the sequence of one or more sub-tasks associated with the plurality of frames, obtaining
the one or more sub-tasks predicted with highest probability at each of the plurality
of time steps and removing one or more duplicate sub-tasks and extra blank tokens
to predict the sequence of one or more sub-tasks using the trained TSNet and building
a manipulation graph from the predicted sequence of one or more sub-tasks to generate
a parameterized trajectory based on the manipulation graph. Furthermore, the processor-implemented
method includes training one or more Dynamic Movement Primitive (DMP) models using
the plurality of the kinesthetic robot task demonstration to build a Task Agnostic
DMP Library (TADL) comprising DMPs associated with the one or more sub-tasks and selecting
a relevant movement for the one or more sub-tasks from the TADL to execute an intended
movement comprising of one or more sub-tasks on a manipulator robotic arm.
[0008] In another embodiment, a system for imitation learning in robotics for complex task
learning is provided. The system comprises a memory storing a plurality of instructions,
one or more Input/Output (I/O) interfaces, and one or more hardware processors coupled
to the memory via the one or more I/O interfaces. The one or more hardware processors
are configured by the instructions to collect a plurality of videos of long-horizon
task demonstration and a plurality of kinesthetic robot demonstration and sampling
a plurality of frames at a plurality of time steps from each of the plurality of videos
using a uniform temporal sampling to identify a sequence of one or more sub-tasks
in the plurality of videos of task demonstration. Further, the one or more hardware
processors are configured by the instructions to label the sampled plurality of frames
with an associated sub-task of the one or more sub-tasks to generate a training dataset
of and train a Task Sequencing Network (TSNet) using the generated training dataset
comprising of long-horizon task demonstration to predict the sequence of one or more
sub-tasks associated with the plurality of frames. Furthermore, the one or more hardware
processors are configured by the instructions to obtain the one or more sub-tasks
predicted with highest probability at each of the plurality of time steps and removing
one or more duplicate sub-tasks and extra blank tokens to predict the sequence of
one or more sub-tasks using the trained TSNet and build a manipulation graph from
the predicted sequence of one or more sub-tasks to generate a parameterized trajectory
based on the manipulation graph. Further, the one or more hardware processors are
configured by the instructions to train one or more Dynamic Movement Primitive (DMP)
models using the plurality of the kinesthetic robot task demonstration to build a
Task Agnostic DMP Library (TADL) comprising DMPs associated with the one or more sub-tasks
and select a relevant movement for the one or more sub-tasks from the TADL to execute
an intended movement comprising of one or more sub-tasks on a manipulator robotic
arm.
[0009] In yet another embodiment, there are provided one or more non-transitory machine-readable
information storage mediums comprising one or more instructions, which when executed
by one or more hardware processors causes a method for imitation learning in robotics
for complex task learning is provided. The processor-implemented method includes collecting
a plurality of videos of long-horizon task demonstration and a plurality of kinesthetic
robot demonstration, sampling a plurality of frames at a plurality of time steps from
each of the plurality of videos using a uniform temporal sampling to identify a sequence
of one or more sub-tasks in the plurality of videos of demonstration, and labeling
the sampled plurality of frames with an associated sub-task of the one or more sub-tasks
to generate a training dataset. Further, the processor-implemented method includes
training a Task Sequencing Network (TSNet) using the generated training dataset of
long-horizon task demonstration to predict the sequence of one or more sub-tasks associated
with the plurality of frames, obtaining the one or more sub-tasks predicted with highest
probability at each of the plurality of time steps and removing one or more duplicate
sub-tasks and extra blank tokens to predict the sequence of one or more sub-tasks
using the trained TSNet and building a manipulation graph from the predicted sequence
of one or more sub-tasks to generate a parameterized trajectory based on the manipulation
graph. Furthermore, the processor-implemented method includes training one or more
Dynamic Movement Primitive (DMP) models using the plurality of the kinesthetic robot
task demonstration to build a Task Agnostic DMP Library (TADL) comprising DMPs associated
with the one or more sub-tasks and selecting a relevant movement for the one or more
sub-tasks from the TADL to execute an intended movement comprising of one or more
sub-tasks on a manipulator robotic arm.
[0010] It is to be understood that both the foregoing general description and the following
detailed description are exemplary and explanatory only and are not restrictive of
the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0011] The accompanying drawings, which are incorporated in and constitute a part of this
disclosure, illustrate exemplary embodiments and, together with the description, serve
to explain the disclosed principles:
FIG. 1 illustrates a system for an imitation learning in robotics for a complex task
learning, according to some embodiments of the present disclosure.
FIGS. 2A and 2B (collectively referred as FIG. 2) is a functional block diagram to
illustrate a system for an imitation learning in robotics for a complex task learning,
according to some embodiments of the present disclosure.
FIGS. 3A and 3B (collectively referred as FIG. 3) is an exemplary flow diagram illustrating
a processor-implemented method for an imitation learning in robotics for a complex
task learning, according to some embodiments of the present disclosure.
FIG. 4 is a functional block diagram to illustrate Task Agnostic DMP Library (TADL)
based low level controller, according to some embodiments of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
[0012] Exemplary embodiments are described with reference to the accompanying drawings.
In the figures, the left-most digit(s) of a reference number identifies the figure
in which the reference number first appears. Wherever convenient, the same reference
numbers are used throughout the drawings to refer to the same or like parts. While
examples and features of disclosed principles are described herein, modifications,
adaptations, and other implementations are possible without departing from the scope
of the disclosed embodiments.
[0013] Traditionally, IL methodologies require explicit robot states and actions from the
demonstration, more commonly known under the umbrella of Learning from Demonstration
(LfD). The algorithms directly map robot states to actions through a nonlinear function.
In this paradigm, dynamic movement primitives (DMPs) have gained popularity. DMPs
follow the procedure of kinesthetic demonstration of a movement trajectory which is
then learnt using a set of differential equations. The trajectory can then be reproduced
with different start and goal positions for different objects. DMPs have shown success
in short-horizon1 manipulations tasks but their use in long-horizon tasks still remains
a challenge. Also, these approaches fail to accommodate skill learning from expert
demonstrations which are available in other forms such as videos.
[0014] Visual imitation learning accommodates skill transfer from video demonstrations to
robots. These approaches are more similar to the learning capabilities of humans as
they can learn by mere scene observation. Current technologies use an end-to-end approach
where image space is directly mapped to the robot actions. This reduces efficiency
and scope of task customization. Also, End-to-end approaches are harder to train.
[0015] The key capabilities that an intelligent robot must possess is to be able to perform
complex and long-horizon object manipulation tasks without human intervention. For
example, let us consider a service robot where it is expected to learn and perform
task such as reaching to a water container - grabbing it - getting it to a glass -
pouring water into the glass - bringing the container to the original place - releasing
it. The position of glass and container can vary greatly in different situations.
Humans on the other hand have the unique ability to learn and mimic the complex tasks
with ease by joining the task primitives (that they have learnt during the course
of life) in correct order. However, the existing IL methods are not equipped enough
to handle such challenges.
[0016] Embodiments herein provide a method and system for an imitation learning in robotics
for complex task learning. The present disclosure overcomes the above mentioned challenges
by decoupling the imitation learning task into two parts using a decouple neural network
architecture comprising two modules. A high-level module that segments out the sequence
of primitive actions from video of human demonstration of a complex task and a low-level
module that executes those primitives sequentially on the robot. The high level module
(referred as Task Sequencing Network (TSNet)) is a hybrid neural network composed
of Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) with Connectionist
Temporal Classification (CTC) loss, while the low-level controller is built using
dynamic movement primitives (DMP). The network is trained in a supervised learning
setting. The Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) with
Connectionist Temporal Classification (CTC) combination helps in efficiently segmenting
the task primitives in sequence for a given input video demonstrations.
[0017] Referring now to the drawings, and more particularly to FIG. 1 through FIG. 4, where
similar reference characters denote corresponding features consistently throughout
the figures, there are shown preferred embodiments and these embodiments are described
in the context of the following exemplary system and/or method.
[0018] FIG. 1 illustrates a block diagram of a system for generating ancillary bundled offers
by jointly optimizing the customer bundle preferences of ancillary product bundles,
in accordance with an example embodiment. Although the present disclosure is explained
considering that the system 100 is implemented on a server, it may be understood that
the system 100 may comprise one or more computing devices 102, such as a laptop computer,
a desktop computer, a notebook, a workstation, a cloud-based computing environment
and the like. It will be understood that the system 100 may be accessed through one
or more input/output interfaces 104-1, 104-2... 104-N, collectively referred to as
I/O interface 104. Examples of the I/O interface 104 may include, but are not limited
to, a user interface, a portable computer, a personal digital assistant, a handheld
device, a smartphone, a tablet computer, a workstation, and the like. The I/O interface
104 are communicatively coupled to the system 100 through a network 106.
[0019] In an embodiment, the network 106 may be a wireless or a wired network, or a combination
thereof. In an example, the network 106 can be implemented as a computer network,
as one of the different types of networks, such as virtual private network (VPN),
intranet, local area network (LAN), wide area network (WAN), the internet, and such.
The network 106 may either be a dedicated network or a shared network, which represents
an association of the different types of networks that use a variety of protocols,
for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet
Protocol (TCP/IP), and Wireless Application Protocol (WAP), to communicate with each
other. Further, the network 106 may include a variety of network devices, including
routers, bridges, servers, computing devices, storage devices. The network devices
within the network 106 may interact with the system 100 through communication links.
[0020] The system 100 supports various connectivity options such as BLUETOOTH
®, USB, ZigBee, and other cellular services. The network environment enables connection
of various components of the system 100 using any communication link including Internet,
WAN, MAN, and so on. In an exemplary embodiment, the system 100 is implemented to
operate as a stand-alone device. In another embodiment, the system 100 may be implemented
to work as a loosely coupled device to a smart computing environment. Further, the
system 100 comprises at least one memory 110 with a plurality of instructions, one
or more databases 112, and one or more hardware processors 108 which are communicatively
coupled with the at least one memory to execute a plurality of modules 114 therein.
The components and functionalities of the system 100 are described further in detail.
[0021] FIGS. 2A and 2B (collectively referred as FIG. 2) is a functional block diagram 200
to illustrate the system 100 for an imitation learning in robotics for a complex task
learning, according to some embodiments of the present disclosure. In one embodiment,
the system 100 is configured to decouple task segmentation and skill encoding during
the learning cycle by proposing a Task Sequencing Network (TSNet) associated with
the Task Agnostic DMP Library (TADL). The TSNet, a high-level controller, segments
out one or more primitive actions present in the video of a human demonstration and
the TADL executes the one or more primitives in a predefined sequence on a robot.
The TSNet gauges the intent (to be imitated) of the human in the demonstration and
the TADL is responsible for imitating the intent by the robot.
[0022] In one embodiment, firstly the TADL is designed. One or more kinesthetic Robot Demonstrations
(RD) are collected by manually actuating the manipulator robotic arm for a plurality
of primitive actions. During RD, the joint angle data of the manipulator is recorded
and is used to train the Dynamic Movement Primitives (DMPs) model for each of the
one or more primitive actions. Building of the TADL is theoretically a one-time activity
because the primitive and short-horizon actions defined in the TADL are not specific
to any task (task agnostic) and found universally in various object manipulation tasks.
[0023] In another embodiment, the TSNet is a hybrid deep neural network composed of three
components: layers of CNN followed by RNN coupled with a Connectionist Temporal Classification
(CTC) loss function. The TSNet is trained in a supervised learning setting using a
video data of human demonstration for one or more tasks outlined in Table 1. The supervision
helps the TSNet to identify one or more task primitives involved in a demonstration
with correct sequence. During the test phase, the trained TSNet predicts task primitives
in their sequential order given a video of a human demonstration. The parameters of
segmented primitives are then searched in the TADL and executed on a real robot as
per the sequence predicted by the TSNet.
Table 1
Task |
Label |
Liquid Pouring |
Reach - Grasp - Reach - Tilt - Reach - Release - Retract |
Stack |
Reach - Grasp - Reach - Release - Retract |
Push |
Reach - Reach - Retract |
Pick |
Reach - Grasp - Reach - Reach - Release - Retract |
Liquid Mixing |
Reach - Grasp - Reach -Stir- Reach - Release - Retract |
Table Cleaning |
Reach - Grasp - Wipe - Release - Retract |
Place in Basket |
Reach - Grasp - Reach - Release - Retract |
Object Rotation |
Reach - Grasp - Rotate - Release - Retract |
Poke |
Reach - Press - Retract |
[0024] It would be appreciated that the use of CTC loss function has been successful in
sequence modelling of natural languages. It helps RNN to be used directly for sequence
modelling while removing the need of pre-segmented training data and post-processed
output. Herein, the CTC loss function is used for subtask/task primitives sequencing.
[0025] FIG. 3A and 3B (collectively referred as FIG. 3) is a flow diagram illustrating a
processor-implemented method 300 for an imitation learning in robotics for a complex
task learning implemented by the system 100 of FIG. 1. Functions of the components
of the system 100 are now explained through steps of flow diagram in FIG. 3, according
to some embodiments of the present disclosure.
[0026] Initially, at step 302 of the processor-implemented method 300, the one or more hardware
processors 108 are configured by the programmed instructions to collect a plurality
of videos of a kinesthetic robot long-horizon task demonstration and a plurality of
kinesthetic robot demonstration, wherein during the kinesthetic robot demonstration
a manipulator robotic arm is manually actuated to perform a plurality of primitive
actions and joint angle data of the manipulator is recorded.
[0027] At the next step 304 of the processor-implemented method 300, the one or more hardware
processors 108 are configured by the programmed instructions to sampling a plurality
of frames at a plurality of time steps from each of the plurality of videos using
a uniform temporal sampling to identify a sequence of one or more sub-tasks in the
plurality of videos of a kinesthetic robot demonstration.
[0028] In one example, consider
D is a set of human demonstration from a distribution of task

. Here

is the
n-dimensional feature space of all the sequences and
S = L is the set of all associated subtask/task primitives which has label from L.
Each element in
D is a tuple of {F, s} with
F = {X
1, X
2, ..., Xv} being a particular demonstration of length
V and the associated subtask sequence is provided in
s = {
s1,
s2, ... ,
sT} of length
T where T ≤
V. Generally, number of subtasks involved in a demonstration is lower than the number
of time frames in demonstration. The idea here is to learn a temporal map
g:
F →
S using the demonstration
D to predict the sequencing of subtask/task primitives involved in the novel inputs
sampled from the distribution

while minimizing an appropriate loss function.
[0029] At the next step 306 of the processor-implemented method 300, the one or more hardware
processors 108 are configured by the programmed instructions to label the sampled
plurality of frames with an associated sub-task of the one or more sub-tasks to generate
a training dataset of kinesthetic robot demonstration.
[0030] In one example, for an input video, 10 frames are sampled with uniform temporal spacing
from start to end. The frames are concatenated to form a single image of size 30x800x3
and subsequently converted to grayscale image. It is then followed by four pairs of
convolutional and max pool layer. All convolutional layers have filters of size 3x3,
stride 1, activation function i.e., rectified linear unit (ReLU) and padding to preserve
shape. Multiple layers of CNN and MaxPool helps to bring down the input size to the
RNN layers.
[0031] In order to have a probability distribution over label sequence, the network uses
a softmax output layer with two extra units than the number of primitives. The extra
labels signify the probability of occurrences of the blank label and the padding.
Further, a new set of labels as
Laug by including extra labels in
L. This essentially helps to compute the probability of all possible sequence of stitching
the task primitives to form a complex and long-horizon task. Let

is a continuous map that maps an
n-dimensional input sequence to a v-dimensional output sequence of length
T. The weights of Γ is trained using the demonstration D such that the output:

wherein

is the
ith element of
u, represents the activation of
ith node at
tth time instant. It gives the probability of occurrence of the
ith label at the
t. Then the probability distribution over (
Laug)
T of
T sequences can be found using:

wherein the elements of (
Laug)
T represent the sequence of task primitives and is denoted as subtask-group. It should
be noted that the output of the network at different time interval has no conditional
dependence on time as there is no inter connection in the output layer. It helps us
to provide demonstrations of different durations with different kinds. Before moving
to the next step the additional labels need to be removed from the predicted labels
(this process is explained later) to get the possible set of sequences and let us
denote it as
Lfinal. Now the conditional probability
p(
l|
x) of all
l ∈
Lfinal can be obtained by summing up the all the probabilities of all the subtask-groups
associated to it. Now we are ready to define g(x) that predicts the maximum probable
sequence of the task primitives for a given input and it is given by-

wherein
g(
x) essentially a classifier that categorizes the segments of the input video in different
task primitives. Further, the network weights are updated in a direction where the
likelihood of predicting the correct sequence of task primitives for all the input
video demonstration increases. This maximization problem is converted to a minimization
problem by defining the loss function as follows:

[0032] The minimization problem is solved using a gradient descent technique. The network
is updated by taking the gradient of (4) with respect to the output. Considering the
training examples independent of each other, the gradient can be written as:

wherein, the computation of the gradient requires CTC forward-backward algorithm.
Equation 5 is then used for a network training.
[0033] At the next step 308 of the processor-implemented method 300, the one or more hardware
processors 108 are configured by the programmed instructions to train a Task Sequencing
Network (TSNet) using the generated training dataset kinesthetic robot long-horizon
task demonstration to predict the sequence of one or more sub-tasks associated with
the plurality of frames. Wherein, the TSNet comprises a Convolutional Neural Network
(CNN) and a Recurrent Neural Network (RNN). The RNN portion comprises of two bi-directional
LSTM layers with 64 and 128 units respectively. The final layer is fully connected
with 12 (10+2) units, where 10 for total primitive count in TADL, 1 for the extra
blank label and 1 for padding. The network output is of dimension 50 x 12, i.e., TSNet
can predict at-most a sequence of 50 primitives in a video with 10 probable options
for each primitive. The output of the network needs processing before it computes
the probabilities
p(
l|
x) of all
l ∈
Lfinal. At this stage extra nodes are discarded and
l ∈
Lfinal is constructed. The process essentially follows two stages: selection of task primitives
and removing the duplicate characters and the extra blank token. In the first stage,
the task primitives with highest probability at every time step is chosen. This results
a sequence of 50 primitives.
[0034] In another example, wherein the one or more task primitives are labelled as 1-reach,
3-retract, 4-grasp, 5-release.
- a) 1 1 1 ∈ 4 4 1 1 ∈ 1 1 . . . 5 3 3 0 0 0 [length = 50]
- b) 1∈ 4 1 ∈ 1 5 3 0
- c) 1 4 1 1 5 3 0
- d) 1 4 1 1 5 3
wherein
ε represents the blank token in the output layer. The adjacent duplicate task primitives
are merged into one. (Step: a to b). The blank token is removed (step: b to c) and
then the padded label '0' is removed (step: c to d). The blank token is essential
in scenarios where repetition of labels is required in concatenation. It should be
noted that without the
ε, such a scenario would not be plausible and would have merged to 14 1 5 3 instead.
Since the primitive sequences are of different lengths, the labels are padded with
a '0' to equalize their lengths before training.
[0035] At the next step 310 of the processor-implemented method 300, the one or more hardware
processors 108 are configured by the programmed instructions to obtain the one or
more sub-tasks predicted with highest probability at each of the plurality of time
steps and removing one or more duplicate sub-tasks and extra blank tokens to predict
the sequence of one or more sub-tasks using the trained TSNet.
[0036] In a Learning from Demonstration (LfD) framework, the significant advantage is that
a nonexpert user, without any understanding of the underlying robotic system, can
demonstrate a set of actions that need to be learned for a given state to achieve
a desired goal. However, the following three concerns need to be accounted for: 1)
correspondence problem - the difference in human and robot morphology introduces a
systematic domain shift, 2) generalization to new context - intractability of teaching
the robot for every possible scenario in the environment requires spatio-temporal
scaling properties for a learned demonstration for generalization to unseen scenarios
in robot workspace, and 3) robustness against perturbation - a robust and accurate
replication of learned movements in a dynamic environment, for ex. in the presence
of obstacles.
[0037] In the present disclosure, a demonstrated movement is modelled using a cartesian
DMP formulation as given below:

where
x is the phase variable,
z is the scaled velocity of the movement,
g is the desired final position on the movement, and
f is a nonlinear forcing term. With the choice
τ > 0 and
αz = 4
βz the linear part of equation system becomes critically damped and
y, z monotonically converges to a unique attractor point at
y =
g;
z = 0.
f(x) is defined as a linear combination of N nonlinear radial basis functions:

wherein
ci are the centers of gaussians distributed along the phase of the movement and
hi their widths.
[0038] The phase x has been introduced to avoid explicit time dependency. Its evolution
follows the first order linear dynamics:

wherein
αx > 0 and x (0) = 1.
[0039] At the next step 312 of the processor-implemented method 300, the one or more hardware
processors 108 are configured by the programmed instructions to build a manipulation
graph from the predicted sequence of one or more sub-tasks to generate a parameterized
trajectory based on the manipulation graph.
[0040] FIG. 4 is a functional block diagram to illustrate a Task Agnostic DMP Library (TADL)
based low level controller, according to some embodiments of the present disclosure.
Herein, The TADL is built using basic movement primitives that forms the basis of
universal set of robotic manipulations. The plurality of kinesthetic robot demonstration
is collected for each of the one or more primitive actions. The plurality of kinesthetic
robot demonstration is collected using a predefined manipulator arm in gravity compensated
mode. The collected plurality of kinesthetic robot demonstration is used to train
DMP models for each of the one or more movement primitives and consequently stored
in the TADL. It would be appreciated that the design of TADL to be altered and augmented
as further research builds.
[0041] At the next step 314 of the processor-implemented method 300, the one or more hardware
processors 108 are configured by the programmed instructions to train one or more
Dynamic Movement Primitive (DMP) models using the plurality of the kinesthetic robot
long-horizon task demonstration to build a Task Agnostic DMP Library (TADL) comprising
DMPs associated with the one or more sub-tasks.
[0042] Finally, at the last step 316 of the processor-implemented method 300, the one or
more hardware processors 108 are configured by the programmed instructions to selecting,
via the one or more hardware processors, a relevant movement for the one or more sub-tasks
from the TADL to execute an intended movement comprising of one or more sub-tasks
on a manipulator robotic arm.
Experiments:
[0043] A 6 DOF UR 10 robot manipulator is used with a RGBD realsense-D435i sensor mounted
on eye-in-hand configuration. A customized 2 finger gripper built using a Grabber
Reacher with rotating jaw is mounted to the robot end-effector. A table with 10 objects
for object manipulation is placed in the robot workspace. The proposed imitation learning
framework is tested for three different robotic manipulation tasks of liquid pouring,
table cleaning and object placing. To test the entire framework in a real robot, the
realistic human demonstration dataset was used for the three tasks. Human demonstration
videos are provided for each of the three tasks. The TSNet predicts the sequence of
primitives (manipulation graph) for each task:
- (a) table cleaning: reach - grasp - wipe - release - retract,
- (b) object placing: reach - grasp - reach - release - retract and
- (c) liquid pouring: reach - grasp - reach - tilt-reach- release - retract.
[0044] These primitives in manipulation graph in each case is attached with semantic goals
using perception module to generate semantic manipulation graph. Finally, the semantic
manipulation graph is executed by low level controller using TADL. The execution steps
are shown in Fig. 14. Object and scene configuration differed in human demonstrations
and robot workspace during experiments. This validation demonstrates the usability
of the generalized decoupled imitation learning framework for complex robotic tasks.
[0045] The written description describes the subject matter herein to enable any person
skilled in the art to make and use the embodiments. The scope of the subject matter
embodiments is defined by the claims and may include other modifications that occur
to those skilled in the art. Such other modifications are intended to be within the
scope of the claims if they have similar elements that do not differ from the literal
language of the claims or if they include equivalent elements with insubstantial differences
from the literal language of the claims.
[0046] The embodiments of present disclosure herein address unresolved problem of long-horizon
object manipulation tasks by an intelligent robot without human intervention. Embodiments
herein provide a decoupled architecture for an imitation learning. A high-level module
does primitive action segmentation while the low-level controller is responsible for
sequential action execution on the robot. The high-level controller is a novel hybrid
neural network with CTC loss function that helps avoid the tedious and intractable
task of frame-wise labelling, by requiring only the final sequence of primitives as
label. A universal set of DMPs is proposed for the low-level control.
[0047] It is to be understood that the scope of the protection is extended to such a program
and in addition to a computer-readable means having a message therein; such computer-readable
storage means contain program-code means for implementation of one or more steps of
the method, when the program runs on a server or mobile device or any suitable programmable
device. The hardware device can be any kind of device which can be programmed including
e.g., any kind of computer like a server or a personal computer, or the like, or any
combination thereof. The device may also include means which could be e.g., hardware
means like e.g., an application-specific integrated circuit (ASIC), a field-programmable
gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC
and an FPGA, or at least one microprocessor and at least one memory with software
processing components located therein. Thus, the means can include both hardware means,
and software means. The method embodiments described herein could be implemented in
hardware and software. The device may also include software means. Alternatively,
the embodiments may be implemented on different hardware devices, e.g., using a plurality
of CPUs.
[0048] The embodiments herein can comprise hardware and software elements. The embodiments
that are implemented in software include but are not limited to, firmware, resident
software, microcode, etc. The functions performed by various components described
herein may be implemented in other components or combinations of other components.
For the purposes of this description, a computer-usable or computer readable medium
can be any apparatus that can comprise, store, communicate, propagate, or transport
the program for use by or in connection with the instruction execution system, apparatus,
or device.
[0049] The illustrated steps are set out to explain the exemplary embodiments shown, and
it should be anticipated that ongoing technological development will change the manner
in which particular functions are performed. These examples are presented herein for
purposes of illustration, and not limitation. Further, the boundaries of the functional
building blocks have been arbitrarily defined herein for the convenience of the description.
Alternative boundaries can be defined so long as the specified functions and relationships
thereof are appropriately performed. Alternatives (including equivalents, extensions,
variations, deviations, etc., of those described herein) will be apparent to persons
skilled in the relevant art(s) based on the teachings contained herein. Such alternatives
fall within the scope of the disclosed embodiments. Also, the words "comprising,"
"having," "containing," and "including," and other similar forms are intended to be
equivalent in meaning and be open ended in that an item or items following any one
of these words is not meant to be an exhaustive listing of such item or items or meant
to be limited to only the listed item or items. It must also be noted that as used
herein and in the appended claims, the singular forms "a," "an," and "the" include
plural references unless the context clearly dictates otherwise.
[0050] Furthermore, one or more computer-readable storage media may be utilized in implementing
embodiments consistent with the present disclosure. A computer-readable storage medium
refers to any type of physical memory on which information or data readable by a processor
may be stored. Thus, a computer-readable storage medium may store instructions for
execution by one or more processors, including instructions for causing the processor(s)
to perform steps or stages consistent with the embodiments described herein. The term
"computer-readable medium" should be understood to include tangible items and exclude
carrier waves and transient signals, i.e., be non-transitory. Examples include random
access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory,
hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage
media.
[0051] It is intended that the disclosure and examples be considered as exemplary only,
with a true scope of disclosed embodiments being indicated by the following claims.
1. A processor-implemented method (300), comprising:
collecting (302), via an input/output interface, a plurality of videos of a long-horizon
task demonstration and a plurality of kinesthetic robot demonstration, wherein during
the kinesthetic robot demonstration a manipulator robotic arm is manually actuated
to perform a plurality of primitive actions and joint angle data of the manipulator
is recorded;
sampling (304), via one or more hardware processors, a plurality of frames at a plurality
of time steps from each of the plurality of videos using a uniform temporal sampling
to identify a sequence of one or more sub-tasks in the plurality of videos of a task
demonstration;
labeling (306), via the one or more hardware processors, the sampled plurality of
frames with an associated sub-task of the one or more sub-tasks to generate a training
dataset;
training (308), via the one or more hardware processors, a Task Sequencing Network
(TSNet) using the generated training dataset comprising the long-horizon task demonstration
to predict the sequence of one or more sub-tasks associated with the plurality of
frames;
obtaining (310), via the one or more hardware processors, the one or more sub-tasks
predicted with highest probability at each of the plurality of time steps, and removing
one or more duplicate sub-tasks and extra blank tokens to predict the sequence of
one or more sub-tasks using the trained TSNet;
building (312), via the one or more hardware processors, a manipulation graph from
the predicted sequence of one or more sub-tasks to generate a parameterized trajectory
based on the manipulation graph;
training (314), via the one or more hardware processors, one or more Dynamic Movement
Primitive (DMP) models using the plurality of the kinesthetic robot task demonstration
to build a Task Agnostic DMP Library (TADL) comprising DMPs associated with the one
or more sub-tasks; and
selecting (316), via the one or more hardware processors, a relevant movement for
the one or more sub-tasks from the TADL to execute an intended movement comprising
of one or more sub-tasks on a manipulator robotic arm.
2. The processor-implemented method (300) as claimed in claim 1, wherein the TSNet comprises
a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN).
3. The processor-implemented method (300) as claimed in claim 2, wherein the CNN is trained
to classify the plurality of frames.
4. The processor-implemented method (300) as claimed in claim 2, wherein the RNN is trained
to learn the temporal relationship between the plurality of frames.
5. The processor-implemented method (300) as claimed in claim 2, wherein the CNN and
the RNN are trained using a Connectionist Temporal Classification (CTC) loss function.
6. A system (100) comprising:
a memory (110) storing instructions;
one or more Input/Output (I/O) interfaces (104); and
one or more hardware processors (108) coupled to the memory (110) via the one or more
I/O interfaces (104), wherein the one or more hardware processors are configured by
the instructions to:
collect a plurality of videos of a long-horizon task demonstration and a plurality
of kinesthetic robot demonstration, wherein during the kinesthetic robot demonstration
a manipulator robotic arm is manually actuated to perform a plurality of primitive
actions and joint angle data of the manipulator is recorded;
sampling a plurality of frames at a plurality of time steps from each of the plurality
of videos using a uniform temporal sampling to identify a sequence of one or more
sub-tasks in the plurality of videos of a task demonstration;
label the sampled plurality of frames with an associated sub-task of the one or more
sub-tasks to generate a training dataset;
train a Task Sequencing Network (TSNet) using the generated training dataset comprising
the long-horizon task demonstration to predict the sequence of one or more sub-tasks
associated with the plurality of frames;
obtain the one or more sub-tasks predicted with highest probability at each of the
plurality of time steps, and removing one or more duplicate sub-tasks and extra blank
tokens to predict the sequence of one or more sub-tasks using the trained TSNet;
build a manipulation graph from the predicted sequence of one or more sub-tasks to
generate a parameterized trajectory based on the manipulation graph;
train one or more Dynamic Movement Primitive (DMP) models using the plurality of the
kinesthetic robot task demonstration to build a Task Agnostic DMP Library (TADL) comprising
DMPs associated with the one or more sub-tasks; and
select a relevant movement for the one or more sub-tasks from the TADL to execute
an intended movement comprising of one or more sub-tasks on a manipulator robotic
arm.
7. The system (100) as claimed in claim 6, wherein the TSNet comprises a Convolutional
Neural Network (CNN) and a Recurrent Neural Network (RNN).
8. The system (100) as claimed in claim 7, wherein the CNN is trained to classify the
plurality of frames.
9. The system (100) as claimed in claim 7, wherein the RNN is trained to learn the temporal
relationship between the plurality of frames.
10. The system (100) as claimed in claim 7, wherein the CNN and the RNN are trained using
a Connectionist Temporal Classification (CTC) loss function.
11. One or more non-transitory machine-readable information storage mediums comprising
one or more instructions which when executed by one or more hardware processors cause:
collecting, via an input/output interface, a plurality of videos of a long-horizon
task demonstration and a plurality of kinesthetic robot demonstration, wherein during
the kinesthetic robot demonstration a manipulator robotic arm is manually actuated
to perform a plurality of primitive actions and joint angle data of the manipulator
is recorded;
sampling a plurality of frames at a plurality of time steps from each of the plurality
of videos using a uniform temporal sampling to identify a sequence of one or more
sub-tasks in the plurality of videos of a task demonstration;
labeling the sampled plurality of frames with an associated sub-task of the one or
more sub-tasks to generate a training dataset;
training a Task Sequencing Network (TSNet) using the generated training dataset comprising
the long-horizon task demonstration to predict the sequence of one or more sub-tasks
associated with the plurality of frames;
obtaining the one or more sub-tasks predicted with highest probability at each of
the plurality of time steps, and removing one or more duplicate sub-tasks and extra
blank tokens to predict the sequence of one or more sub-tasks using the trained TSNet;
building a manipulation graph from the predicted sequence of one or more sub-tasks
to generate a parameterized trajectory based on the manipulation graph;
training one or more Dynamic Movement Primitive (DMP) models using the plurality of
the kinesthetic robot task demonstration to build a Task Agnostic DMP Library (TADL)
comprising DMPs associated with the one or more sub-tasks; and
selecting a relevant movement for the one or more sub-tasks from the TADL to execute
an intended movement comprising of one or more sub-tasks on a manipulator robotic
arm.
12. The one or more non-transitory machine-readable information storage mediums of claim
11, wherein the TSNet comprises a Convolutional Neural Network (CNN) and a Recurrent
Neural Network (RNN).
13. The one or more non-transitory machine-readable information storage mediums of claim
12, wherein the CNN is trained to classify the plurality of frames.
14. The one or more non-transitory machine-readable information storage mediums of claim
12, wherein the RNN is trained to learn the temporal relationship between the plurality
of frames.
15. The one or more non-transitory machine-readable information storage mediums of claim
12, wherein the CNN and the RNN are trained using a Connectionist Temporal Classification
(CTC) loss function.