FAST BACKUP OF COMPUTE NODES IN A MASSIVELY PARALLEL COMPUTER SYSTEM

(19)

(11)

EP 2 118 749 B9

(12)	CORRECTED EUROPEAN PATENT SPECIFICATION
	Note: Bibliography reflects the latest situation

(15)	Correction information:
	Corrected version no 1 (W1 B1)
	Corrections, see Claims EN

(48)	Corrigendum issued on:
	03.11.2010 Bulletin 2010/44

(45)	Mention of the grant of the patent:
	07.07.2010 Bulletin 2010/27

(21)	Application number: 07847538.1

(22)	Date of filing: 29.11.2007

(51)

International Patent Classification (IPC):

G06F 11/20^(2006.01)

(86)	International application number:
	PCT/EP2007/063022

(87)	International publication number:
	WO 2008/071555 (19.06.2008 Gazette 2008/25)

(54)	FAST BACKUP OF COMPUTE NODES IN A MASSIVELY PARALLEL COMPUTER SYSTEM SCHNELLES BACKUP VON DATENVERARBEITUNGSKNOTEN IN EINEM MASSIV PARALLELEN COMPUTERSYSTEM SAUVEGARDE RAPIDE DE NOEUDS DE CALCUL DANS UN SYSTÈME INFORMATIQUE MASSIVEMENT PARALLÈLE

(84)	Designated Contracting States:
	AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC MT NL PL PT RO SE SI SK TR

(30)

Priority:

11.12.2006 US 608891

(43)	Date of publication of application:
	18.11.2009 Bulletin 2009/47

(73)	Proprietor: International Business Machines Corporation
	Armonk, NY 10504 (US)

(72)	Inventors:
	DARRINGTON, David Rochester, Minnesota 55906 (US) MCCARTHY, Patrick Joseph Rochester, Minnesota 55901 (US) PETERS, Amanda Winchester, Hampshire SO21 2JN (GB) SIDELNIK, Albert Winchester, Hampshire SO21 2JN (GB)

(74)	Representative: Robertson, Tracey
	IBM United Kingdom Limited Intellectual Property Law Hursley Park Winchester, Hampshire SO21 2JN Winchester, Hampshire SO21 2JN (GB)

(56)

References cited: :

WO-A-2005/106668
US-A1- 2004 153 754

US-A- 5 689 646

GARA A: "overview of the Blue Gene/L system architecture" IBM Journal of Research and Development IBM USA, [Online] vol. 49, no. 2-3, 1 March 2005 (2005-03-01), pages 195-212, XP002469210 USA ISSN: 0018-8646 [retrieved on 2008-02-15]

Note: Within nine months from the publication of the mention of the grant of the European patent, any person may give notice to the European Patent Office of opposition to the European patent granted. Notice of opposition shall be filed in a written reasoned statement. It shall not be deemed to have been filed until the opposition fee has been paid. (Art. 99(1) European Patent Convention).

Description

Technical field of the invention

[0001] This invention generally relates to backing up and fault recovery in a computing system, and more specifically relates to an apparatus for fast backup of compute nodes in a massively parallel super computer.

Background of the invention

[0002] Efficient fault recovery is important to decrease down time and repair costs for sophisticated computer systems. On parallel computer systems with a large number of compute nodes, a failure of a single component may cause a large portion of the computer to be taken off line for repair.

[0003] Massively parallel computer systems are one type of parallel computer system that have a large number of interconnected compute nodes. A family of such massively parallel computers is being developed by International Business Machines Corporation (IBM) under the name Blue Gene. The Blue Gene/L system is a scalable system in which the current maximum number of compute nodes is 65,536. The Blue Gene/L node consists of a single ASIC (application specific integrated circuit) with 2 CPUs and memory. The full computer is housed in 64 racks or cabinets with 32 node boards in each rack.

[0004] The Blue Gene/L supercomputer communicates over several communication networks. The 65,536 computational nodes are arranged into both a logical tree network and a 3-dimensional torus network. The logical tree network connects the computational nodes in a tree structure so that each node communicates with a parent and one or two children. The torus network logically connects the compute nodes in a three-dimensional lattice like structure that allows each compute node to communicate with its closest 6 neighbours in a section of the computer. Since the compute nodes are arranged in a torus and tree network that require communication with adjacent nodes, a hardware failure of a single node can bring a large portion of the system to a standstill until the faulty hardware can be repaired. For example, a single node failure could render inoperable a complete section of the torus network, where a section of the torus network in the Blue Gene/L system is a half a rack or 512 nodes. Further, all the hardware assigned to the partition of the failure may also need to be taken off line until the failure is corrected.

[0005] On large parallel computer systems in the prior art, a failure of a single node during execution often requires that the data of an entire partition of the computer be saved to external file system so the partition can be taken off line. The data must then be reloaded to a backup partition for the job to resume. When a failure event occurs, it is advantageous to be able to save the data of the software application quickly so that the application can resume on the backup hardware with minimal delay to increase the overall system efficiency. Without a way to more effectively save the software state and data, parallel computer systems will continue to waste potential computer processing time and increase operating and maintenance costs.

[0006] US 2004/0153754 A1 discloses a multiprocessor parallel computer which is made tolerant to hardware failures by providing extra groups of redundant standby processors and by designing the system so that these extra groups of processors can be swapped with any group which experiences a hardware failure. This swapping can be under software control, thereby permitting the entire computer to sustain a hardware failure but, after swapping in the standby processors, to still appear to software as a pristine fully functioning system.

Disclosure of the invention

[0007] According to the preferred embodiments, a method is described for a fast backup of a set of compute nodes to save the state of the software in a parallel computer system. A fast backup mechanism in the service node of the computer system configures a set of nodes to be used for a backup and when needed uses the network hardware to perform a fast backup from node to node from an original set of nodes to the backup set of nodes. The fast backup mechanism takes advantage of the high speed data transfer capability of the torus network to copy from node to node all the node data for the job executing on the node. In the preferred embodiments the fast backup is done with respect to a midplane or rack of nodes copied to a backup rack of nodes.

[0008] The disclosed embodiments are directed to the Blue Gene architecture but can be implemented on any parallel computer system with multiple processors arranged in a network structure. The preferred embodiments are particularly advantageous for massively parallel computer systems.

[0009] Viewed from a first aspect, the present invention provides a parallel computer system comprising: a plurality of midplanes, each midplane comprising a plurality of interconnected compute nodes with node data and a plurality of link chips that connect the plurality of midplanes to a set of cables that interconnect the plurality of midplanes; a fast backup mechanism in a service node of the computer system that sets up a midplane as a backup midplane by setting the plurality of link chips in the backup midplane into a pass through mode that passes data coming to a first link chip in the backup midplane to a second link chip in an adjacent midplane; and wherein the fast backup mechanism instructs all the nodes in an original midplane to copy all their data to corresponding nodes in the backup midplane using temporary coordinates for the corresponding nodes in the backup midplane.

[0010] Preferably, the present invention provides a parallel computer system wherein the fast backup mechanism sets node coordinates in the backup midplane to a set of coordinates corresponding to the original midplane and puts the plurality of link chips in the original midplane into the pass through mode.

[0011] Preferably, the present invention provides a parallel computer system wherein the compute nodes in the midplane are interconnected with a torus network to connect each node with its six nearest neighbours.

[0012] Preferably, the present invention provides a parallel computer system further comprising cables to connect the midplanes to their six nearest neighbours.

[0013] Preferably, the present invention provides a parallel computer system wherein the computer system is a massively parallel computer system.

[0014] Preferably, the present invention provides a parallel computer system further comprising copy code in static random access memory (SRAM) to copy all the data in the compute nodes of the original midplane to corresponding nodes in the backup midplane.

[0015] Viewed from a second aspect, the present invention provides a method for fast backup of compute nodes in a parallel computer system where the method comprises the steps of :

setting up one of a plurality of midplanes, each midplane comprising a plurality of interconnected compute nodes with node data and a plurality of link chips that connect the plurality of midplanes to a set of cables that interconnect the plurality of midplanes;

setting up a midplane as a backup midplane and setting the plurality of link chips in the backup midplane into a pass through mode that passes data coming to a first link chip in the backup midplane to a second link chip in an adjacent midplane; and

instructing all of the nodes in an original midplane to copy all their data to corresponding nodes in the backup midplane using temporary coordinates for the corresponding nodes in the backup midplane.

[0016] Preferably, the present invention provides a method wherein the step of setting up one of a plurality of backplanes as a backup midplane comprises the steps of: programing a plurality of link chips on the backup midplane to the pass through mode; programing link chips on a plurality of remaining midplanes into a partition to pass node data to adjacent midplanes; and scheduling a job to execute on the partition.

[0017] Preferably, the present invention provides a method wherein the step of performing a fast backup further comprises the steps of: programing the plurality of link chips on the backup midplane to the normal mode to accept data; assigning temporary coordinates to the nodes in the backup midplane; and notifying all the nodes in the original midplane to send all data on the node to the corresponding node in the backup midplane.

[0018] Preferably, the present invention provides a method further comprising the steps of: programing the link chips in the original midplane to the pass through mode; and switching the coordinates in the backup midplane to match the coordinates of the original midplane to configure the backup midplane to take the place of the original midplane.

[0019] Preferably, the present invention provides a method further comprising the step of: terminating any jobs running on the backup midplane if there are any.

[0020] Preferably, the present invention provides a method wherein the step of copying data from the original midplane to the backup midplane is accomplished with copy code located in SRAM of the compute nodes.

[0021] Viewed from a third aspect, the invention provides a computer-readable program product comprising: a fast backup mechanism in a service node of the computer system that sets up a midplane as a backup midplane by setting a plurality of link chips in the backup midplane into a pass through mode that passes data coming to a first link chip to a second link chip in an adjacent midplane and the fast backup mechanism instructs all compute nodes in an original midplane to copy all their data to corresponding compute nodes in the backup midplane using temporary coordinates for the corresponding nodes in the backup midplane; and computer recordable media bearing the fast backup mechanism.

[0022] Preferably, the present invention provides a computer -readable program product wherein the fast backup mechanism sets node coordinates in the backup midplane to a set of coordinates corresponding to the original midplane and puts the plurality of link chips in the original midplane into the pass through mode.

[0023] Preferably, the present invention provides a computer-readable program product wherein the compute nodes in the midplane are interconnected with a torus network to connect each node with its six nearest neighbours.

[0024] Preferably, the present invention provides a computer-readable program product wherein the computer system is a massively parallel computer system.

[0025] Preferably, the present invention provides a computer-readable program product wherein the backup mechanism places copy code into SRAM in the compute nodes and the copy code copies all the data in the compute nodes in an original midplane to corresponding compute nodes in the backup midplane.

BRIEF DESCRIPTION OF DRAWINGS

[0026] Embodiments of the invention are described below in detail, by way of example only, with reference to the accompanying drawings in which:

Figure 1 is a block diagram of a massively parallel computer system according in accordance with a preferred embodiment of the present invention;

Figure 2 is a block diagram of a compute node for in a massively parallel computer system according to the prior art;

Figure 3 is a block diagram of a midplane in a massively parallel computer system according to the prior art;

Figure 4 is a block diagram of a midplane in a massively parallel computer system according to the prior art;

Figure 5 is a block diagram of a link card in a massively parallel computer system according to the prior art;

Figure 6 is a block diagram that shows the different modes of operation of a link chip in a massively parallel computer system according to the prior art;

Figure 7 is a block diagram representing a partition of a highly interconnected computer system such as a massively parallel computer system to illustrate an example according to preferred embodiments of the present invention;

Figure 8 is a block diagram of the partition shown in Figure 7 configured with a set of backup racks in accordance with a preferred embodiment of the present invention;

Figure 9 is a highly simplified block diagram representing a partition operating in the normal mode in accordance with a preferred embodiment of the present invention;

Figure 10 is a block diagram representing the partition in Figure 9 in the copy mode of operation in accordance with a preferred embodiment of the present invention;

Figure 11 is a block diagram representing the partition in Figure 10 now in the backup mode of operation in accordance with a preferred embodiment of the present invention;

Figure 12 is a method flow diagram for fast backup of compute nodes in a parallel computer system according to a preferred embodiment of the present invention;

Figure 13 is a method flow diagram that illustrates one possible method for implementing step 1210 of the method shown in Figure 12 in accordance with a preferred embodiment of the present invention; and

Figure 14 is another method flow diagram that illustrates one possible method for implementing step 1230 of the method shown in Figure 12 in accordance with a preferred embodiment of the present invention.

Detailed description of the invention

[0027] The present invention relates to an apparatus and method for fast backup of compute nodes in a highly interconnected computer system such as a massively parallel super computer system. When a rack of nodes has a failure, the application software is suspended while the data on all the nodes is copied to a backup rack and the torus network is rerouted to include the backup rack in place of the failing rack. The preferred embodiments will be described with respect to the Blue Gene/L massively parallel computer being developed by International Business Machines Corporation (IBM).

[0028] Figure 1 shows a block diagram that represents a massively parallel computer system 100 such as the Blue Gene/L computer system. The Blue Gene/L system is a scalable system in which the maximum number of compute nodes is 65,536. Each node 110 has an application specific integrated circuit (ASIC) 112, also called a Blue Gene/L compute chip 112. The compute chip incorporates two processors or central processor units (CPUs) and is mounted on a node daughter card 114. The node also typically has 512 megabytes of local memory (not shown). A node board 120 accommodates 32 node daughter cards 114 each having a node 110. Thus, each node board has 32 nodes, with 2 processors for each node, and the associated memory for each processor. A rack 130 is a housing that contains 32 node boards 120. Each of the node boards 120 connect into a midplane printed circuit board 132 with a midplane connector 134. The midplane 132 is inside the rack and not shown in Figure 1. The full Blue Gene/L computer system would be housed in 64 racks 130 or cabinets with 32 node boards 120 in each. The full system would then have 65,536 nodes and 131,072 CPUs (64 racks x 32 node boards x 32 nodes x 2 CPUs).

[0029] The Blue Gene/L computer system structure can be described as a compute node core with an I/O node surface, where communication to 1024 compute nodes 110 is handled by each I/O node that has an I/O processor 170 connected to the service node 140. The I/O nodes have no local storage. The I/O nodes are connected to the compute nodes through the logical tree network and also have functional wide area network capabilities through a gigabit Ethernet network (not shown). The gigabit Ethernet network is connected to an I/O processor (or Blue Gene/L link chip) 170 located on a node board 120 that handles communication from the service node 160 to a number of nodes. The Blue Gene/L system has one or more I/O processors 170 on an I/O board (not shown) connected to the node board 120. The I/O processors can be configured to communicate with 8, 32 or 64 nodes. The service node uses the gigabit network to control connectivity by communicating to link cards on the compute nodes. The connections to the I/O nodes are similar to the connections to the compute node except the I/O nodes are not connected to the torus network.

[0030] Again referring to Figure 1, the computer system 100 includes a service node 140 that handles the loading of the nodes with software and controls the operation of the whole system. The service node 140 is typically a mini computer system such as an IBM pSeries server running Linux with a control console (not shown). The service node 140 is connected to the racks 130 of compute nodes 110 with a control system network 150. The control system network provides control, test, and bring-up infrastructure for the Blue Gene/L system. The control system network 150 includes various network interfaces that provide the necessary communication for the massively parallel computer system. The network interfaces are described further below.

[0031] The service node 140 manages the control system network 150 dedicated to system management. The control system network 150 includes a private 100-Mb/s Ethernet connected to an Ido chip 180 located on a node board 120 that handles communication from the service node 160 to a number of nodes. This network is sometime referred to as the JTAG network since it communicates using the JTAG protocol. All control, test, and bring-up of the compute nodes 110 on the node board 120 is governed through the JTAG port communicating with the service node. In addition, the service node 140 includes a fast backup mechanism 142. The fast backup mechanism comprises software in the service node 140 that operates to copy from one midplane to another according to preferred embodiments claimed herein.

[0032] The Blue Gene/L supercomputer communicates over several communication networks. Figure 2 shows a block diagram that shows the I/O connections of a compute node on the Blue Gene/L computer system. The 65,536 computational nodes and 1024 I/O processors 170 are arranged into both a logical tree network and a logical 3-dimensional torus network. The torus network logically connects the compute nodes in a lattice like structure that allows each compute node 110 to communicate with its closest 6 neighbours. In Figure 2, the torus network is illustrated by the X+, X-, Y+, Y-, Z+ and Z- network connections that connect the node to six respective adjacent nodes. The tree network is represented in Figure 2 by the tree0, tree1 and tree2 connections. Other communication networks connected to the node include a JTAG network and the global interrupt network. The JTAG network provides communication for testing and control from the service node 140 over the control system network 150 shown in Figure 1. The global interrupt network is used to implement software barriers for synchronization of similar processes on the compute nodes to move to a different phase of processing upon completion of some task. Further, there are clock and power signals to each compute node 110.

[0033] Figure 3 illustrates a block diagram of a compute node 110 in the Blue Gene/L computer system according to the prior art. The compute node 110 has a node compute chip 112 that has two processing units 310A, 310B. Each processing unit 310, has a processing core 312. The processing units 310 are connected to a level three memory cache (L3 cache) 320, and to a static random access memory (SRAM) memory bank 330. Data from the L3 cache 320 is loaded to a bank of dual data rate (DDR) synchronous dynamic random access memory (SDRAM) 340 by means of a DDR memory controller 350.

[0034] Again referring to Figure 3, the SRAM memory 330 is connected to a JTAG interface 360 that communicates off the compute chip 112 to an Ido chip 180. The service node communicates with the compute node through the Ido chip 180 over an Ethernet link that is part of the control system network 150 (described above with reference to figure 1). In the Blue Gene/L system there is one Ido chip per node board 120, and others on boards in each midplane 132 (Figure 1). The Ido chips receive commands from the service node using raw UDP packets over a trusted private 100 Mbit/s Ethernet control network. The Ido chips support a variety of serial protocols for communication with the compute nodes. The JTAG protocol is used for reading and writing from the service node 140 (Figure 1) to any address of the SRAMs 330 in the compute nodes110 and is used for the system initialization and booting process.

[0035] As illustrated in Figure 3, the SRAM 330 includes a personality 335. During the boot process, the service node stores information that is specific to an individual node in the personality. The personality includes the X, Y, Z coordinates 336 for the local node as assigned by the service node. When the node is initialized, initialization software uses the X, Y, Z coordinates 336 in the personality 335 to configure this node to the coordinates as assigned. The service node can change the X, Y, Z coordinates and direct the node to change its assigned coordinates. This prior art feature is used by the fast backup mechanism as described further below.

[0036] Again referring to Figure 3, in preferred embodiments, the SRAM 330 further includes fast copy code 337. The fast copy code 337 is used to copy the contents of the node's SDRAM memory 340 during the backup as directed by the fast backup mechanism 142 (Figure 1) in the service node. In the prior art, copying from node to node over the torus network was typically done with code executing from the DDR SDRAM 340. Further, in the prior art approach to node backup, the original node would copy the contents of memory to an external device (file in the file system) and the target node would read it from the file system. In contrast, the fast copy code in the SRAM 330 supports copying the contents of the entire DDR SDRAM 340 from an original midplane to a backup midplane over the torus network without writing the contents to an external file. Utilizing the torus network for backup is much faster than coping to a file. Also, the prior procedure to write to a file needed to execute out of DDR memory because it was a much larger procedure that would not fit in SRAM. In contrast, the fast copy code can be a small amount of code since it does not involve file system access and therefore can be deployed in the smaller SRAM 330 memory.

[0037] During the backup process, the service node suspends all code execution from SDRAM 340 and directs the fast copy code 337 to perform the fast copy of the SDRAM 340 memory to the backup node. On the receiving end of the fast copy, the backup node may also use the fast copy code 337 in the SRAM 330 in receiving of the backup data.

[0038] The node compute chip 112, illustrated in Figure 3, further includes network hardware 390. The network hardware 390 includes hardware for the Torus 392, Tree 394 and Global interrupt 396 networks. These networks of the Blue Gene/L are used for a compute node 110 to communicate with the other nodes in the system as described briefly above. The network hardware 390 allows the compute node to receive and pass along data packets over the torus network. The network hardware 390 handles network data traffic independently so the compute node's processors are not burdened by the amount of data flowing on the torus network.

[0039] Figure 4 illustrates a midplane 132 of the BG/L computer system. As stated above, each rack of nodes is divided into two midplanes. Each of the midplanes is connected to its six adjacent neighbours in the torus network as indicated by the arrows from each face of the midplane 132. Besides the 16 node cards, each with 32 BG/L compute nodes, each midplane contains four link cards 410 with six link chips 510 (shown in Figure 5) on each link card for a total of 24 link chips per midplane. At the midplane boundaries, all the BG/L networks pass through a link chip. The link chip serves two functions. First, it re-drives signals over the cables between the midplanes, restoring the high-speed signal shape and amplitude in the middle of a long lossy trace-cable-trace connection between compute ASICs on different midplanes. Second, the link chip can redirect signals between its different ports. This redirection function allows BG/L to be partitioned into multiple logically separate systems.

[0040] Again referring to Figure 4, each midplane communicates with its 6 neighbouring midplanes on the torus network. The connections to the 6 neighbouring midplanes are designated by their Cartesian coordinates with respect to the midplane and therefore lie in the X+, X-, Y+, Y-, Z+ and Z- directions as shown. In addition, there is an additional set of connections in the X axis called X split cables. The X split cables include an X+ split cable 420 and an X- split cable 422. The X split cables 420, 422 provide a way to enhance partition functionality by providing an additional route for connecting the torus network in the X dimension. When some midplanes are used as a backup as described herein, the X split cables can also be used to group backup midplanes or racks into a partition for use by other applications when the backup racks are not needed as a backup.

[0041] Figure 5 illustrates a block diagram of a link card 410 with six link chips 510. Each link chip 510 has six ports (A, B, C, D, E, and F). Ports A and B are connected directly to nodes in a midplane through midplane connections 512. The other four ports are connected to cables or are unused. In the BG/L system, the link card only has 16 cable connectors, each attached to a link chip driving or receiving port, therefore 8 ports of the link chips are unused. The logic inside the link chip supports arbitrary static routing of any port to any other port. This routing is set by the host at the time the partition is created and is static until another partition is created or reconfigured. The chip contains three send (B,C,D) and three receive ports (A,E,F); signals received at each input port can be routed to any of the output ports. The A and B ports are connected to the midplane. The F and C ports are connected to an cable in the X, Y or Z plane. The E and D ports that are used are connected to an X split cable (420, 422 in Figure 4). Each link chip port supports 21 differential pairs (16 data signals, a sense signal to prevent an unpowered chip from being driven by driver outputs from the other end of the cable, a spare signal, a parity signal, and two asynchronous global interrupt signals).

[0042] The BG/L torus interconnect requires a node to be connected to its six nearest neighbours (X+, X-, Y+, Y-, Z+, Z-) in a logical 3D Cartesian array. The connections to the six neighbours is done at the node level, and at the midplane level. Each midplane is a 8 x 8 x 8 array of nodes. The six faces (X+, X-, Y+, Y-, Z+, Z-) of the node array in the midplane are each 8 × 8 = 64 nodes in size. Each torus network signal from the 64 nodes on each of the six faces is communicated through the link chips to the corresponding nodes in adjacent midplanes. The signals of each face may also be routed back to the inputs of the same midplane on the opposite face when the midplane is used in a partition with a depth of one midplane in any dimension. Each link chip port serves 16 unidirectional torus links entering and exiting the midplane using the data signals of the 21 pairs through each port. Each midplane is served by 24 link chips with two ports of each link chip with 16 data signals on each port. Thus the six faces with 64 nodes requires 384 input and 384 output data signals supplied by 2 ports on the 24 link chips with each port supporting 16 data signals (16 x 24 = 384 for input and 384 for output).

[0043] Figure 6 illustrates the different modes of operation for the link chip 510 introduced above. When the link chip 510 is in the normal mode 610, the link chip 510 connects Port A to Port F and Port B to Port C. The normal mode 610 connects the midplane to the regular cables in the X, Y and Z dimensions. When the link chip 510 is connected in pass through mode 612, Port C is connected to Port F to bypass the midplane and send all signals to the next midplane in the torus network. In the Blue Gene/L system there are split cables connected in the X dimensions as introduced above. In split cable mode 614, Port A is connected to Port E and Port B is connected to Port D to connect the midplane to the X split cables.

[0044] Figure 7 illustrates a set 700 of racks 130 of a massively parallel computer system such as the Blue Gene/L computer system that are arranged into a partition in the X dimension. Each midplane 132 in each rack 130 is an 8x8x8 torus, where the coordinates of the torus are X, Y, and Z. Each rack is arranged in a 8x8x 16 torus since the two midplanes of each rack are arranged in the Z dimension. The first rack 710 is rack 0 and has two midplanes R00 712 and R01 714. The remaining racks are similarly numbered R10 through R71. In the illustrated partition, the X cables 720 connect the 8 racks in the X dimension and the Y and Z dimensions are wrapped around in a single midplane. The X split cables 730 are shown on the right hand side of the drawing but are not used to configure the partition in this example. So the partition shown in Figure 7 is a 128x8x8 torus. The X dimension cables 720 can be seen to connect the racks in the order of R0, R1, R3, R5, R7, R6, R4, R2 by following the direction of the X cables into a rack and then leaving the rack to the next rack. The coordinates of the nodes in the rack shown in Figure 7 would then be assigned as shown in Table 1.

Table 1

Rack	Node coordinates (X,Y,Z)
R0	(0,0,0) - (7,7,16)
R1	(8,0,0) - (15,7,16)
R2	(56,0,0) - (63,7,16)
R3	(16,0,0) (23,7,16)
R4	(48,0,0) - (55,7,16)
R5	(24,0,0) - (31,7,16)
R6	(40,0,0) - (47,7,16)
R7	(32,0,0) - (39,7,16)

[0045] Figure 8 illustrates a set of racks 130 of a massively parallel computer system such as the Blue Gene/L computer system similar the set shown in Figure 7, except now racks R4 810 and R5 812 are configured as a set of back up racks. The backup racks 810, 812 are configured to the pass through mode 612 shown in Figure 6. The coordinates of the nodes in the rack shown in Figure 8 would then be assigned as shown in Table 2.

Table 2

Rack	Node coordinates (X,Y,Z)
R0	(0,0,0) - (7,7,16)
R1	(8,0,0) - (15,7,16)
R2	(40,0,0) - (47,7,16)
R3	(16,0,0) - (23,7,16)
R4	Pass through mode
R5	Pass through mode
R6	(32,0,0) - (39,7,16)
R7	(24,0,0) - (31,7,16)

[0046] Again referring to Figure 8, the backup racks 810, 812 can be configured as a separate partition to be used when the racks are not need as a backup. This can be done using the x split cables 814 shown in Figure 8. The link chips in R4 810 and R5 812 are configured to be in pass through mode as discussed above, in addition these link chips are set in the split cable mode 614 as shown in Figure 6 and discussed above. The coordinates of the nodes of racks R4 and R5 would then be assigned as shown in Table 3.

Table 3

Rack	Node coordinates (X,Y,Z)
R0	(0,0,0) - (7,7,16)
R1	(8,0,0) - (15,7,16)

[0047] When the copy mode is commenced as described further below, the backup racks 810, 812 are configured to the normal mode 610 shown in Figure 6. The coordinates of the nodes in the rack shown in Figure 8 are then assigned temporary coordinates that are not in the range of the original coordinates. The temporary coordinates allow the data in the original midplane to be copied into the backup midplane. An example of the temporary coordinates are shown in Table 4.

Table 4

Rack	Node coordinates (X,Y,Z)
R0	Pass through mode
R1	Pass through mode
R2	(0,0,0) - (7,7,16)
R3	(24,0,0) - (31,7,16)
R4	(8,0,0) - (15,7,16)
R5	(16,0,0) - (23,7,16)
R6	Pass through mode
R7	Pass through mode

[0048] When the copy mode is complete, the failed racks R2 and R3 are configured to the pass through mode 610 shown in Figure 6 and the coordinates of the backup racks are then assigned the coordinates that were originally assigned to racks R2 and R3. Therefore, the backup racks now take the place of the original racks R2 and R3 as shown in Table 5.

Table 5

Rack	Node coordinates (X,Y,Z)
R0	(0,0,0) - (7,7,16)
R1	(8,0,0) - (15,7,16)
R2	Pass through mode
R3	Pass through mode
R4	(40,0,0) - (47,7,16)
R5	(16,0,0) - (23,7,16)
R6	(32,0,0) - (39,7,16)
R7	(24,0,0) - (31,7,16)

[0049] An Example of a fast back up of compute nodes will now be described with reference to Figures 9 through 11. Figure 9 illustrates a simplified representation of a set of midplanes similar to those shown in Figure 8. Figure 9 illustrates a normal mode of operation 900 for a partition 910 set up for fast backup. Four midplanes are shown connected in the X dimension with a single midplane 912 serving as a backup in the manner described above with reference to Figure 8. The backup midplane 912 has its link cards 914 configured in the pass through mode 610 (Figure 6). The normal mode of operation 900 is set up by the service node setting up the partition for backup operation by setting the backup rack or racks into the pass through mode as described above. Application jobs can then be scheduled on the partition until there is a need for a fast backup of the nodes in a midplane by copying the data in all the nodes to the backup midplane.

[0050] Figure 10 shows a block diagram that represents the partition 1010 of a massively parallel computer system shown in Figure 9 during the copy mode 1000. When the service node detects a failure in a midplane during execution of a job, or for any other reason wants to backup a midplane, the fast backup mechanism 142 begins a fast backup of all the nodes on the midplane with the copy mode 1000. In copy mode 1000, the application or job running on the partition is suspended. The link card 914 for the backup midplane 912 is placed in the normal mode 610 (Figure 6). The link cards 1018 for the other midplanes in the partition can be placed in the pass through mode to simplify the assigning of temporary coordinates. The midplane nodes are then assigned temporary coordinates that do not match the values of the midplanes in the original partition so that the nodes in the failed midplane 1012 can be copied into the backup midplane. The nodes in the failed midplane 1012 are then instructed to copy their contents to the respective nodes in the backup midplane using the temporary coordinates as shown by the data arrow 1014.

[0051] The nodes copy the entire contents of their memory to the corresponding node in the backup midplane in the manner known in the prior art. The copy can be accomplished by software running from the SRAM that receives the destination node from the fast backup mechanism. The node then sends the data over the torus network. The network hardware (390 Figure 3) on each node receives data over the network and passes the data to the next node if the data is intended for a distant node. The data has a hop count that indicates how many positions in the torus to move. The hop count is decremented by each node and the last node accepts the data. After all the data in the nodes is copied, the partition can then be placed into the backup mode as shown in Figure 11.

[0052] Figure 11 shows a block diagram that represents the portion of a massively parallel computer system shown in Figure 10 during the backup mode 1100. The partition is placed into the backup mode as shown in Figure 11 by configuring the link cards 1112 of the failed midplane 1114 to the pass through mode 610 (Figure 6). Further, the coordinates 1116 of the failed midplane 1114 are copied to the coordinates 1010 of the backup midplane 912. The job that was suspended on the partition 910 can now be resumed and continue processing without reloading or restarting the job.

[0053] Figure 12 shows a method 1200 for fast backup copying of nodes in a parallel computer system by the fast backup mechanism 142. The method operates to fast backup an original job operating on an original midplane to a backup midplane. First, setup a midplane or rack as a backup (step 1210). Next, suspend all traffic between all nodes executing a job (step 1220). Then perform a fast backup by copying data from nodes in an original midplane to corresponding nodes in a backup midplane (step 1230). Then notify all the nodes in the partition executing the job that they can resume the suspended job and start network traffic again (step 1240).

[0054] Figure 13 shows a method 1210 for setting up one or more midplanes as a backup and represents one possible implementation for step 1210 in Figure 12. First, program the link chips on the backup midplane to the pass through mode (step 1310). Next, program the link chips on the remaining midplanes into a partition to pass node data to adjacent midplanes (step 1320). Then schedule a job or software application to execute on the partition (step 1330). The method is then done.

[0055] Figure 14 shows a method 1230 as one possible implementation for step 1230 in Figure 12. Method 1230 illustrates performing a fast backup of compute nodes by copying all the node data in a parallel computer system from an original or failed midplane or rack of nodes to a backup midplane or rack of nodes. The method would be executed by the fast backup mechanism 142 on the service node 140. First, terminate any jobs running on the backup midplane if there are any (step 1410). Next, program the link chips on the backup midplane to the normal mode to accept data (step 1420). Then assign temporary coordinates to the nodes in the backup midplane (step 1430). Then notify all the nodes in the original or failed midplane to send all data on the node to the corresponding node in the backup midplane (step 1440). Next, program the link chips in the original midplane to the pass through mode (step 1450). Then switch the coordinates in the backup midplane to match the coordinates of the original midplane to configure the backup midplane to take the place of the original midplane (step 1460), and the method is then done.

[0056] As described above, embodiments provide a method and apparatus for fast backup of a set of nodes in a computer system such as a massively parallel super computer system. Embodiments herein can significantly decrease the amount of down time for increased efficiency of the computer system.

Claims

1. A parallel computer system comprising:

a plurality of midplanes, each midplane comprising a plurality of interconnected compute nodes with node data and a plurality of link chips that connect the plurality of midplanes to a set of cables that interconnect the plurality of midplanes;

a fast backup mechanism in a service node of the computer system that sets up a midplane as a backup midplane by setting the plurality of link chips in the backup midplane into a pass through mode that passes data coming to a first link chip in the backup midplane to a second link chip in an adjacent midplane; and

wherein the fast backup mechanism instructs all the nodes in an original midplane to copy all their data to corresponding nodes in the backup midplane using temporary coordinates for the corresponding nodes in the backup midplane.

2. A parallel computer system as claimed in claim 1 wherein the fast backup mechanism sets node coordinates in the backup midplane to a set of coordinates corresponding to the original midplane and puts the plurality of link chips in the original midplane into the pass through mode.

3. A parallel computer system as claimed in claim 1 wherein the compute nodes in the midplane are interconnected with a torus network to connect each node with its six nearest neighbours.

4. A parallel computer system as claimed in claim 3 further comprising cables to connect the midplanes to their six nearest neighbours.

5. A parallel computer system as claimed in claim 1 further comprising copy code in static random access memory (SRAM) to copy all the data in the compute nodes of the original midplane to corresponding nodes in the backup midplane.

6. A method for fast backup of computer nodes in a parallel computer system, the method comprising the steps of:

7. A method as claimed in claim 6 wherein the step of setting up one of a plurality of backplanes as a backup midplane further comprises the steps of:

programing a plurality of link chips on the backup midplane to the pass through mode;

programing link chips on a plurality of remaining midplanes into a partition to pass node data to adjacent midplanes; and

scheduling a job to execute on the partition.

8. A method as claimed in claim 6 wherein the step of instructing further comprises the steps of:

programing the plurality of link chips on the backup midplane to the normal mode to accept data;

assigning temporary coordinates to the nodes in the backup midplane; and

notifying all the nodes in the original midplane to send all data on the node to the corresponding node in the backup midplane.

9. A method as claimed in claim 8 further comprising the steps of: programming the link chips in the original midplane to the pass through mode; and switching the coordinates in the backup midplane to match the coordinates of the original midplane to configure the backup midplane to take the place of the original midplane.

10. A method as claimed in claim 8 further comprising the step of:

terminating any jobs running on the backup midplane if there are any.

11. A method as claimed in claim 8 wherein the step of copying data from the original midplane to the backup midplane is accomplished with copy code located in SRAM of the compute nodes.

12. A computer program product loadable into the internal memory of a digital computer, comprising software code portions for performing, when said product is run on a computer, to carry out the invention as claimed in claims 6 to 11.

Ansprüche

1. Parallelcomputersystem, das Folgendes umfasst:

eine Vielzahl von Mittelplatten, wobei jede Mittelplatte eine Vielzahl von miteinander verbundenen Rechenknoten mit Knotendaten und eine Vielzahl von Verbindungs-Chips umfasst, welche die Vielzahl von Mittelplatten mit einer Gruppe von Kabeln verbinden, die wiederum die Vielzahl von Mittelplatten miteinander verbinden;

einen Schnellsicherungsmechanismus in einem Dienstknoten des Computersystems, der eine Mittelplatte als eine Sicherungsmittelplatte einrichtet, indem die Vielzahl von Verbindungs-Chips auf der Sicherungsmittelplatte in eine Durchgangsbetriebsart gesetzt werden, die Daten, welche von einem ersten Verbindungs-Chips auf der Sicherungsmittelplatte kommen, an einen zweiten Verbindungs-Chip auf einer benachbarten Mittelplatte weiterleitet; und

wobei der Schnellsicherungsmechanismus alle Knoten auf einer ursprünglichen Mittelplatte anweist, alle ihre Daten in entsprechende Knoten auf der Sicherungsmittelplatte zu kopieren, wobei vorübergehende Koordinaten für die entsprechenden Knoten auf der Sicherungsmittelplatte verwendet werden.

2. Parallelcomputersystem nach Anspruch 1, wobei der Schnellsicherungsmechanismus Knotenkoordinaten auf der Sicherungsmittelplatte auf eine Gruppe von Koordinaten setzt, die der ursprünglichen Mittelplatte entsprechen, und die Vielzahl von Verbindungs-Chips auf der ursprünglichen Mittelplatte in die Durchgangsbetriebsart setzt.

3. Parallelcomputersystem nach Anspruch 1, wobei die Rechenknoten auf der Mittelplatte mit einem Torusnetzwerk miteinander verbunden sind, um jeden Knoten mit seinen sechs am nächsten gelegenen Nachbarn zu verbinden.

4. Parallelcomputersystem nach Anspruch 3, das des Weiteren Kabel umfasst, um die Mittelplatten mit ihren sechs am nächsten gelegenen Nachbarn zu verbinden.

5. Parallelcomputersystem nach Anspruch 1, das des Weiteren Kopiercode im statischen Direktzugriffspeicher (SRAM) umfasst, um alle Daten in den Rechenknoten der ursprünglichen Mittelplatte in die entsprechenden Knoten auf der Sicherungsmittelplatte zu kopieren.

6. Verfahren für die schnelle Sicherung von Computerknoten in einem Parallelcomputersystem, wobei das Verfahren die folgenden Schritte umfasst:

Einrichten einer aus einer Vielzahl von Mittelplatten, wobei jede Mittelplatte eine Vielzahl von miteinander verbundenen Rechenknoten mit Knotendaten und eine Vielzahl von Verbindungs-Chips umfasst, welche die Vielzahl von Mittelplatten mit einer Gruppe von Kabeln verbinden, die wiederum die Vielzahl von Mittelplatten miteinander verbinden;

Einrichten einer Mittelplatte als eine Sicherungsmittelplatte und Setzen der Vielzahl von Verbindungs-Chips auf der Sicherungsmittelplatte in eine Durchgangsbetriebsart, die Daten, welche von einem ersten Verbindungs-Chip auf der Sicherungsmittelplatte kommen, an einen zweiten Verbindungs-Chip auf einer benachbarten Mittelplatte weiterleitet; und

Anweisen aller Knoten auf einer ursprünglichen Mittelplatte, alle ihre Daten in entsprechende Knoten auf der Sicherungsmittelplatte zu kopieren, wobei vorübergehende Koordinaten für die entsprechenden Knoten auf der Sicherungsmittelplatte verwendet werden.

7. Verfahren nach Anspruch 6, wobei der Schritt des Einrichtens einer aus einer Vielzahl von Mittelplatten als eine Sicherungsmittelplatte weiter die folgenden Schritte umfasst:

Programmieren einer Vielzahl von Verbindungs-Chips auf der Sicherungsmittelplatte für die Durchgangsbetriebsart;

Programmieren von Verbindungs-Chips auf einer Vielzahl von verbleibenden Mittelplatten als eine Partition, um Knotendaten an benachbarte Mittelplatten weiterzuleiten; und

Einplanen eines Auftrags, der auf der Partition ausgeführt werden soll.

8. Verfahren nach Anspruch 6, wobei der Schritt des Anweisens weiter die folgenden Schritte umfasst:

Programmieren der Vielzahl von Verbindungs-Chips auf der Sicherungsmittelplatte für die Normalbetriebsart, um Daten anzunehmen;

Zuweisen vorübergehender Koordinaten zu den Knoten auf der Sicherungsmittelplatte; und

Benachrichtigen aller Knoten auf der ursprünglichen Mittelplatte, alle Daten in dem Knoten an den entsprechenden Knoten auf der Sicherungsmittelplatte zu senden.

9. Verfahren nach Anspruch 8, das des Weiteren die folgenden Schritte umfasst: Programmieren der Verbindungs-Chips auf der ursprünglichen Mittelplatte für die Durchgangsbetriebsart; und Wechseln der Koordinaten auf der Sicherungsmittelplatte, sodass sie mit den Koordinaten der ursprünglichen Mittelplatte übereinstimmen, um die Sicherungsmittelplatte so zu konfigurieren, dass sie an die Stelle der ursprünglichen Mittelplatte tritt.

10. Verfahren nach Anspruch 8, das ferner den folgenden Schritt umfasst:

Beenden aller etwaig vorhandenen Aufträge, die auf der Sicherungsmittelplatte ausgeführt werden.

11. Verfahren nach Anspruch 8, wobei der Schritt des Kopierens von Daten von der ursprünglichen Mittelplatte auf die Sicherungsmittelplatte mit Kopiercode erfolgt, der sich im SRAM der Rechenknoten befindet.

12. Computerprogrammprodukt, das in den internen Speicher eines digitalen Computers geladen werden kann, welches Softwarecodeteile umfasst, um bei Ausführung des Produkts auf einem Computer die Erfindung gemäß den Ansprüchen 6 bis 11 auszuführen.

Revendications

1. Système informatique parallèle, comprenant :

une pluralité de fonds de paniers centraux, chaque fond de panier central comprenant une pluralité de noeuds de calcul reliés mutuellement avec des données de noeuds et une pluralité de puces de liaison qui relient la pluralité de fonds de paniers centraux à un ensemble de câbles qui relient mutuellement la pluralité de fonds de paniers centraux,

un mécanisme de secours rapide dans un noeud de service du système informatique qui initialise un fond de panier central en tant que fond de panier central de secours en établissant la pluralité de puces de liaison dans le fond de panier central de secours dans un mode de transfert qui transmet des données arrivant au niveau d'une première puce de liaison dans le fond de panier central à une deuxième puce de liaison dans un fond de panier central adjacent, et

dans lequel le mécanisme de secours rapide ordonne à tous les noeuds dans un fond de panier central d'origine de copier toutes leurs données vers des noeuds correspondants dans le fond de panier central de secours en utilisant des coordonnées temporaires pour les noeuds correspondants dans le fond de panier central de secours.

2. Système informatique parallèle selon la revendication 1, dans lequel le mécanisme de secours rapide établit les coordonnées des noeuds dans le fond de panier central de secours à un ensemble de coordonnées correspondant au fond de panier central d'origine et place la pluralité de puces de liaison dans le fond de panier central d'origine dans un mode de transfert.

3. Système informatique parallèle selon la revendication 1, dans lequel les noeuds de calcul dans le fond de panier central sont reliés mutuellement avec un réseau de type tore pour relier chaque noeud à ses six voisins les plus proches.

4. Système informatique parallèle selon la revendication 3, comprenant en outre des câbles pour relier les fonds de paniers centraux à leurs six voisins les plus proches.

5. Système informatique parallèle selon la revendication 1, comprenant en outre un code de copie dans une mémoire vive statique (SRAM) pour copier toutes les données dans les noeuds de calcul du fond de panier d'origine vers des noeuds correspondants dans le fond de panier central de secours.

6. Procédé de secours rapide de noeuds de calcul dans un système informatique parallèle, le procédé comprenant les étapes consistant à :

initialiser un fond de panier central parmi une pluralité de fonds de paniers centraux, chaque fond de panier central comprenant une pluralité de noeuds de calcul reliés mutuellement avec des données de noeuds et une pluralité de puces de liaison qui relient la pluralité de fonds de paniers centraux à un ensemble de câbles qui relient mutuellement la pluralité de fonds de paniers centraux,

initialiser un fond de panier central en tant que fond de panier central de secours et établir la pluralité de puces de liaison dans le fond de panier central de secours dans un mode de transfert qui transmet des données arrivant au niveau d'une première puce de liaison dans le fond de panier central à une deuxième puce de liaison dans un fond de panier central adjacent, et

ordonner à tous les noeuds dans un fond de panier central d'origine de copier toutes leurs données vers des noeuds correspondants dans le fond de panier central de secours en utilisant des coordonnées temporaires pour les noeuds correspondants dans le fond de panier central de secours.

7. Procédé selon la revendication 6, dans lequel l'étape consistant à initialiser un fond de panier central parmi une pluralité de fonds de paniers centraux en tant que fond de panier central de secours comprend en outre les étapes consistant à :

programmer une pluralité de puces de liaison sur le fond de panier central de secours dans le mode de transfert,

programmer des puces de liaison sur une pluralité de fonds de paniers centraux restants dans une partition pour transmettre des données de noeuds à des fonds de paniers centraux adjacents, et

planifier un travail à exécuter sur la partition.

8. Procédé selon la revendication 6, dans lequel l'étape consistant à ordonner comprend en outre les étapes consistant à :

programmer la pluralité de puces de liaison sur le fond de panier central de secours dans le mode normal pour accepter des données,

attribuer des coordonnées temporaires aux noeuds dans le fond de panier central de secours, et

notifier à tous les noeuds dans le fond de panier central d'origine d'envoyer toutes les données sur le noeud au noeud correspondant dans le fond de panier central de secours.

9. Procédé selon la revendication 8, comprenant en outre les étapes consistant à:

programmer les puces de liaison dans le fond de panier central d'origine dans le mode de transfert, et basculer les coordonnées dans le fond de panier central de secours pour correspondre aux coordonnées du fond de panier central d'origine afin de configurer le fond de panier central de secours pour prendre la place du fond de panier central d'origine.

10. Procédé selon la revendication 8, comprenant en outre l'étape consistant à :

mettre fin à tous les travaux en cours d'exécution sur le fond de panier central s'il en existe.

11. Procédé selon la revendication 8, dans lequel l'étape consistant à copier des données du fond de panier central d'origine vers le fond de panier central de secours est accomplie avec un code de copie situé dans la mémoire SRAM des noeuds de calcul.

12. Produit de programme informatique chargeable dans la mémoire interne d'un ordinateur numérique, comprenant des parties de code logiciel destinées à fonctionner, lorsque ledit produit est exécuté sur un ordinateur, pour réaliser l'invention selon les revendications 6 à 11.

Drawing

Cited references

REFERENCES CITED IN THE DESCRIPTION

This list of references cited by the applicant is for the reader's convenience only. It does not form part of the European patent document. Even though great care has been taken in compiling the references, errors or omissions cannot be excluded and the EPO disclaims all liability in this regard.

Patent documents cited in the description

US20040153754A1 [0006]