Senior Fitness - Exercise and Nutrition for Aging Men and Women
FREE Article Feed for your website.
Bio-Medical Research Article Database
Informative Articles on Life, Love and Happiness
Tutorials on Business to Writing
Famous Quotes from Famous People
Song Lyric Information
New US Patent Information
Comprehensive List of Content by Category
Online Auctions and Shopping Related Articles
Article Search
Most Recent Articles

Interface internet protocol fragmentation of large broadcast packets in an environment with an unaccommodating maximum transfer unit Number:7,522,597 from the United States Patent and Trademark Office (PTO) owispatent

Home    Author Login    Submit Article    Article Search    Add Your Link    Edit Your Link    Contact Us    Advertising    Disclaimer

   

Google
 

Top Breaking News
     White House Defends Obama Budget by Michael Bowman
     Palestinian Hunger Striker Stirs Emotions by Robert Berger
     Al-Qaida Leader Voices Support for Syrian Uprising by VOA News

Title: Interface internet protocol fragmentation of large broadcast packets in an environment with an unaccommodating maximum transfer unit

Abstract: In a multinode data processing system in which the nodes communicate with one another through communication adapters coupled to a switch or network, a method is provided for using the Internet Protocol (IP) for transmitting large broadcast data packets without incurring the overhead normally associated with packet fragmentation. By adding an Internet Protocol (IP) header as the first header in every packet fragment in the fragmentation process, fragmented packets are able to be assembled in the IP layer without intervention at the adapter interface layer.

Patent Number: 7,522,597 Issued on 04/21/2009 to Chang,   et al.


Inventors: Chang; Fu Chung (Rhinebeck, NY), Chaudhary; Piyush (Fishkill, NY), Doxtader; Jennifer M. (Poughkeepsie, NY)
Assignee: International Business Machines Corporation (Armonk, NY)
Appl. No.: 10/981,097
Filed: November 4, 2004


Related U.S. Patent Documents

Application NumberFiling DatePatent NumberIssue Date
60605659Aug., 2004

Current U.S. Class: 370/390 ; 370/432
Current International Class: H04L 12/56 (20060101)
Field of Search: 370/389,390,432


References Cited [Referenced By]

U.S. Patent Documents
5491802 February 1996 Thompson et al.
6272518 August 2001 Blazo et al.
6298428 October 2001 Munroe et al.
6564267 May 2003 Lindsay
6654818 November 2003 Thurber
6721806 April 2004 Boyd et al.
6735647 May 2004 Boyd et al.
6748499 June 2004 Beukema et al.
6788704 September 2004 Lindsay
2002/0087720 July 2002 Davis et al.
2002/0099879 July 2002 Bayer et al.
2002/0152328 October 2002 Kagan et al.
2003/0018787 January 2003 Neal et al.
2003/0043805 March 2003 Graham et al.
2003/0061379 March 2003 Craddock et al.
2003/0061417 March 2003 Craddock et al.
2003/0065775 April 2003 Aggarwal et al.
2003/0195983 October 2003 Krause
2004/0003141 January 2004 Matters et al.
2004/0030806 February 2004 Pandya
2004/0034718 February 2004 Goldenberg et al.
2004/0049580 March 2004 Boyd et al.
2004/0049600 March 2004 Boyd et al.
2004/0049601 March 2004 Boyd et al.
2004/0049603 March 2004 Boyd et al.
2004/0202189 October 2004 Arndt et al.
2005/0147126 July 2005 Qiu et al.
2006/0034176 February 2006 Lindsay
Primary Examiner: Nguyen; Brian D
Attorney, Agent or Firm: Monteleone, Esq.; Geraldine D. Cutter, Esq.; Lawrence D. Heslin Rothenberg Farley & Mesiti P.C.

Parent Case Text



This application claims priority based upon Provisional patent application having Provisional Ser. No. 60/605,659 filed on Aug. 30, 2004.
Claims



The invention claimed is:

1. In a multinode data processing system in which the nodes communicate with one another through communication adapters coupled to a switch or network, a method for transmitting a large data packet comprises the steps of: dividing said large packet into a plurality of fragments and providing each fragment with a first header, a second header and a third header, said headers providing packet handling information for a plurality of transmission protocols; modifying said first header associated with each of said fragments to indicate that said large data packet has been modified for transmission into fragments which mimic the transmission of a series of packets segmented at an upper layer protocol level; transmitting said fragments with said modified headers through said switch or network to at least two receiving adapters which employ said third header to ensure that the large data packet is reassembled.

2. The method of claim 1 in which said modifying alters fields in said first header which indicate that the fragment is part of a larger message and then sets an offset into said large data packet, whereby the large data packet can be reassembled.

3. The method of claim 1 in which said first header is an Internet protocol header.

4. The method of claim 1 in which said second header is a transaction protocol header.

5. The method of claim 1 in which said third header is an adapter interface header.

6. A method for transferring data packets in a multinode data processing system in which the nodes communicate with one another through communication adapters coupled to a switch or network, said method comprising the steps of: fragmenting at least one of said data packets into a plurality of smaller data packets; providing adjusted packet header information to each of said plurality of smaller packets sufficient for the transfer of said smaller data packets through at least one of said communications adapters; and transferring said plurality of smaller data packets through at least one of said communications adapters having a maximum transfer packet size that is smaller than a packet size specifiable at an operating system level to at least two other communication adapters.
Description



BACKGROUND OF THE INVENTION

The present invention is generally directed to the transfer of information residing in one computer system or on one data processing node to another data processing node. The present invention is more particularly directed to data transfers in which data is transferred by the network adapter directly into the target user buffer in the address space of the receiving system or node from the address space of the sending system or node. This is referred to as remote direct memory access (RDMA). Even more particularly, the present invention is directed to systems and methods for carrying out RDMA without the automatic assumption that data which has been sent is data which has also been received. This assumption is referred to as reliable transport but which should really be thought of as the "reliable transport assumption." As used herein, the concept of reliable transport refers to a communication modality which is based upon a "send and forget" model for Upper Layer Protocols (ULPs) running on the data processing nodes themselves, as opposed to adapter operations. Correspondingly, the concept of unreliable transport refers to a communication modality which is not "send and forget" with respect to the ULP. Also, as used herein the term "datagram" refers to a message packet that is both self-contained as to content and essential heading descriptions and which is not guaranteed to arrive at any given time.

For a proper understanding of the contributions made to the data communication arts by the present invention it should be fully appreciated that the present invention is designed to operate not only in an environment which employs DMA data transfers, but that this data transfer occurs across a network, that is, remotely. Accordingly, the context of RDMA data transfer is an important aspect for understanding the operation and benefits of the present invention. In the RDMA environment, the programming model allows the end user or middleware user to issue a read or write command (or request) directed to specific virtual memory locations defined at both a sending node and at a remote data processing node. The node issuing the command is called (for the purposes of the RDMA transaction) the master node; the other node is referred to as the slave node. For purposes of better understanding the advantages offered by the present invention, it is noted here that the existing RDMA state of the art paradigm includes no functionality for referencing more than one remote node. It is also understood that the RDMA model assumes that there is no software in the host processor at the slave end of the transaction which operates to affect RDMA data transfers. There are no intermediate packet arrival interrupts, nor is there any opportunity for even notifying the master side that a certain portion of the RDMA data sent has now been received by the target. There is no mechanism in existing RDMA transport mechanisms to accept out-of-order packet delivery.

An example of the existing state of the art in RDMA technology is seen in the Infiniband architecture (also referred to as IB).

The state of the art RDMA paradigm also includes the limitation that data packets sent over the communications fabric are received in the same order as they were sent since they assume the underlying network transport to be reliable for RDMA to function correctly. This means that transmitted packets can easily accumulate in the sending side network communications adapters waiting for acknowledgment. This behavior has the tendency to create situations in which, at any given time, there are a large number of "packets in flight" that are buffered at the sending side network adapter waiting to be acknowledged. This tends to bog down adapter operation and produces its own form of bandwidth limiting effect in addition to the bandwidth limiting effect caused by the fact that the source and destination nodes are constrained by having all of the packets pass in order through a single communications path. In addition, adapter design itself is unnecessarily complicated since this paradigm requires the buffering of unacknowledged in-flight packets.

The DMA and RDMA environments are essentially hardware environments. This provides advantages but it also entails some risk and limitations. Since the RDMA function is provided in hardware, RDMA data transfers possess significant advantages in terms of data transfer rates and, as with any DMA operation, data transfer workload is offloaded from central processing units (CPUs) at both ends of the transfer. RDMA also helps reduce the load on the memory subsystem. Furthermore, the conventional RDMA model is based upon the assumption that data packets are received in the order that they are sent. Just as importantly, the "send and forget" RDMA model (RDMA with the underlying reliable network transport assumption) unnecessarily limits bandwidth and precludes the use of many other features and functions such as efficient striping multiple packets across a plurality of paths. These features also include data striping, broadcasting, multicasting, third party RDMA operations, conditional RDMA operations, half RDMA and half FIFO operations, safe and efficient failover operations, and "lazy" deregistration. None of these functions can be carried out as efficiently within the existing state of the art RDMA "send and forget" paradigm as they are herein.

The RDMA feature is also referred to in the art as "memory semantics" for communication across a cluster network, or as "hardware put/get" or as "remote read/write" or as "Remote Direct Memory Access (RDMA)."

It should also be understood that the typical environment in which the present invention is employed is one in which a plurality of data processing nodes communicate with one another through a switch, across a network, or through some other form of communication fabric. In the present description, these terms are used essentially synonymously since the only requirement imposed on these devices is the ability to transfer data from source to destination, as defined in a data packet passing through the switch. Additionally the typical environment for the operation of the present invention includes communication adapters coupled between the data processing nodes and the switch (network, fabric). It is also noted that while a node contains at least one central processing unit (CPU), it may contain a plurality of such units. In data processing systems in the pSeries line of products currently offered by the assignee of the present invention a node possibly contains up to thirty-two CPUs on Power4 based systems and up to sixty-four CPUs on Power5 based systems. (Power4 and Power5 are microprocessor systems offered by the assignee of the present invention). To ensure good balance between computational and communication capacity, nodes are typically equipped with one RDMA capable network adapter for every four CPUs. Each node, however, possesses its own address space. That is, no global shared memory is assumed to exist for access from across the entire cluster. This address space includes random access memory (RAM) and larger scale, but slower external direct access storage devices (DASD) typically deployed in the form of rotating magnetic disk media which works with the CPUs to provide a virtual address space in accordance with well known memory management principles. Other nonvolatile storage mechanisms such as tape are also typically employed in data processing systems as well.

The use of Direct Memory Address (DMA) technology provides an extremely useful mechanism for reducing CPU (processor) workload in the management of memory operations. Workload that would normally have to be processed by the CPU is handled instead by the DMA engine. However, the use of DMA technology has been limited by the need for tight hardware controls and coordination of memory operations. The tight coupling between memory operations and CPU operations poses some challenges, however, when the data processing system comprises a plurality of processing nodes that communicate with one another over a network. These challenges include the need for the sending side to have awareness of remote address spaces, multiple protection domains, locked down memory requirements (also called pinning), notification, striping and recovery models. The present invention is directed to a mechanism for reliable RDMA protocol over a possibly unreliable network transport model.

If one wishes to provide the ability to perform reliable RDMA transport operations over a possibly unreliable underlying network transport path, there are many important issues that should be addressed. For example, how does one accomplish efficient data striping over multiple network interfaces available on a node by using RDMA? How does one provide an efficient notification mechanism on either end (master and slave) regarding the completion of RDMA operations? How would one define an RDMA interface that lends itself to efficient implementation? How does one design a recovery model for RDMA operations (in the event when a single network interface exists and in the event when multiple network interfaces exist)? How does one implement an efficient third party transfer model using RDMA for DLMs (Distributed Lock Managers) and other parallel subsystems? How does one implement an efficient resource management model for RDMA resources? How does one design a lazy deregistration model for efficient implementation of the management of the registered memory for RDMA? The answers to these questions and to other related problems, that should be addressed as part of a complete, overall RDMA model, are presented herein.

As pointed out above, prior art RDMA models (such as Infiniband referred to above) do not tolerate receipt of packets in other than their order of transmission. In such systems, an RDMA message containing data written to or read from one node to another node is segmented into multiple packets and transmitted across a network between the two nodes. The size of data blocks which are being transferred, together with the packet size supported by the network or fabric, are the driving force for the partitioning of the data into multiple packets. In short, the need to transmit multiple packets as the result of a single read or write request is important given the constraints and state of the art of existing communication network fabrics. Furthermore, at another level, it is advantageous in the method of the present invention to divide the transfer into several independent multi-packet segments to enable striping across multiple paths in the communication fabric. At the node receiving the message (the receiving node), the packets are then placed in a buffer in the order received and the data payload is extracted from the packets and is assembled directly into the memory of the receiving node. The existing state of the art mechanisms are built on the assumption that the receipt of packets occurs in the same order in which they were transmitted. If this assumption is not true, then the communication transport protocols could mistake the earlier arriving packets as being the earlier transmitted packets, even though earlier arriving packets might actually have been transmitted relatively late in the cycle. If a packet was received in a different order than it was transmitted, serious data integrity problems could result. This occurs, for example, if a packet containing data that is intended to be written to a higher range of addresses of a memory, is received prior to another packet containing data that is intended to be written to a lower range of addresses. If the reversed order of delivery went undetected, the data intended for the higher range of addresses could be written to the lower range of addresses, and vice versa, as well. In addition, in existing RDMA schemes, a packet belonging to a current more recently initiated operation could be mistaken for one belonging to an earlier operation that is about to finish.

Accordingly, prior art RDMA schemes focused on enhancing network transport function to guarantee reliable delivery of packets across the network. With reliable datagram and reliably connected transport mechanisms such as this, the packets of a message are assured of arriving in the same order in which they are transmitted, thus avoiding the serious data integrity problems which could otherwise result. The present invention provides a method to overcome this dependence on reliable transport and on the in-order delivery of packets requirements and is implemented over an unreliable datagram network transport.

The prior art "reliably connected and reliable datagram" RDMA model also has many other drawbacks. Transport of message packets or "datagram" between the sending and receiving nodes is limited to a single communication path over the network that is selected prior to beginning data transmission from one node to the other. In addition, the reliable delivery model requires that no more than a few packets (equal to the buffering capability on the sending side network adapter) be outstanding at any one time. Thus, in order to prevent packets from being received out of transmission order, transactions in existing RDMA technologies have to be assigned small time-out values, so that a time-out is forced to occur unless the expected action (namely, the receipt of an acknowledgment of the packet from the receiver to the sender) occurs within an undesirably short period of time. All of these restrictions impact the effective bandwidth that is apparent to a node for the transmission of RDMA messages across the network. The present invention provides solutions to all of these problems.

SUMMARY OF THE INVENTION

In accordance with a preferred embodiment of the present invention a mechanism is provided for the transfer of data from the memory space of one data processing node to the memory space of one or more other data processing nodes. In particular, the present invention provides a data transfer structure and mechanism in which the data is transferred in at least one and typically in many packets which are not constrained to arrive at the destination node in any given order. The presence of the potentially out-of-order aspect provides the ability to structure a number of other transfer modalities and to provide a number of ancillary advantages all of which are described below under their respective headings.

In a first example of these additional data transfer modalities, it is possible to provide transfer modalities which are not processed symmetrically on both sides of the transfer. For example, one side may operate in a standard mode where data is transferred out of a FIFO queue while the other side operates in a remote DMA fashion.

In accordance with this first example there is provided a method for performing a write operation from a source node to a destination node, said method comprising the steps of: transferring said data via a DMA operation from said source node to a first communications adapter, coupled to said source node; transferring said data via a network from said first communications adapter to a second communications adapter coupled to said destination node; and transferring said data into a storage queue in said destination node wherein said data is subject to subsequent transfer to specific target memory locations within said destination node under program control in said second node.

In further accordance with this first example there is provided a method for performing a write operation from a source node to a destination node, said method comprising the steps of: transferring said data into a storage queue in said source node wherein said data is subject to subsequent transfer to a first communications adapter coupled to said source node under program control in said source node; transferring said data via a network from said first communications adapter to a second communications adapter coupled to said destination node; and transferring said data via a DMA operation from said second communications adapter to specific target memory locations within said destination node.

In accordance with this first example there is provided a method for performing a read operation initiated by a destination node for data residing on a source node, said method comprising the steps of: transferring said data via a DMA operation from said source node to a first communications adapter, coupled to said source node; transferring said data via a network from said first communications adapter to a second communications adapter coupled to said destination node; and transferring said data into a storage queue in said destination node wherein said data is subject to subsequent transfer to specific target memory locations within said destination node under program control in said second node.

In still further accordance with this first example there is provided a method for performing a read operation initiated by a destination node for data residing on a source node, said method comprising the steps of: transferring said data into a storage queue in said source node wherein said data is subject to subsequent transfer to a first communications adapter coupled to said source node under program control in said source node; transferring said data via a network from said first communications adapter to a second communications adapter coupled to said destination node; and transferring said data via a DMA operation from said second communications adapter to specific target memory locations within said destination node.

In a second example of the additional transfer modalities provided, it is noted that insensitivity to out-of-order data arrival makes it possible to transfer multiple data packets over a multiplicity of paths thus rendering it possible to engage in the rapid transfer of data over parallel paths.

In accordance with this second example there is provided method for data transport from a source node to at least one destination node, said method comprising the step of: transferring said data, in the form of a plurality of packets, from said source node to said at least one destination node wherein said transfer is via remote direct memory access from specific locations within said source memory to specific target locations within destination node memory locations and wherein said packets traverse multiple paths from said source node to said destination node.

In a third example. out-of-order DMA transfers render it possible to provide RDMA operations in which initiation and control of the transfer is provided by a third party data processing node which is neither the data source nor the data sink. Another feature provided by the underlying structure herein is the ability to transfer data from a source node to a plurality of other nodes in either a broadcast or multicast fashion. Yet another feature along these same lines is the ability to condition the transfer of data on the occurrence of subsequent events.

In accordance with a broadcast example there is provided a method for data transport, in a network of at least three data processing nodes, from a source node to multiple destination nodes, said method comprising the step of: transferring said data from said source node to a plurality of destination nodes wherein said transfer is via remote direct memory access operation from specific locations within source node memory to specific target locations within destination node memory locations.

In accordance with a multicast example there is provided a method for data transport, in a network of at least three data processing nodes, from a source node to multiple destination nodes, said method comprising the step of: transferring said data from said source node to preselected ones of a plurality of destination nodes wherein said transfer is via remote direct memory access operation from specific locations within source node memory to specific target locations within destination node memory locations.

In accordance with a third party transfer example there is provided a method for data transport, in a network of at least three data processing nodes, from a source node to at least one destination node, said method comprising the step of: transferring said data from said source node to at least one destination node wherein said transfer is via remote direct memory access operation from specific locations within source node memory to specific target locations within destination memory locations and wherein said transfer is initiated at a node which is neither said source node nor said at least one destination node.

In accordance with a conditional transfer multicast example there is provided a method for data transport, in a network of at least three data processing nodes, from a source node to at least one destination node, said method comprising the step of: transferring said data from said source node to at least one destination node wherein said transfer is via remote direct memory access operation from specific locations within said source node memory to specific target locations within destination node memory locations and wherein said transfer is conditioned upon one or more events occurring in either said source node or in said destination node.

In a fourth example, the structure of the remote DMA provided herein permits the earlier processing of interrupts thus allowing the CPU to operate more efficiently by focusing on other tasks.

In accordance with the fourth example embodiment there is provided a method for data transport from a source node to at least one destination node, said method comprising the steps of: transferring said data, in the form of a plurality of packets, from said source node to said at least one destination node wherein said transfer is via remote direct memory access from specific locations within said source memory to specific target locations within destination node memory locations and wherein said transfer path includes communication adapters coupled to said source and destination nodes and wherein said destination side adapter issues an interrupt indicating completion prior to transfer of data into said specific target locations within said destination node memory locations.

In a fifth example, a process and system provide a snapshot interface in RDMA Operations.

In a sixth example, a process and system are provided for dealing with failover mechanisms in RDMA Operations.

In a seventh example, a process and system are provided for structuring and handling RDMA server global TCE tables.

In an eighth example, process and system are provided for the interface Internet Protocol fragmentation of large broadcast packets.

In a ninth example, process and system are provided for "lazy" deregistration of user virtual Machine to adapter Protocol Virtual Offsets.

Accordingly, it is an object of the present invention to provide a model for RDMA in which the transfer of messages avoids CPU copies on the send and receive side and which reduces protocol processing overhead.

It is also an object of the present invention to permit jobs running on one node to use the maximum possible portion of the available physical memory for RDMA purposes.

It is a further object of the present invention to provide zero-copy replacement functionality.

It is a still further object of the present invention to provide RDMA functionality in those circumstances where it is particularly appropriate in terms of system resources and packet size.

It is a still further object of the present invention to allow users the ability to disable RDMA functionality through the use of job execution environment parameters.

It is another object of the present invention to keep the design for the adapter as simple as possible.

It is yet another object of the present invention to provide a mechanism in which almost all of the error handling functionality is outside the mainline performance critical path.

It is still another object of the present invention to provide a protocol which guarantees "at most once" delivery of an RDMA message.

It is yet another object of the present invention to minimize the performance and design impact on the other transport models that coexist with RDMA.

It is yet another object of the present invention to provide additional flexibility in the transfer of data packets within the RDMA paradigm.

It is still another object of the present invention to provide a mechanism for RDMA transfer of data packets in which packets are broadcast to a plurality of destinations.

It is also another object of the present invention to provide a mechanism for RDMA transfer of data packets in a multicast modality.

It is a further object of the present invention to provide a mechanism for third party transfer of data packets via RDMA.

It is a still further object of the present invention to provide a mechanism for the conditional transfer of data packets via RDMA.

It is a further object of the present invention to provide a mechanism in which it is possible to improve transmission bandwidth by taking advantage of the fact that the transport protocol now permits data packets to be transmitted across multiple paths at the same time.

It is another object of the present invention to provide efficient striping across multiple interfaces and failover mechanisms for use in RDMA data transfer operations.

It is yet another object of the present invention to provide improved optimistic methods for deallocating RDMA enabled memory resources following the end of a data transfer.

It is a still further object of the present invention to provide a mechanism for the transfer of data packets to receiving hardware without the need for software intervention or processing intermediate packet arrival interrupts on either the slave side or on the master side of the transaction.

Lastly, but not limited hereto, it is an object of the present invention to improve the flexibility, efficiency and speed of data packet transfers made directly from the memory address space of one data processing unit to the memory address space of one or more other data processing units.

The recitation herein of a list of desirable objects which are met by various embodiments of the present invention is not meant to imply or suggest that any or all of these objects are present as essential features, either individually or collectively, in the most general embodiment of the present invention or in any of its more specific embodiments.

DESCRIPTION OF THE DRAWINGS

The subject matter which is regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of practice, together with the further objects and advantages thereof, may best be understood by reference to the following description taken in connection with the accompanying drawing in which:

FIG. 1 is a block diagram illustrating the overall concept of Remote Direct Memory Access between nodes in a data processing system;

FIG. 2 is a block diagram illustrating a software layering architecture usable in conjunction with the present invention;

FIG. 3 is a block diagram illustrating the steps in a process for a RDMA write operation over a possibly unreliable network;

FIG. 4 is a block diagram illustrating the steps in a process for a RDMA read operation over a possibly unreliable network;

FIG. 5 is a block diagram illustrating the problem that packets sent in a certain order may not actually arrive in the same order and that they may in fact travel via multiple paths;

FIG. 6 is a block diagram similar to FIG. 3 but more particularly illustrating a half-send RDMA write operation over a possibly unreliable network;

FIG. 7 is a block diagram similar to FIG. 3 but more particularly illustrating a half-receive RDMA write operation over a possibly unreliable network;

FIG. 8 is a block diagram similar to FIG. 3 but more particularly illustrating a half-send RDMA read operation over a possibly unreliable network;

FIG. 9 is a block diagram similar to FIG. 3 but more particularly illustrating a half-receive RDMA read operation over a possibly unreliable network;

FIG. 10 is a block diagram illustrating the use of broadcast and/or multicast operational modes;

FIG. 11 is s a block diagram illustrating address mapping used for RDMA operations in the present invention;

FIG. 12 is a flow chart illustrating possible adapter operations involved in handling a request from the CPU to either transmit a data packet or begin an RDMA transfer;

FIG. 13 is a timing diagram illustrating striping across multiple paths;

FIG. 14 is a block diagram illustrating the exchange of information that occurs is a third party RDMA operation;

FIG. 15 is a block diagram illustrating the organization of Translation Control Entry (TCE) tables for RDMA and the protection domains on each node of a system;

FIG. 16 is a block diagram illustrating key protections structures in the adapter and the important fields in each RDMA packet;

FIG. 17 is a block diagram illustrating how the setup of shared tables across multiple adapters on a node allows for simple striping models;

FIG. 18 is a block diagram illustrating how the shared translation setup per job enables a task in a parallel job;

FIG. 19 is a block diagrams illustrating the structure and use of the snapshot capabilities of the present invention;

FIG. 20 is a block diagram illustrating the fragmentation of a large broadcast packet into smaller packets for transmission via the FIFO mode and in which the IP header is adjusted for reassembly upon receipt;

FIG. 21 illustrate receive side processing which removes the interface header and delivers smaller packets to the TCP Layer, where they are reassembled;

FIG. 22 illustrates the comparison between a process in which multiple threads are carrying out copy operations and a process in which a single thread is carrying out copy operations in a pipelined RDMA model;

FIG. 23 illustrates the process steps involved in the performance of RDMA operations in which packets are broadcast to different locations.

DETAILED DESCRIPTION OF THE INVENTION

In order to provide a more understandable description of the structure and operation of the present invention, it is useful to first describe some of the components that exist in the environment in which the invention is typically embodied. This environment includes at least two data processing systems capable of Remote Direct Memory Addressing (RDMA) operations. Each data processing system communicates to any other coupled data processing system through a switch (also termed herein a network, fabric or communications fabric). The switches hook up to the nodes via a switch adapter. Each data processing system includes one or more nodes which in turn may include one or more independently operating Central Processing Units (CPUs). Each data processing system communicates with the switch by means of a switch adapter. These adapters, such as those present in the pSeries of products offered by the assignee of the present invention, include their own memory for storage and queuing operations. These switch adapters may also include their own microcode driven processing units for handling requests, commands and data that flow via the adapter through the network to corresponding communication adapters. The corresponding communication adapters are associated in the same way with other data processing systems and may have similar capabilities.

The Concept of a Window

An adapter window is an abstraction of a receive FIFO queue, a send FIFO queue and some set of adapter resources and state information that are mapped to a user process that is part of a parallel job. The FIFO queues are used for packet-mode messages, as well as for posting RDMA command and notification requests that help the ULP handshake with the adapter.

The Receive Side

Each receive side FIFO queue is a structure in the form of one or more large pages. An even easier alternative is to always deploy the FIFO queue as a 64 MB memory page. The memory for the Receive FIFO, regardless of its size, is expected to be in contiguous real memory, and the real memory address of the start of the table is stored in the Local Mapping Table (LMT) for the given adapter window in adapter SRAM. Having the receive FIFO queues mapped to contiguous real memory eliminates the need for the network adapter to have to deal with TCE (translation tables) tables and for the driver to have to set these tables up during job startup. The contiguous real memory hence simplifies the adapter design considerably because it does not need to worry about TCE caching and its management in the critical data transfer paths. Regardless of the FIFO size, in the preferred implementation herein, the queue is comprised of fixed length (2 KB) data packet frames which is dictated by the maximum packet size handled by the switch. The concepts explained herein naturally extend to other possible packet sizes.

Packet arrival notification for packet mode operation is accomplished as follows. The microcode DMAs all but the first cache line of an arriving packet to system memory. It waits for that data to reach the point of coherence, and then DMAs the first cache line (the so-called header) into the appropriate packet header slot in the packet buffer in system memory. The ULP polls on the first cache line of the next packet frame to determine if a new packet has arrived. Upon consuming the packet, the ULP zeroes out the first line (or a word thereof) of the packet, to prepare the fifo slot for its next use. This zeroing out by the ULP allows the ULP to easily distinguish new packets (which never have the line zeroed out) from older packets already consumed by the ULP. For RDMA mode, the FIFO entry that is put into the FIFO is an RDMA completion packet, which as a header only entity, so is transferred as a single cache line DMA.

This FIFO queue structure is simple and it minimizes short-packet latency. Short packets (that is data packets less than 128 bytes) suffer only one system memory latency hit, as opposed to other mechanisms involving a separate notification array or descriptor. The present mechanism also enhances compatibility with the send-side interface, and is readily amenable to other variations based on the use of hardware as opposed to software as a design component of the FIFO queue model.

When the receive-side FIFO queue is full, incoming packet mode packets and RDMA completion packets are silently discarded. Interrupts are based on the FIFO queue count threshold for the given adapter window. For example, an interrupt is generated when the microcode writes the n.sup.th receive FIFO entry, where n is an integer previously provided by the higher level protocol as the next 18 item of interest. Note that interrupts can be disabled by the user space ULP by setting the count to some selected special value such as zero or all ones. Interrupt signals are generated upon triggering conditions, as perceived by the adapter. Incoming packets are validated on the basis of the protection key stamped in the packet header.

In the present invention, packets are potentially delivered to the FIFO queue system memory in an out-of-order sequence (that is, it is possible for completions to be marked out of order). The FIFO queue tail pointer in the Local Mapping Table is incremented as each entry is written to the receive FIFO. Multiple receive threads engines on the adapter, even if acting on behalf of the same window, require no additional synchronization with respect to each other allowing for significant concurrency in receipt of a message. Packet frames are returned by the adapter to the ULP by means of an MMIO command (see below). The total number of adapter receive tasks is preferably limited to the minimum number that keeps all pipelines full. In the presently preferred embodiments of the present invention this number is four and can be tuned to different settings for improved performance.

The current and preferred implementation of the present adapter uses MMIO commands for a number of commands from the ULPs to the adapter microcode and/or hardware. The host code can access the registers and facilities of the adapter by reading or writing to specific addresses on the system bus. In the case of writes, depending upon the specific address, the data that is written can be either forwarded by the hardware interface to the microcode running on the adapter, or processed directly by the adapter hardware. When dealing with a command that is destined for the microcode, both the data written and the address used serve as input to the microcode to determine the exact operation requested. Among the operations utilized in the receive processing are: a. Update Slot Counts command--This is used to return Receive FIFO slots to the adapter for reuse; b. Update Interrupt Threshold command--This is used to set a new mark to indicate when the adapter microcode should generate the next interrupt. Note that it is possible for this command to cause an immediate interrupt.

Adapter SRAM accesses are also achieved using Memory Mapped I/O (MMIO), but this is handled directly by the hardware without microcode support. The host side code may access, with proper authority which is setup during parallel job startup, the data stored in the adapter SRAM. This includes the LMTs for all of the windows and all of the RDMA Contexts (rCxt). Read LMT is an example of a specific device driver/hypervisor implemented command that uses the adapter SRAM access MMIO to retrieve data from adapter SRAM. It works by passing the specific address of where the LMT is stored, within SRAM, as a part of the Adapter SRAM access MMIO. It is important to point out though that the LMT stored in the SRAM may not always be current. In the preferred implementation, the working copy of the LMT is cached closer to the adapter processor for improved performance in the mainline data transfer path.

The adapter microcode performs a number of different operations related to processing packets received from the network or switch. Below is a brief overview of this process. The microcode executes steps either as the result of receiving a MMIO command, passed through by the hardware, or an incoming packet from the network. It is noted that in the preferred implementation herein, there is provided a microcode engine running on the adapter. Other implementations which do not use a microcode engine are possible (for example, a complete state machine, an FPGA based engine, etc.). The concepts explained in this preferred embodiment extend to other possible implementations as well. The steps involved in processing a packet received from the network or switch are now considered: a. Task allocation--The hardware pre-allocates tasks to be used for received packets (this spawns a new thread of execution on the adapter). b. Packet header arrival (thread wakeup)--When the hardware starts to receive a packet header, it allocates the appropriate Channel Buffer (CB) (or window resources) based upon the adapter channel that this packet is destined for (information within the packet indicates this). The channel buffer is an array of memory in the adapter that is available to the microcode. This is where the LMTs (the window state) are cached for all adapter windows. By design there are as many channel buffers as there are tasks, and there is enough space within the channel buffers to store LMTs for all of the possible adapter windows. All tasks working on the same group of windows reference the same channel buffer. In addition to allocating the channel buffer, the hardware also copies the received packet header to task registers and schedules the task to be executed. The role of the microcode during this time is to validate the packet as something of interest and to prepare for the arrival of the payload (if any). The microcode then checks to determine whether or not the payload has arrived. If so, it proceeds directly to the next step. Otherwise, it suspends this task waiting for the payload to arrive. Such suspension allows for other activity to be overlapped with waiting for payload arrival, thus ensuring maximum concurrency in adapter operations. c. Packet data arrival--When the packet payload, if any, arrives, the hardware allocates a Data Transfer Buffer (DTB). The DTB is an array of memory in the adapter which is not directly accessible to the adapter processor. The DTB is a staging area for the packet payload in the adapter before it is pushed into system memory for the ULP or application to absorb into the ongoing computation. If adapter microcode had suspended processing awaiting the payload arrival, the task is then added to the dispatch queue by the hardware. Assuming that the packet is valid, the microcode initiates a data transfer of the payload to system memory (the Receive FIFO for FIFO mode). For the FIFO mode of operation, if the payload is greater than a single cache line, then only data following the first cache line is moved initially. (If the payload is less than or equal to a single cache line, then the entire payload is transferred at once.) For RDMA mode, the entire packet is transferred to system memory at one time. Once any data move has been started, the microcode suspends the task waiting for the data transfer to complete. d. When the hardware completes moving the data to system memory, the task is again awakened. In FIFO mode, if the payload is greater than a single cache line, then the first cache line is written to system memory. Writing this data after the rest of the data is already in system memory insures data coherence. As explained above the DMA-ing of the first cache line is a signal to the ULP that new data has arrived. The task again suspends after initiating this transfer. e. Once all data has been transferred to system memory, the task is again dispatched. At this point it determines whether or not an interrupt is required. If so, then it initiates the interrupt process prior to releasing the CB and DTB buffers and deallocating the task.

It is noted that items c, d, and e in the list above are similar to items a and b, but that they can be driven by a different hardware event.

The Send Side

The real memory address of the transmit-side FIFO queue, like that of the receive side FIFO queue, is a contiguous section of real system memory whose real address is stored in the LMT for the adapter window. This queue also employs fixed length packet frames. In presently preferred embodiments of the present invention, each packet is up to 16 cache lines long. The destination node id (that is, the id for the destination node adapter), the actual length of the data in the packet, and the destination window are specified in a 16-byte header field (in the presently preferred embodiment) within the first cache line. The header is formatted so as to minimize the amount of shuffling required by the microcode. Note that the destination node id need not be checked by microcode; this protection is provided by the above-mentioned protection key.

For each outgoing packet, the adapter fetches the packet header into the Header Buffer (an array of memory in the memory of the adapter; there are 16 of these header buffers in the presently preferred embodiment). The microcode then modifies the packet header to prepare it for injecting the packet into the network, including adding the protection key associated with this adapter window from the LMT. The adapter also fetches the data payload from the send FIFO queue (for FIFO mode) or from the identified system memory for RDMA, into the Data Transfer Buffer, via local DMA (system memory into adapter). Once the header is prepared and the payload is in the DTB, the adapter can inject the packet into the network.

After transmitting the packet, the adapter marks completion by updating the header cache line of the frame in the send FIFO queue. This is done for every packet that the microcode processes so that the ULP can clearly see which packets have or have not been processed by the adapter.

This completion marking of the send FIFO queue entry is performed using a sync-and-reschedule DMA operation in the presently preferred embodiment. When this DMA operation completes, the task is ready to process a new send request.

The adapter maintains a bit vector of active adapter windows (hence the number of adapter windows is restricted). The bit vector is contained in either one of two global adapter registers. A bit is set by a system MMIO StartTransmit command; the command also includes a packet count which is added to the readyToSend packet count in the LMT. Each time a packet is processed, readyToSend is decremented. The ready bit is cleared when the adapter processes the last readyToSend packet.

Transmit threads proceed through the active adapter windows in a round robin fashion. We switch windows every k packets, even when k=1. Some efficiency (for example, fewer LMT fetches) is gained for k>1, for instance by reusing the register state. But that tends to optimizes the send side better than the receive side (which helps exchange bandwidth, but may cause congestion to increase). The actual selection of the value of k can be tuned based on the requirements for performance, fairness among all windows, and other policy controls selectable by the user or system administrator.

The total number of transmit tasks is limited to the minimum that keeps all pipes flowing (Typically there are four pipes in the presently preferred embodiment.) No attempt is made to maintain transmission order and to suffer with its associated overhead. This is, in fact, a key feature of the present invention in almost all of its embodiments.

Upon the issuance of a StartTransmit MMIO command, the adapter attempts to assign the work to forks off a pre-allocated Transmit task. If no Transmit tasks are currently available, then the readyToSend count in the appropriate LMT is updated, the bit that is associated with this adapter window is set in the bit vector and a work manager task is notified. The work manager task distributes transmit work to Transmit tasks as they become available. It is the work manager task's responsibility to update the readyToSend count in the LMT and the bit vector in the Global Registers (referred to also as "GR"; which are registers accessible to all adapter tasks) as the work is distributed.

On the "send side," the transmit threshold interrupt behaves exactly like the receive side. The use of interrupts is optional, and in fact, are not typically used for send side operation. The use of send and receive interrupts is optional. MMIO commands used for send side operations include: a. StartTransmit--An MMIO that identifies how many send FIFO queue slots have just been prepared and are now ready to send. This count is used by the adapter to process the correct number of packets. b. Set Interrupt Threshold--As with the receive side operation, this optional command allows the host code to identify when it would like an interrupt generated.

In addition, the host code may issue Adapter SRAM access MMIOs. However, they are of little value during normal operation.

LMT Contents

Presented below are the fields in the LMT data structure (in conceptual form) that are preferably employed.

TABLE-US-00001 //*********************************************************** // LMT Definition //*********************************************************** Typedef struct { Recv_fifo_real_address - Real address of the base of the Receive FIFO Window_state - Window state (invalid, valid, etc) Recv_fifo_size - Size of the Receive FIFO (encoded) Window_id - Window id Recv_mask - Mask for current slot from the recv_current_cnt Recv_current_cnt - Receive current count for interrupt threshold Recv_fifo_avail_slots - Receive FIFO number of slots available Recv_int_threshold - Receive interrupt threshold Window_key - Window key (used for protection) Rcxt_id - rCxt id associated with Window Fatal error Int_vect_data - Interrupt vector entry with Window Fatal error send_fifo_real_address - Real address of the base of the Send FIFO Config_parm_bcast - Config parm - Enable sending broadcast pkts. Send_fifo_size - Size of the Send FIFO (encoded) Send_quanta_value - Count of send actions remaining in quanta Send_mask - Mask to get current slot from send_current_cnt Send_current_cnt - Send current count for interrupt threshold Send_fifo_ready_slots - Send FIFO number of slots ready to process Send_int_threshold_hi - Send interrupt threshold Rcxt_head - Head rCxt of RDMA send queue for window Rcxt_tail - Tail rCxt of RDMA send queue for window Rcxt_count - Count of rCxts on RDMA send queue } lmt_t;

What follows now is a description of the RDMA architecture.

Memory Protection Model

Memory regions are registered to a particular RDMA job. Protection granularity is per page. RDMA memory accesses, both local and remote, are validated by RDMA job id and buffer protection key. The job id is verified by comparing the job id (or window key) for the window with the job id assigned to the Translation Control Entry (TCE) Table (see below for a more detailed description of their use in the discussion for FIGS. 15, 16, 17 and 18). The memory protection key, which insures that the request uses a current view of the memory usage, is validated by comparing the key in the Protocol Virtual Offset (PVO; see below) with the key for the particular TCE Table entry being referenced. This insures that only the authorized job accesses its data and also provides protection from stale packets in the network.

For a better understanding of the present invention, it is desirable to consider the general operation of RDMA systems and methods. FIG. 1 seeks to answer the question: "What is RDMA?". In RDMA a master task running on one node is able to update or read data from a slave task running on another node. The exchange of information occu


Free Web Sudoku Puzzles.
Solve with your browser.
      8   9 1    
  3              
5         2 8 3  
        9 1     5
    3       7    
1     4 6        
  5 9 6         3
              4  
    8 3   5      
What is it?



Add Your Site · Terms Of Service · Privacy Policy


DISCLAIMER
Linkgrinder is a free service that searches the Internet and indexes all files found so that you may search quickly and easily for shared files. These files are created and made available individually by users whose identity we are not aware of and who we have no control over. In essence we function like a search engine tool; these files ARE NOT STORED OR SERVED BY OUR NETWORK. We are not responsible for any materials obtained by using our service. We do not monitor any of the contents of these files. These files may contain viruses, illegal materials, materials inappropriate for minors, offensive files and the like. BY USING OUR SERVICE, YOU ASSUME FULL RESPONSIBILITY FOR DOWNLOADING THESE MATERIALS AND WILL INDEMNIFY US FOR ANY DAMAGES THAT MAY BE INCURRED.

For More Specific Information VIEW OUR TERMS OF SERVICE.

Thank you and Enjoy!