Title: Channel-based late race resolution mechanism for a computer system
Abstract: A channel-based mechanism resolves race conditions in a computer system between a first processor writing modified data back to memory and a second processor trying to obtain a copy of the modified data. In addition to a Q0 channel for carrying requests for data, a Q1 channel for carrying probes in response to Q0 requests, and a Q2 channel for carrying responses to Q0 requests, a new channel, the QWB channel, which has a higher priority than Q1 but lower than Q2, is also defined. When a forwarded Read command from the second processor results in a miss at the first processor's cache, because the requested memory block was written back to memory, a Loop command is issued to memory by the first processor on the QWB virtual channel. In response to the Loop command, memory sends the written back version of the memory block to the second processor.
Patent Number: 7,000,080 Issued on 02/14/2006 to Van Doren,   et al.
| Inventors:
|
Van Doren; Stephen R. (Northborough, MA);
Tierney; Gregory E. (Chelmsford, MA)
|
| Assignee:
|
Hewlett-Packard Development Company, L.P. (Houston, TX)
|
| Appl. No.:
|
263836 |
| Filed:
|
October 3, 2002 |
| Current U.S. Class: |
711/143; 711/145; 711/158 |
| Current Intern'l Class: |
G06F 12/00 (20060101) |
| Field of Search: |
711/141,143,144,145,147,158
|
References Cited [Referenced By]
U.S. Patent Documents
| 4847804 | Jul., 1989 | Shaffer et al.
| |
| 5222224 | Jun., 1993 | Flynn et al.
| |
| 5233616 | Aug., 1993 | Callander.
| |
| 5297269 | Mar., 1994 | Donaldson et al.
| |
| 5303362 | Apr., 1994 | Butts, Jr. et al.
| |
| 5313609 | May., 1994 | Baylor et al.
| |
| 5490261 | Feb., 1996 | Bean et al.
| |
| 5530933 | Jun., 1996 | Frink et al.
| |
| 5537575 | Jul., 1996 | Foley et al.
| |
| 5551005 | Aug., 1996 | Sarangdhar et al.
| |
| 5579504 | Nov., 1996 | Callander et al.
| |
| 5608893 | Mar., 1997 | Slingwine et al.
| |
| 5737757 | Apr., 1998 | Hassoun et al.
| |
| 5761731 | Jun., 1998 | Van Doren et al.
| |
| 5905998 | May., 1999 | Ebrahim et al.
| |
| 6014690 | Jan., 2000 | VanDoren et al.
| |
| 6055605 | Apr., 2000 | Sharma et al.
| |
| 6061765 | May., 2000 | Van Doren et al.
| |
| 6088771 | Jul., 2000 | Steely, Jr. et al.
| |
| 6094686 | Jul., 2000 | Sharma.
| |
| 6101420 | Aug., 2000 | VanDoren et al.
| |
| 6105108 | Aug., 2000 | Steely, Jr. et al.
| |
| 6108737 | Aug., 2000 | Sharma et al.
| |
| 6108752 | Aug., 2000 | VanDoren et al.
| |
| 6125429 | Sep., 2000 | Goodwin et al.
| |
| 6154816 | Nov., 2000 | Steely et al.
| |
| 6202126 | Mar., 2001 | Van Doren et al.
| |
| 6249520 | Jun., 2001 | Steely, Jr. et al.
| |
| 6249846 | Jun., 2001 | Van Doren et al.
| |
| 6279085 | Aug., 2001 | Carpenter et al.
| |
| 6944719 | Sep., 2005 | Rowlands et al.
| |
| Foreign Patent Documents |
| 0 817 074 | Jul., 1998 | EP.
| |
Other References
Scales, D. and Gharachorloo, K., Design and Performance of the Shasta Distributed
Shared Memory Protocol, XP-000755264, Jul. 7, 1997, pp. 245-252.
Scales, D., and Gharachorloo, K. and Thekkath, C., Shasta: A Low Overhead, Software-Only
Approach for Supporting Fine-Grain Shared Memory, XP-002173083, Jan. 10, 1996,
pp. 174-185.
Scales, D. and Gharachorloo, K., Towards Transparent and Efficient Software Distributed
Shared Memory, XP-000771029, Dec. 1997, pp. 157-169.
Scales, D., Gharachorloo, K. and Aggarwal, A., Fine-Grain Software Distributed
Shared Memory on SMP Clusters, WRL Research Report 97/3, Feb. 1997, pp. i and 1-28.
Gharachorloo, K., Lemoski, D., Laudon, J., GIbbons, P., Gupta, A. and Hennessey,
J., Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors,
(c) 1990 IEEE, pp. 15-26.
Jouppi, N., Improving Direct-Mapped Cache Performance by the Addition of a Small
Fully-Associative Cache and Prefetch Buffers, (c) 1990 IEEE, pp. 364-373.
Agarwal, A., Simoni, R., Hennesy, J. and Horowitz, M., An Evaluation of Directory
Schemes for Cache Coherence, (c)1988 IEEE, pp. 353-362.
Papapanaroos, M. and Patel. J., A Low-Overhead Coherence Solution for Multiprocessors
with Private Cache Memories, (c) 1984 IEEE, pp. 284-290.
UltraSPARC Ultra Port Architecture (UPA): The New-Media System Architecture,
http://www.sun.com/processors/whitepapers/wp95-023.html, Copyright 1994-2002 Sun
Microsystems, pp. 1-4.
Porting OpenVMS Applications to Intel Itanium Architecture, Compaq Conputer Corporation,
Apr. 2002, pp. 1-17.
Adve, S., Hill, M., Miller, B. and Nester, R., Detecting Data Races on Weak Memory
Systems, (c) 1991 ACM, pp. 234-243.
Gharachorloo, K., Sharma, M., Steely, S. and Van Doren, S., Architecture and
Design of AlphaServer GS320, Nov. 2000, pp. 1-12.
IEEE Standard for Scalable Coherent Interface (SCI), (c) 1993 IEEE, pp. Table
of Contents, 30-34 and 141-188.
|
Primary Examiner: Anderson; Matthew D.
Assistant Examiner: Barton; Jonathan
Claims
What is claimed is:
1. In a computer system having a plurality of processors and a main memory organized
into a plurality of memory blocks, the processors having one or more caches, a
method for resolving a late race condition between a first processor and a second
processor for a selected memory block, the method comprising:
defining a plurality of channels within the computer system for exchanging command
packets among the processors and main memory, the channels including a Q
0
channel for carrying requests for memory blocks, a Q
1 channel, having a
higher priority than the Q
0 channel, for carrying probes in response to
Q
0 requests, a Q
2 channel, having a higher priority than the Q
1
channel, for carrying responses to Q
0 requests, and QWB channel having a
higher priority than the Q
1 channel but lower than Q
2 channel;
issuing a Write_Back (WB) command from the first processor to main memory, the
WB command including a modified version of the selected memory block taken from
the first processor's cache;
forwarding from main memory to the first processor a memory reference request
specifying the selected memory block, the memory reference request initiated by
the second processor;
in response to the memory reference request, issuing a Loop command from the
first processor to main memory on the QWB channel;
in response to the WB command, writing the modified data back to main memory; and
in response to the Loop command, issuing a memory reference response from main
memory to the second processor.
2. The method of claim 1 wherein the channels are implemented as ordered channels.
3. The method of claim 2 wherein the computer system further includes at least
one directory for maintaining status information regarding the memory blocks configured
at main memory, the directory having, for each memory block, an owner field specifying
the owner of the respective memory block, a sharer list specifying one or more
processors, if any, that have a shared copy of the respective memory block, and
a writer field specifying the last processor to have written the respective memory
block back to main memory, the method further comprising entering an identifier
(ID) assigned to the first processor in the writer field of the directory entry
for the selected memory block in response to the WB command.
4. The method of claim 3 further wherein the issuing the memory reference response
to the second processor depends on the respective writer field matching the source
of the Loop command.
5. The method of claim 1 wherein the WB command is issued on the QWB channel.
6. The method of claim 2 wherein the memory reference request is a request for
a shared copy of the selected memory block, and the Loop command is a Loop_Forwarded
Read (LFRead) command requesting main memory to send the selected memory block
to the second processor.
7. The method of claim 2 wherein the memory reference request is a request for
write access to the selected memory block, and the Loop command is a Loop_Forwarded_Read_Modify
(LFReadMod) command requesting main memory to send the selected memory block to
the second processor and to grant the second processor write access to the selected
memory block.
8. The method of claim 1 wherein
the computer system has physical interconnect links and buffering resources coupling
the processors and main memory, and
each channel is an independently flow-controlled virtual channel of commands
that shares the physical interconnect link and buffering resources.
9. The method of claim 3 wherein the directory is free from maintaining transient
states for any memory block.
10. The method of claim 1 wherein the forwarded memory reference request results
in a cache miss at the first processor as the selected memory block was victimized
from the first processor's cache in response to the WB command.
11. A computer system configured to resolve late race conditions, the computer
system comprising:
a plurality of interconnected processors, each processor having a cache;
a main memory in communicating relationship with the plurality of processors,
the main memory organized into a plurality of memory blocks; and
a plurality of channels for carrying command packets among the processors and
main memory, wherein
the channels include a Q
0 channel for carrying requests for memory blocks,
a Q
1 channel for carrying probes in response to Q
0 requests, a Q
2
channel for carrying responses to Q
0 requests, and a QWB channel, having
a higher priority than the Q
1 channel but lower than Q
2 channel,
for carrying Loop commands from a processor to main memory in response to a forwarded
memory reference request received at the processor that specifies a selected memory
block that was written back to main memory.
12. The computer system of claim 11 wherein the processor is configured to write
the selected memory block back to main memory by issuing a Write_Back (WB) command
packet on the QWB channel to main memory, the WB command including a copy of the
modified version of the selected memory block.
13. The computer system of claim 12 wherein each channel is implemented as an
ordered channel.
14. The computer system of claim 13 further comprising physical interconnect
links and buffering resources coupling the processors and main memory, wherein
each channel is an independently flow-controlled virtual channel of commands that
shares the physical interconnect link and buffering resources.
15. The computer system of claim 12 further comprising at least one directory
for maintaining status information regarding the memory blocks of main memory,
the directory having, for each memory block, an owner field specifying the owner
of the respective memory block, a sharer list specifying zero, one or more processors
that have a shared copy of the respective memory block, and a writer field specifying
the last owner processor to write the respective memory block back to main memory,
wherein, in response to the WB command, an identifier (ID) assigned to the first
processor is entered in the writer field of the directory entry for the selected
memory block.
16. In a computer system having a plurality of processors and a main memory organized
into a plurality of memory blocks, the processors having one or more caches, a
method for resolving a late race condition between a first processor and a second
processor for a selected memory block, the method comprising:
defining a plurality of channels within the computer system for exchanging command
packets among the processors and main memory, the channels including a Q
0
channel for carrying requests for memory blocks, a Q
1 channel, having a
higher priority than the Q
0 channel, for carrying probes in response to
Q
0 requests, a Q
2 channel, having a higher priority than the Q
1
channel, for carrying responses to Q
0 requests, and QWB channel having a
higher priority than the Q
1 channel but lower than Q
2 channel;
issuing a Write_Back (WB) command from the first processor, the WB command including
a modified version of the selected memory block taken from the first processor's cache;
forwarding to the first processor a memory reference request specifying the selected
memory block, the memory reference request initiated by the second processor;
in response to the memory reference request, issuing a Loop command from the
first processor on the QWB channel;
in response to the WB command, writing the modified data back to main memory; and
in response to the Loop command, issuing a memory reference response to the second processor.
17. The method of claim 16 wherein the computer system further includes a directory,
and the WB command and Loop command are received at the directory.
18. The method of claim 17 wherein the memory reference request and the Loop
command are issued from the directory.
19. The method of claim 16 wherein at least part of the directory is located
in the main memory of the computer system.
20. The method of claim 16 wherein the WB command is issued on the QWB channel.
Description
CROSS-REFERENCE TO RELATED APPLICATION
This application is related to the following co-pending, commonly owned U.S.
patent applications:
U.S. patent application Ser. No. 10/263,739 entitled DIRECTORY STRUCTURE PERMITTING
EFFICIENT WRITE-BACKS IN A SHARED MEMORY COMPUTER SYSTEM, filed Oct. 3, 2002.
U.S. patent application Ser. No. 10/263,743 entitled RETRY-BASED LATE RACE RESOLUTION
MECHANISM FOR A COMPUTER SYSTEM, filed Oct. 3, 2002.
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to shared memory computer architectures and, more
specifically, to cache coherency protocols for use in shared memory computer systems.
2. Background Information
A computer system typically comprises one or more processors linked to a main
memory
by a bus or other interconnect. In most computer systems, main memory organizes
the instructions and data being stored into units typically referred to as "blocks"
each of which is separately addressable and may be of a fixed size. Instructions
and data are typically moved about the computer system in terms of one or more blocks.
Ordinarily, a processor will retrieve data, e.g., one or more blocks,
from main memory, perform some operation on it, and eventually return the results
back to main memory. Retrieving data from main memory and providing it to a processor
can take significant time especially in terms of the high operating speeds of today's
processors. To reduce such latencies as well as to reduce the number of times a
processor must access main memory, modem processors and/or processor chipsets include
one or more cache memories or caches. A cache is a small, fast memory module that
is placed in close proximity to the processor. Many caches are static random access
memories (SRAMs), which are faster, but more expensive, than dynamic random access
memories (DRAMs), which are often used for main memory. The cache is used to store
information, e.g., data or instructions, which the processor is currently using
or is likely to use in the near future. There are two basic types of caches: "write-through"
caches and "write-back" caches.
With a write-through cache, whenever a processor modifies or updates a piece
of data in the processor's cache, main memory's copy of that data is automatically
updated. This is accomplished by having the processor write the data back to memory
whenever the data is modified or updated. A write-back cache, in contrast, does
not automatically send modified or updated data to main memory. Instead, the updated
data remains in the cache until some more convenient time, e.g., when the processor
is idle, at which point the modified data is written back to memory. The utilization
of write-back caches typically improves system performance. In some systems, a
write-back or victim buffer is provided in addition to the cache. "Victim data"
refers to modified data that is being removed from the processor's cache in order
to make room for new data received at the processor. Typically, the data selected
for removal from the cache is data the processor is no longer using. The victim
buffer stores this modified data which is waiting to be written back to main memory.
The use of a victim buffer frees up space in the cache for other data. Modified
data in the victim buffer is eventually "victimized", i.e., written back to main
memory, at some convenient time.
Although the implementation of write-back or victim buffers have increased
the performance of computer systems, there are some drawbacks. For example, the
addition of a victim buffer requires additional logic and storage or memory space
at the processor chipset increasing cost, complexity and size of the processor chipset.
Symmetrical Multiprocessor (SMP) Systems
Multiprocessor computing systems, such as symmetrical multiprocessor
(SMP) systems, provide a computer environment in which software applications may
run on a plurality of processors using a single address space or shared memory
abstraction. In a shared memory system, each processor can access any data item
without a programmer having to worry about where the data is or how to obtain its
value. This frees the programmer to focus on program development rather than on
managing partitioned data sets and communicating values.
Cache Coherency
Because more than one processor of the SMP system may request a copy of the
same memory block from main memory, cache coherency protocols have been developed
to ensure that no processor relies on a memory block that has become stale, typically
due to a modification or update performed to the block by some other processor.
Many cache coherency protocols associate a state with each cache line. A given
memory block, for example, may be in a shared state in which copies of the block
may be present in the caches associated with multiple processors. When a memory
block is in the shared state, a processor may read from, but not write to, the
respective block. To support write operations, a memory block may be in an exclusive
state. In this case, the block is owned by a single processor which may write to
the cache line. When the processor updates or modifies the block, its copy becomes
the most up-to-date version, while corresponding copies of the block at main memory
and/or other processor caches become stale.
When a processor wishes to obtain exclusive ownership over a memory block that
is currently in the shared state (i.e., copies of the block are present in the
caches of other processors) invalidate requests are typically issued to those other
processors. When an invalidate request is received by a given processor, its cache
is searched for the specified memory block. If the block is found, it is transitioned
to an invalid state. Many caches assign or associate a valid bit with each memory
block or cache line stored in the cache. If the bit is asserted, then the cache
line is considered to be valid and may be accessed and utilized by the processor.
When a memory block is initially received from main memory, the valid bit is asserted
and the memory block is stored in the cache. When an invalidate request is received,
the valid bit of the respective cache line is de-asserted, thereby indicating that
the cache line is no longer valid.
There are two classes of cache coherency protocols: snooping and directory
based. With snooping, the caches monitor or snoop all transactions traversing the
shared memory bus, looking for transactions that reference a memory block stored
at the cache. If such a transaction is detected, the cache updates the status information
for its copy of the memory block based on the snoop transaction. In this way, every
cache that has a copy of a given memory block also has a copy of the status information
of that block. With a directory based protocol, the state of each block is kept
in a single, centralized location in the system, called a directory. Status information
is not maintained in the individual caches.
FIG. 1 is a highly schematic illustration of a prior art directory 100.
Directory 100 has a plurality of entries 102
a-d each of which
corresponds to a respective memory block. The directory 100 is organized,
moreover, such that each entry 102
a-d has a plurality of fields or
cells for storing state and/or status information for the respective block. In
particular, the directory 100 has an address column 103 that stores
the address of the memory block, an owner column 104 that stores the identity
of the entity, e.g., a processor or main memory itself, that is considered to be
the owner of the memory block, and a sharer column 106 that stores the identity
of those processors or other system entities that have a shared copy of the block.
The sharer column 106 may have a plurality of sub-columns 106
a-c,
each of which may contain the identity of a particular processor that has a shared
copy of the respective memory block. If a request for shared access to a memory
block is received from a first processor, P1, main memory examines the directory
entry, e.g., entry 102
c, for the block to determine its owner. As
memory is itself the owner of the block, memory sends its copy of the block to
P1 and enters P1's identifier (ID) into one of the sharer fields,
e.g. field 106
b, of the respective directory entry, e.g., entry 102
c,
thereby noting that P1 has a shared copy of the block. Since P1 only
requested shared access to the memory block, the contents of the entry's owner
field 104 are not modified.
If P1 issues a request for exclusive or write access to some other memory
block, e.g., the block corresponding to entry 102
d, main memory again
examines the contents of entry 102
d. Suppose that, at the time the
request is received, the owner field reflected that memory was the owner of the
memory block as shown in parentheses. In this case, memory sends the block to P1,
and replaces the contents of the owner field 104 with P1's ID to
reflect that P1, rather than memory, is now the owner of the memory block.
P1 may then modify or update the memory block. If a request from a second
processor, P2, is subsequently received for a shared copy of this memory
block, main memory examines entry 102
d of the directory 100
and determines that P1 is the owner of the memory block. Because its copy
of the block, i.e., the copy stored at main memory, may be stale, memory does not
forward its copy to P2. Instead, memory may be configured to forward the
request to P1 and add P2's ID to one of the sharer fields, e.g.,
field 106
a. In response to the forwarded request, P1 may then
supply P2 with a copy of the modified memory block from P1's cache.
Alternatively, main memory may be configured to force P1 to relinquish ownership
of the memory block and return the modified version to memory so that memory can
send a copy of the up-to-date version to P2.
It has been recognized that a computer system's cache coherency protocol is a
key factor in the system's ultimate performance. Poorly designed cache coherency
protocols can result in latencies, bottlenecks, other inefficiencies and/or higher
complexity, each of which may reduce performance and/or increase cost. Bottlenecks,
for example, often arise in high occupancy controllers, such as directory controllers.
"Occupancy" is a term of art and refers to the amount of time a controller is unavailable,
e.g., for the servicing of requests, following receipt of an earlier request.
In some cache coherency protocols, when a directory controller receives a request
corresponding to a memory block, it thereafter becomes unavailable to service other
requests for that memory block until certain acknowledgements to the earlier request
are received back at the directory controller. The stalling of requests or references
until the directory controller is once again available may degrade system performance.
Thus, efforts have been made to design low occupancy cache coherency protocols,
which allow multiple requests to the same memory block to be executing substantially
simultaneously within the computer system.
Low occupancy cache coherency protocols can nonetheless result in the creation
of coherency races that, in turn, can cause system deadlock and/or starvation.
Accordingly, a need exists for a low occupancy cache coherency protocol that avoids
deadlock and/or starvation in the face of coherency races.
SUMMARY OF THE INVENTION
Briefly, the invention relates to a mechanism for resolving late races involving
write-backs to memory by creating a new virtual channel and a new message to be
transmitted in the new virtual channel. The channel-based late race resolution
mechanism of the present invention is designed for use in a shared memory computer
system, such as a symmetrical multiprocessor (SMP) computer system. The SMP system
may comprise one or more nodes each having a plurality of processors and a plurality
of shared memory subsystems coupled together by an interconnect fabric. The memory
subsystems are configured to store data in terms of memory blocks, and each processor
preferably has a cache for storing copies of memory blocks being used by the processor.
Each processor further includes a miss address file (MAF) that keeps track of requests
issued to a memory subsystem for a memory block not currently stored in the processor's
cache. Each memory subsystem, moreover, has a memory controller and a directory
for maintaining owner and sharer status information for the memory blocks for which
the memory subsystem is responsible, i.e., those memory blocks for which the memory
subsystem is the "home" memory.
In the illustrative embodiment, the directory has a plurality of entries each
of which is assigned to a respective memory block, and is organized into a main
directory region and a write-back directory region. In the main directory region,
each entry has a single owner/sharer field and a sharer list. The owner/sharer
field indicates which entity, e.g., processor, is considered to be the owner of
the block. The sharer list indicates which entities, e.g., processors, have a copy
of the memory block in their caches. In the write-back directory region, each entry
has a writer field identifying the last owner to have written the memory block
back to the memory subsystem.
The processors and memory subsystems of the SMP system communicate with each
other by exchanging command packets that are carried by the SMP system within a
plurality of virtual channels. The virtual channels are utilized to avoid deadlock
and prevent starvation. They include a Q0 virtual channel for carrying memory
reference requests, a Q1 virtual channel, which has a higher priority than
Q0, for carrying probes in response to Q1 requests, and a Q2
virtual channel, which has a higher priority than Q1, for carrying responses
to Q0 requests. In accordance with the present invention, there is also
a new virtual channel, the QWB virtual channel, which has a higher priority than
Q1 but lower than Q2. In the illustrative embodiment, each of the
virtual channels is an ordered communication channel.
In operation, when a first processor requests write access over a given memory
block, the owner/sharer field of the respective directory entry is loaded with
an identifier (ID) assigned to the first processor, thereby reflecting that the
first processor is the owner of the memory block and has the most up-to-date copy.
When the first processor completes its modification of the memory block, it issues
a Write_Back (WB) command on the new QWB virtual channel to the memory subsystem.
Here, the writer field of the respective directory entry is loaded with the first
processor's ID, the owner/sharer field is left unchanged, and the modified data
is written back to memory. Preferably, the processors do not have victim caches
and thus do not buffer a copy of modified data pending completion of a WB command.
Suppose a Read command is issued for the memory block by a second processor
before the WB command from the first processor is received at the directory. As
the first processor is still considered to be the owner of the memory block, a
probe, such as a Forwarded_Read (FRead) command, is preferably sent to the first
processor on the Q1 virtual channel directing it to service the Read command
out of the first processor's cache. At the first processor, however, a miss will
occur as the first processor sent the modified data back to main memory in the
WB command. This condition is known as a late race condition.
To resolve the late race, the first processor issues a new command, called a
Loop_Forwarded_Read
(LFRead) command, to main memory also on the QWB virtual channel. Because the QWB
virtual channel is an ordered channel, the WB command arrives at the home memory
before the LFRead. WB command is processed by the memory subsystem as described
above. That is, the writer field is updated with the first processor's ID and the
modified data is written back to memory. When the LFRead is received, the memory
subsystem compares the directory entry's writer field with the ID of the entity
that sourced the LFRead command. As the two values match, the memory subsystem
responds by issuing a Fill command to the second processor on the Q2 virtual
channel that includes a copy of the requested memory block from memory. The second
processor thus receives the requested data, thereby completing the memory reference
operation. Notably, the LFRead command does not cause any change to the directory state.
In an alternative embodiment, the channels are unordered and another new channel,
the Q3 virtual channel is added. The Q3 virtual channel has a higher
priority than the Q2 virtual channel. In this embodiment, WB commands are
issued on the Q2 virtual channel as opposed to the QWB virtual channel while
the loop commands are still issued on the QWB virtual channel. The Q3 virtual
channel is used for WB_Acknowledgments (WBAcks) from the memory subsystems to the
processors confirming receipt of WB commands from the processors.
BRIEF DESCRIPTION OF THE DRAWINGS
The invention description below refers to the accompanying drawings, of which:
FIG. 1, previously discussed, is a highly schematic diagram of a conventional directory;
FIG. 2 is a highly schematic functional block diagram of a multi-processor node;
FIG. 3 is a highly schematic functional block diagram of a symmetrical multiprocessor
(SMP) computer system formed from a plurality of multi-processor nodes;
FIG. 4 is a highly schematic block diagram of a processor socket and memory
subsystem of the SMP computer system of FIG. 3;
FIG. 5 is a highly schematic block diagram of a miss address file (MAF) entry;
FIG. 6 is a highly schematic block diagram of a cache tag entry;
FIG. 7 is a highly schematic block diagram of the directory of the present invention;
FIG. 8 is a highly schematic, function block diagram of interconnect logic between
two sockets; and
FIGS. 9A-C and 10A-C illustrate an exemplary exchange of command
packets between a plurality of processors and a memory subsystem.
DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT
FIG. 2 is a highly schematic illustration of a preferred multiprocessor node
200 for use with the present invention. The node
200 comprises a
plurality of, e.g., eight, sockets, S
0-S
7, which are designated by
reference numerals
202a-h. The eight sockets
202a-h are
logically located at the corners of a cube, and are interconnected by a plurality
of inter-processor links
204a-p. Thus, each socket can communicate
with any other socket of the node
200. In the illustrative embodiment, sockets
forming two opposing sides of the node
200 are fully interconnected, while
the two sides are connected only along the edges of the cube. That is, sockets
S
0-S
3, which form one side of the cube, and S
4-S
7,
which form the opposing side of the cube, are fully interconnected with each other,
while the two opposing sides are connected by four inter-socket links
204g-j.
As described herein, each socket includes one or more processors and has or is
coupled to two main memory subsystems.
FIG. 3 is a highly schematic illustration of a symmetrical multiprocessing (SMP)
computer system
300 from a plurality of nodes. In particular system
300
comprises four nodes
200a-d, each of which corresponds to node
200
(FIG. 2). The inter-processor links have been omitted for clarity. As described
above, each node, such as nodes
200a and
200c, has
eight sockets, such as sockets
202a-h and
202i-p, respectively.
Each node also includes a plurality of main memory subsystems (M
0-M
15).
In the preferred embodiment, each node has sixteen memory subsystems, two for each
socket. The sixteen memory subsystems M
0-M
15 of node
200a
are designated by reference numerals
302a-p. Each socket is coupled
to a pair of memory subsystems by a corresponding pair of processor/memory links.
Socket
202a, for example, is coupled to memory subsystems
302a
and
302b by processor/memory links
304a and
304b, respectively.
The four nodes
200a-d, moreover, are fully interconnected with
each other through an interconnect fabric
306. Specifically each memory
subsystem, such as subsystems
302a and
302b, are connected
to the interconnect fabric
306 by fabric links
308. In the preferred
embodiment, each memory subsystem at a given node is coupled to its corresponding
memory subsystem at the other three nodes. That is, memory subsystem M
0
at node
200a is coupled by four fabric links to the M
0 memory
subsystem at the three other nodes
202b-d, memory subsystem M
1
at node
200a is coupled by four fabric links to the M
1 memory
subsystem at the other three nodes
202b-d, and so on.
FIG. 4 is a highly schematic illustration of socket (S
0)
202a,
and one of its associated memory subsystems (M
0)
302a. Socket
202a includes two processor modules
402a and
402b.
Each processor module, such as module
402a, has a processor or central
processing unit (CPU)
404, a cache tags storage device
406, a miss
address file (MAF) entity
408 and a probe/response queue
410. The
CPU
404 includes one or more processor caches (not shown) that are in close
proximity to the CPU for storing data that the CPU
404 is currently using
or is likely to use in the near future. Information regarding the status of the
data stored in the processor cache(s), such as the address and validity of that
data, is maintained in the cache tags storage device
406. The MAF entity
408, which keeps track of commands, such as memory reference requests, issued
to the system, has a MAF engine
412 and a MAF table
414. MAF entity
408 may also include one or more buffers, such as MAF buffer
416.
Processor module
402b similarly includes a CPU, a cache tags
storage device, a MAF entity and a probe/response queue. Socket (S
0)
202a
is coupled to the other sockets (S
1-S
7) of node
200a
by inter-socket links and to memory subsystems (M
0)
302a and
(M
1)
302b (FIG. 3) by processor/memory links
304a
and
304b, respectively.
It should be understood that each processor module
402 may also include
other components, such as a write back or victim buffer, a register file, a translation
look-aside buffer (TLB), load/store (L/S) queues, etc.
The memory subsystem (M
0)
302a has a memory controller
418,
a directory
420 and one or more memory modules or banks, such as memory
unit
422. Memory unit
422 may be and/or may include one or more conventional
or commercially available dynamic random access memory (DRAM), synchronous DRAM
(SDRAM), double data rate SDRAM (DDR-SDRAM) or Rambus DRAM (RDRAM) memory devices.
The memory subsystems of nodes
200a-d combine to form the main
memory of the SMP system
300 some or all of which may be shared among the
processors. Each socket
202, moreover, includes a portion of main memory
by virtue of its respective memory subsystems
302. Data stored at the memories
422 of each subsystem
302, moreover, is organized into separately
addressable memory, blocks that are equivalent in size to the amount of data stored
in a processor cache line. The memory blocks or cache lines are of uniform, fixed
size, and represent the smallest unit of data that can be moved around the SMP
system
300. In the preferred embodiment, each cache line contains 128-bytes
of data, although other fixed sizes, such as 64-bytes, could be utilized. Each
memory address, moreover, maps to and thus identifies one and only one memory block.
And, a plurality of address bits, such as the upper three address bits, are preferably
employed to identify the "home" memory subsystem of the respective memory block.
That is, each memory block, which is separately addressable by the SMP system
300,
has a pre-determined home memory subsystem that does not change. Each directory,
moreover, maintains status information for the cache lines for which its memory
subsystem is the home memory. In other words, rather than having a single, centralized
directory, the "directory" for the SMP system
300 is distributed across
all of the memory subsystems.
CPU
404 may be and/or include any one of the processors from the Itanium
architecture from Intel Corp. of Santa Clara, Calif., such as the Itanium®
1 or Itanium® 2 processors. Nonetheless, those skilled in the art will understand
that other processors, such as the Hammer series of 64-bit processors from Advanced
Micro Devices, Inc. (AMD) of Sunnyvale, Calif., may also be used.
The processors
404 and memory subsystems
302 interact with each
other by sending "command packets" or simply "commands" to each other. Commands
may be classified generally into three types: Requests, Probes and Responses. Requests
are commands that are issued by a processor when, as a result of executing a load
or store operation, it must obtain a copy of data. Requests are also used to gain
exclusive ownership or write access to a piece of data, e.g., a memory block. Requests
include Read commands, Read_Modify (ReadMod) commands, Change_to_Dirty (CTD) commands,
and Write_Back (WB) commands, among others. Probes are commands issued to one or
more processors requesting data and/or cache tag status updates. Probe commands
include Forwarded_Read (FRead) commands, Forwarded_Read_Modify (FReadMod) commands,
and Invalidate (Inval) commands, among others. Responses are commands which carry
requested data to a processor or acknowledge some request. For Read and ReadMod
commands, the responses are Fill and Fill_Modify (FillMod) commands, respectively.
For CTD commands, the responses are CTD_Success or CTD_Failure commands. For WB
commands, the response may be a WB_Acknowledgement command.
The MAF table
414 is organized at least logically as a table or array
having a plurality of rows and columns whose intersections define cells for storing
information. FIG. 5 is a highly schematic block diagram of an exemplary row or
entry
500 of MAF table
414 (FIG. 4). Entry
500 has a plurality
of fields including a 1-bit active field or flag
502, which indicates whether
the respective entry
500 is active or inactive, i.e., whether the outstanding
request represented by entry
500 is complete or not. A request that is not
yet complete is considered active. Entry
500 further includes a command
field
504 that specifies the particular command that is outstanding, and
an address field
506 that specifies the memory address corresponding to
the command. Entry
500 additionally includes an invalid count (Inval Cnt.)
field
508, an acknowledgement count (Ack Cnt.) field
510, a read
pointer (ptr.) field
512, a read chain field
514, a write pointer
field
516, a write chain field
518, a fill/marker state field
520
and a write done field
522.
MAF engine
412, among other things, operates one or more state machines
for each entry of the MAF table
414. Specifically, the read chain field
514, the write chain field
518 and the fill/marker field
520
each store a current state associated with the entry. In the illustrative embodiment,
a MAF entry transitions between two fill/marker states: idle and active, and the
current fill/marker state is recorded at field
520.
The cache tags storage device
406 (FIG. 4) is also organized at least
logically as a table or array having a plurality of rows and columns whose intersections
define cells for storing information. FIG. 6 is a highly schematic block diagram
of an exemplary row or entry
600 of the cache tags storage device
406.
As mentioned above, each entry of the cache tags storage device
406, including
entry
600, corresponds to a particular cache line stored at the processor's
cache(s). Cache tag entry
600 includes a tag field
602 that specifies
the memory address of the respective cache line, and a series of status flags or
fields, including a shared flag
604, a dirty flag
606 and a valid
flag
608.
In the illustrative embodiment, the processors and memory subsystems of the SMP
300 system cooperate to execute a write-invalidate, ownership-based cache
coherency protocol. "Write-invalidate" implies that when a processor wishes to
modify a cache line, it causes copies of the cache line that may be located in
other processors' caches to be invalidated, rather than updating them with the
new value. "Ownership-based" implies there is always an identifiable owner for
a cache line, whether it is memory or one of the processors of the SMP system
300.
The owner of a cache line, moreover, is responsible for supplying the most up-to-date
value upon request. A processor may own a cache line "exclusively" or "shared".
If a processor has exclusive ownership over a cache line, it may modify or update
the cache line without informing the system. Otherwise, it must inform the system
and potentially invalidate copies located in other processors' caches.
Directory
420 is similarly organized at least logically as a table
or array having a plurality of rows and columns whose intersections define cells
for storing information. FIG. 7 is a highly schematic block diagram of directory
420. In accordance with the present invention, directory
420 is organized
into two regions or areas, a main directory region
702 and a write-back
directory region
704. A plurality of rows
706-
710 span both
regions
702 and
704 of the directory
420. Several versions
of row
706, which are described below, are shown. Within each region
702
and
704, a plurality of columns are defined for specifying the type of information
stored in the directory's entries. The main directory region
702, for example,
has an owner/sharer column
714 for storing the identifier (ID) assigned
to the entity that owns the cache line, and a sharer list column
716 for
indicating which entities, if any, have a shared copy of the cache line.
The sharer list column
716 is preferably configured to operate in one
of two different modes. In a first mode, sharer list column
716 is organized
into two sharer columns
716a and
716b each of which
can store the identifier (ID) assigned to a single entity, such as a processor,
of the SMP system
300 that has a shared copy of the respective cache line.
If a third entity is to be added as a sharer, the sharer list column
716
converts from two sharer columns
716a and
716b to a
single coarse sharer vector column
716c. Each bit of the sharer vector
column
716c corresponds to and thus identifies a set of one or more
sockets
202 of system
300. If a bit is asserted, then at least one
processor located within the set of sockets associated with the asserted bit has
a copy of the respective cache line. Entries
707 and
709 illustrate
the first mode, and entries
708 and
710 illustrate the second mode.
Main region
702 further includes an unused column
718 and an error
correction code (ECC) column
720 for storing an ECC value calculated for
the data in fields
714-
718.
The write-back region
704 has a writer column
722, an unused column
724 and an ECC column
726. As explained herein, the contents of the
owner/sharer column
714 of the main region
702 together with the
contents of the writer column
722 of the write-back region
704 determine
who owns the respective cache line and thus where the most up-to-date version is
located within the SMP system
300. The ECC column
726 stores an ECC
value calculated for the data in fields
722 and
724.
The unused fields
718 and
724 are provided in order to support
modifications to the protocol and/or increases in the size of the address or other
fields. It should be understood that one or more bits of unused column
714
may be used to signify whether the corresponding entry's sharer list
716
is in individual sharer mode, i.e., fields
716a and
716b,
or in coarse sharer vector mode, i.e., sharer vector field
716c.
In the preferred embodiment, directory
420 is actually located within
the
memory unit
422 itself along with the memory blocks, and is not a separate
memory component. That is, each memory address indexes to an area of the memory
device
422 that is preferably divided into three regions. The first region
corresponds to the main directory region for the block specified by the memory
address. The second region corresponds to the write-back region for the memory
block, and the third region corresponds to the data contents of the memory block.
In the illustrative embodiment, the owner/sharer field
714 is 10-bits,
the sharer list field
716 is 16-bits, thereby supporting either two 8-bit
sharer-IDs or one 16-bit coarse sharer vector, and the unused and ECC fields
718
and
720 are each 7-bits. The main directory region
702 of a memory
area is thus 5-bytes. For the write-back region
704, the writer field is
9-bits, the unused field is 1-bit and the ECC field is 6-bits, thereby making the
write-back region 2-bytes. The third region includes the cache line, which may
be 128-bytes, and a 9-byte ECC field (not shown) for a total of 137-bytes. The
ECC field associated with the cache line contains an ECC value computed for the
cache line itself.
Accordingly, for each cache line, the memory area comprises 144-bytes
of information in total.
As mentioned above, each CPU
404 of the SMP system
300 may access
portions of memory stored at the two memory subsystems
302 coupled to its
socket, i.e., a "local" memory access, or at the memory subsystems coupled to any
other socket of the SMP system
300, i.e., a "remote" memory access. Because
the latency of a local memory access will differ from the latency of a remote memory
access, the SMP system
500 is said to have a non-uniform memory access (NUMA)
architecture. Further, since the system
300 provides coherent caches, the
system is often called a cache-coherent NUMA (CC-NUMA) system. In the illustrative
embodiment of the invention, the SMP system
300 is preferably referred to
as a distributed shared memory system, although it may also be considered equivalent
to the above classes of systems.
Virtual Channels
Memory reference operations, such as reads, from a processor are preferably
executed by the SMP system
300 through a series of steps where each step
involves the exchange of a particular command packet or more simply command among
the processors and shared memory subsystems. The cache coherency protocol of the
present invention avoids deadlock through the creation of a plurality of channels.
Preferably, the channels share physical resources and are thus "virtual" channels.
Each virtual channel, moreover, is assigned a specific priority relative to the
other virtual channels so that, by appropriately assigning the different types
of commands to different virtual channels, the SMP system
300 can also eliminate
flow dependence. In general, commands corresponding to later steps in the series
for a given operation are assigned to higher priority virtual channels than the
commands corresponding to earlier steps.
In accordance with the present invention, the SMP system
300 maps commands
into at least four (4) different virtual channels. A Q
0 channel carries
processor command packet requests for memory space read and write transactions.
A Q
1 channel accommodates probe command packets to Q
0 requests and
has a higher priority than Q
0. A new virtual channel, which is referred
to as the QWB virtual channel, carries write-backs and other commands and has a
higher priority than Q
1. A Q
2 channel carries response command packets
to Q
0 requests and has the highest priority. Each of the virtual channels,
moreover, is implemented as an ordered virtual channel. That is, the physical components
that implement the virtual channels are configured such that the commands in any
given virtual channel are received in the same order in which they are sent.
A suitable mechanism for implementing ordered virtual channels in a large SMP
system
is described in U.S. Pat. No. 6,014,690, issued Jan. 11, 2000 for EMPLOYING MULTIPLE
CHANNELS FOR DEADLOCK AVOIDANCE IN A CACHE COHERENCY PROTOCOL, which is hereby
incorporated by reference in its entirety.
Those skilled in the art will recognize that other and/or additional virtual
channels could be defined. The virtual channels, moreover, can be configured to
carry other types of command packets. The Q
0 virtual channel, for example,
may also accommodate processor command request packets for programmed input/output
(PIO) read and write transactions, including control status register (CSR) transactions,
to input/output (I/O) address space. Alternatively, a QIO virtual channel having
a priority below the Q
0 virtual channel can be defined to accommodate PIO
read and write transactions.
Operation of the Distributed Directory
Each memory subsystem preferably includes a built-in, self test (BIST) engine
(not shown) that is used during initialization of the subsystem. The BIST engine
initializes the contents of the memory device
422, including the directory
contents and ECC values, by setting them to predetermined values as one of the
final steps of the self test. It should be understood that firmware, rather than
or in addition to a BIST engine, may be used for initialization purposes.
As data is brought into the SMP system
300, it is loaded into the memory
devices
422 of the memory subsystems
302 in units of memory blocks
or cache lines. As each memory block is stored at a memory subsystem
302,
the memory controller
418 computes a first error correction code (ECC) value
for the block which is stored along with the cache line as described above. Data
may be brought into the memory subsystems
302 from any number of sources,
such as floppy disk drives, hard disk drives, tape drives, optical or magneto-optical
drives, scanners, sound cards, etc. The memory controller
418 also