Senior Fitness - Exercise and Nutrition for Aging Men and Women
FREE Article Feed for your website.
Home Ownership Magazine
Party Planning Information
Article Marketing Resources
Bio-Medical Research Article Database
Informative Articles on Life, Love and Happiness
Tutorials on Business to Writing
Famous Quotes from Famous People
Song Lyric Information
New US Patent Information
Comprehensive List of Content by Category
Online Auctions and Shopping Related Articles
Article Search
Most Recent Articles
 

Bad Credit Loans Made Easier by Pre Approval
Category:
Business  

Vitamin supplements by Nguang Nguek Fluek
Category:
Health / Fitness  

How you Can Save Money if you Book Hotels in Central Rome
Category:
Travel  

Universal Life Insurance guide 101
Category:
Finance / Investment  

FINE or VICE Cash Loans
Category:
Finance / Investment  

Why Blogs are so popular
Category:
Marketing  

Office Supplies and Client Relation
Category:
Business  

Buying a Hidden Spy Camera
Category:
Business  

Understanding Flower Bulbs
Category:
Home And Family  

Parenting 101 Get Into a Parenting Class
Category:
Home And Family  

Lanzarote Tourist
Category:
Travel  

A Visitors Guide to Paris France
Category:
Travel  

Personal Accounts Choosing Your Bank
Category:
Business  

Acne A Clean Face First Step In A 12 Step Program
Category:
Health / Fitness  

VOIP security guide
Category:
Computers  

Three Reasons For Becoming A Foster Parent
Category:
Home And Family  

Affiliate Programs MLM Income Opportunity Residual
Category:
Business  

Hepatitis C Symptoms What are the Signs and Symptoms of Hepatiti...
Category:
Health / Fitness  

Sales Success Who Do You Really Work For
Category:
Business  

Stress Testing Tools How to Test for Stress Level DHEA
Category:
Health / Fitness  

Stay At Home CEO How a Single Dad Found Financial Success Workin...
Category:
Business  

Build Your Confidence and Find Your Soulmate
Category:
Entertainment / Television  

Importance of Good Web Design
Category:
Business  

WANT MORE CHANCES OF WINNING THE LOTTERY JACKPOT
Category:
Business  

Eight Strategies to Become a Winner
Category:
Self Help  

Business Property Investment can provide Guaranteed Returns For ...
Category:
Business  

IVR Surveys The secret to Increasing response Rates
Category:
Business  

New Bankruptcy Training Course Provides 7 CLE Credits for Parale...
Category:
Business  

Something new to try What about a head or face massage
Category:
Health / Fitness  

10 Tips for Rapid Fat Loss
Category:
Health / Fitness  

A Guide to Tropical Wall Murals
Category:
Home And Family  

Debt Relief Solutions Get the Way for Financial Relief
Category:
Finance / Investment  

Evolution of Myspace from a social networking website to a marke...
Category:
Marketing  

Top Networking Marketing Opportunities Is There Such A Thing
Category:
Business  

What are you prepared to risk to optimise your chances of intern...
Category:
Marketing  

Using a Free Baby Shower Word Scramble Game
Category:
Home And Family  

To Everyone that Wants to Taste the Love
Category:
Entertainment / Television  

Business Loans
Category:
Business  

PSP Downloads Site Receives 5 Star Rating
Category:
Home And Family  

Did Colorado Kill Doc Holliday
Category:
Travel  

What is franchising
Category:
Business  

Dead Ducks Don t Quack
Category:
Business  

Capital and Repayment Mortgages
Category:
Finance / Investment  

Three Online Stock Trading Systems
Category:
Finance / Investment  

Compare Gyms and Save
Category:
Health / Fitness  

What are the Health Benefits of an Infrared Sauna
Category:
Health / Fitness  

Timeframe of long term SEO results
Category:
Marketing  

Why You Might Consider Enhancement After LASIK Laser Eye Surgery...
Category:
Health / Fitness  

One Way Links and Reciprocal Link Exchange and Traffic
Category:
Marketing  

Avoid Cold Calling Download Ebook Free Online
Category:
Business  

handbags
Category:
Computers  

Cottage Getaway to Plan Book early to secure your Cottage Rental...
Category:
Travel  

Understanding Teen Acne
Category:
Home And Family  

12 Cost effective Ways to Keep Your Child Safe around the Home
Category:
Home And Family  

What Are Supplemental Credit Cardholders
Category:
Business  

Equity Indexed Annuity is a Fixed Annuity Now Known as an Index ...
Category:
Finance / Investment  

Using A Data Recovery Service A Quick Overview
Category:
Computers  

Hemorrhoids Exercises to Easy Your Hemorrhoids
Category:
Health / Fitness  

What Comprises a Good Graphic Design
Category:
Computers  

Email Marketing For Success
Category:
Business  

Rx Assistance For NY Citizens By ACIRX
Category:
Business  

Secured Loan
Category:
Finance / Investment  

Are there really free online surveys that pay
Category:
Business  

Bread Makers Why your Kitchen is Begging for One
Category:
Home And Family  

SEO 101 For Beginners Revised
Category:
Marketing  

How to building and managing an opt in list for a website
Category:
Marketing  

The Benefits Of Using Professional Translations For Internationa...
Category:
Business  

What Is A Second Mortgage
Category:
Business  

3 Simple Methods To Building A Profitable Opt In List
Category:
Marketing  

Varieties Of Electric Heating Pads
Category:
Health / Fitness  

7 Ways To Ensure Your Article Never Gets Used By Other Webmaster...
Category:
Marketing  

We Should All be Greatful to Day Traders
Category:
Finance / Investment  

How To Find The Best PDA Phones On The Market Even If You re A N...
Category:
Computers  

Making Your Resource Box Work
Category:
Marketing  

Unraveling some of the myths about email promotion
Category:
Marketing

Channel-based late race resolution mechanism for a computer system Number:7,000,080 from the United States Patent and Trademark Office (PTO) owispatent

Home    Author Login    Submit Article    Article Search    Add Your Link    Edit Your Link    Contact Us    Advertising    Disclaimer

   

 
Web LinkGrinder.com

Top Breaking News
     Greek, Cypriot Leaders Resume Unification Talks in Nicosia by Nathan Morley
     Indonesia Tobacco Sales Grow, Raising Health Fears
     South Korea Allows Top Defector to Travel Overseas by VOA News

Title: Channel-based late race resolution mechanism for a computer system

Abstract: A channel-based mechanism resolves race conditions in a computer system between a first processor writing modified data back to memory and a second processor trying to obtain a copy of the modified data. In addition to a Q0 channel for carrying requests for data, a Q1 channel for carrying probes in response to Q0 requests, and a Q2 channel for carrying responses to Q0 requests, a new channel, the QWB channel, which has a higher priority than Q1 but lower than Q2, is also defined. When a forwarded Read command from the second processor results in a miss at the first processor's cache, because the requested memory block was written back to memory, a Loop command is issued to memory by the first processor on the QWB virtual channel. In response to the Loop command, memory sends the written back version of the memory block to the second processor.

Patent Number: 7,000,080 Issued on 02/14/2006 to Van Doren,   et al.


Inventors: Van Doren; Stephen R. (Northborough, MA); Tierney; Gregory E. (Chelmsford, MA)
Assignee: Hewlett-Packard Development Company, L.P. (Houston, TX)
Appl. No.: 263836
Filed: October 3, 2002

Current U.S. Class: 711/143; 711/145; 711/158
Current Intern'l Class: G06F 12/00    (20060101)
Field of Search: 711/141,143,144,145,147,158


References Cited [Referenced By]

U.S. Patent Documents
4847804Jul., 1989Shaffer et al.
5222224Jun., 1993Flynn et al.
5233616Aug., 1993Callander.
5297269Mar., 1994Donaldson et al.
5303362Apr., 1994Butts, Jr. et al.
5313609May., 1994Baylor et al.
5490261Feb., 1996Bean et al.
5530933Jun., 1996Frink et al.
5537575Jul., 1996Foley et al.
5551005Aug., 1996Sarangdhar et al.
5579504Nov., 1996Callander et al.
5608893Mar., 1997Slingwine et al.
5737757Apr., 1998Hassoun et al.
5761731Jun., 1998Van Doren et al.
5905998May., 1999Ebrahim et al.
6014690Jan., 2000VanDoren et al.
6055605Apr., 2000Sharma et al.
6061765May., 2000Van Doren et al.
6088771Jul., 2000Steely, Jr. et al.
6094686Jul., 2000Sharma.
6101420Aug., 2000VanDoren et al.
6105108Aug., 2000Steely, Jr. et al.
6108737Aug., 2000Sharma et al.
6108752Aug., 2000VanDoren et al.
6125429Sep., 2000Goodwin et al.
6154816Nov., 2000Steely et al.
6202126Mar., 2001Van Doren et al.
6249520Jun., 2001Steely, Jr. et al.
6249846Jun., 2001Van Doren et al.
6279085Aug., 2001Carpenter et al.
6944719Sep., 2005Rowlands et al.
Foreign Patent Documents
0 817 074Jul., 1998EP.


Other References

Scales, D. and Gharachorloo, K., Design and Performance of the Shasta Distributed Shared Memory Protocol, XP-000755264, Jul. 7, 1997, pp. 245-252.
Scales, D., and Gharachorloo, K. and Thekkath, C., Shasta: A Low Overhead, Software-Only Approach for Supporting Fine-Grain Shared Memory, XP-002173083, Jan. 10, 1996, pp. 174-185.
Scales, D. and Gharachorloo, K., Towards Transparent and Efficient Software Distributed Shared Memory, XP-000771029, Dec. 1997, pp. 157-169.
Scales, D., Gharachorloo, K. and Aggarwal, A., Fine-Grain Software Distributed Shared Memory on SMP Clusters, WRL Research Report 97/3, Feb. 1997, pp. i and 1-28.
Gharachorloo, K., Lemoski, D., Laudon, J., GIbbons, P., Gupta, A. and Hennessey, J., Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors, (c) 1990 IEEE, pp. 15-26.
Jouppi, N., Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers, (c) 1990 IEEE, pp. 364-373.
Agarwal, A., Simoni, R., Hennesy, J. and Horowitz, M., An Evaluation of Directory Schemes for Cache Coherence, (c)1988 IEEE, pp. 353-362.
Papapanaroos, M. and Patel. J., A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories, (c) 1984 IEEE, pp. 284-290.
UltraSPARC Ultra Port Architecture (UPA): The New-Media System Architecture, http://www.sun.com/processors/whitepapers/wp95-023.html, Copyright 1994-2002 Sun Microsystems, pp. 1-4.
Porting OpenVMS Applications to Intel Itanium Architecture, Compaq Conputer Corporation, Apr. 2002, pp. 1-17.
Adve, S., Hill, M., Miller, B. and Nester, R., Detecting Data Races on Weak Memory Systems, (c) 1991 ACM, pp. 234-243.
Gharachorloo, K., Sharma, M., Steely, S. and Van Doren, S., Architecture and Design of AlphaServer GS320, Nov. 2000, pp. 1-12.
IEEE Standard for Scalable Coherent Interface (SCI), (c) 1993 IEEE, pp. Table of Contents, 30-34 and 141-188.

Primary Examiner: Anderson; Matthew D.
Assistant Examiner: Barton; Jonathan

Claims



What is claimed is:

1. In a computer system having a plurality of processors and a main memory organized into a plurality of memory blocks, the processors having one or more caches, a method for resolving a late race condition between a first processor and a second processor for a selected memory block, the method comprising:

defining a plurality of channels within the computer system for exchanging command packets among the processors and main memory, the channels including a Q0 channel for carrying requests for memory blocks, a Q1 channel, having a higher priority than the Q0 channel, for carrying probes in response to Q0 requests, a Q2 channel, having a higher priority than the Q1 channel, for carrying responses to Q0 requests, and QWB channel having a higher priority than the Q1 channel but lower than Q2 channel;

issuing a Write_Back (WB) command from the first processor to main memory, the WB command including a modified version of the selected memory block taken from the first processor's cache;

forwarding from main memory to the first processor a memory reference request specifying the selected memory block, the memory reference request initiated by the second processor;

in response to the memory reference request, issuing a Loop command from the first processor to main memory on the QWB channel;

in response to the WB command, writing the modified data back to main memory; and

in response to the Loop command, issuing a memory reference response from main memory to the second processor.

2. The method of claim 1 wherein the channels are implemented as ordered channels.

3. The method of claim 2 wherein the computer system further includes at least one directory for maintaining status information regarding the memory blocks configured at main memory, the directory having, for each memory block, an owner field specifying the owner of the respective memory block, a sharer list specifying one or more processors, if any, that have a shared copy of the respective memory block, and a writer field specifying the last processor to have written the respective memory block back to main memory, the method further comprising entering an identifier (ID) assigned to the first processor in the writer field of the directory entry for the selected memory block in response to the WB command.

4. The method of claim 3 further wherein the issuing the memory reference response to the second processor depends on the respective writer field matching the source of the Loop command.

5. The method of claim 1 wherein the WB command is issued on the QWB channel.

6. The method of claim 2 wherein the memory reference request is a request for a shared copy of the selected memory block, and the Loop command is a Loop_Forwarded Read (LFRead) command requesting main memory to send the selected memory block to the second processor.

7. The method of claim 2 wherein the memory reference request is a request for write access to the selected memory block, and the Loop command is a Loop_Forwarded_Read_Modify (LFReadMod) command requesting main memory to send the selected memory block to the second processor and to grant the second processor write access to the selected memory block.

8. The method of claim 1 wherein

the computer system has physical interconnect links and buffering resources coupling the processors and main memory, and

each channel is an independently flow-controlled virtual channel of commands that shares the physical interconnect link and buffering resources.

9. The method of claim 3 wherein the directory is free from maintaining transient states for any memory block.

10. The method of claim 1 wherein the forwarded memory reference request results in a cache miss at the first processor as the selected memory block was victimized from the first processor's cache in response to the WB command.

11. A computer system configured to resolve late race conditions, the computer system comprising:

a plurality of interconnected processors, each processor having a cache;

a main memory in communicating relationship with the plurality of processors, the main memory organized into a plurality of memory blocks; and

a plurality of channels for carrying command packets among the processors and main memory, wherein

the channels include a Q0 channel for carrying requests for memory blocks, a Q1 channel for carrying probes in response to Q0 requests, a Q2 channel for carrying responses to Q0 requests, and a QWB channel, having a higher priority than the Q1 channel but lower than Q2 channel, for carrying Loop commands from a processor to main memory in response to a forwarded memory reference request received at the processor that specifies a selected memory block that was written back to main memory.

12. The computer system of claim 11 wherein the processor is configured to write the selected memory block back to main memory by issuing a Write_Back (WB) command packet on the QWB channel to main memory, the WB command including a copy of the modified version of the selected memory block.

13. The computer system of claim 12 wherein each channel is implemented as an ordered channel.

14. The computer system of claim 13 further comprising physical interconnect links and buffering resources coupling the processors and main memory, wherein each channel is an independently flow-controlled virtual channel of commands that shares the physical interconnect link and buffering resources.

15. The computer system of claim 12 further comprising at least one directory for maintaining status information regarding the memory blocks of main memory, the directory having, for each memory block, an owner field specifying the owner of the respective memory block, a sharer list specifying zero, one or more processors that have a shared copy of the respective memory block, and a writer field specifying the last owner processor to write the respective memory block back to main memory, wherein, in response to the WB command, an identifier (ID) assigned to the first processor is entered in the writer field of the directory entry for the selected memory block.

16. In a computer system having a plurality of processors and a main memory organized into a plurality of memory blocks, the processors having one or more caches, a method for resolving a late race condition between a first processor and a second processor for a selected memory block, the method comprising:

defining a plurality of channels within the computer system for exchanging command packets among the processors and main memory, the channels including a Q0 channel for carrying requests for memory blocks, a Q1 channel, having a higher priority than the Q0 channel, for carrying probes in response to Q0 requests, a Q2 channel, having a higher priority than the Q1 channel, for carrying responses to Q0 requests, and QWB channel having a higher priority than the Q1 channel but lower than Q2 channel;

issuing a Write_Back (WB) command from the first processor, the WB command including a modified version of the selected memory block taken from the first processor's cache;

forwarding to the first processor a memory reference request specifying the selected memory block, the memory reference request initiated by the second processor;

in response to the memory reference request, issuing a Loop command from the first processor on the QWB channel;

in response to the WB command, writing the modified data back to main memory; and

in response to the Loop command, issuing a memory reference response to the second processor.

17. The method of claim 16 wherein the computer system further includes a directory, and the WB command and Loop command are received at the directory.

18. The method of claim 17 wherein the memory reference request and the Loop command are issued from the directory.

19. The method of claim 16 wherein at least part of the directory is located in the main memory of the computer system.

20. The method of claim 16 wherein the WB command is issued on the QWB channel.
Description



CROSS-REFERENCE TO RELATED APPLICATION

This application is related to the following co-pending, commonly owned U.S. patent applications:

U.S. patent application Ser. No. 10/263,739 entitled DIRECTORY STRUCTURE PERMITTING EFFICIENT WRITE-BACKS IN A SHARED MEMORY COMPUTER SYSTEM, filed Oct. 3, 2002.

U.S. patent application Ser. No. 10/263,743 entitled RETRY-BASED LATE RACE RESOLUTION MECHANISM FOR A COMPUTER SYSTEM, filed Oct. 3, 2002.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to shared memory computer architectures and, more specifically, to cache coherency protocols for use in shared memory computer systems.

2. Background Information

A computer system typically comprises one or more processors linked to a main memory by a bus or other interconnect. In most computer systems, main memory organizes the instructions and data being stored into units typically referred to as "blocks" each of which is separately addressable and may be of a fixed size. Instructions and data are typically moved about the computer system in terms of one or more blocks.

Ordinarily, a processor will retrieve data, e.g., one or more blocks, from main memory, perform some operation on it, and eventually return the results back to main memory. Retrieving data from main memory and providing it to a processor can take significant time especially in terms of the high operating speeds of today's processors. To reduce such latencies as well as to reduce the number of times a processor must access main memory, modem processors and/or processor chipsets include one or more cache memories or caches. A cache is a small, fast memory module that is placed in close proximity to the processor. Many caches are static random access memories (SRAMs), which are faster, but more expensive, than dynamic random access memories (DRAMs), which are often used for main memory. The cache is used to store information, e.g., data or instructions, which the processor is currently using or is likely to use in the near future. There are two basic types of caches: "write-through" caches and "write-back" caches.

With a write-through cache, whenever a processor modifies or updates a piece of data in the processor's cache, main memory's copy of that data is automatically updated. This is accomplished by having the processor write the data back to memory whenever the data is modified or updated. A write-back cache, in contrast, does not automatically send modified or updated data to main memory. Instead, the updated data remains in the cache until some more convenient time, e.g., when the processor is idle, at which point the modified data is written back to memory. The utilization of write-back caches typically improves system performance. In some systems, a write-back or victim buffer is provided in addition to the cache. "Victim data" refers to modified data that is being removed from the processor's cache in order to make room for new data received at the processor. Typically, the data selected for removal from the cache is data the processor is no longer using. The victim buffer stores this modified data which is waiting to be written back to main memory. The use of a victim buffer frees up space in the cache for other data. Modified data in the victim buffer is eventually "victimized", i.e., written back to main memory, at some convenient time.

Although the implementation of write-back or victim buffers have increased the performance of computer systems, there are some drawbacks. For example, the addition of a victim buffer requires additional logic and storage or memory space at the processor chipset increasing cost, complexity and size of the processor chipset.

Symmetrical Multiprocessor (SMP) Systems

Multiprocessor computing systems, such as symmetrical multiprocessor (SMP) systems, provide a computer environment in which software applications may run on a plurality of processors using a single address space or shared memory abstraction. In a shared memory system, each processor can access any data item without a programmer having to worry about where the data is or how to obtain its value. This frees the programmer to focus on program development rather than on managing partitioned data sets and communicating values.

Cache Coherency

Because more than one processor of the SMP system may request a copy of the same memory block from main memory, cache coherency protocols have been developed to ensure that no processor relies on a memory block that has become stale, typically due to a modification or update performed to the block by some other processor. Many cache coherency protocols associate a state with each cache line. A given memory block, for example, may be in a shared state in which copies of the block may be present in the caches associated with multiple processors. When a memory block is in the shared state, a processor may read from, but not write to, the respective block. To support write operations, a memory block may be in an exclusive state. In this case, the block is owned by a single processor which may write to the cache line. When the processor updates or modifies the block, its copy becomes the most up-to-date version, while corresponding copies of the block at main memory and/or other processor caches become stale.

When a processor wishes to obtain exclusive ownership over a memory block that is currently in the shared state (i.e., copies of the block are present in the caches of other processors) invalidate requests are typically issued to those other processors. When an invalidate request is received by a given processor, its cache is searched for the specified memory block. If the block is found, it is transitioned to an invalid state. Many caches assign or associate a valid bit with each memory block or cache line stored in the cache. If the bit is asserted, then the cache line is considered to be valid and may be accessed and utilized by the processor. When a memory block is initially received from main memory, the valid bit is asserted and the memory block is stored in the cache. When an invalidate request is received, the valid bit of the respective cache line is de-asserted, thereby indicating that the cache line is no longer valid.

There are two classes of cache coherency protocols: snooping and directory based. With snooping, the caches monitor or snoop all transactions traversing the shared memory bus, looking for transactions that reference a memory block stored at the cache. If such a transaction is detected, the cache updates the status information for its copy of the memory block based on the snoop transaction. In this way, every cache that has a copy of a given memory block also has a copy of the status information of that block. With a directory based protocol, the state of each block is kept in a single, centralized location in the system, called a directory. Status information is not maintained in the individual caches.

FIG. 1 is a highly schematic illustration of a prior art directory 100. Directory 100 has a plurality of entries 102a-d each of which corresponds to a respective memory block. The directory 100 is organized, moreover, such that each entry 102a-d has a plurality of fields or cells for storing state and/or status information for the respective block. In particular, the directory 100 has an address column 103 that stores the address of the memory block, an owner column 104 that stores the identity of the entity, e.g., a processor or main memory itself, that is considered to be the owner of the memory block, and a sharer column 106 that stores the identity of those processors or other system entities that have a shared copy of the block.

The sharer column 106 may have a plurality of sub-columns 106a-c, each of which may contain the identity of a particular processor that has a shared copy of the respective memory block. If a request for shared access to a memory block is received from a first processor, P1, main memory examines the directory entry, e.g., entry 102c, for the block to determine its owner. As memory is itself the owner of the block, memory sends its copy of the block to P1 and enters P1's identifier (ID) into one of the sharer fields, e.g. field 106b, of the respective directory entry, e.g., entry 102c, thereby noting that P1 has a shared copy of the block. Since P1 only requested shared access to the memory block, the contents of the entry's owner field 104 are not modified.

If P1 issues a request for exclusive or write access to some other memory block, e.g., the block corresponding to entry 102d, main memory again examines the contents of entry 102d. Suppose that, at the time the request is received, the owner field reflected that memory was the owner of the memory block as shown in parentheses. In this case, memory sends the block to P1, and replaces the contents of the owner field 104 with P1's ID to reflect that P1, rather than memory, is now the owner of the memory block. P1 may then modify or update the memory block. If a request from a second processor, P2, is subsequently received for a shared copy of this memory block, main memory examines entry 102d of the directory 100 and determines that P1 is the owner of the memory block. Because its copy of the block, i.e., the copy stored at main memory, may be stale, memory does not forward its copy to P2. Instead, memory may be configured to forward the request to P1 and add P2's ID to one of the sharer fields, e.g., field 106a. In response to the forwarded request, P1 may then supply P2 with a copy of the modified memory block from P1's cache. Alternatively, main memory may be configured to force P1 to relinquish ownership of the memory block and return the modified version to memory so that memory can send a copy of the up-to-date version to P2.

It has been recognized that a computer system's cache coherency protocol is a key factor in the system's ultimate performance. Poorly designed cache coherency protocols can result in latencies, bottlenecks, other inefficiencies and/or higher complexity, each of which may reduce performance and/or increase cost. Bottlenecks, for example, often arise in high occupancy controllers, such as directory controllers. "Occupancy" is a term of art and refers to the amount of time a controller is unavailable, e.g., for the servicing of requests, following receipt of an earlier request.

In some cache coherency protocols, when a directory controller receives a request corresponding to a memory block, it thereafter becomes unavailable to service other requests for that memory block until certain acknowledgements to the earlier request are received back at the directory controller. The stalling of requests or references until the directory controller is once again available may degrade system performance. Thus, efforts have been made to design low occupancy cache coherency protocols, which allow multiple requests to the same memory block to be executing substantially simultaneously within the computer system.

Low occupancy cache coherency protocols can nonetheless result in the creation of coherency races that, in turn, can cause system deadlock and/or starvation. Accordingly, a need exists for a low occupancy cache coherency protocol that avoids deadlock and/or starvation in the face of coherency races.

SUMMARY OF THE INVENTION

Briefly, the invention relates to a mechanism for resolving late races involving write-backs to memory by creating a new virtual channel and a new message to be transmitted in the new virtual channel. The channel-based late race resolution mechanism of the present invention is designed for use in a shared memory computer system, such as a symmetrical multiprocessor (SMP) computer system. The SMP system may comprise one or more nodes each having a plurality of processors and a plurality of shared memory subsystems coupled together by an interconnect fabric. The memory subsystems are configured to store data in terms of memory blocks, and each processor preferably has a cache for storing copies of memory blocks being used by the processor. Each processor further includes a miss address file (MAF) that keeps track of requests issued to a memory subsystem for a memory block not currently stored in the processor's cache. Each memory subsystem, moreover, has a memory controller and a directory for maintaining owner and sharer status information for the memory blocks for which the memory subsystem is responsible, i.e., those memory blocks for which the memory subsystem is the "home" memory.

In the illustrative embodiment, the directory has a plurality of entries each of which is assigned to a respective memory block, and is organized into a main directory region and a write-back directory region. In the main directory region, each entry has a single owner/sharer field and a sharer list. The owner/sharer field indicates which entity, e.g., processor, is considered to be the owner of the block. The sharer list indicates which entities, e.g., processors, have a copy of the memory block in their caches. In the write-back directory region, each entry has a writer field identifying the last owner to have written the memory block back to the memory subsystem.

The processors and memory subsystems of the SMP system communicate with each other by exchanging command packets that are carried by the SMP system within a plurality of virtual channels. The virtual channels are utilized to avoid deadlock and prevent starvation. They include a Q0 virtual channel for carrying memory reference requests, a Q1 virtual channel, which has a higher priority than Q0, for carrying probes in response to Q1 requests, and a Q2 virtual channel, which has a higher priority than Q1, for carrying responses to Q0 requests. In accordance with the present invention, there is also a new virtual channel, the QWB virtual channel, which has a higher priority than Q1 but lower than Q2. In the illustrative embodiment, each of the virtual channels is an ordered communication channel.

In operation, when a first processor requests write access over a given memory block, the owner/sharer field of the respective directory entry is loaded with an identifier (ID) assigned to the first processor, thereby reflecting that the first processor is the owner of the memory block and has the most up-to-date copy. When the first processor completes its modification of the memory block, it issues a Write_Back (WB) command on the new QWB virtual channel to the memory subsystem. Here, the writer field of the respective directory entry is loaded with the first processor's ID, the owner/sharer field is left unchanged, and the modified data is written back to memory. Preferably, the processors do not have victim caches and thus do not buffer a copy of modified data pending completion of a WB command.

Suppose a Read command is issued for the memory block by a second processor before the WB command from the first processor is received at the directory. As the first processor is still considered to be the owner of the memory block, a probe, such as a Forwarded_Read (FRead) command, is preferably sent to the first processor on the Q1 virtual channel directing it to service the Read command out of the first processor's cache. At the first processor, however, a miss will occur as the first processor sent the modified data back to main memory in the WB command. This condition is known as a late race condition.

To resolve the late race, the first processor issues a new command, called a Loop_Forwarded_Read (LFRead) command, to main memory also on the QWB virtual channel. Because the QWB virtual channel is an ordered channel, the WB command arrives at the home memory before the LFRead. WB command is processed by the memory subsystem as described above. That is, the writer field is updated with the first processor's ID and the modified data is written back to memory. When the LFRead is received, the memory subsystem compares the directory entry's writer field with the ID of the entity that sourced the LFRead command. As the two values match, the memory subsystem responds by issuing a Fill command to the second processor on the Q2 virtual channel that includes a copy of the requested memory block from memory. The second processor thus receives the requested data, thereby completing the memory reference operation. Notably, the LFRead command does not cause any change to the directory state.

In an alternative embodiment, the channels are unordered and another new channel, the Q3 virtual channel is added. The Q3 virtual channel has a higher priority than the Q2 virtual channel. In this embodiment, WB commands are issued on the Q2 virtual channel as opposed to the QWB virtual channel while the loop commands are still issued on the QWB virtual channel. The Q3 virtual channel is used for WB_Acknowledgments (WBAcks) from the memory subsystems to the processors confirming receipt of WB commands from the processors.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention description below refers to the accompanying drawings, of which:

FIG. 1, previously discussed, is a highly schematic diagram of a conventional directory;

FIG. 2 is a highly schematic functional block diagram of a multi-processor node;

FIG. 3 is a highly schematic functional block diagram of a symmetrical multiprocessor (SMP) computer system formed from a plurality of multi-processor nodes;

FIG. 4 is a highly schematic block diagram of a processor socket and memory subsystem of the SMP computer system of FIG. 3;

FIG. 5 is a highly schematic block diagram of a miss address file (MAF) entry;

FIG. 6 is a highly schematic block diagram of a cache tag entry;

FIG. 7 is a highly schematic block diagram of the directory of the present invention;

FIG. 8 is a highly schematic, function block diagram of interconnect logic between two sockets; and

FIGS. 9A-C and 10A-C illustrate an exemplary exchange of command packets between a plurality of processors and a memory subsystem.

DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT

FIG. 2 is a highly schematic illustration of a preferred multiprocessor node 200 for use with the present invention. The node 200 comprises a plurality of, e.g., eight, sockets, S0-S7, which are designated by reference numerals 202a-h. The eight sockets 202a-h are logically located at the corners of a cube, and are interconnected by a plurality of inter-processor links 204a-p. Thus, each socket can communicate with any other socket of the node 200. In the illustrative embodiment, sockets forming two opposing sides of the node 200 are fully interconnected, while the two sides are connected only along the edges of the cube. That is, sockets S0-S3, which form one side of the cube, and S4-S7, which form the opposing side of the cube, are fully interconnected with each other, while the two opposing sides are connected by four inter-socket links 204g-j. As described herein, each socket includes one or more processors and has or is coupled to two main memory subsystems.

FIG. 3 is a highly schematic illustration of a symmetrical multiprocessing (SMP) computer system 300 from a plurality of nodes. In particular system 300 comprises four nodes 200a-d, each of which corresponds to node 200 (FIG. 2). The inter-processor links have been omitted for clarity. As described above, each node, such as nodes 200a and 200c, has eight sockets, such as sockets 202a-h and 202i-p, respectively. Each node also includes a plurality of main memory subsystems (M0-M15). In the preferred embodiment, each node has sixteen memory subsystems, two for each socket. The sixteen memory subsystems M0-M15 of node 200a are designated by reference numerals 302a-p. Each socket is coupled to a pair of memory subsystems by a corresponding pair of processor/memory links. Socket 202a, for example, is coupled to memory subsystems 302a and 302b by processor/memory links 304a and 304b, respectively.

The four nodes 200a-d, moreover, are fully interconnected with each other through an interconnect fabric 306. Specifically each memory subsystem, such as subsystems 302a and 302b, are connected to the interconnect fabric 306 by fabric links 308. In the preferred embodiment, each memory subsystem at a given node is coupled to its corresponding memory subsystem at the other three nodes. That is, memory subsystem M0 at node 200a is coupled by four fabric links to the M0 memory subsystem at the three other nodes 202b-d, memory subsystem M1 at node 200a is coupled by four fabric links to the M1 memory subsystem at the other three nodes 202b-d, and so on.

FIG. 4 is a highly schematic illustration of socket (S0) 202a, and one of its associated memory subsystems (M0) 302a. Socket 202a includes two processor modules 402a and 402b. Each processor module, such as module 402a, has a processor or central processing unit (CPU) 404, a cache tags storage device 406, a miss address file (MAF) entity 408 and a probe/response queue 410. The CPU 404 includes one or more processor caches (not shown) that are in close proximity to the CPU for storing data that the CPU 404 is currently using or is likely to use in the near future. Information regarding the status of the data stored in the processor cache(s), such as the address and validity of that data, is maintained in the cache tags storage device 406. The MAF entity 408, which keeps track of commands, such as memory reference requests, issued to the system, has a MAF engine 412 and a MAF table 414. MAF entity 408 may also include one or more buffers, such as MAF buffer 416.

Processor module 402b similarly includes a CPU, a cache tags storage device, a MAF entity and a probe/response queue. Socket (S0) 202a is coupled to the other sockets (S1-S7) of node 200a by inter-socket links and to memory subsystems (M0) 302a and (M1) 302b (FIG. 3) by processor/memory links 304a and 304b, respectively.

It should be understood that each processor module 402 may also include other components, such as a write back or victim buffer, a register file, a translation look-aside buffer (TLB), load/store (L/S) queues, etc.

The memory subsystem (M0) 302a has a memory controller 418, a directory 420 and one or more memory modules or banks, such as memory unit 422. Memory unit 422 may be and/or may include one or more conventional or commercially available dynamic random access memory (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR-SDRAM) or Rambus DRAM (RDRAM) memory devices.

The memory subsystems of nodes 200a-d combine to form the main memory of the SMP system 300 some or all of which may be shared among the processors. Each socket 202, moreover, includes a portion of main memory by virtue of its respective memory subsystems 302. Data stored at the memories 422 of each subsystem 302, moreover, is organized into separately addressable memory, blocks that are equivalent in size to the amount of data stored in a processor cache line. The memory blocks or cache lines are of uniform, fixed size, and represent the smallest unit of data that can be moved around the SMP system 300. In the preferred embodiment, each cache line contains 128-bytes of data, although other fixed sizes, such as 64-bytes, could be utilized. Each memory address, moreover, maps to and thus identifies one and only one memory block. And, a plurality of address bits, such as the upper three address bits, are preferably employed to identify the "home" memory subsystem of the respective memory block. That is, each memory block, which is separately addressable by the SMP system 300, has a pre-determined home memory subsystem that does not change. Each directory, moreover, maintains status information for the cache lines for which its memory subsystem is the home memory. In other words, rather than having a single, centralized directory, the "directory" for the SMP system 300 is distributed across all of the memory subsystems.

CPU 404 may be and/or include any one of the processors from the Itanium architecture from Intel Corp. of Santa Clara, Calif., such as the Itanium® 1 or Itanium® 2 processors. Nonetheless, those skilled in the art will understand that other processors, such as the Hammer series of 64-bit processors from Advanced Micro Devices, Inc. (AMD) of Sunnyvale, Calif., may also be used.

The processors 404 and memory subsystems 302 interact with each other by sending "command packets" or simply "commands" to each other. Commands may be classified generally into three types: Requests, Probes and Responses. Requests are commands that are issued by a processor when, as a result of executing a load or store operation, it must obtain a copy of data. Requests are also used to gain exclusive ownership or write access to a piece of data, e.g., a memory block. Requests include Read commands, Read_Modify (ReadMod) commands, Change_to_Dirty (CTD) commands, and Write_Back (WB) commands, among others. Probes are commands issued to one or more processors requesting data and/or cache tag status updates. Probe commands include Forwarded_Read (FRead) commands, Forwarded_Read_Modify (FReadMod) commands, and Invalidate (Inval) commands, among others. Responses are commands which carry requested data to a processor or acknowledge some request. For Read and ReadMod commands, the responses are Fill and Fill_Modify (FillMod) commands, respectively. For CTD commands, the responses are CTD_Success or CTD_Failure commands. For WB commands, the response may be a WB_Acknowledgement command.

The MAF table 414 is organized at least logically as a table or array having a plurality of rows and columns whose intersections define cells for storing information. FIG. 5 is a highly schematic block diagram of an exemplary row or entry 500 of MAF table 414 (FIG. 4). Entry 500 has a plurality of fields including a 1-bit active field or flag 502, which indicates whether the respective entry 500 is active or inactive, i.e., whether the outstanding request represented by entry 500 is complete or not. A request that is not yet complete is considered active. Entry 500 further includes a command field 504 that specifies the particular command that is outstanding, and an address field 506 that specifies the memory address corresponding to the command. Entry 500 additionally includes an invalid count (Inval Cnt.) field 508, an acknowledgement count (Ack Cnt.) field 510, a read pointer (ptr.) field 512, a read chain field 514, a write pointer field 516, a write chain field 518, a fill/marker state field 520 and a write done field 522.

MAF engine 412, among other things, operates one or more state machines for each entry of the MAF table 414. Specifically, the read chain field 514, the write chain field 518 and the fill/marker field 520 each store a current state associated with the entry. In the illustrative embodiment, a MAF entry transitions between two fill/marker states: idle and active, and the current fill/marker state is recorded at field 520.

The cache tags storage device 406 (FIG. 4) is also organized at least logically as a table or array having a plurality of rows and columns whose intersections define cells for storing information. FIG. 6 is a highly schematic block diagram of an exemplary row or entry 600 of the cache tags storage device 406. As mentioned above, each entry of the cache tags storage device 406, including entry 600, corresponds to a particular cache line stored at the processor's cache(s). Cache tag entry 600 includes a tag field 602 that specifies the memory address of the respective cache line, and a series of status flags or fields, including a shared flag 604, a dirty flag 606 and a valid flag 608.

In the illustrative embodiment, the processors and memory subsystems of the SMP 300 system cooperate to execute a write-invalidate, ownership-based cache coherency protocol. "Write-invalidate" implies that when a processor wishes to modify a cache line, it causes copies of the cache line that may be located in other processors' caches to be invalidated, rather than updating them with the new value. "Ownership-based" implies there is always an identifiable owner for a cache line, whether it is memory or one of the processors of the SMP system 300. The owner of a cache line, moreover, is responsible for supplying the most up-to-date value upon request. A processor may own a cache line "exclusively" or "shared". If a processor has exclusive ownership over a cache line, it may modify or update the cache line without informing the system. Otherwise, it must inform the system and potentially invalidate copies located in other processors' caches.

Directory 420 is similarly organized at least logically as a table or array having a plurality of rows and columns whose intersections define cells for storing information. FIG. 7 is a highly schematic block diagram of directory 420. In accordance with the present invention, directory 420 is organized into two regions or areas, a main directory region 702 and a write-back directory region 704. A plurality of rows 706-710 span both regions 702 and 704 of the directory 420. Several versions of row 706, which are described below, are shown. Within each region 702 and 704, a plurality of columns are defined for specifying the type of information stored in the directory's entries. The main directory region 702, for example, has an owner/sharer column 714 for storing the identifier (ID) assigned to the entity that owns the cache line, and a sharer list column 716 for indicating which entities, if any, have a shared copy of the cache line.

The sharer list column 716 is preferably configured to operate in one of two different modes. In a first mode, sharer list column 716 is organized into two sharer columns 716a and 716b each of which can store the identifier (ID) assigned to a single entity, such as a processor, of the SMP system 300 that has a shared copy of the respective cache line. If a third entity is to be added as a sharer, the sharer list column 716 converts from two sharer columns 716a and 716b to a single coarse sharer vector column 716c. Each bit of the sharer vector column 716c corresponds to and thus identifies a set of one or more sockets 202 of system 300. If a bit is asserted, then at least one processor located within the set of sockets associated with the asserted bit has a copy of the respective cache line. Entries 707 and 709 illustrate the first mode, and entries 708 and 710 illustrate the second mode. Main region 702 further includes an unused column 718 and an error correction code (ECC) column 720 for storing an ECC value calculated for the data in fields 714-718.

The write-back region 704 has a writer column 722, an unused column 724 and an ECC column 726. As explained herein, the contents of the owner/sharer column 714 of the main region 702 together with the contents of the writer column 722 of the write-back region 704 determine who owns the respective cache line and thus where the most up-to-date version is located within the SMP system 300. The ECC column 726 stores an ECC value calculated for the data in fields 722 and 724.

The unused fields 718 and 724 are provided in order to support modifications to the protocol and/or increases in the size of the address or other fields. It should be understood that one or more bits of unused column 714 may be used to signify whether the corresponding entry's sharer list 716 is in individual sharer mode, i.e., fields 716a and 716b, or in coarse sharer vector mode, i.e., sharer vector field 716c.

In the preferred embodiment, directory 420 is actually located within the memory unit 422 itself along with the memory blocks, and is not a separate memory component. That is, each memory address indexes to an area of the memory device 422 that is preferably divided into three regions. The first region corresponds to the main directory region for the block specified by the memory address. The second region corresponds to the write-back region for the memory block, and the third region corresponds to the data contents of the memory block.

In the illustrative embodiment, the owner/sharer field 714 is 10-bits, the sharer list field 716 is 16-bits, thereby supporting either two 8-bit sharer-IDs or one 16-bit coarse sharer vector, and the unused and ECC fields 718 and 720 are each 7-bits. The main directory region 702 of a memory area is thus 5-bytes. For the write-back region 704, the writer field is 9-bits, the unused field is 1-bit and the ECC field is 6-bits, thereby making the write-back region 2-bytes. The third region includes the cache line, which may be 128-bytes, and a 9-byte ECC field (not shown) for a total of 137-bytes. The ECC field associated with the cache line contains an ECC value computed for the cache line itself.

Accordingly, for each cache line, the memory area comprises 144-bytes of information in total.

As mentioned above, each CPU 404 of the SMP system 300 may access portions of memory stored at the two memory subsystems 302 coupled to its socket, i.e., a "local" memory access, or at the memory subsystems coupled to any other socket of the SMP system 300, i.e., a "remote" memory access. Because the latency of a local memory access will differ from the latency of a remote memory access, the SMP system 500 is said to have a non-uniform memory access (NUMA) architecture. Further, since the system 300 provides coherent caches, the system is often called a cache-coherent NUMA (CC-NUMA) system. In the illustrative embodiment of the invention, the SMP system 300 is preferably referred to as a distributed shared memory system, although it may also be considered equivalent to the above classes of systems.

Virtual Channels

Memory reference operations, such as reads, from a processor are preferably executed by the SMP system 300 through a series of steps where each step involves the exchange of a particular command packet or more simply command among the processors and shared memory subsystems. The cache coherency protocol of the present invention avoids deadlock through the creation of a plurality of channels. Preferably, the channels share physical resources and are thus "virtual" channels. Each virtual channel, moreover, is assigned a specific priority relative to the other virtual channels so that, by appropriately assigning the different types of commands to different virtual channels, the SMP system 300 can also eliminate flow dependence. In general, commands corresponding to later steps in the series for a given operation are assigned to higher priority virtual channels than the commands corresponding to earlier steps.

In accordance with the present invention, the SMP system 300 maps commands into at least four (4) different virtual channels. A Q0 channel carries processor command packet requests for memory space read and write transactions. A Q1 channel accommodates probe command packets to Q0 requests and has a higher priority than Q0. A new virtual channel, which is referred to as the QWB virtual channel, carries write-backs and other commands and has a higher priority than Q1. A Q2 channel carries response command packets to Q0 requests and has the highest priority. Each of the virtual channels, moreover, is implemented as an ordered virtual channel. That is, the physical components that implement the virtual channels are configured such that the commands in any given virtual channel are received in the same order in which they are sent.

A suitable mechanism for implementing ordered virtual channels in a large SMP system is described in U.S. Pat. No. 6,014,690, issued Jan. 11, 2000 for EMPLOYING MULTIPLE CHANNELS FOR DEADLOCK AVOIDANCE IN A CACHE COHERENCY PROTOCOL, which is hereby incorporated by reference in its entirety.

Those skilled in the art will recognize that other and/or additional virtual channels could be defined. The virtual channels, moreover, can be configured to carry other types of command packets. The Q0 virtual channel, for example, may also accommodate processor command request packets for programmed input/output (PIO) read and write transactions, including control status register (CSR) transactions, to input/output (I/O) address space. Alternatively, a QIO virtual channel having a priority below the Q0 virtual channel can be defined to accommodate PIO read and write transactions.

Operation of the Distributed Directory

Each memory subsystem preferably includes a built-in, self test (BIST) engine (not shown) that is used during initialization of the subsystem. The BIST engine initializes the contents of the memory device 422, including the directory contents and ECC values, by setting them to predetermined values as one of the final steps of the self test. It should be understood that firmware, rather than or in addition to a BIST engine, may be used for initialization purposes.

As data is brought into the SMP system 300, it is loaded into the memory devices 422 of the memory subsystems 302 in units of memory blocks or cache lines. As each memory block is stored at a memory subsystem 302, the memory controller 418 computes a first error correction code (ECC) value for the block which is stored along with the cache line as described above. Data may be brought into the memory subsystems 302 from any number of sources, such as floppy disk drives, hard disk drives, tape drives, optical or magneto-optical drives, scanners, sound cards, etc. The memory controller 418 also


Free Web Sudoku Puzzles.
Solve with your browser.
  1           3  
    5 7 6 1 2 8  
        8     1  
6         5   4  
    3       8    
  8   2         5
  4     7        
  6 9 1 3 2 7    
  7           6  
What is it?



Add Your Site · Terms Of Service · Privacy Policy


DISCLAIMER
Linkgrinder is a free service that searches the Internet and indexes all files found so that you may search quickly and easily for shared files. These files are created and made available individually by users whose identity we are not aware of and who we have no control over. In essence we function like a search engine tool; these files ARE NOT STORED OR SERVED BY OUR NETWORK. We are not responsible for any materials obtained by using our service. We do not monitor any of the contents of these files. These files may contain viruses, illegal materials, materials inappropriate for minors, offensive files and the like. BY USING OUR SERVICE, YOU ASSUME FULL RESPONSIBILITY FOR DOWNLOADING THESE MATERIALS AND WILL INDEMNIFY US FOR ANY DAMAGES THAT MAY BE INCURRED.

For More Specific Information VIEW OUR TERMS OF SERVICE.

Thank you and Enjoy!