Senior Fitness - Exercise and Nutrition for Aging Men and Women
FREE Article Feed for your website.
Home Ownership Magazine
Party Planning Information
Article Marketing Resources
Bio-Medical Research Article Database
Informative Articles on Life, Love and Happiness
Tutorials on Business to Writing
Famous Quotes from Famous People
Song Lyric Information
New US Patent Information
Comprehensive List of Content by Category
Online Auctions and Shopping Related Articles
Article Search
Most Recent Articles
 

The Marketing Shack Express Marketing Ideas
Category:
Marketing  

A Fold that s worth a Thousand Gain
Category:
Business  

Chronic Fatigue Syndrome Myth or Malady
Category:
Health / Fitness  

Use Affiliate Programs for Home Business Income
Category:
Business  

Randomizer Scripts Are all Randomizer Sites Scams
Category:
Business  

How To Avoid These Common Affiliate Mistakes
Category:
Marketing  

Article Writing for the Nervous
Category:
Marketing  

11 Hot Tips to Make Google Adwords Pay Part 3
Category:
Business  

Weight Loss FAQ
Category:
Health / Fitness  

Making a Living Online
Category:
Marketing  

Trade Show Display Associations Have Ideas You Can Use
Category:
Business  

Asthmatics don t suffer at altitude
Category:
Health / Fitness  

Why are American s Small Businesses Failing at Such Alarming Rat...
Category:
Business  

Have You Fed Your Anxiety Today
Category:
Health / Fitness  

Adipex and the success story of weight loss
Category:
Health / Fitness  

10 Incredible Ways To Sell Your Products Now
Category:
Marketing  

Think Twice About Going To The Emergency Room For Back Pain
Category:
Health / Fitness  

Warning Don t Let Your Business Become a Commodity
Category:
Business  

Avoid Home Business Scams
Category:
Business  

10 Ways To Boost Your E zine Subscribers
Category:
Business  

Smoking in the 21st century
Category:
Health / Fitness  

What Is The Big ‘R For Marketing Your Business
Category:
Marketing  

Turn Your Competitors into Collaborators
Category:
Business  

The Language of Success A Different Way to Profit from Your Busi...
Category:
Business  

Are you helping by asking Did you take your meds
Category:
Health / Fitness  

Business Success Without the Blindfold
Category:
Business  

What are Asset Labels Asset Tags Property Labels or Identificati...
Category:
Business  

How To Break Into The World of Internet Business Without A Websi...
Category:
Business  

How to Wipe Out Overwhelm
Category:
Business  

Dry Skin And Water
Category:
Health / Fitness  

Your Inherited Biological Nutritional Key
Category:
Health / Fitness  

Work At Home Mothers Are You Going Through A Difficult Phase
Category:
Business  

Life After Sugar Complex Carbohydrates Made Simple
Category:
Health / Fitness  

Eye Surgery Providers TLC Laser Eye Center
Category:
Health / Fitness  

What are the symptoms of Mesothelioma
Category:
Health / Fitness  

Does Chiropractic Care Really Make Sense
Category:
Health / Fitness  

All directory small business guide Part one
Category:
Business  

Why is it so hard to get ahead
Category:
Business  

History and Health Benefits of Echinacea
Category:
Health / Fitness  

How to Hire a DUI Attorney in Connecticut
Category:
Business  

Global Warming
Category:
Health / Fitness  

The Twist and Shout
Category:
Business  

Master This 7 Part Breakout Formula to Start Your Own Business
Category:
Business  

Natural Testosterone Supplements
Category:
Health / Fitness  

Health Care Facilities A Profitable Niche for Your Cleaning Busi...
Category:
Business  

The Whole Truth About Acne Rosacea
Category:
Health / Fitness  

Atheists Agnostics and Evolutionists The Worst Gamblers in the W...
Category:
Entertainment / Television  

Immune Support Products and Why We Need Them
Category:
Health / Fitness  

Vitamins for Youth Health and Healing Check Out Vitamin E
Category:
Health / Fitness  

Web Hosting The Most Important Aspect of Your Internet Business
Category:
Business  

Using Banner Stands to Increase Trade Show Traffic
Category:
Business  

How to Attract Targeted Leads Simply and Quickly
Category:
Business  

Become Healthier Become Fitter
Category:
Health / Fitness  

Reading Your Financial Statements What Every Entrepreneur Must K...
Category:
Business  

Corporate Career Development Networking
Category:
Business  

5 Money Making Tips on How To Earn Hundreds of Dollars With Focu...
Category:
Business  

Buying Chainsaws Online
Category:
Health / Fitness  

Ditch Clutter to Tune In Your Intuitive Vision
Category:
Business  

Forgotten powerful Business Strategy
Category:
Business  

20 Ways To Convert Visitors Into Subscribers
Category:
Business  

Wavefront Better Than Conventional LASIK Eye Surgery
Category:
Health / Fitness  

Biofeedback
Category:
Health / Fitness  

The Right Pair of Rider s Protection
Category:
Business  

Wear the Perfect fit Helmet
Category:
Business  

Online Network Marketing A Powerful Tool for Today s Entrepreneu...
Category:
Business  

Recovery in the 21st Century Get the Facts First Since Your Life...
Category:
Health / Fitness  

What Is Restless Leg Syndrome
Category:
Health / Fitness  

Did you know that it s ok to have and make money online
Category:
Business  

The Main Causes of Acne
Category:
Health / Fitness  

Sell Your Music Online
Category:
Marketing  

Simple Steps for Starting Your Home Based Business
Category:
Business  

Make Money With Blogs
Category:
Marketing  

The proof of the pudding is in the e mail
Category:
Business  

Einstein The Universe And Leadership
Category:
Business  

Einstein The Universe And Leadership
Category:
Business

Processor with multiple-thread, vertically-threaded pipeline Number:6,938,147 from the United States Patent and Trademark Office (PTO) owispatent

Home    Author Login    Submit Article    Article Search    Add Your Link    Edit Your Link    Contact Us    Advertising    Disclaimer

   

 
Web LinkGrinder.com

Top Breaking News
     Colombian Military Releases Video of Hostage Rescue by VOA News
     Former DRC Warlord Brought Before ICC Amid Doubts by Brent Latham
     Tanzania Devises Plan to Cope with Avian Flu Outbreak (Part 1/5) by Douglas Mpuga

Title: Processor with multiple-thread, vertically-threaded pipeline

Abstract: A processor reduces wasted cycle time resulting from stalling and idling, and increases the proportion of execution time, by supporting and implementing both vertical multithreading and horizontal multithreading. Vertical multithreading permits overlapping or "hiding" of cache miss wait times. In vertical multithreading, multiple hardware threads share the same processor pipeline. A hardware thread is typically a process, a lightweight process, a native thread, or the like in an operating system that supports multithreading. Horizontal multithreading increases parallelism within the processor circuit structure, for example within a single integrated circuit die that makes up a single-chip processor. To further increase system parallelism in some processor embodiments, multiple processor cores are formed in a single die. Advances in on-chip multiprocessor horizontal threading are gained as processor core sizes are reduced through technological advancements.

Patent Number: 6,938,147 Issued on 08/30/2005 to Joy,   et al.


Inventors: Joy; William N. (Aspen, CO); Tremblay; Marc (Menlo Park, CA); Lauterbach; Gary (Los Altos, CA); Chamdani; Joseph I. (Santa Clara, CA)
Assignee: Sun Microsystems, Inc. (Santa Clara, CA)
Appl. No.: 309732
Filed: May 11, 1999

Current U.S. Class: 712/28
Intern'l Class: G06F 009/00
Field of Search: 712/23,28,239 711/117


References Cited [Referenced By]

U.S. Patent Documents
5361337Nov., 1994Okin.
5404469Apr., 1995Chung et al.
5452452Sep., 1995Gaetner et al.
5513130Apr., 1996Redmond.
5584023Dec., 1996Hsu.
5692193Nov., 1997Jagannathan et al.
5704054Dec., 1997Bhattacharya.
5721868Feb., 1998Yung et al.
5724565Mar., 1998Dubey et al.
5742806Apr., 1998Reiner et al.
5752027May., 1998Familiar.
5761285Jun., 1998Stent.
5778247Jul., 1998Tremblay.
5809415Sep., 1998Rossmann.
5828880Oct., 1998Hanko.
5861761Jan., 1999Kean.
5875461Feb., 1999Lindholm.
5881277Mar., 1999Bondi et al.
5890008Mar., 1999Panwar et al.
5909695Jun., 1999Wong et al.
5913925Jun., 1999Kahle et al.
5933627Aug., 1999Parady.
5953530Sep., 1999Rishi et al.
6463527Oct., 2002Vishkin.


Other References

Berekovic, M. et al.: "An Algorithm-Hardware-System Approach to VLIW Multimedia Processors" Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, Kluwer Academic Publishers, Dordrecht, NL, vol. 20, no 1/02, Oct. 1, 1998, pp. 163-179, XP000786735, ISSN: 0922-5773.
Byrd, G. et al.: "Multithreaded Processor Architectures" IEEE Spectrum, IEEE Inc. New York, US, vol. 32, no. 8, Aug. 1, 1995, pp. 38-46, XP000524885, ISSN: 0018-9235.
Fillo, M. et al.: "The M-Machine Multicomputer" Ann Arbor, Nov. 29-Dec. 1, 1995, Los Alamitos, IEEE Comp. Soc. Press, US, vol. SYMP. 28, Nov. 29, 1995, pp. 146-156, XP000585356, ISBN: 0-8186-7349-4.
Gulati, M. et al.: "Performance Study of a Multithreaded Superscalar Microprocessor" Proceedings. International Symposium on High-Performance Computer Architecture, 1996, XP000572068.
Horel, T. et al.: "UltraSPARC-III: Designing Third-Generation 64-Bit Performance" IEEE Micro, US, IEEE Inc. New York, vol. 19, no. 3, May 1999, pp. 73-85, XP000832022, ISSN: 0272-1732.
Tullsen, D. M. et al.: "Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor" Computer Architecture News, Association for Computing Machinery, New York, US, vol. 24, no. 2, May 1, 1996, pp. 191-202, XP000592185, ISSN: 0163-5964.
Tremblay et al., "A Three Dimensional Register File for Superscalar Processors", Jan. 1995, pp. 191-201, Proceedings of the 28th Annual Hawaii International Conference on Systems Sciences.

Primary Examiner: Eng; David Y.
Attorney, Agent or Firm: Zagorin O'Brien Graham LLP

Claims



1. A processor comprising:

a shared processor pipeline that includes therein a plurality of multiple-bit flip-flops, each multiple-bit flip-flop capable of concurrently holding in the shared processor pipeline, at least a portion of thread state for a plurality of execution treads, one of the execution threads being actively executed in the shared processor pipeline at a given time; and

thread control logic coupled to the shared processor pipeline and cap able of controlling the shared processor pipeline to select thread state for an active one of the execution threads, including the portion of the thread state represented in the multiple-bit flip-flops of the shared processor pipeline.

2. A processor according to claim 1 wherein the plurality of multiple-bit flip-flops are capable of generating multiple thread paths without using multiplexers to select from among the plurality of execution threads.

3. A processor according to claim 1 wherein the plurality of multiple-bit flip-flops are fabricated in circuits the size of single-bit flip-flops so that multiple threads are supported without increasing the surface area of the integrated circuit, maintaining the footprint of single-thread circuits so that integrated circuit die size is maintained.

4. A processor according to claim 1, wherein the multile-bit flip-flops include a plurality of latches controlled by thread switches.

5. A processor according to claim 4 wherein the latches are removed from a direct path of signal propagation so that signal speed is not degraded.

6. A processor according to claim 1 further comprising:

a register file coupled to the shared processor pipeline and capable of supplying data to the processor pipeline for execution.

7. A processor according to claim 1 further comprising:

a register file coupled to the shared processor pipeline and capable of supplying data to the processor pipeline for execution, the register file being a multiple-dimensional register file that includes a plurality of two-dimensional storage planes.

8. A processor comprising:

a shared processor pipeline including a plurality of pulse-based multiple-bit high-speed flip-flops, each pulse-based multiple-bit high-speed flip-flop including a plurality of latches, wherein the shared processor pipeline includes a plurality of processing units capable of executing a plurality of instructions in parallel, the shared processor pipeline being capable of concurrently holding a plurality of execution threads, one of the plurality of execution threads being actively executed; and

a thread control logic coupled to the shared processor pipeline that is capable of controlling the shared processor pipeline to select a thread machine state of the plurality of execution threads to be in either an actively executed state or a held state.

9. A processor according to claim 8 further comprising:

the plurality of processing units including one or more integer arithmetic logic units and one or more floating point arithmetic logic units.

10. A processor according to claim 8 further comprising:

the plurality of processing units including one or more graphic units.

11. A processor according to claim 8 wherein an individual processing unit of the plurality of processing units further includes:

a plurality of load/store units coupled to the shad processor pipeline, the plurality of load/store units being respectively allocated to the plurality of execution threads.

12. A processor according to claim 8 further comprising:

a plurality of load/store units coupled to the shared processor pipeline, the plurality of load/store units being respectively allocated to the plurality of execution threads; and

an external cache control unit coupled to the plurality of load/store units, the load/store units being shared among the plurality of execution threads and being shared among the plurality of processing units.

13. A processor according to claim 8 wherein an individual processing unit of the plurality of processing units further includes:

a data storage unit coupled to the execution pipeline and shared among the plurality of execution threads.

14. A processor according to claim 13 wherein the data storage unit further includes:

a data cache coupled to the execution pipeline and shared among the plurality of execution threads; and

a data memory management unit coupled to the data cache.

15. A processor according to claim 8 wherein an individual processing unit of the plurality of processing units further includes:

an instruction control logic coupled to the shared processor pipeline and shared among the plurality of execution threads.

16. A processor according to claim 15 wherein the instruction control logic further includes:

an instruction cache coupled to the shared processor pipeline;

a branch predict logic coupled to the instruction cache; and

an instruction memory management unit coupled to the instruction cache and coupled to the branch predict logic.

17. A processor according to claim 8, further comprising:

a tag RAM integrated into the single integrated circuit, the tag RAM supporting a two-way external cache, the tag RAM being shared among the plurality of processing units.

18. A processor comprising:

a plurality of processing units that operate in a shared processor pipeline, the shrub processor pipeline including a plurality of pulse-based multiple-bit high-ed flip-flops, each pulse-based multiple-bit high-speed flip-flop including a plurality of latches, wherein the plurality of processing units are capable of concurrently holding a plurality of execution threads as a plurality of shadow states, the individual shadow states being respectively allocated to an execution thread of a plurality of execution threads; and

a thread control logic coupled to the shared processor pipeline that is capable of controlling the shared processor pipeline to select a thread machine state of the plurality of execution threads, the thread machine state of the individual execution threads being an actively executed state or a held state.

19. A processor according to claim 18 further comprising:

an external cache control unit coupled to the plurality of processing units and shared among the plurality of execution threads, the external cache control unit being coupled to an external cache RAM.

20. A processor according to claim 18 further comprising:

a memory control unit coupled to the external cache control unit, the memory control unit including a cache miss processing and interfacing logic for supplying a plurality of execution threads to the plurality of processing units in thread processing.

21. A processor according to claim 18 further comprising:

a Peripheral Component Interconnect (PCI) interface coupled to the external cache control unit.

22. A processor according to claim 18 wherein the processor is integrated into a single integrated circuit.

23. A processor comprising:

a plurality of processing units in a single integrated circuit that are each capable of executing respective pluralities of execution threads in respective pipelines thereof, the pipelines each being capable of concurrently representing therein at least a portion of thread state for plural execution threads, the pipelines including a plurality of multiple-bit flip-flops, each multiple-bit flip-flop including a plurality of latches for representing respective ones of the thread states;

thread control logic coupled to at least one of the pipelines and capable of controlling the pipeline to select an active one of the represented thread states; and

an external cache control unit coupled to the pipelines and shared thereamongst.

24. A processor according to claim 23 further comprising:

an external cache arbiter integrated into the single integrated circuit and coupled to the external cache control units of the plurality of processing units.

25. A processor according to claim 23 further comprising:

an external cache arbiter integrated into the single integrated circuit and coupled to the external cache control units of the plurality of processing units; and

a cache integrated into the single integrated circuit and coupled to the external cache arbiter.

26. A processor according to claim 23 wherein the multiple-bit flip-flops have a latch structure coupled to a plurality of select-bus lines, the select-bus lines selecting an active thread from among the plurality of execution threads.

27. A processor according to claim 23 wherein an individual processing unit of the plurality of processing units further includes:

a memory control unit coupled to the external cache control unit.

28. A processor according to 23 wherein an individual processing unit of the plurality of processing units further includes:

a Peripheral Component Interconnect (PC) interface coupled to the external cache control unit.

29. A processor according to claim 23, further comprising:

a multiple-dimension register file coupled to the pipelines, the multiple-dimension register file including register instances replicated for storage of register state for respective execution threads concurrently represented in the pipelines.

30. A processor according to claim 23, wherein the multiple-bit flip-flops arm pulse-based.

31. A method of operating a processor comprising:

concurrently representing thread states for a plurality of execution threads in multiple-bit flip-flops of a shed processor pipeline each multiple-bit flip-flop including a plurality of latches for representing a portion of the respective thread states;

actively executing one of the plurality of execution threads; and

controlling the shared processor pipeline to select a respective one of the concurrently represented thread states.

32. A method according to claim 31, further comprising:

representing, using replicated register instances, register state for respective execution threads for which thread state is concurrently represented in the shared processor pipeline.

33. A processor comprising:

a shared processor pipeline including a plurality of pulse-based multiple-bit flip-flops, each pulse-based multiple-bit flip-flop including a plurality of latches, wherein the shared processor pipeline is capable of concurrently representing at least a portion of thread state for a plurality of execution threads, one of the plurality of execution threads being actively executed, at least a portion of thread state for others of the plurality of execution threads being held within the shared processor pipeline pending selection for execution; and

thread control logic coupled to the shared processor pipeline that is capable of controlling activation and deactivation of the plurality of execution threads in the shared processor pipeline.

34. A processor according to claim 33, further comprising:

a multiple-dimension register file coupled to the shared processor pipeline, the multiple-dimension register file including register instances replicated for storage of register state for respective execution threads concurrently represented in the shared processor pipeline.

35. A processor comprising:

a vertically multithreaded processor pipeline including a plurality of pipeline registers defined therein that include, for respective storage positions thereof, multiple-bit flip-flops wherein respective ones of the multiple-bits encode at least a portion of thread state for respective execution threads concurrently represented in the processor pipeline; and

thread control logic coupled to the processor pipeline and selective for the respective bits of the multiple-bit flip-flops, which correspond to an active one of the execution threads.

36. A processor according to claim 35, further comprising:

a multiple-dimension register file coupled to the processor pipeline, the multiple-dimension register file including register instances replicated for storage of register state for respective execution threads concurrently represented in the processor pipeline.

37. A processor according to claim 35, wherein the multiple-bit flip-flops are pulse-based.
Description



CROSS-REFERENCE

The present invention is related to subject matter disclosed in the following co-pending patent applications which are incorporated by reference herein in their entirety:
    • 1. United States patent application entitled, "Vertically-Threaded Processor with Multi-Dimensional Storage",
    • 2. United States patent application entitled, "Multi-Threaded Processor By Multiple-Bit Flip-Flop Global Substitution",
    • 3. United States patent application entitled, "Switching Method in a Multi-Threaded Processor", atty. docket no.: SP-3878 US>naming William Joy, Marc Tremblay, Gary Lauterbach, and Joseph Chamdani as inventors and filed on even date herewith;
    • 4. United States patent application entitled, "Multiple-Thread Processor with Single-Thread Interface Shared among Threads", atty. docket no.: SP3877 US>naming William Joy, Marc Tremblay, Gary Lauterbach, and Joseph Chamdani as inventors and filed on even date herewith; and
    • 5. United States patent application entitled, "Thread Switch Logic in a Multiple-Thread Processor", atty. docket no.: SP-3879 US>naming William Joy, Marc Tremblay, Gary Lauterbach, and Joseph Chamdani as a inventors and filed on even date herewith.


  • BACKGROUND OF THE INVENTION

    1. Field of the Invention

    The present invention relates to processor or computer-architecture. More specifically, the present invention relates to multiple-threading processor architectures and methods of operation and execution.

    2. Description of the Related Art

    In many commercial computing applications, a large percentage of time elapses during pipeline stalling and idling, rather than in productive execution, due to cache misses and latency in accessing external caches or external memory following the cache misses. Stalling and idling are most detrimental, due to frequent cache misses, in database handling operations such as OLTP, DSS, data mining, financial forecasting, mechanical and electronic computer-aided design (MCAD/ECAD), web servers, data servers, and the like. Thus, although a processor may execute at high speed, much time is wasted while idly awaiting data.

    One technique for reducing stalling and idling is hardware multithreading to achieve processor execution during otherwise idle cycles. Hardware multithreading involves replication of some processor resources, for example replication of architected registers, for each thread. Replication is not required for most processor resources, including instruction and data caches, translation look-aside buffers (TLB), instruction fetch and dispatch elements, branch units, execution units, and the like.

    Unfortunately duplication of resources is costly in terms of integrated circuit consumption and performance.

    Accordingly, improved multithreading circuits and operating methods are needed that are economical in resources and avoid costly overhead which reduces processor performance.

    SUMMARY OF THE INVENTION

    A processor reduces wasted cycle time resulting from stalling and idling, and increases the proportion of execution time, by supporting and implementing both vertical multithreading and horizontal multithreading. Vertical multithreading permits overlapping or "hiding" of cache miss wait times. In vertical multithreading, multiple hardware threads share the same processor pipeline. A hardware thread is typically a process, a lightweight process, a native thread, or the like in an operating system that supports multithreading. Horizontal multithreading increases parallelism within the processor circuit structure, for example within a single integrated circuit die that makes up a single-chip processor. To further increase system parallelism in some processor embodiments, multiple processor cores are formed in a single die.

    Advances in on-chip multiprocessor horizontal threading are gained as processor core sizes are reduced through technological advancements.

    The described processor structure and operating method may be implemented in many structural variations. For example two processor cores are combined with an on-chip set-associative L2 cache in one system. In another example, four processor cores are combined with a direct RAMBUS interface with no external L2 cache. A countless number of variations are possible. In some systems, each processor core is a vertically-threaded pipeline.

    In a further aspect of some multithreading system and method embodiments, a computing system may be configured in many different processor variations that allocate execution among a plurality of execution threads. For example, in a "1 C2T" configuration, a single processor die includes two vertical threads. In a "4C4T" configuration, a four-processor multiprocessor is formed on a single die with each of the four processors being four-way vertically threaded. Countless other "nCkT" structures and combinations may be implemented on one or more integrated circuit dies depending on the fabrication process employed and the applications envisioned for the processor. Various systems may include caches that are selectively configured, for example as segregated L1 caches and segregated L2 caches, or segregated L1 caches and shared L2 caches, or shared L1 caches and shared L2 caches.

    In an aspect of some multithreading system and method embodiments, in response to a cache miss stall a processor freezes the entire pipeline state of an executing thread. The processor executes instructions and manages the machine state of each thread separately and independently. The functional properties of an independent thread state are stored throughout the pipeline extending to the pipeline registers to enable the processor to postpone execution of a stalling thread, relinquish the pipeline to a previously idle thread, later resuming execution of the postponed stalling thread at the precise state of the stalling thread immediately prior to the thread switch.

    In another aspect of some multithreading system and method embodiments, a processor include a "four-dimensional" register structure in which register file structures are replicated by N for vertical threading in combination with a three-dimensional storage circuit. The multi-dimensional storage is formed by constructing a storage, such as a register file or memory, as a plurality of two-dimensional storage planes.

    In another aspect of some multithreading system and method embodiments, a processor implements N-bit flip-flop global substitution. To implement multiple machine states, the processor converts 1-bit flip-flops in storage cells of the stalling vertical thread to an N-bit global flip-flop where N is the number of vertical threads.

    In one aspect of some processor and processing method embodiments, the processor improves throughput efficiency and exploits increased parallelism by introducing multithreading to an existing and mature processor core. The multithreading is implemented in two steps including vertical multithreading and horizontal multithreading. The processor core is retrofitted to support multiple machine states. System embodiments that exploit retrofitting of an existing processor core advantageously leverage hundreds of man-years of hardware and software development by extending the lifetime of a proven processor pipeline generation.

    In another aspect of some multithreading system and method embodiments, a processor includes logic for tagging a thread identifier (TID) for usage with processor blocks that are not stalled. Pertinent non-stalling blocks include caches, translation look-aside buffers (TLB), a load buffer asynchronous interface, an external memory management unit (MMU) interface, and others.

    In a further aspect of some multithreading system and method embodiments, a processor includes a cache that is segregated into a plurality of N cache parts. Cache segregation avoids interference, "pollution", or "cross-talk" between threads. One technique for cache segregation utilizes logic for storing and communicating thread identification (TID) bits. The cache utilizes cache indexing logic. For example, the TID bits can be inserted at the most significant bits of the cache index.

    In another aspect of some multithreading system and method embodiments, a processor includes a thread switching control logic that performs a fast thread-switching operation in response to an L1 cache miss stall. The fast thread-switching operation implements one or more of several thread-switching methods. A first thread-switching operation is "oblivious" thread-switching for every N cycle in which the individual flip-flops locally determine a thread-switch without notification of stalling. The oblivious technique avoids usage of an extra global interconnection between threads for thread selection. A second thread-switching operation is "semi-oblivious" thread-switching for use with an existing "pipeline stall" signal (if any). The pipeline stall signal operates in two capacities, first as a notification of a pipeline stall, and second as a thread select signal between threads so that, again, usage of an extra global interconnection between threads for thread selection is avoided. A third thread-switching operation is an "intelligent global scheduler" thread-switching in which a thread switch decision is based on a plurality of signals including: (1) an L1 data cache miss stall signal, (2) an instruction buffer empty signal, (3) an L2 cache miss signal, (4) a thread priority signal, (5) a thread timer signal, (6), or other sources of triggering. In some embodiments, the thread select signal is broadcast as fast as possible, similar to a clock tree distribution. In some systems, a processor derives; a thread select signal that is applied to the flip-flops by overloading a scan enable (SE) signal of a scannable flip-flop.

    In an additional aspect of some multithreading system and method embodiments, a processor includes anti-aliasing logic coupled to an L1 cache so that the L1 cache is shared among threads via anti-aliasing. The L1 cache is a virtually-indexed, physically-tagged cache that is shared among threads. The anti-aliasing logic avoids hazards that result from multiple virtual addresses mapping to one physical address. The anti-aliasing logic selectively invalidates or updates duplicate L1 cache entries.

    In another aspect of some multithreading system and method embodiments, a processor includes logic for attaining a very fast exception handling functionality while executing non-threaded programs by invoking a multithreaded-type functionality in response to an exception condition. The processor, while operating in multithreaded conditions or while executing non-threaded programs, progresses through multiple machine states during execution. The very fast exception handling logic includes connection of an exception signal line to thread select logic, causing an exception signal to evoke a switch in thread and machine state. The switch in thread and machine state causes the processor to enter and to exit the exception handler immediately, without waiting to drain the pipeline or queues and without the inherent timing penalty of the operating system's software saving and restoring of registers.

    An additional aspect of some multithreading systems and methods is a thread reservation system or thread locking system in which a thread pathway is reserved for usage by a selected thread. A thread control logic may select a particular thread that is to execute with priority in comparison to other threads. A high priority thread may be associated with an operation with strict time constraints, an operation that is frequently and predominantly executed in comparison to other threads. The thread control logic controls thread-switching operation so that a particular hardware thread is reserved for usage by the selected thread.

    In another aspect of some multithreading system and method embodiments, a processor includes logic supporting lightweight processes and native threads. The logic includes a block that disables thread ID tagging and disables cache segregation since lightweight processes and native threads share the same virtual tag space.

    In a further additional aspect of some embodiments of the multithreading system and method, some processors include a thread reservation functionality.

    BRIEF DESCRIPTION OF THE DRAWINGS

    The features of the described embodiments are specifically set forth in the appended claims. However, embodiments of the invention relating to both structure and method of operation, may best be understood by referring to the following description and accompanying drawings.

    FIGS. 1A and 1B are timing diagrams respectively illustrating execution flow of a single-thread processor and a vertical multithread processor.

    FIGS. 2A, 2B, and 2C are timing diagrams respectively illustrating execution flow of a single-thread processor, a vertical multithread processor, and a vertical and horizontal multithread processor.

    FIG. 3 is a schematic functional block diagram depicting a design configuration for a single-processor vertically-threaded processor that is suitable for implementing various multithreading techniques and system implementations that improve multithreading performance and functionality.

    FIGS. 4A, 4B, and 4C are diagrams showing an embodiment of a pulse-based high-speed flip-flop that is advantageously used to attain multithreading in an integrated circuit. FIG. 4A is a schematic block diagram illustrating control and storage blocks of a circuit employing high-speed multiple-bit flip-flops. FIG. 4B is a schematic circuit diagram that shows a multiple-bit bistable multivibrator (flip-flop) circuit. FIG. 4C is a timing diagram illustrating timing of the multiple-bit flip-flop.

    FIG. 5 is a schematic block diagram illustrating an N-bit "thread selectable" flip-flop substitution logic that is used to create vertically multithreaded functionality in a processor pipeline while maintaining the same circuit size as a single-threaded pipeline.

    FIG. 6 is a schematic block diagram illustrating a thread switch logic which rapidly generates a thread identifier (TID) signal identifying an active thread of a plurality of threads.

    FIGS. 7A and 7B are, respectively, a schematic block diagram showing an example of a segregated cache and a pictorial diagram showing an example of an addressing technique for the segregated cache.

    FIG. 8 is a schematic block diagram showing a suitable anti-aliasing logic for usage in various processor implementations including a cache, such as an L1 cache, and L2 cache, or others.

    FIG. 9 is a schematic functional block diagram depicting a design configuration for a single-chip dual-processor vertically-threaded processor that is suitable for implementing various multithreading techniques and system implementations that improve multithreading performance and functionality.

    FIG. 10 is a schematic functional block diagram depicting an alternative design configuration for a single-processor vertically-threaded processor that is suitable for implementing various multithreading techniques and system implementations that improve multithreading performance and functionality.

    FIG. 11 is a schematic functional block diagram depicting an alternative design configuration for a single-chip dual-processor vertically-threaded processor that is suitable for implementing various multithreading techniques and system implementations that improve multithreading performance and functionality.

    FIG. 12 is a schematic block diagram illustrating a processor and processor architecture that are suitable for implementing various multithreading techniques and system implementations that improve multithreading performance and functionality.

    FIG. 13 is a schematic perspective diagram showing a multi-dimensional register file.

    FIG. 14 is a schematic circuit diagram showing a conventional implementation of register windows.

    FIG. 15 is a schematic circuit diagram showing a plurality of bit cells of a register windows of the multi-dimensional register file that avoids waste of integrated circuit area by exploiting the condition that only one window is read and only one window is written at one time.

    FIG. 16, a schematic circuit diagram illustrates a suitable bit storage circuit storing one bit of the local registers for the multi-dimensional register file with eight windows.

    FIGS. 17A and 17B are, respectively, a schematic pictorial diagram and a schematic block diagram illustrating sharing of registers among adjacent windows.

    FIG. 18 is a schematic circuit diagram illustrating an implementation of a multi-dimensional register file for registers shared across a plurality of windows.

    The use of the same reference symbols in different drawings indicates similar or, identical items.

    DESCRIPTION OF THE EMBODIMENT(S)

    Referring to FIGS. 1A and 1B, two timing diagrams respectively illustrate execution flow 110 in a single-thread processor and instruction flow 120 in a vertical multithread processor. Processing applications such as database applications spend a significant portion of execution time stalled awaiting memory servicing. FIG. 1A is a highly schematic timing diagram showing execution flow 110 of a single-thread processor executing a database application. In an illustrative example, the single-thread processor is a four-way superscalar processor. Shaded areas 112 correspond to periods of execution in which the single-thread processor core issues instructions. Blank areas 114 correspond to time periods in which the single-thread processor core is stalled waiting for data or instructions from memory or an external cache. A typical single-thread processor executing a typical database application, executes instructions about 30% of the time with the remaining 70% of the time elapsed in a stalled condition. The 30% utilization rate exemplifies the inefficient usage of resources by a single-thread processor.

    FIG. 1B is a highly schematic timing diagram showing execution flow 120 of similar database operations by a multithread processor. Applications such as database applications have a large amount inherent parallelism due to the heavy throughput orientation of database applications and the common database functionality of processing several independent transactions at one time. The basic concept of exploiting multithread functionality involves utilizing processor resources efficiently when a thread is stalled by executing other threads while the stalled thread remains stalled. The execution flow 120 depicts a first thread 122, a second thread 124, a third thread 126 and a fourth thread 128, all of which are shown with shading in the timing diagram. As one thread stalls, for example first thread 122, another thread, such as second thread 124, switches into execution on the otherwise unused or idle pipeline. Blank areas 130 correspond to idle times when all threads are stalled. Overall processor utilization is significantly improved by multithreading. The illustrative technique of multithreading employs replication of architected registers for each thread and is called "vertical multithreading".

    Vertical multithreading is advantageous in processing applications in which frequent cache misses result in heavy clock penalties. When cache misses cause a first thread to stall, vertical multithreading permits a second thread to execute when the processor would otherwise remain idle. The second thread thus takes over execution of the pipeline. A context switch from the first thread to the second thread involves saving the useful states of the first thread and assigning new states to the second thread. When the first thread restarts after stalling, the saved states are returned and the first thread proceeds in execution. Vertical multithreading imposes costs on a processor in resources used for saving and restoring thread states.

    Referring to FIGS. 2A, 2B, and 2C, three highly schematic timing diagrams respectively illustrate execution flow 210 of a single-thread processor, execution flow 230 of a vertical multithread processor, and execution flow 250 a combined vertical and horizontal multithread processor. In FIG. 2A, shaded areas 212 showing periods of execution and blank areas 214 showing time periods in which the single-thread processor core is idle due to stall illustrate the inefficiency of a single-thread processor.

    In FIG. 2B, execution flow 230 in a vertical threaded processor includes execution of a first thread 232, and a second thread 234, both shaded in the timing diagram, and an idle time shown in a blank area 240. Efficient instruction execute proceeds as one thread stalls and, in response to the stall, another thread switches into execution on the otherwise unused or idle pipeline. In the blank areas 240, an idle time occurs when all threads are stalled. For vertical multithread processor maintains a separate processing state for T executing threads. Only one of the threads is active at one time. The vertical multithreaded processor switches execution to another thread on a cache miss, for example an L1 cache miss.

    A horizontal threaded processor, using a technique called chip-multiple processing, combines multiple processors on a single integrated circuit die. The multiple processors are vertically threaded to form a processor with both vertical, and horizontal threading, augmenting executing efficiency and decreasing latency in a multiplicative fashion. In FIG. 2C execution flow 250 in a vertical and horizontal threaded processor includes execution of a first thread 252 executing on a first processor, a second thread 254 executing on the first processor, a first thread 256 executing on a second processor and a second thread 258 executing on the second processor. An idle time is shown in a blank area 260 for both the first and second processors. Execution of the first thread 252 and the second thread 254 on the first processor illustrate vertical threading. Similarly, execution of the first thread 256 and the second thread 258 on the second processor illustrate vertical threading. In the illustrative embodiment, a single integrated circuit includes both the first processor and the second processor, the multiple processors executing in parallel so that the multithreading operation is a horizontal multiple-threading or integrated-circuit chip multiprocessing (CMP) in combination with the vertical multithreading of the first processor and the second processor. The combination of vertical multithreading and horizontal multithreading increases processor parallelism and performance, and attains an execution efficiency that exceeds the efficiency of a processor with only vertical multithreading. The combination of vertical multithreading and horizontal multithreading also advantageously reduces communication latency among local: (on-chip) multi-processor tasks by eliminating much signaling on high-latency communication lines between integrated circuit chips. Horizontal multithreading further advantageously exploits processor speed and power improvements that inherently result from reduced circuit sizes in the evolution of silicon processing.

    For each vertical threaded processor, efficient instruction execute proceeds as one thread stalls and, in response to the stall, another thread switches into execution on the otherwise unused or idle pipeline. In the blank areas 260, an idle time occurs when all threads are stalled.

    Vertical multithreading is advantageously used to overcome or hide cache miss stalls, thereby continuing execution of the processor despite stalls. Vertical multithreading thus improves performance in commercial multiprocessor and multithreading applications. Vertical multithreading advantageously accelerates context switching time from millisecond ranges to nanosecond ranges. Vertical multithreading is highly advantageous in all processing environments including embedded, desktop, and server applications, and the like.

    Horizontal multithreading or circuit chip multiprocessing further increases on-chip parallelism by exploiting increasingly smaller processor core sizes.

    Although the illustrative example shows execution of two concurrent vertical multithreading processors with each concurrent vertical multithreading processor executing two threads, in other examples various numbers of concurrently executing processors may execute various numbers of threads. The number of threads that execute on one processor may be the same or different from the number of threads executing concurrently and in parallel on another processor.

    In some processor designs, vertical and horizontal multithreading is incorporated into the fundamental design of the processors, advantageously creating modular and flexible structures that promote scalability of design. In other processor designs, multithreading is incorporated into existing and mature processor designs to leverage existing technological bases and increasing performance of multiprocessing and multithreading applications. One highly suitable example of processor design for retrofitting with multithreading functionality is an UltraSPARC processor. In some designs, vertical and horizontal multithreading are achieved with minimal retrofitting of an existing processor core, advantageously reducing logic and physical design changes and avoiding global chip re-routing, recomposing, and the expense of heavy redesign of integrated circuits.

    Referring to FIG. 3, a schematic functional block diagram depicts a design configuration for a single-processor vertically-threaded processor 300 that is suitable for implementing various multithreading techniques and system implementations that improve multithreading performance and functionality. The single-processor vertically-threaded processor 300 has a single pipeline shared among a plurality of machine states or threads, holding a plurality of machine states concurrently. A thread that is currently active, not stalled, is selected and supplies data to functional blocks connected to the pipeline. When the active thread is stalled, the pipeline immediately switches to a non-stalled thread, if any, and begins executing the non-stalled thread.

    The single-processor vertically-threaded processor 300 includes a thread 0 machine state block 310 that defines a machine state of a first thread (thread 0). The single-processor vertically-threaded processor 300 also includes a thread 1 machine state block 312 that defines a machine state of a second thread (thread 1) that "shadows" the machine state of thread 0. The thread 0 machine state block 310 and the thread 1 machine state block 312 are fabricated in a single integrated circuit logic structure using a high-speed multi-bit flip-flop design and a "four-dimensional" register file structure and supply instructions from thread 0 and thread 1 to a shared processor pipeline 314 using vertical threading. The multiple-dimensional register file employs register file structures that are replicated by N for vertical threading in combination with a three-dimensional storage circuit. The three-dimensional storage is formed by constructing a storage, such as a register file or memory, as a plurality of two-dimensional storage planes.

    In response to a cache miss stall the processor 300 freezes the entire pipeline state of an executing thread in the shared processor pipeline 314. The processor 300 issues instructions manages the machine state of each thread separately and independently. The functional properties of an independent thread state are stored throughout the pipeline extending to the pipeline registers to allow the processor 300 to postpone execution of a stalling thread by freezing the active state in the pipeline, relinquish the pipeline 314 to a previously idle thread by activating the previously idle thread in the pipeline while holding the state of the newly idle thread in the pipeline, and later resume execution of the postponed stalling thread at the precise state of the stalling thread immediately prior to the thread switch.

    The shared processor pipeline 314 is coupled to a dual load/store unit including a thread 0 load/store unit 316 and a thread 1 load/store unit 318 that execute load and store data accesses for instruction threads 0 and 1, respectively. The load/store units generate a virtual address of all load and store operations for accessing a data cache, decoupling load misses from the pipeline through a load buffer (not shown), and decoupling the stores through a store buffer. Up to one load or store is issued per cycle.

    The shared processor pipeline 314 and the dual load/store unit are connected to a data memory interface 320 including a shared data cache and a shared data memory management unit (DMMU). The shared data cache is used to cache data for both thread 0 and thread 1 instruction sequences. In an illustrative processor 300, the data cache is a write-through non-allocating 16-kilobyte direct-mapped 32-byte line cache.

    The data cache is virtually-indexed and physically-tagged using a tag array that is dual-ported so that tag updates resulting from line fills do not collide with tag reads for incoming loads. Snoops to the data cache use the second tag port so that an incoming load processed without delay by the snoop. The shared data memory management unit (DMMU) manages virtual to physical address translation.

    The dual load/store units are also connected to an external cache control unit (ECU) 322, which is connected to an external cache bus 324. The external cache control unit 322 is also connected to an UltraPort Architecture Interconnect (UPA) bus 326 via a memory interface unit (MIU) 328. The external cache control unit 322 and the memory interface unit (MIU) 328 are unified between thread 0 and thread 1 to perform functions of cache miss processing and interfacing with external devices to supply, in combination, a plurality of execution threads to the thread 0 machine state block 310 and the thread 1 machine state block 312 via a shared instruction control block 330. The unified external cache control unit 322 and memory interface unit (MIU) 328 include thread identifier (TID) tagging to specify and identify a transaction that is accessed via the external cache bus 324 and the UPA bus 326. In the processor 300, TID logging is only internal to the processor 300 (integrated circuit chip). Outside the integrated circuit chip, hardware interacts with the processor 300 in the manner of an interaction with a single CPU with one UPA bus, and one external cache bus interface. In contrast, software outside the integrated circuit chip interacts with the processor 300 in the manner of an interaction with two logical CPUs.

    The instruction control block 330 includes an instruction (L1) cache, a branch prediction unit, NFRAM, and an instruction memory management unit (IMMU) all of which are shared between the multiple threads, thread 0 and thread 1. In an illustrative processor, the instruction cache is a 16 kilobyte two-way set-associative cache with 32-byte blocks. The instruction cache is physically indexed and physically tagged. The set is predicted as part of a "next field" so that only index bits of an address are needed to address the cache. The instruction memory management unit (IMMU) supports virtual to physical address translation of instruction program counters (PCs). To prefetch across conditional branches, dynamic branch prediction is implemented in hardware based on a two-bit history of a branch. In an illustrative processor, a next-field is associated with every four instructions in the instruction cache points to the next cache line to be fetched. Up to twelve instructions are stored in an instruction buffer and issued to the pipeline.

    The external cache control unit 322 manages instruction (L1) cache and data cache misses, and permits up to one access every other cycle to the external cache. Load operations that miss in the data cache are remedied by multiple-byte data cache fills on two consecutive accesses to the external cache. Store operations are fully pipelined and write-through to the external cache. Instruction prefetches that miss the instruction cache are remedied by multiple-byte instruction cache fills using four consecutive accesses to the parity-protected external cache.

    The external cache control unit 322 supports DMA accesses which hit in the external cache and maintains data coherence between the external cache and the main memory (not shown).

    The memory interface unit (MIU) 328 controls transactions to the UPA bus 326. The UPA bus 326 runs at a fraction (for example, ⅓) of the processor clock.

    Vertical multithreading advantageously improves processor performance in commercial application workloads which have high cache miss rates with a high miss penalty, low processor utilization (30%-50% on OLTP), and latency periods that present an opportunity to overlap execution to utilize cache miss wait times.

    Vertical multithreading is also highly advantageous in sequential and parallel processing applications with frequent context switches.

    Vertical multithreading does impose some costs on a processor in terms of resources used to save and restore thread states. The costs vary depending on the implementation of multithreading resources. For example, a software implementation typically incurs a time expense that negates any gain in latency. In another example, pipeline stages may be duplicated while attempting to share as many resources as possible, disadvantageously resulting in a high cost in silicon area.

    An advantageous technique for implementing vertical multithreading, called a high-speed multi-bit flip-flop design, involves designing pipeline registers (flops) with multiple storage bits. The individual bits of a flip-flop are allocated to a separate thread. When a first thread stalls, typically due to a cache miss, the active bit of a flip-flop is removed from the pipeline pathway and another bit of the flip-flop becomes active. The states of the stalled thread are preserved in a temporarily inactive bit of the individual flip-flops in a pipeline stage. The high-speed multi-bit flip-flop design utilizes placement of a multiple-bit flip-flop at the end of the individual pipeline stages. The individual bits of the multiple-bit flip-flop are individually accessible and controllable to allow switching from a first thread to a second thread when the first thread stalls.

    Referring to FIG. 4A, a schematic block diagram illustrates control and storage blocks of a circuit employing high-speed multiple-bit flip-flops. A multiple-bit flip-flop storage block 410 includes a storage header block 412 and a multiple-bit flip-flop block 414. The storage header block 412 supplies timing signals and thread select signals to the multiple-bit flip-flop block 414. Input signals to the storage header block 412 include a clock signal 14clk that is supplied from external to the multiple-bit flip-flop storage block 410, a combined scan enable and clock enable signal se_ce_, and a thread identifier (TID) signal tid_g that is supplied from thread select circuitry external to the multiple-bit flip-flop storage block 410. The storage header block 412 derives an internal flip-flop clock signal elk, the inverse of the internal flip-flop clock signal clk_l, and a scan clock signal sclk from the external clock 14clk and the scan enable and clock enable signal se_ce_l. The storage header block 412 asserts an internal thread ID signal tid based on the thread identifier (TID) signal tid_g. The storage header block 412 drives one or more flip-flop cells in the multiple-bit flip-flop block 414. Typically, the multiple-bit flip-flop block 414 includes from one to 32 bistable multivibrator cells, although more cells may be used. The internal flip-flop clock signal clk, the inverse of the internal flip-flop clock signal clk_l, the scan clock signal sclk, and the internal thread ID signal tid are supplied from the storage header block 412 to the multiple-bit flip-flop block 414.

    In addition to the internal flip-flop clock signal clk, the inverse of the internal flip-flop clock signal clk_l, the scan clock signal sclk, and the internal thread ID signal tid, the multiple-bit flip-flop block 414 also receives an input signal d and a scan chain input signal si.

    Referring to FIG. 4B, a schematic circuit diagram shows a multiple-bit bistable multivibrator (flip-flop) circuit. A conventional flip-flop is a single-bit storage structure and is commonly used to reliably sample and store data. A flip-flop is typically a fundamental component of a semiconductor chip with a single phase clock and a major determinant of the overall clocking speed of a microcontroller or microprocessor. A novel pulse-based multiple-bit high-speed flip-flop 400 is used to accelerate the functionality and performance of a processor.

    An individual cell of the pulse-based multiple-bit high-speed flip-flop 400, includes an input stage with a push-pull gate driver 402. The push-pull gate driver 402 operates as a push-pull circuit for driving short-duration pulses to a multiple-bit storage circuit 428 and an output line q via an inverter 438. The push-pull gate driver 402 has four MOSFETs connected in series in a source-drain pathway between VDD and VCC references including a p-channel MOSFET 418, a p-channel MOSFET 420, an n-channel MOSFET 422, and an n-channel MOSFET 424. P-channel MOSFET 418 and n-channel MOSFET 424 have gate terminals connected to the input signal d. The p-channel MOSFET 420 has a source-drai


    Free Web Sudoku Puzzles.
    Solve with your browser.
    1         6      
        5     8 4 7  
    7 4           2  
                  3  
    4     6   5     1
      2              
      9           6 5
      5 1 3     8    
          8         7
    What is it?



    Add Your Site · Terms Of Service · Privacy Policy


    DISCLAIMER
    Linkgrinder is a free service that searches the Internet and indexes all files found so that you may search quickly and easily for shared files. These files are created and made available individually by users whose identity we are not aware of and who we have no control over. In essence we function like a search engine tool; these files ARE NOT STORED OR SERVED BY OUR NETWORK. We are not responsible for any materials obtained by using our service. We do not monitor any of the contents of these files. These files may contain viruses, illegal materials, materials inappropriate for minors, offensive files and the like. BY USING OUR SERVICE, YOU ASSUME FULL RESPONSIBILITY FOR DOWNLOADING THESE MATERIALS AND WILL INDEMNIFY US FOR ANY DAMAGES THAT MAY BE INCURRED.

    For More Specific Information VIEW OUR TERMS OF SERVICE.

    Thank you and Enjoy!