Title: Processor with multiple-thread, vertically-threaded pipeline
Abstract: A processor reduces wasted cycle time resulting from stalling and idling, and increases the proportion of execution time, by supporting and implementing both vertical multithreading and horizontal multithreading. Vertical multithreading permits overlapping or "hiding" of cache miss wait times. In vertical multithreading, multiple hardware threads share the same processor pipeline. A hardware thread is typically a process, a lightweight process, a native thread, or the like in an operating system that supports multithreading. Horizontal multithreading increases parallelism within the processor circuit structure, for example within a single integrated circuit die that makes up a single-chip processor. To further increase system parallelism in some processor embodiments, multiple processor cores are formed in a single die. Advances in on-chip multiprocessor horizontal threading are gained as processor core sizes are reduced through technological advancements.
Patent Number: 6,938,147 Issued on 08/30/2005 to Joy,   et al.
| Inventors:
|
Joy; William N. (Aspen, CO);
Tremblay; Marc (Menlo Park, CA);
Lauterbach; Gary (Los Altos, CA);
Chamdani; Joseph I. (Santa Clara, CA)
|
| Assignee:
|
Sun Microsystems, Inc. (Santa Clara, CA)
|
| Appl. No.:
|
309732 |
| Filed:
|
May 11, 1999 |
| Current U.S. Class: |
712/28 |
| Intern'l Class: |
G06F 009/00 |
| Field of Search: |
712/23,28,239
711/117
|
References Cited [Referenced By]
U.S. Patent Documents
| 5361337 | Nov., 1994 | Okin.
| |
| 5404469 | Apr., 1995 | Chung et al.
| |
| 5452452 | Sep., 1995 | Gaetner et al.
| |
| 5513130 | Apr., 1996 | Redmond.
| |
| 5584023 | Dec., 1996 | Hsu.
| |
| 5692193 | Nov., 1997 | Jagannathan et al.
| |
| 5704054 | Dec., 1997 | Bhattacharya.
| |
| 5721868 | Feb., 1998 | Yung et al.
| |
| 5724565 | Mar., 1998 | Dubey et al.
| |
| 5742806 | Apr., 1998 | Reiner et al.
| |
| 5752027 | May., 1998 | Familiar.
| |
| 5761285 | Jun., 1998 | Stent.
| |
| 5778247 | Jul., 1998 | Tremblay.
| |
| 5809415 | Sep., 1998 | Rossmann.
| |
| 5828880 | Oct., 1998 | Hanko.
| |
| 5861761 | Jan., 1999 | Kean.
| |
| 5875461 | Feb., 1999 | Lindholm.
| |
| 5881277 | Mar., 1999 | Bondi et al.
| |
| 5890008 | Mar., 1999 | Panwar et al.
| |
| 5909695 | Jun., 1999 | Wong et al.
| |
| 5913925 | Jun., 1999 | Kahle et al.
| |
| 5933627 | Aug., 1999 | Parady.
| |
| 5953530 | Sep., 1999 | Rishi et al.
| |
| 6463527 | Oct., 2002 | Vishkin.
| |
Other References
Berekovic, M. et al.: "An Algorithm-Hardware-System Approach to VLIW Multimedia
Processors" Journal of VLSI Signal Processing Systems for Signal, Image, and Video
Technology, Kluwer Academic Publishers, Dordrecht, NL, vol. 20, no 1/02, Oct. 1,
1998, pp. 163-179, XP000786735, ISSN: 0922-5773.
Byrd, G. et al.: "Multithreaded Processor Architectures" IEEE Spectrum, IEEE
Inc. New York, US, vol. 32, no. 8, Aug. 1, 1995, pp. 38-46, XP000524885, ISSN: 0018-9235.
Fillo, M. et al.: "The M-Machine Multicomputer" Ann Arbor, Nov. 29-Dec. 1, 1995,
Los Alamitos, IEEE Comp. Soc. Press, US, vol. SYMP. 28, Nov. 29, 1995, pp. 146-156,
XP000585356, ISBN: 0-8186-7349-4.
Gulati, M. et al.: "Performance Study of a Multithreaded Superscalar Microprocessor"
Proceedings. International Symposium on High-Performance Computer Architecture,
1996, XP000572068.
Horel, T. et al.: "UltraSPARC-III: Designing Third-Generation 64-Bit Performance"
IEEE Micro, US, IEEE Inc. New York, vol. 19, no. 3, May 1999, pp. 73-85, XP000832022,
ISSN: 0272-1732.
Tullsen, D. M. et al.: "Exploiting Choice: Instruction Fetch and Issue on an
Implementable Simultaneous Multithreading Processor" Computer Architecture News,
Association for Computing Machinery, New York, US, vol. 24, no. 2, May 1, 1996,
pp. 191-202, XP000592185, ISSN: 0163-5964.
Tremblay et al., "A Three Dimensional Register File for Superscalar Processors",
Jan. 1995, pp. 191-201, Proceedings of the 28th Annual Hawaii International Conference
on Systems Sciences.
|
Primary Examiner: Eng; David Y.
Attorney, Agent or Firm: Zagorin O'Brien Graham LLP
Claims
1. A processor comprising:
a shared processor pipeline that includes therein a plurality of multiple-bit
flip-flops, each multiple-bit flip-flop capable of concurrently holding in the
shared processor pipeline, at least a portion of thread state for a plurality of
execution treads, one of the execution threads being actively executed in the shared
processor pipeline at a given time; and
thread control logic coupled to the shared processor pipeline and cap able of
controlling the shared processor pipeline to select thread state for an active
one of the execution threads, including the portion of the thread state represented
in the multiple-bit flip-flops of the shared processor pipeline.
2. A processor according to claim 1 wherein the plurality of multiple-bit flip-flops
are capable of generating multiple thread paths without using multiplexers to select
from among the plurality of execution threads.
3. A processor according to claim 1 wherein the plurality of multiple-bit flip-flops
are fabricated in circuits the size of single-bit flip-flops so that multiple threads
are supported without increasing the surface area of the integrated circuit, maintaining
the footprint of single-thread circuits so that integrated circuit die size is maintained.
4. A processor according to claim 1, wherein the multile-bit flip-flops include
a plurality of latches controlled by thread switches.
5. A processor according to claim 4 wherein the latches are removed from a direct
path of signal propagation so that signal speed is not degraded.
6. A processor according to claim 1 further comprising:
a register file coupled to the shared processor pipeline and capable of supplying
data to the processor pipeline for execution.
7. A processor according to claim 1 further comprising:
a register file coupled to the shared processor pipeline and capable of supplying
data to the processor pipeline for execution, the register file being a multiple-dimensional
register file that includes a plurality of two-dimensional storage planes.
8. A processor comprising:
a shared processor pipeline including a plurality of pulse-based multiple-bit
high-speed flip-flops, each pulse-based multiple-bit high-speed flip-flop including
a plurality of latches, wherein the shared processor pipeline includes a plurality
of processing units capable of executing a plurality of instructions in parallel,
the shared processor pipeline being capable of concurrently holding a plurality
of execution threads, one of the plurality of execution threads being actively
executed; and
a thread control logic coupled to the shared processor pipeline that is capable
of controlling the shared processor pipeline to select a thread machine state of
the plurality of execution threads to be in either an actively executed state or
a held state.
9. A processor according to claim 8 further comprising:
the plurality of processing units including one or more integer arithmetic logic
units and one or more floating point arithmetic logic units.
10. A processor according to claim 8 further comprising:
the plurality of processing units including one or more graphic units.
11. A processor according to claim 8 wherein an individual processing unit of
the plurality of processing units further includes:
a plurality of load/store units coupled to the shad processor pipeline, the plurality
of load/store units being respectively allocated to the plurality of execution
threads.
12. A processor according to claim 8 further comprising:
a plurality of load/store units coupled to the shared processor pipeline, the
plurality of load/store units being respectively allocated to the plurality of
execution threads; and
an external cache control unit coupled to the plurality of load/store units,
the load/store units being shared among the plurality of execution threads and
being shared among the plurality of processing units.
13. A processor according to claim 8 wherein an individual processing unit of
the plurality of processing units further includes:
a data storage unit coupled to the execution pipeline and shared among the plurality
of execution threads.
14. A processor according to claim 13 wherein the data storage unit further includes:
a data cache coupled to the execution pipeline and shared among the plurality
of execution threads; and
a data memory management unit coupled to the data cache.
15. A processor according to claim 8 wherein an individual processing unit of
the plurality of processing units further includes:
an instruction control logic coupled to the shared processor pipeline and shared
among the plurality of execution threads.
16. A processor according to claim 15 wherein the instruction control logic further includes:
an instruction cache coupled to the shared processor pipeline;
a branch predict logic coupled to the instruction cache; and
an instruction memory management unit coupled to the instruction cache and coupled
to the branch predict logic.
17. A processor according to claim 8, further comprising:
a tag RAM integrated into the single integrated circuit, the tag RAM supporting
a two-way external cache, the tag RAM being shared among the plurality of processing
units.
18. A processor comprising:
a plurality of processing units that operate in a shared processor pipeline,
the shrub processor pipeline including a plurality of pulse-based multiple-bit
high-ed flip-flops, each pulse-based multiple-bit high-speed flip-flop including
a plurality of latches, wherein the plurality of processing units are capable of
concurrently holding a plurality of execution threads as a plurality of shadow
states, the individual shadow states being respectively allocated to an execution
thread of a plurality of execution threads; and
a thread control logic coupled to the shared processor pipeline that is capable
of controlling the shared processor pipeline to select a thread machine state of
the plurality of execution threads, the thread machine state of the individual
execution threads being an actively executed state or a held state.
19. A processor according to claim 18 further comprising:
an external cache control unit coupled to the plurality of processing units and
shared among the plurality of execution threads, the external cache control unit
being coupled to an external cache RAM.
20. A processor according to claim 18 further comprising:
a memory control unit coupled to the external cache control unit, the memory
control unit including a cache miss processing and interfacing logic for supplying
a plurality of execution threads to the plurality of processing units in thread
processing.
21. A processor according to claim 18 further comprising:
a Peripheral Component Interconnect (PCI) interface coupled to the external cache
control unit.
22. A processor according to claim 18 wherein the processor is integrated into
a single integrated circuit.
23. A processor comprising:
a plurality of processing units in a single integrated circuit that are each
capable of executing respective pluralities of execution threads in respective
pipelines thereof, the pipelines each being capable of concurrently representing
therein at least a portion of thread state for plural execution threads, the pipelines
including a plurality of multiple-bit flip-flops, each multiple-bit flip-flop including
a plurality of latches for representing respective ones of the thread states;
thread control logic coupled to at least one of the pipelines and capable of
controlling the pipeline to select an active one of the represented thread states;
and
an external cache control unit coupled to the pipelines and shared thereamongst.
24. A processor according to claim 23 further comprising:
an external cache arbiter integrated into the single integrated circuit and coupled
to the external cache control units of the plurality of processing units.
25. A processor according to claim 23 further comprising:
an external cache arbiter integrated into the single integrated circuit and coupled
to the external cache control units of the plurality of processing units; and
a cache integrated into the single integrated circuit and coupled to the external
cache arbiter.
26. A processor according to claim 23 wherein the multiple-bit flip-flops have
a latch structure coupled to a plurality of select-bus lines, the select-bus lines
selecting an active thread from among the plurality of execution threads.
27. A processor according to claim 23 wherein an individual processing unit of
the plurality of processing units further includes:
a memory control unit coupled to the external cache control unit.
28. A processor according to
23 wherein an individual processing unit
of the plurality of processing units further includes:
a Peripheral Component Interconnect (PC) interface coupled to the external cache
control unit.
29. A processor according to claim 23, further comprising:
a multiple-dimension register file coupled to the pipelines, the multiple-dimension
register file including register instances replicated for storage of register state
for respective execution threads concurrently represented in the pipelines.
30. A processor according to claim 23, wherein the multiple-bit flip-flops arm pulse-based.
31. A method of operating a processor comprising:
concurrently representing thread states for a plurality of execution threads
in multiple-bit flip-flops of a shed processor pipeline each multiple-bit flip-flop
including a plurality of latches for representing a portion of the respective thread
states;
actively executing one of the plurality of execution threads; and
controlling the shared processor pipeline to select a respective one of the concurrently
represented thread states.
32. A method according to claim 31, further comprising:
representing, using replicated register instances, register state for respective
execution threads for which thread state is concurrently represented in the shared
processor pipeline.
33. A processor comprising:
a shared processor pipeline including a plurality of pulse-based multiple-bit
flip-flops, each pulse-based multiple-bit flip-flop including a plurality of latches,
wherein the shared processor pipeline is capable of concurrently representing at
least a portion of thread state for a plurality of execution threads, one of the
plurality of execution threads being actively executed, at least a portion of thread
state for others of the plurality of execution threads being held within the shared
processor pipeline pending selection for execution; and
thread control logic coupled to the shared processor pipeline that is capable
of controlling activation and deactivation of the plurality of execution threads
in the shared processor pipeline.
34. A processor according to claim 33, further comprising:
a multiple-dimension register file coupled to the shared processor pipeline,
the multiple-dimension register file including register instances replicated for
storage of register state for respective execution threads concurrently represented
in the shared processor pipeline.
35. A processor comprising:
a vertically multithreaded processor pipeline including a plurality of pipeline
registers defined therein that include, for respective storage positions thereof,
multiple-bit flip-flops wherein respective ones of the multiple-bits encode at
least a portion of thread state for respective execution threads concurrently represented
in the processor pipeline; and
thread control logic coupled to the processor pipeline and selective for the
respective bits of the multiple-bit flip-flops, which correspond to an active one
of the execution threads.
36. A processor according to claim 35, further comprising:
a multiple-dimension register file coupled to the processor pipeline, the multiple-dimension
register file including register instances replicated for storage of register state
for respective execution threads concurrently represented in the processor pipeline.
37. A processor according to claim 35, wherein the multiple-bit flip-flops are pulse-based.
Description
CROSS-REFERENCE
The present invention is related to subject matter disclosed in the following
co-pending patent applications which are incorporated by reference herein in their entirety:
- 1. United States patent application entitled, "Vertically-Threaded Processor
with Multi-Dimensional Storage",
- 2. United States patent application entitled, "Multi-Threaded Processor
By Multiple-Bit Flip-Flop Global Substitution",
- 3. United States patent application entitled, "Switching Method in a
Multi-Threaded Processor", atty. docket no.: SP-3878 US>naming William Joy,
Marc Tremblay, Gary Lauterbach, and Joseph Chamdani as inventors and filed on even
date herewith;
- 4. United States patent application entitled, "Multiple-Thread Processor
with Single-Thread Interface Shared among Threads", atty. docket no.: SP3877 US>naming
William Joy, Marc Tremblay, Gary Lauterbach, and Joseph Chamdani as inventors and
filed on even date herewith; and
- 5. United States patent application entitled, "Thread Switch Logic in
a Multiple-Thread Processor", atty. docket no.: SP-3879 US>naming William
Joy, Marc Tremblay, Gary Lauterbach, and Joseph Chamdani as a inventors and filed
on even date herewith.
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to processor or computer-architecture. More specifically,
the present invention relates to multiple-threading processor architectures and
methods of operation and execution.
2. Description of the Related Art
In many commercial computing applications, a large percentage of time elapses
during pipeline stalling and idling, rather than in productive execution, due to
cache misses and latency in accessing external caches or external memory following
the cache misses. Stalling and idling are most detrimental, due to frequent cache
misses, in database handling operations such as OLTP, DSS, data mining, financial
forecasting, mechanical and electronic computer-aided design (MCAD/ECAD), web servers,
data servers, and the like. Thus, although a processor may execute at high speed,
much time is wasted while idly awaiting data.
One technique for reducing stalling and idling is hardware multithreading to
achieve processor execution during otherwise idle cycles. Hardware multithreading
involves replication of some processor resources, for example replication of architected
registers, for each thread. Replication is not required for most processor resources,
including instruction and data caches, translation look-aside buffers (TLB), instruction
fetch and dispatch elements, branch units, execution units, and the like.
Unfortunately duplication of resources is costly in terms of integrated
circuit consumption and performance.
Accordingly, improved multithreading circuits and operating methods
are needed that are economical in resources and avoid costly overhead which reduces
processor performance.
SUMMARY OF THE INVENTION
A processor reduces wasted cycle time resulting from stalling and idling, and
increases
the proportion of execution time, by supporting and implementing both vertical
multithreading and horizontal multithreading. Vertical multithreading permits overlapping
or "hiding" of cache miss wait times. In vertical multithreading, multiple hardware
threads share the same processor pipeline. A hardware thread is typically a process,
a lightweight process, a native thread, or the like in an operating system that
supports multithreading. Horizontal multithreading increases parallelism within
the processor circuit structure, for example within a single integrated circuit
die that makes up a single-chip processor. To further increase system parallelism
in some processor embodiments, multiple processor cores are formed in a single die.
Advances in on-chip multiprocessor horizontal threading are gained as processor
core sizes are reduced through technological advancements.
The described processor structure and operating method may be implemented in
many structural variations. For example two processor cores are combined with an
on-chip set-associative L2 cache in one system. In another example, four processor
cores are combined with a direct RAMBUS interface with no external L2 cache. A
countless number of variations are possible. In some systems, each processor core
is a vertically-threaded pipeline.
In a further aspect of some multithreading system and method embodiments, a computing
system may be configured in many different processor variations that allocate execution
among a plurality of execution threads. For example, in a "1 C2T" configuration,
a single processor die includes two vertical threads. In a "4C4T" configuration,
a four-processor multiprocessor is formed on a single die with each of the four
processors being four-way vertically threaded. Countless other "nCkT" structures
and combinations may be implemented on one or more integrated circuit dies depending
on the fabrication process employed and the applications envisioned for the processor.
Various systems may include caches that are selectively configured, for example
as segregated L1 caches and segregated L2 caches, or segregated L1 caches and shared
L2 caches, or shared L1 caches and shared L2 caches.
In an aspect of some multithreading system and method embodiments, in response
to a cache miss stall a processor freezes the entire pipeline state of an executing
thread. The processor executes instructions and manages the machine state of each
thread separately and independently. The functional properties of an independent
thread state are stored throughout the pipeline extending to the pipeline registers
to enable the processor to postpone execution of a stalling thread, relinquish
the pipeline to a previously idle thread, later resuming execution of the postponed
stalling thread at the precise state of the stalling thread immediately prior to
the thread switch.
In another aspect of some multithreading system and method embodiments, a processor
include a "four-dimensional" register structure in which register file structures
are replicated by N for vertical threading in combination with a three-dimensional
storage circuit. The multi-dimensional storage is formed by constructing a storage,
such as a register file or memory, as a plurality of two-dimensional storage planes.
In another aspect of some multithreading system and method embodiments, a processor
implements N-bit flip-flop global substitution. To implement multiple machine states,
the processor converts 1-bit flip-flops in storage cells of the stalling vertical
thread to an N-bit global flip-flop where N is the number of vertical threads.
In one aspect of some processor and processing method embodiments, the processor
improves throughput efficiency and exploits increased parallelism by introducing
multithreading to an existing and mature processor core. The multithreading is
implemented in two steps including vertical multithreading and horizontal multithreading.
The processor core is retrofitted to support multiple machine states. System embodiments
that exploit retrofitting of an existing processor core advantageously leverage
hundreds of man-years of hardware and software development by extending the lifetime
of a proven processor pipeline generation.
In another aspect of some multithreading system and method embodiments, a processor
includes logic for tagging a thread identifier (TID) for usage with processor blocks
that are not stalled. Pertinent non-stalling blocks include caches, translation
look-aside buffers (TLB), a load buffer asynchronous interface, an external memory
management unit (MMU) interface, and others.
In a further aspect of some multithreading system and method embodiments, a processor
includes a cache that is segregated into a plurality of N cache parts. Cache segregation
avoids interference, "pollution", or "cross-talk" between threads. One technique
for cache segregation utilizes logic for storing and communicating thread identification
(TID) bits. The cache utilizes cache indexing logic. For example, the TID bits
can be inserted at the most significant bits of the cache index.
In another aspect of some multithreading system and method embodiments, a processor
includes a thread switching control logic that performs a fast thread-switching
operation in response to an L1 cache miss stall. The fast thread-switching operation
implements one or more of several thread-switching methods. A first thread-switching
operation is "oblivious" thread-switching for every N cycle in which the individual
flip-flops locally determine a thread-switch without notification of stalling.
The oblivious technique avoids usage of an extra global interconnection between
threads for thread selection. A second thread-switching operation is "semi-oblivious"
thread-switching for use with an existing "pipeline stall" signal (if any). The
pipeline stall signal operates in two capacities, first as a notification of a
pipeline stall, and second as a thread select signal between threads so that, again,
usage of an extra global interconnection between threads for thread selection is
avoided. A third thread-switching operation is an "intelligent global scheduler"
thread-switching in which a thread switch decision is based on a plurality of signals
including: (1) an L1 data cache miss stall signal, (2) an instruction buffer empty
signal, (3) an L2 cache miss signal, (4) a thread priority signal, (5) a thread
timer signal, (6), or other sources of triggering. In some embodiments, the thread
select signal is broadcast as fast as possible, similar to a clock tree distribution.
In some systems, a processor derives; a thread select signal that is applied to
the flip-flops by overloading a scan enable (SE) signal of a scannable flip-flop.
In an additional aspect of some multithreading system and method embodiments,
a processor includes anti-aliasing logic coupled to an L1 cache so that the L1
cache is shared among threads via anti-aliasing. The L1 cache is a virtually-indexed,
physically-tagged cache that is shared among threads. The anti-aliasing logic avoids
hazards that result from multiple virtual addresses mapping to one physical address.
The anti-aliasing logic selectively invalidates or updates duplicate L1 cache entries.
In another aspect of some multithreading system and method embodiments, a processor
includes logic for attaining a very fast exception handling functionality while
executing non-threaded programs by invoking a multithreaded-type functionality
in response to an exception condition. The processor, while operating in multithreaded
conditions or while executing non-threaded programs, progresses through multiple
machine states during execution. The very fast exception handling logic includes
connection of an exception signal line to thread select logic, causing an exception
signal to evoke a switch in thread and machine state. The switch in thread and
machine state causes the processor to enter and to exit the exception handler immediately,
without waiting to drain the pipeline or queues and without the inherent timing
penalty of the operating system's software saving and restoring of registers.
An additional aspect of some multithreading systems and methods is a thread reservation
system or thread locking system in which a thread pathway is reserved for usage
by a selected thread. A thread control logic may select a particular thread that
is to execute with priority in comparison to other threads. A high priority thread
may be associated with an operation with strict time constraints, an operation
that is frequently and predominantly executed in comparison to other threads. The
thread control logic controls thread-switching operation so that a particular hardware
thread is reserved for usage by the selected thread.
In another aspect of some multithreading system and method embodiments, a processor
includes logic supporting lightweight processes and native threads. The logic includes
a block that disables thread ID tagging and disables cache segregation since lightweight
processes and native threads share the same virtual tag space.
In a further additional aspect of some embodiments of the multithreading system
and method, some processors include a thread reservation functionality.
BRIEF DESCRIPTION OF THE DRAWINGS
The features of the described embodiments are specifically set forth in the appended
claims. However, embodiments of the invention relating to both structure and method
of operation, may best be understood by referring to the following description
and accompanying drawings.
FIGS. 1A and 1B are timing diagrams respectively illustrating execution flow
of a single-thread processor and a vertical multithread processor.
FIGS. 2A, 2B, and 2C are timing diagrams respectively illustrating
execution flow of a single-thread processor, a vertical multithread processor,
and a vertical and horizontal multithread processor.
FIG. 3 is a schematic functional block diagram depicting a design configuration
for a single-processor vertically-threaded processor that is suitable for implementing
various multithreading techniques and system implementations that improve multithreading
performance and functionality.
FIGS. 4A, 4B, and 4C are diagrams showing an embodiment of a
pulse-based high-speed flip-flop that is advantageously used to attain multithreading
in an integrated circuit. FIG. 4A is a schematic block diagram illustrating control
and storage blocks of a circuit employing high-speed multiple-bit flip-flops. FIG.
4B is a schematic circuit diagram that shows a multiple-bit bistable multivibrator
(flip-flop) circuit. FIG. 4C is a timing diagram illustrating timing of the multiple-bit flip-flop.
FIG. 5 is a schematic block diagram illustrating an N-bit "thread selectable"
flip-flop substitution logic that is used to create vertically multithreaded functionality
in a processor pipeline while maintaining the same circuit size as a single-threaded pipeline.
FIG. 6 is a schematic block diagram illustrating a thread switch logic which
rapidly generates a thread identifier (TID) signal identifying an active thread
of a plurality of threads.
FIGS. 7A and 7B are, respectively, a schematic block diagram showing an example
of a segregated cache and a pictorial diagram showing an example of an addressing
technique for the segregated cache.
FIG. 8 is a schematic block diagram showing a suitable anti-aliasing logic for
usage in various processor implementations including a cache, such as an L1 cache,
and L2 cache, or others.
FIG. 9 is a schematic functional block diagram depicting a design configuration
for a single-chip dual-processor vertically-threaded processor that is suitable
for implementing various multithreading techniques and system implementations that
improve multithreading performance and functionality.
FIG. 10 is a schematic functional block diagram depicting an alternative design
configuration for a single-processor vertically-threaded processor that is suitable
for implementing various multithreading techniques and system implementations that
improve multithreading performance and functionality.
FIG. 11 is a schematic functional block diagram depicting an alternative design
configuration for a single-chip dual-processor vertically-threaded processor that
is suitable for implementing various multithreading techniques and system implementations
that improve multithreading performance and functionality.
FIG. 12 is a schematic block diagram illustrating a processor and processor
architecture that are suitable for implementing various multithreading techniques
and system implementations that improve multithreading performance and functionality.
FIG. 13 is a schematic perspective diagram showing a multi-dimensional register file.
FIG. 14 is a schematic circuit diagram showing a conventional implementation
of register windows.
FIG. 15 is a schematic circuit diagram showing a plurality of bit cells of a
register windows of the multi-dimensional register file that avoids waste of integrated
circuit area by exploiting the condition that only one window is read and only
one window is written at one time.
FIG. 16, a schematic circuit diagram illustrates a suitable bit storage circuit
storing one bit of the local registers for the multi-dimensional register file
with eight windows.
FIGS. 17A and 17B are, respectively, a schematic pictorial diagram and a schematic
block diagram illustrating sharing of registers among adjacent windows.
FIG. 18 is a schematic circuit diagram illustrating an implementation of a multi-dimensional
register file for registers shared across a plurality of windows.
The use of the same reference symbols in different drawings indicates similar
or, identical items.
DESCRIPTION OF THE EMBODIMENT(S)
Referring to FIGS. 1A and 1B, two timing diagrams respectively illustrate
execution flow 110 in a single-thread processor and instruction flow 120
in a vertical multithread processor. Processing applications such as database applications
spend a significant portion of execution time stalled awaiting memory servicing.
FIG. 1A is a highly schematic timing diagram showing execution flow 110
of a single-thread processor executing a database application. In an illustrative
example, the single-thread processor is a four-way superscalar processor. Shaded
areas 112 correspond to periods of execution in which the single-thread
processor core issues instructions. Blank areas 114 correspond to time periods
in which the single-thread processor core is stalled waiting for data or instructions
from memory or an external cache. A typical single-thread processor executing a
typical database application, executes instructions about 30% of the time with
the remaining 70% of the time elapsed in a stalled condition. The 30% utilization
rate exemplifies the inefficient usage of resources by a single-thread processor.
FIG. 1B is a highly schematic timing diagram showing execution flow 120
of similar database operations by a multithread processor. Applications such as
database applications have a large amount inherent parallelism due to the heavy
throughput orientation of database applications and the common database functionality
of processing several independent transactions at one time. The basic concept of
exploiting multithread functionality involves utilizing processor resources efficiently
when a thread is stalled by executing other threads while the stalled thread remains
stalled. The execution flow 120 depicts a first thread 122, a second
thread 124, a third thread 126 and a fourth thread 128, all
of which are shown with shading in the timing diagram. As one thread stalls, for
example first thread 122, another thread, such as second thread 124,
switches into execution on the otherwise unused or idle pipeline. Blank areas 130
correspond to idle times when all threads are stalled. Overall processor utilization
is significantly improved by multithreading. The illustrative technique of multithreading
employs replication of architected registers for each thread and is called "vertical multithreading".
Vertical multithreading is advantageous in processing applications in which
frequent cache misses result in heavy clock penalties. When cache misses cause
a first thread to stall, vertical multithreading permits a second thread to execute
when the processor would otherwise remain idle. The second thread thus takes over
execution of the pipeline. A context switch from the first thread to the second
thread involves saving the useful states of the first thread and assigning new
states to the second thread. When the first thread restarts after stalling, the
saved states are returned and the first thread proceeds in execution. Vertical
multithreading imposes costs on a processor in resources used for saving and restoring
thread states.
Referring to FIGS. 2A, 2B, and 2C, three highly schematic
timing diagrams respectively illustrate execution flow 210 of a single-thread
processor, execution flow 230 of a vertical multithread processor, and execution
flow 250 a combined vertical and horizontal multithread processor. In FIG.
2A, shaded areas 212 showing periods of execution and blank areas 214
showing time periods in which the single-thread processor core is idle due to stall
illustrate the inefficiency of a single-thread processor.
In FIG. 2B, execution flow 230 in a vertical threaded processor includes
execution of a first thread 232, and a second thread 234, both shaded
in the timing diagram, and an idle time shown in a blank area 240. Efficient
instruction execute proceeds as one thread stalls and, in response to the stall,
another thread switches into execution on the otherwise unused or idle pipeline.
In the blank areas 240, an idle time occurs when all threads are stalled.
For vertical multithread processor maintains a separate processing state for T
executing threads. Only one of the threads is active at one time. The vertical
multithreaded processor switches execution to another thread on a cache miss, for
example an L1 cache miss.
A horizontal threaded processor, using a technique called chip-multiple processing,
combines multiple processors on a single integrated circuit die. The multiple processors
are vertically threaded to form a processor with both vertical, and horizontal
threading, augmenting executing efficiency and decreasing latency in a multiplicative
fashion. In FIG. 2C execution flow 250 in a vertical and horizontal threaded
processor includes execution of a first thread 252 executing on a first
processor, a second thread 254 executing on the first processor, a first
thread 256 executing on a second processor and a second thread 258
executing on the second processor. An idle time is shown in a blank area 260
for both the first and second processors. Execution of the first thread 252
and the second thread 254 on the first processor illustrate vertical threading.
Similarly, execution of the first thread 256 and the second thread 258
on the second processor illustrate vertical threading. In the illustrative embodiment,
a single integrated circuit includes both the first processor and the second processor,
the multiple processors executing in parallel so that the multithreading operation
is a horizontal multiple-threading or integrated-circuit chip multiprocessing (CMP)
in combination with the vertical multithreading of the first processor and the
second processor. The combination of vertical multithreading and horizontal multithreading
increases processor parallelism and performance, and attains an execution efficiency
that exceeds the efficiency of a processor with only vertical multithreading. The
combination of vertical multithreading and horizontal multithreading also advantageously
reduces communication latency among local: (on-chip) multi-processor tasks by eliminating
much signaling on high-latency communication lines between integrated circuit chips.
Horizontal multithreading further advantageously exploits processor speed and power
improvements that inherently result from reduced circuit sizes in the evolution
of silicon processing.
For each vertical threaded processor, efficient instruction execute proceeds
as one thread stalls and, in response to the stall, another thread switches into
execution on the otherwise unused or idle pipeline. In the blank areas 260,
an idle time occurs when all threads are stalled.
Vertical multithreading is advantageously used to overcome or hide cache
miss stalls, thereby continuing execution of the processor despite stalls. Vertical
multithreading thus improves performance in commercial multiprocessor and multithreading
applications. Vertical multithreading advantageously accelerates context switching
time from millisecond ranges to nanosecond ranges. Vertical multithreading is highly
advantageous in all processing environments including embedded, desktop, and server
applications, and the like.
Horizontal multithreading or circuit chip multiprocessing further increases
on-chip parallelism by exploiting increasingly smaller processor core sizes.
Although the illustrative example shows execution of two concurrent vertical
multithreading processors with each concurrent vertical multithreading processor
executing two threads, in other examples various numbers of concurrently executing
processors may execute various numbers of threads. The number of threads that execute
on one processor may be the same or different from the number of threads executing
concurrently and in parallel on another processor.
In some processor designs, vertical and horizontal multithreading is incorporated
into the fundamental design of the processors, advantageously creating modular
and flexible structures that promote scalability of design. In other processor
designs, multithreading is incorporated into existing and mature processor designs
to leverage existing technological bases and increasing performance of multiprocessing
and multithreading applications. One highly suitable example of processor design
for retrofitting with multithreading functionality is an UltraSPARC processor.
In some designs, vertical and horizontal multithreading are achieved with minimal
retrofitting of an existing processor core, advantageously reducing logic and physical
design changes and avoiding global chip re-routing, recomposing, and the expense
of heavy redesign of integrated circuits.
Referring to FIG. 3, a schematic functional block diagram depicts a design
configuration for a single-processor vertically-threaded processor 300 that
is suitable for implementing various multithreading techniques and system implementations
that improve multithreading performance and functionality. The single-processor
vertically-threaded processor 300 has a single pipeline shared among a plurality
of machine states or threads, holding a plurality of machine states concurrently.
A thread that is currently active, not stalled, is selected and supplies data to
functional blocks connected to the pipeline. When the active thread is stalled,
the pipeline immediately switches to a non-stalled thread, if any, and begins executing
the non-stalled thread.
The single-processor vertically-threaded processor 300 includes a thread
0 machine state block 310 that defines a machine state of a first
thread (thread 0). The single-processor vertically-threaded processor 300
also includes a thread 1 machine state block 312 that defines a machine
state of a second thread (thread 1) that "shadows" the machine state of
thread 0. The thread 0 machine state block 310 and the thread
1 machine state block 312 are fabricated in a single integrated circuit
logic structure using a high-speed multi-bit flip-flop design and a "four-dimensional"
register file structure and supply instructions from thread 0 and thread
1 to a shared processor pipeline 314 using vertical threading. The
multiple-dimensional register file employs register file structures that are replicated
by N for vertical threading in combination with a three-dimensional storage circuit.
The three-dimensional storage is formed by constructing a storage, such as a register
file or memory, as a plurality of two-dimensional storage planes.
In response to a cache miss stall the processor 300 freezes the entire
pipeline state of an executing thread in the shared processor pipeline 314.
The processor 300 issues instructions manages the machine state of each
thread separately and independently. The functional properties of an independent
thread state are stored throughout the pipeline extending to the pipeline registers
to allow the processor 300 to postpone execution of a stalling thread by
freezing the active state in the pipeline, relinquish the pipeline 314 to
a previously idle thread by activating the previously idle thread in the pipeline
while holding the state of the newly idle thread in the pipeline, and later resume
execution of the postponed stalling thread at the precise state of the stalling
thread immediately prior to the thread switch.
The shared processor pipeline 314 is coupled to a dual load/store unit
including a thread 0 load/store unit 316 and a thread 1 load/store
unit 318 that execute load and store data accesses for instruction threads
0 and 1, respectively. The load/store units generate a virtual address
of all load and store operations for accessing a data cache, decoupling load misses
from the pipeline through a load buffer (not shown), and decoupling the stores
through a store buffer. Up to one load or store is issued per cycle.
The shared processor pipeline 314 and the dual load/store unit are connected
to a data memory interface 320 including a shared data cache and a shared
data memory management unit (DMMU). The shared data cache is used to cache data
for both thread 0 and thread 1 instruction sequences. In an illustrative
processor 300, the data cache is a write-through non-allocating 16-kilobyte
direct-mapped 32-byte line cache.
The data cache is virtually-indexed and physically-tagged using a tag array that
is dual-ported so that tag updates resulting from line fills do not collide with
tag reads for incoming loads. Snoops to the data cache use the second tag port
so that an incoming load processed without delay by the snoop. The shared data
memory management unit (DMMU) manages virtual to physical address translation.
The dual load/store units are also connected to an external cache control unit
(ECU) 322, which is connected to an external cache bus 324. The external
cache control unit 322 is also connected to an UltraPort Architecture Interconnect
(UPA) bus 326 via a memory interface unit (MIU) 328. The external
cache control unit 322 and the memory interface unit (MIU) 328 are
unified between thread 0 and thread 1 to perform functions of cache
miss processing and interfacing with external devices to supply, in combination,
a plurality of execution threads to the thread 0 machine state block 310
and the thread 1 machine state block 312 via a shared instruction
control block 330. The unified external cache control unit 322 and
memory interface unit (MIU) 328 include thread identifier (TID) tagging
to specify and identify a transaction that is accessed via the external cache bus
324 and the UPA bus 326. In the processor 300, TID logging
is only internal to the processor 300 (integrated circuit chip). Outside
the integrated circuit chip, hardware interacts with the processor 300 in
the manner of an interaction with a single CPU with one UPA bus, and one external
cache bus interface. In contrast, software outside the integrated circuit chip
interacts with the processor 300 in the manner of an interaction with two
logical CPUs.
The instruction control block 330 includes an instruction (L1) cache,
a branch prediction unit, NFRAM, and an instruction memory management unit (IMMU)
all of which are shared between the multiple threads, thread 0 and thread
1. In an illustrative processor, the instruction cache is a 16 kilobyte
two-way set-associative cache with 32-byte blocks. The instruction cache is physically
indexed and physically tagged. The set is predicted as part of a "next field" so
that only index bits of an address are needed to address the cache. The instruction
memory management unit (IMMU) supports virtual to physical address translation
of instruction program counters (PCs). To prefetch across conditional branches,
dynamic branch prediction is implemented in hardware based on a two-bit history
of a branch. In an illustrative processor, a next-field is associated with every
four instructions in the instruction cache points to the next cache line to be
fetched. Up to twelve instructions are stored in an instruction buffer and issued
to the pipeline.
The external cache control unit 322 manages instruction (L1) cache and
data cache misses, and permits up to one access every other cycle to the external
cache. Load operations that miss in the data cache are remedied by multiple-byte
data cache fills on two consecutive accesses to the external cache. Store operations
are fully pipelined and write-through to the external cache. Instruction prefetches
that miss the instruction cache are remedied by multiple-byte instruction cache
fills using four consecutive accesses to the parity-protected external cache.
The external cache control unit 322 supports DMA accesses which hit in
the external cache and maintains data coherence between the external cache and
the main memory (not shown).
The memory interface unit (MIU) 328 controls transactions to the UPA bus
326. The UPA bus 326 runs at a fraction (for example, ⅓) of
the processor clock.
Vertical multithreading advantageously improves processor performance in
commercial application workloads which have high cache miss rates with a high miss
penalty, low processor utilization (30%-50% on OLTP), and latency periods that
present an opportunity to overlap execution to utilize cache miss wait times.
Vertical multithreading is also highly advantageous in sequential and parallel
processing applications with frequent context switches.
Vertical multithreading does impose some costs on a processor in terms of
resources used to save and restore thread states. The costs vary depending on the
implementation of multithreading resources. For example, a software implementation
typically incurs a time expense that negates any gain in latency. In another example,
pipeline stages may be duplicated while attempting to share as many resources as
possible, disadvantageously resulting in a high cost in silicon area.
An advantageous technique for implementing vertical multithreading, called a
high-speed
multi-bit flip-flop design, involves designing pipeline registers (flops) with
multiple storage bits. The individual bits of a flip-flop are allocated to a separate
thread. When a first thread stalls, typically due to a cache miss, the active bit
of a flip-flop is removed from the pipeline pathway and another bit of the flip-flop
becomes active. The states of the stalled thread are preserved in a temporarily
inactive bit of the individual flip-flops in a pipeline stage. The high-speed multi-bit
flip-flop design utilizes placement of a multiple-bit flip-flop at the end of the
individual pipeline stages. The individual bits of the multiple-bit flip-flop are
individually accessible and controllable to allow switching from a first thread
to a second thread when the first thread stalls.
Referring to FIG. 4A, a schematic block diagram illustrates control and
storage blocks of a circuit employing high-speed multiple-bit flip-flops. A multiple-bit
flip-flop storage block 410 includes a storage header block 412 and
a multiple-bit flip-flop block 414. The storage header block 412
supplies timing signals and thread select signals to the multiple-bit flip-flop
block 414. Input signals to the storage header block 412 include
a clock signal 14clk that is supplied from external to the multiple-bit
flip-flop storage block 410, a combined scan enable and clock enable signal
se_ce_, and a thread identifier (TID) signal tid_g that is supplied from thread
select circuitry external to the multiple-bit flip-flop storage block 410.
The storage header block 412 derives an internal flip-flop clock signal
elk, the inverse of the internal flip-flop clock signal clk_l, and a scan clock
signal sclk from the external clock 14clk and the scan enable and clock
enable signal se_ce_l. The storage header block 412 asserts an internal
thread ID signal tid based on the thread identifier (TID) signal tid_g. The storage
header block 412 drives one or more flip-flop cells in the multiple-bit
flip-flop block 414. Typically, the multiple-bit flip-flop block 414
includes from one to 32 bistable multivibrator cells, although more cells may be
used. The internal flip-flop clock signal clk, the inverse of the internal flip-flop
clock signal clk_l, the scan clock signal sclk, and the internal thread ID signal
tid are supplied from the storage header block 412 to the multiple-bit flip-flop
block 414.
In addition to the internal flip-flop clock signal clk, the inverse of the internal
flip-flop clock signal clk_l, the scan clock signal sclk, and the internal thread
ID signal tid, the multiple-bit flip-flop block 414 also receives an input
signal d and a scan chain input signal si.
Referring to FIG. 4B, a schematic circuit diagram shows a multiple-bit
bistable multivibrator (flip-flop) circuit. A conventional flip-flop is a single-bit
storage structure and is commonly used to reliably sample and store data. A flip-flop
is typically a fundamental component of a semiconductor chip with a single phase
clock and a major determinant of the overall clocking speed of a microcontroller
or microprocessor. A novel pulse-based multiple-bit high-speed flip-flop 400
is used to accelerate the functionality and performance of a processor.
An individual cell of the pulse-based multiple-bit high-speed flip-flop 400,
includes an input stage with a push-pull gate driver 402. The push-pull
gate driver 402 operates as a push-pull circuit for driving short-duration
pulses to a multiple-bit storage circuit 428 and an output line q via an
inverter 438. The push-pull gate driver 402 has four MOSFETs connected
in series in a source-drain pathway between VDD and VCC references including a
p-channel MOSFET 418, a p-channel MOSFET 420, an n-channel MOSFET
422, and an n-channel MOSFET 424. P-channel MOSFET 418 and
n-channel MOSFET 424 have gate terminals connected to the input signal d.
The p-channel MOSFET 420 has a source-drai