현재 접속중인 등록 사용자는 0명, 익명 사용자는 3명 입니다.
전체 등록 사용자: 751명
|PC Processor Microarchitecture
|글쓴이: EzDoum 글쓴날: 2002년 04월 21일 오후 08:21
PC Processor Microarchitecture
June 9, 2001
By: J. Scott Gardner
Isn't it interesting that new high-tech products seem so complicated, yet only a
few years later we talk about how much simpler the old stuff was? This is
certainly true for microprocessors. As soon as we finally figure out all the new
features and feel comfortable giving advice to our family and friends, we're
confronted with details about a brand-new processor that promises to obsolete
our expertise on the "old" generation. Gone are the simple and familiar diagrams
of the past, replaced by arcane drawings and cryptic buzzwords. For a PC
technology enthusiast, this is like discovering a new world to be explored and
conquered. While many areas will seem strange and unusual, much of the landscape
resembles places we've traveled before. This article is meant to serve as a
faithful companion for this journey, providing a guidebook of the many wondrous
new discoveries we're sure to encounter.
An Objective Tutorial and Analysis of PC Microarchitecture
The goal of this article is to give the reader some tools for understanding the
internals of modern PC microprocessors. In the article "PC Motherboard
Technology", we developed some tools for analyzing a modern PC motherboard. This
article takes us one step deeper, zooming into the complex world inside the PC
processor itself. The internal design of a processor is called the
"microarchitecture". Each CPU vendor uses slightly different techniques for
getting the most out of their design, while meeting their unique performance,
power, and cost goals. The marketing departments from these companies will often
highlight microarchitectural features when promoting their newest CPUs, but it's
often difficult for us PC technology enthusiasts to figure out what it really
What is needed is an objective comparison of the design features for all the CPU
vendors, and that's the goal of this article. We'll walk through the features of
the latest x86 32-bit desktop CPUs from Intel, AMD, and VIA (Centaur). Since the
Transmeta "Crusoe" processor is mostly targeted at the mobile market, we'll
analyze their microarchitecture in another article. It will also be the task for
another article to thoroughly explore Apple's PowerPC G4 microprocessor, and
many of the analytical tools learned here will apply to all high-end processors.
Building a Framework for Comparison
Before we can dive right into the block diagram of a modern CPU, we need to
develop some analytical tools for understanding how these features affect the
operation of the PC system. We also need to develop a common framework for
comparison. As you'll soon see, that is no easy task. There are some radical
differences in architecture between these vendors, and it's difficult to make
direct comparisons. As it turns out, the best way to understand and compare
these new CPUs is to go back to basic computer architectural concepts and show
how each vendor has solved the common problems faced in modern computer design.
In our last section, we'll gaze into the future of PC microarchitecture and make
a few predictions.
Let's Not Lose Sight of What Really Matters
There is one issue that should be stressed right up front. We should never lose
sight of the real objective in computer design. All that really matters is how
well the CPU helps the PC run your software. A PC is a computer system, and
subtle differences in CPU microarchitecture may not be noticeable when you're
running your favorite computer program. We learned this in our article on
motherboard technology, since a well-balanced PC needs to remove all the
bottlenecks (and meet the cost goals of the user). The CPU designers are turning
to more and more elaborate techniques to squeeze extra performance out of these
machines, so it's still really interesting to peek in on the raging battle for
even a few percent better system performance.
For a PC technology enthusiast, it's just downright fascinating how these CPU
architects mix clever engineering tricks with brute-force design techniques to
take advantage of the enormous number of transistors available on the latest
What Does a Computer Really Do?
It's easy to get buried too deeply in the complexities of these modern machines,
but to really understand the design choices, let's think again about the
fundamental operation of a computer. A computer is nothing more than a machine
that reads a coded instruction, decodes the instruction, and executes it. If the
instruction needs to load or store some data, the computer figures out the
location for the data and moves it. That's it; that's all a computer does. We
can break this operation into a series of stages:
The 5 Computer Operation Stages
Stage 1 Instruction Access (IA)
Stage 2 Instruction Decode (ID)
Stage 3 Execution (EX)
Stage 4 Data Access (DA)
Stage 5 Store (write back) Results (WB)
Some computer architects may re-arrange, combine, or break up the stages, but
every computer microarchitecture does these five things. We can use this
framework to build on as we work our way up to even the most complicated CPUs.
For those of you who eat this stuff for breakfast and are anxious to jump ahead,
remember that we haven't yet talked about pipelines. These stages could all be
completely processed for a single instruction before starting the next one. If
you think about that idea for a moment, you'll realize that almost all the
complexity comes when we start improving on that limitation. Don't worry; the
discussion will quickly ramp up in complexity, and some readers might appreciate
a quick refresher. Let's see what happens in each of these stages:
A coded instruction is read from the memory subsystem at an address that is
determined by a program counter (PC). In our analysis, we'll treat memory as
something that hangs off to the side of our CPU "execution core", as we show in
the figure below. Some architects like to view memory and the system bus as an
integral part of the microarchitecture, and we'll show how the memory subsystem
interacts with the rest of the machine.
Figure 1: under 'Instruction Access'
URL : http://www.ezdoum.com/upload/ooo/ooo1.gif
The coded instruction is converted into control information for the logic
circuits of the machine. Each "operation code (Opcode)" represents a different
instruction and causes the machine to behave in different ways. Embedded in the
Opcode (or stored in later bytes of the instruction) can be address information
or "immediate" data to be processed. The address information can represent a new
address that might need to be loaded into the PC (a branch address) or the
address can represent a memory location for data (loads and stores). If the
instruction needs data from a register, it is usually brought in during this
This is the stage where the machine does whatever operation was directed by the
instruction. This could be a math operation (multiply, add, etc.) or it might be
a data movement operation. If the instruction deals with data in memory, the
processor must calculate an "Effective Address (EA)". This is the actual
location of the data in the memory subsystem (ignoring virtual memory issues for
now), based on calculating address offsets or resolving indirect memory
references (A simple example of indirection would be registers that house an
address, rather than data).
In this stage, instructions that need data from memory will present the
Effective Address to the memory subsystem and receive back the data. If the
instruction was a store, then the data will be saved in memory. Our simple model
for comparison gets a bit frayed in this stage, and we'll explain in a moment
what we mean.
Once the processor has executed the instruction, perhaps having been forced to
wait for a data load to complete, any new data is written back to the
destination register (if the instruction type requires it).
Was There a Question From the Back of the Room?
Some of the x86 experts in the audience are going to point out the numerous
special cases for the way a processor must deal with an instruction set designed
in the 1970s. Our five-stage model isn't so simple when it must deal with all
the addressing modes of an x86. A big issue is the fact that the x86 is what is
called a "register-memory" architecture where even ALU (Arithmetic Logic Unit)
instructions can access memory. This is contrasted with RISC (Reduced
Instruction Set Computing) architectures that only allow Load and Store
instructions to move data (register-register or more commonly called Load/Store
The reason we can focus on the Load/Store architecture to describe what happens
in each stage of a computer is that modern x86 processors translate their native
CISC (Complex Instruction Set Computing) instructions into RISC instructions
(with some exceptions). By translating the instructions, most of the special
cases are turned into extra RISC instructions and can be more efficiently
processed. RISC instructions are much easier for the hardware to optimize and
run at higher clock rates. This internal translation to RISC is one of the ways
that x86 processors were able to deal with the threat that higher-performance
RISC chips would take over the desktop in the early 1990s. We'll talk about
instruction translation more when we dig into the details of some specific
processors, at which point we'll also show several ways in which our model is
To the questioner in the back of the room, there will be several things we're
going to have to gloss over (and simplify) in order to keep this article from
getting as long as a computer textbook. If you really want to dig into details,
check out the list of references at the end of this article.
The Memory Subsystem
The memory subsystem plays a big part in the microarchitecture of a CPU. Notice
that both the Instruction Access stage and the Data Access stage of our simple
processor must get to memory. This memory can be split into separate sections
for instructions and data, allowing each stage to have a dedicated (hence
faster) port to memory.
This is called a "Harvard Architecture", a term from work at Harvard University
in the 1940s that has been extended to also refer to architectures with separate
instruction and data caches--even though main memory (and sometimes L2 cache) is
"unified". For some background on cache design, you can refer to the memory
hierarchy discussion in the article, "PC Motherboard Technology". That article
also covers the system bus interface, an important part of the PC CPU design
that is tailored to support the internal microarchitecture.
Virtual Memory: Making Life Easier for the Programmer and Tougher for the
To make life simpler for the programmer, most addresses are "virtual addresses"
that allow the software designer to pretend to have a large, linear block of
memory. These virtual addresses are translated into "physical addresses" that
refer to the actual addresses of the memory in the computer. In almost all x86
chips, the caches contain memory data that is addressed with physical addresses.
Before the cache is accessed, any virtual addresses are translated in a
"Translation Look-aside Buffer (TLB)". A TLB is like a cache of recently-used
virtual address blocks (pages), responding back with the physical address page
that corresponds to the virtual address presented by the CPU core. If the
virtual address isn't in one of the pages stored by the TLB (a TLB miss), then
the TLB must be updated from a bigger table stored in main memory--a huge
performance hit (especially if the page isn't in main memory and must be loaded
from disk). Some CPUs have multiple levels of TLBs, similar to the notion of
cache memory hierarchy. The size and structure of the TLBs and caches will be
important during our CPU comparisons later, but we'll focus mainly on the CPU
core for our analysis.
Exploiting ILP Through Pipelining
Figure 2: under 'Exploiting Instruction-Level Parallelism through Pipelining'
URL : http://www.ezdoum.com/upload/ooo/ooo2.gif
Instead of waiting until an instruction has completed all five stages of our
model machine, we could start a new instruction as soon as the first instruction
has cleared stage 1. Notice that we can now have five instructions progressing
through our "pipeline" at the same time. Essentially, we're processing five
instructions in parallel, referred to as "Instruction-Level Parallelism (ILP)".
If it took five clock cycles to completely execute an instruction before we
pipelined the machine, we're now able to execute a new instruction every single
clock. We made our computer five times faster, just with this "simple" change.
Let's Just Think About This a Minute
We'll use a bunch of computer engineering terms in a moment, since we've got to
keep that person in the back of the room happy. Before doing that, take a step
back and think about what we did to the machine. (Even experienced engineers
forget to do that sometimes.) Suddenly, memory fetches have to occur five times
faster then before. This implies that system and cache must now run five times
as fast, even though each instruction still takes five cycles to completely
We've also made a huge assumption that each stage was taking exactly the same
amount of time, since that's the rule that our pipeline clock is enforcing. What
about the assumption that the processor was even going to run the next four
instructions in that order? We (usually) won't even know until the execute stage
whether we need to branch to some other instruction address. Hey, what would
happen if the sequence of instructions called for the processor to load some
data from memory and then try to perform a math operation using that data in the
next instruction? The math operation would likely be delayed, due to memory
latency slowing down the process.
They're Called Pipeline Hazards
What we're describing are called "pipeline hazards", and their effects can get
really ugly. There are three types of hazards that can cause our pipeline to
come to a screeching halt--or cause nasty errors if we don't put in extra
hardware to detect them. The first hazard is a "data hazard", such as the
problem of trying to use data before it's available (a "data dependency").
Another type is a "control hazard" where the pipeline contains instructions that
come after a branch. A "structural hazard" is caused by resource conflicts where
an instruction sequence can cause multiple instructions to need the same
processor resource during a given clock cycle. We'd have a structural hazard if
we tried to use the same memory port for both instructions and data.
Modern Pipelines Can Have a Lot of Stall Cycles
There are ways to reduce the chances of a pipeline hazard occurring, and we'll
discuss some of the ways that CPU architects deal with the various cases. In a
practical sense, there will always be some hazards that will cause the pipeline
to stall. One way to describe the situation is to say that an instruction will
"block" part of the pipe (something modern implementations help minimize). When
the pipe stalls, every (blocked) instruction behind the stalled stage will have
to wait, while the instructions fetched earlier can continue on their way. This
opens up a gap (a "pipeline bubble") between blocked instructions and the
instructions proceeding down the pipeline in front of the blocked instructions.
When the blocked instruction restarts, the bubble will continue down the
pipeline. For some hazards, like the control hazard caused by a (mispredicted)
branch instruction, the following instructions in the pipeline need to be
killed, since they aren't supposed to execute. If the branch target address
isn't in the instruction cache, the pipeline can stall for a large number of
clock cycles. The stall would be extended by the latency of accesses to the L2
cache or, worse, accesses to main memory. Stalls due to branches are a serious
problem, and this is one of the two major areas where designers have focused
their energy (and transistor budget). The other major area, not surprisingly, is
when the pipeline goes to memory to load data. Most of our analysis will focus
in on these 2 latency-induced problems.
Design Tricks To Reduce Data Hazards
For some data hazards, one commonly-used solution is to forward result data from
a completed instruction straight to another instruction yet to execute in the
pipeline (data "forwarding", though sometimes called "bypassing"). This is much
faster than writing out the data and forcing the other instruction to read it
back in. Our case of a math operation needing data from a previous memory load
instruction would seem to be a good candidate for this technique. The data
loaded from memory into a register can also be forwarded straight to the ALU
execute stage, instead of going all the way through the register write-back
stage. An instruction in the write-back stage could forward data straight to an
instruction in the execute stage.
Why wait 2 cycles? Why not forward straight from the data access stage? In
reality, the data load stage is far from instantaneous and suffers from the same
memory latency risk as instruction fetches. The figure below shows how this can
occur. What if the data is not in the cache? There would be a huge pipeline
bubble. As it turns out, data access is even more challenging than an
instruction fetch, since we don't know the memory address until we've calculated
the Effective Address. While instructions are usually accessed sequentially,
allowing several cache lines to be prefetched from the instruction cache (and
main memory) into a fast local buffer near the execution core, data accesses
don't always have such nice "locality of reference".
Figure 3: Clock #
URL : http://www.ezdoum.com/upload/ooo/ooo3.gif
The Limits of Pipelining
If five stages made us run up to five times faster, why not chop up the work
into a bunch more stages? Who cares about pipeline hazards when it gives the
marketing folks some really high peak performance numbers to brag about? Well,
every x86 processor we'll analyze has a lot more than five stages. Originally
called "super-pipelining" until Intel (for no obvious reason) decided to rename
it "hyper-pipelining" in their Pentium 4 design, this technique breaks up
various processing stages into multiple clock cycles.
This also has the architectural benefit of giving better granularity to
operations, so there should be fewer cases where a fast operation waits around
while slow operations throttle the clock rate. With some of the clever design
techniques we'll examine, the pipeline hazards can be managed, and clock rates
can be cranked into the stratosphere. The real limit isn't an architectural
issue, but is related to the way digital circuits clock data between pipeline
To pipeline an operation, each new stage of the pipeline must store information
passed to it from a prior stage, since each stage will (usually) contain
information for a different instruction. This staged data is held in a storage
device (usually a "latch"). As you chop up a task into smaller and smaller
pipeline stages, the overhead time it takes to clock data into the latch
("set-up and hold" times and allowance for clock "skew" between circuits)
becomes a significant percentage of the entire clock period. At some point,
there is no time left in the clock cycle to do any real work. There are some
exotic circuit tricks that can help, but it would burn a lot of power - not a
good trade-off for chips that already exceed 70 watts in some cases.
Exploiting ILP Via Superscalar Processing
While our simple machine doesn't have any serious structural hazards, that's
only because it is a "single-issue" architecture. Only a single instruction can
be executed during a clock cycle. In a "superscalar" architecture, extra compute
resources are added to achieve another dimension of instruction-level
parallelism. The original Pentium provided 2 separate pipelines that Intel
called the U and V pipelines. In theory, each pipeline could be working
simultaneously on 2 different sets of instructions.
With a multi-issue processor (where multiple instructions can be dispatched each
clock cycle to multiple pipelines in the single processor), we can have even
more data hazards, since an operation in one pipeline could depend on data that
is in another pipeline. The control hazards can get worse, since our
"instruction fetch bandwidth" rises (doubled in a 2-issue machine, for example).
A (mispredicted) branch instruction could cause both pipelines to need
Issue Restrictions Limit How Often Parallelism Can Be Achieved
In practice, a superscalar machine has lots of "issue restrictions" that limit
what each pipeline is capable of processing. This structural hazard limited how
often both the U and V pipe of the Pentium could simultaneously execute 2
instructions. The limitations are caused by the cost of duplicating all the
hardware for each pipeline, so the designers focus instead on exploiting
parallelism in as many cases as practical.
Combining Superscalar with Super-Pipelining to Get the Best of Both
Another approach to superscalar is to duplicate portions of the pipeline. This
becomes much easier in the new architectures that don't require instructions to
proceed at the same rate through the pipeline (or even in the original program
order). An obvious stage for exploiting superscalar design techniques is the
execute stage, since PC's process three different types of data. There are
integer operations, floating-point operations and now "media" operations. We
know all about integer and floating-point. A media instruction processes
graphics, sound or video data (as well as communications data). The instruction
sets now include MMX, 3DNow!, Enhanced 3DNow!, SSE, and SSE2 media instructions.
The execute stage could attempt to simultaneously process all three types of
instructions, as long as there is enough hardware to avoid structural hazards.
In practice, there are several structural hazards that require issue
restrictions. Each new execution resource could also have its own pipeline. Many
floating-point instructions and media instructions require multiple clocks and
aren't fully pipelined in some implementations. We'll clear up any confusion
when we analyze some real processors later. For now, it's only important to
understand the fundamentals of superscalar design and realize that modern
architectures include combinations of multiple pipelines running simultaneously.
Exploiting Data-Level Parallelism Via SIMD
We'll talk more about this later, but the new focus on media instructions has
allowed CPU designers to recognize the inherent parallelism in the way data is
processed. The same operation is often performed on independent data sets, such
as multiplying data stored in a vector or a matrix. A single instruction is
repeated over and over for multiple pieces of data. We can design special
hardware to do this more efficiently, and we call this a "Single Instruction
Multiple Data (SIMD)" computing model.
More Pressure on the Memory System
Once again, take a step back and think about the implications before that person
in the back of the room gets us to dive into implementation details. With some
intuitive analysis, we can observe that we've once again put tremendous pressure
on our memory subsystem. A single instruction coming down our pipeline(s) could
force multiple data load and store operations. Thinking a bit further about the
nature of media processing, some of the streaming media types (like video) have
critical timing constraints, and the streams can last for a long time (i.e. as a
viewer of video, you expect a continuous flow of the video stream over time,
preferably without choppiness or interruptions). Our data caches may not do us
much good, since the data may only get processed once before the next chunk of
data wants to replace it (data caches are most effective when the same data is
accessed over and over). Thus the CPU architects have some new challenges to
Where Should Designers Focus The Effort?
By now, you've likely come to realize that every CPU vendor is trying to solve
similar problems. They're all trying to take a 1970s instruction set and do as
much parallel processing as possible, but they're forced to deal with the
limitations of both the instruction set and the nature of memory systems. There
is a practical limit to how many instructions can be processed in parallel, and
it gets more and more difficult for the hardware to "dynamically" schedule
instructions around any possible instruction blockage. The compilers are getting
better at "statically" scheduling, based on the limited information available at
compile time. However, the hardware is being pushed to the limits in an attempt
to look as far ahead in the instruction stream as possible in the search for
It's All About Memory Latency
As we've shown, there are 2 stages of our computer model where the designers can
get the most return on their efforts. These are Instruction Fetch and Data
Access, and both can cause an enormous performance loss if not handled properly.
The problem is caused by the fact that our pipelines are now running at over one
GHz, and it can take over 100 pipeline cycles to get something from main memory.
The key to solving the problem is to make sure that the required instructions or
data aren't sitting in main memory when you need them, but instead, are already
in a buffer inside your pipeline--or at least sitting in an upper level of your
Branch Prediction Can Solve the Problem With I-Fetch Latency
If we could predict with 100% certainty which direction a program branch is
going (forward or backward in the instruction stream), then we could make sure
that the instructions following the branch instruction are in the correct
sequence in the pipeline. That's not possible, but improvement in the branch
predictor can have a dramatic performance gain for these modern,
deeply-pipelined architectures. We'll analyze some branch prediction approaches
Data Memory Latency is Much Tougher to Handle
One way to deal with data latency is to have "non-blocking loads" so that other
memory operations can proceed while we're waiting for the data for a specific
instruction to come back from the memory system. Every x86 architecture does
this now. Still, if the data is sitting in main memory when the load is being
executed, the chip's performance will take a severe hit. The key is to pre-fetch
blocks of data before they're needed, and special instructions have been added
to directly allow the software to deal with the limited locality of data.
There are also some ways that the pipeline can help by buffering up load
requests and using intelligent data pre-fetching techniques based on the
processor's knowledge of the instruction stream. We'll analyze some of the
vendor solutions to the problem of data access.
A Closer Look At Branch Prediction
The person in the back of the room will be happy to hear that things are about
to get more complicated. We're now going to explore some of the recent
innovations in CPU microarchitecture, starting with branch prediction. All the
easy techniques have already been implemented. To get better prediction
accuracy, microprocessor designers are combining multiple predictors and
inventing clever new algorithms.
There really are three different kinds of branches:
Forward conditional branches - based on a run-time condition, the PC (Program
Counter) is changed to point to an address forward in the instruction stream.
Backward conditional branches - the PC is changed to point backward in the
instruction stream. The branch is based on some condition, such as branching
backwards to the beginning of a program loop when a test at the end of the loop
states the loop should be executed again.
Unconditional branches - this includes jumps, procedure calls and returns that
have no specific condition. For example, an unconditional jump instruction might
be coded in assembly language as simply "jmp", and the instruction stream must
immediately be directed to the target location pointed to by the jump
instruction, whereas a conditional jump that might be coded as "jmpne" would
redirect the instruction stream only if the result of a comparison of two values
in a previous "compare" instructions shows the values to not be equal. (The
segmented addressing scheme used by the x86 architecture adds extra complexity,
since jumps can be either "near" (within a segment) or "far" (outside the
segment). Each type has different effects on branch prediction algorithms.)
Using Branch Statistics for Static Prediction
Forward branches dominate backward branches by about 4 to 1 (whether conditional
or not). About 60% of the forward conditional branches are taken, while
approximately 85% of the backward conditional branches are taken (because of the
prevalence of program loops). Just knowing this data about average code
behavior, we could optimize our architecture for the common cases. A "Static
Predictor" can just look at the offset (distance forward or backward from
current PC) for conditional branches as soon as the instruction is decoded.
Backward branches will be predicted to be taken, since that is the most common
case. The accuracy of the static predictor will depend on the type of code being
executed, as well as the coding style used by the programmer. These statistics
were derived from the SPEC suite of benchmarks, and many PC software workloads
will favor slightly different static behavior.
Dynamic Branch Prediction with a Branch History Buffer (BHB)
To refine our branch prediction, we could create a buffer that is indexed by the
low-order address bits of recent branch instructions. In this BHB (sometimes
called a "Branch History Table (BHT)"), for each branch instruction, we'd store
a bit that indicates whether the branch was recently taken. A simple way to
implement a dynamic branch predictor would be to check the BHB for every branch
instruction. If the BHB's prediction bit indicates the branch should be taken,
then the pipeline can go ahead and start fetching instructions from the new
address (once it computes the target address).
By the time the branch instruction works its way down the pipeline and actually
causes a branch, then the correct instructions are already in the pipeline. If
the BHB was wrong, a "misprediction" occurred, and we'll have to flush out the
incorrectly fetched instructions and invert the BHB prediction bit.
Refining Our BHB by Storing More Bits
It turns out that a single bit in the BHB will be wrong twice for a loop--once
on the first pass of the loop and once at the end of the loop. We can get better
prediction accuracy by using more bits to create a "saturating counter" that is
incremented on a taken branch and decremented on an untaken branch. It turns out
that a 2-bit predictor does about as well as you could get with more bits,
achieving anywhere from 82% to 99% prediction accuracy with a table of 4096
entries. This size of table is at the point of diminishing returns for 2 bit
entries, so there isn't much point in storing more. Since we're only indexing by
the lower address bits, notice that 2 different branch addresses might have the
same low-order bits and could point to the same place in our table--one reason
not to let the table get too small.
Two-Level Predictors and the GShare Algorithm
There is a further refinement we can make to our BHB by correlating the behavior
of other branches. Often called a "Global History Counter", this "two-level
predictor" allows the behavior of other branches to also update the predictor
bits for a particular branch instruction and achieve slightly better overall
prediction accuracy. One implementation is called the "GShare algorithm". This
approach uses a "Global Branch History Register" (a register that stores the
global result of recent branches) that gets "hashed" with bits from the address
of the branch being predicted. The resulting value is used as an index into the
BHB where the prediction entry at that location is used to dynamically predict
the branch direction. Yes, this is complicated stuff, but it's being used in
several modern processors.
Using a Branch Target Buffer (BTB) to Further Reduce the Branch Penalty
In addition to a large BHB, most predictors also include a buffer that stores
the actual target address of taken branches (along with optional prediction
bits). This table allows the CPU to look to see if an instruction is a branch
and start fetching at the target address early on in the pipeline processing. By
storing the instruction address and the target address, even before the
processor decodes the instruction, it can know that it is a branch. The figure
below shows an implementation of a BTB. A large BTB can completely remove most
branch penalties (for correctly-predicted branches) if the CPU looks far enough
ahead to make sure the target instructions are pre-fetched. Using a Return
Address Buffer to predict the return from a subroutine One technique for dealing
with the unconditional branch at the end of a subroutine is to create a buffer
of the most recent return addresses. There are usually some subroutines that get
called quite often in a program, and a return address buffer can make sure that
the correct instructions are in the pipeline after the return instruction.
Figure 4: Depiction of a Branch Target Buffer (BTB)
URL : http://www.ezdoum.com/upload/ooo/ooo4.gif
Speculative, Out-of-Order Execution Gets a New Name
While RISC chips used the same terms as the rest of the computer engineering
community, the Intel marketing department decided that the average consumer
wouldn't like the idea of a computer that "speculates" or runs programs "out of
order". A nice warm-and-fuzzy term was coined for the P6 architecture, and
"Dynamic Execution" was added to our list of non-descriptive buzzwords.
Both AMD and Intel use a microarchitecture that, after decoding into simpler
RISC instructions, tosses the instructions into a big hopper and allows them to
execute in whatever order best matches the available compute resources. Once the
instructions have finished executing out of order, results get "committed" in
the original program order. The term "speculation" refers to instructions being
speculatively fetched, decoded and executed.
A useful analogy can be drawn to the stock market investor who "speculates" that
a stock will go up in value and justify an investment. For a microprocessor
speculating on instructions in advance, if the speculation turns out to be
incorrect, those instructions are eliminated before any machine state changes
are committed (written to processor registers or memory).
Once Again, Let's Take a Step Back and Try Some More Intuitive Analysis
By now that person in the back of the room has finally gotten used to these
short pauses to look at the big picture. In this case, we just made a huge
change to our machine, and it's hard to easily conceptualize. We've completely
scrambled the notion of how instructions flow down a one-way pipeline. One thing
that becomes obvious is the need for darn good branch prediction. All that
speculation becomes wasted memory bandwidth, execution time, and power if we end
up taking a branch we didn't expect. Following our stock investor analogy, if
the value doesn't go up, then the investment was wasted and could have been more
productively used elsewhere. In fact, the speculation could make us worse off.
The need to wait before committing completed instructions to registers or memory
should probably be obvious, since we could end up with incorrect program
behavior and incorrect data--then have to try to unwind everything when a branch
misprediction (or an exception) comes along. The real power of this approach
would seem to be realized by having lots of superscalar stages, since we can
reorder the instructions to better match the issue restrictions of multiple
compute resources. OK, enough speculation, let's dig into the details:
Register Renaming Creates Virtual Registers
If you're going to have speculative instructions operating out of order, then
you can't have them all trying to change the same registers. You need to create
a "register alias table (RAT)" that renames and maps the eight x86 registers to
a much larger set of temporary internal register storage locations, permitting
multiple instances of any of the original eight registers. An instruction will
load and store values using these temporary registers, while the RAT keeps track
of what the latest known values are for the actual x86 registers. Once the
instructions are completed and re-ordered so that we know the register state is
correct, then the temporary registers are committed back to the real x86
The Reorder Buffer (ROB) Helps Keep Instructions in Order
After an instruction is decoded, it's allowed to execute out of order as soon as
the operands (data) become available. A special Reorder Buffer is created to
keep track of instruction status, such as when the operands become available for
execution, or when the instruction has completed execution and results can be
"committed" or "retired" to architectural registers or memory in the original
program order. These instructions use the renamed register set and are
"dispatched" to the execution units as resources become available, perhaps
spending some time in "reservation stations" that operate as instruction queues
at the front of various execution units. After an instruction has finished
executing, it can be "retired" by the ROB. However, the state still isn't
committed until all the older instructions (with respect to program order) have
been retired first.
A neat thing about using register renaming, reservation stations, and the ROB is
that a result from a completed instruction can be forwarded directly to the
renamed register of a new instruction. Many potential data dependencies go away
completely, and the pipelines are kept moving.
Load and Store Buffering Tries to Hide Data Access Latency
In the same way that instructions are executed as soon as resources become
available, a load or a store instruction can get an early start by using this
speculative approach. Obviously, the stores can't actually get sent all the way
to memory until we're sure the data really should be changed (requiring we
maintain program order). Instead, the stores are buffered, retired, and
committed in order. The loads are a more interesting case, since they are
directly affected by memory latency, the other key problem we highlighted
earlier. The hardware will speculatively execute the load instruction,
calculating the Effective Address out of order. Depending on the implementation,
it may even allow out-of-order cache access, as long as the loads don't access
the same address as a previous store instruction still in the processor
pipeline, but not yet committed. If in fact the load instruction needs the
results of a previous store that has completed but is still in the machine, the
store data can get forwarded directly to the load instruction (saving the memory
Analyzing Some Real Microprocessors: P4
We've come to the end of our tutorial on processor microarchitecture. Hopefully,
we've given you enough analytical tools so that you're now ready to dig into the
details of some real products. There are a few common microarchitectural
features (like instruction translation) that we decided would be easier to
explain as we show some real implementations. We'll also look a bit deeper at
the arcane science of branch prediction. Let's now take an objective look at the
Intel P4, AMD Athlon, and VIA/Centaur C3. We'll then do some more big-picture
analysis and gaze forward to predict the future of PC microarchitecture.
Intel Pentium 4 Microarchitecture
Intel is vigorously promoting the Pentium 4 as the preferred desktop processor,
so we'll focus our Intel analysis on this microarchitecture. We'll make a few
comparisons to previous processor generations, but our goal is to gain a
detailed understanding of how the Pentium 4 meets its design goals. We'll leave
it as an "exercise for the reader" to apply your new analytical tools to the
Pentium III. The Pentium 4 is the first x86 chip to use some newer
microarchitectural innovations, offering us an opportunity to explore some of
these new approaches to dealing with the 2 key latency-induced challenges in CPU
We should point out that our analysis only covers the "Willamette" version of
the P4, while the forthcoming "Northwood" will move to a .13 micron process
geometry and make slight changes to the microarchitecture (most likely improving
the memory subsystem). We'll update this article when we get more information on
The NetBurstTM Moniker Describes a Collection of Design Features
What's the point of introducing a new product without adding a new Intel
buzzword? In this case, the name doesn't refer to a single architectural
improvement, but is really meant to serve as a name for this family of
microprocessors. The NetBurst design changes include a deeper pipeline, new bus
architecture, more execution resources, and changes to the memory subsystem. The
figure below shows a block diagram of the Pentium 4, and we'll take a look at
each major section.
Figure 5: Block Diagram of the Pentium 4
URL : http://www.ezdoum.com/upload/ooo/ooo5.gif
Deeply Pipelined for Higher Clock Rate
The Pentium 4 has a whopping 20-stage pipeline when processing a branch
misprediction. The figure below shows how this pipeline compares to the 10
stages of the Pentium III. The most interesting thing about the Pentium 4 pipe
is that Intel has dedicated 2 stages for driving data across the chip. This is
fascinating proof that the limiting factor in modern IC design has become the
time it takes to transmit a signal across the wire connections on the chip. To
understand why it's fascinating, consider that it wasn't so long ago that
designers only worried about the speed of transistors, and the time it took to
traverse such a short piece of metal was considered essentially instantaneous.
Now we're moving from aluminum to copper, just because electrons propagate
faster with copper. (I can see that person in the back of the room is still with
us and is nodding in agreement.) This is fascinating stuff, and Intel is
probably the first vendor to design a pipeline with "Drive" stages.
Figure 6: Comparisons of pipelines between Pentiums 4 and 3
URL : http://www.ezdoum.com/upload/ooo/ooo6.gif
What About All Those Problems with Long Pipelines?
Well, Intel has to work especially hard to make sure they avoid pipeline
hazards. If that long pipeline needs to be flushed very often, then the
performance will be much lower than other designs. We should remind ourselves
that the longer pipeline actually results in less work being done on each clock
cycle. That's the whole point of super-pipelining (or hyper-pipelining, if you
prefer), since doing less work in a clock cycle is what allows the clock cycle
time to be shortened. The pipeline has to run at a higher frequency just to do
the same amount of work as a shorter pipeline. All other things being equal,
you'd expect the Pentium 4 to have less performance than parts with shorter
pipelines at the same frequency.
Searching for Even More Instruction-Level Parallelism
As we learned, there is another thing to realize about long pipelines (besides
being able to run at the high clock rates that motivate uninformed buyers).
Longer pipelines allow more instructions to be in process at the same time. The
compiler (static scheduler) and the hardware (dynamic scheduler) must keep the
faster and deeper pipeline fed with the instructions and data it needs during a
larger instruction "window". The machine is going to have to search even further
to find instructions that can execute in parallel. As you'll see, the Pentium 4
can have an incredible 126 instructions in-flight as it searches further and
further ahead in the instruction stream for something to work on while waiting
for data or resource dependencies to clear.
Pentium 4's Cache Organization
Cache Organization in the Memory Hierarchy
As we described in our article on motherboard technology, there is usually a
trade-off between cache size and speed. This is mostly because of the extra
capacitive loading on the signals that drive the larger SRAM arrays. Refer again
to block diagram of the Pentium 4 (Click Here). Intel has chosen to keep the L1
caches rather small so that they can reduce the latency of cache accesses. Even
a data cache hit will take 2 cycles to complete (6 cycles for floating-point
data). We'll talk about the L1 caches in a moment, but further down the
hierarchy we find that the L2 cache is an 8-way, unified (includes both
instruction and data), 256KB cache with a 128B line size.
The 8-way structure means it has 8 sets of tags, providing about the same cache
miss rate as a "fully-associative" cache (as good as it gets). This makes the
256KB cache more effective than its size indicates, since the miss rate of this
cache is approximately 60% of the miss rate for a direct-mapped (1-way) cache of
the same size.
The downside is that an 8-way cache will be slower to access. Intel states that
the load latency is 7 cycles (this reflects the time it takes an L2 cache line
to be fully retrieved to either the L1 data cache or the x86 instruction
prefetch/decode buffers), but the cache is able to transfer new data every 2
cycles (which is the effective throughput assuming multiple concurrent cache
transfers are initiated). Again, notice that the L2 cache is shared between
instruction fetches and data accesses (unified).
System Bus Architecture is Matched to Memory Hierarchy Organization
One interesting change for the L2 cache is to make the line size 128 bytes,
instead of the familiar 32 bytes. The larger line size can slightly improve the
hit rate (in some cases), but requires a longer latency for cache line refills
from the system bus. This is where the new Pentium 4 bus comes into play. Using
a 100MHz clock and transferring data four times on each bus clock (which Intel
calls a 400MHz data rate), the 64-bit system bus can bring in 32 bytes each
cycle. This translates to a bandwidth of 3.2 GB/sec.
To fill an L2 cache line requires four bus cycles- the same number of cycles as
the P6 bus for a 32-byte line). Note that the system bus protocol has a 64-byte
access length (matching the line size of the L1 cache) and requires 2 main
memory request operations to fill an L2 cache line. However, the faster bus only
helps overcome the latency of getting the extra data into the CPU from the North
Bridge. The longer line size still causes a longer latency before getting all
the burst data from main memory. In fact, some analysts note that P4 systems
have about 19% more memory latency than Pentium III systems (measured in
nanoseconds for the demand word of a cache refill). Smart pre-fetching is
critical or else the P4 will end up with less performance on many applications.
Pre-Fetching Hardware Can Help if Data Accesses Follow a Regular Pattern
The L2 cache has pre-fetch hardware to request the next 2 cache lines (256
bytes) beyond the current access location. This pre-fetch logic has some
intelligence to allow it to monitor the history of cache misses and try to avoid
unnecessary pre-fetches (that waste bandwidth and cache space). We'll talk more
about the pre-fetcher later, but let's take a quick pause for some of our
patented intuitive analysis. We've described the problem of dealing with
streaming media types (like video) that don't spend much time in the cache. The
hardware pre-fetch logic should easily notice the pattern of cache misses and
then pre-load data, leading to much better performance on these types of
Designing for Data Cache Hits
Intel boasts of "new algorithms" to allow faster access to the 8KB, four-way, L1
data cache. They are most likely referring to the fact that the Pentium 4
speculatively processes load instructions as if they always hit in the L1 data
cache (and data TLB). By optimizing for this case, there aren't any extra cycles
burned while cache tags are checked for a miss. The load instruction is sent on
its merry way down the pipeline; if a cache miss delays the load, the processor
passes temporarily incorrect data to dependent instructions that assumed the
data arrived in 2 cycles. Once the hardware discovers the L1 data cache miss and
brings in the actual data from the rest of the memory hierarchy, the machine
must "replay' any instructions that had data dependencies and grabbed the wrong
It's unclear how efficient this approach will be, since it obviously depends on
the load pattern for the applications. The worst case would be an application
that constantly loads data that is scattered around memory, while attempting to
immediately perform an operation on each new data value. The hardware pre-fetch
logic would (perhaps mercifully) never "trigger", and the pipeline would be
constantly restarting instructions.
Again, the Pentium 4 design seems to have been optimized for the case of
streaming media (just as Intel claims), since these algorithms are much more
regular and demand high performance. The designers probably hope that the
pathological worst case only occurs for code that doesn't need high performance.
When the L1 data cache does have a miss, it has a "fat pipe" (32 bytes wide) to
the L2 cache, allowing each 64-byte cache line to be refilled in 2 clocks.
However, there is a 7-cycle latency before the L2 data starts arriving, as we
mentioned previously. The Pentium 4 can have up to four L1 data cache misses in
Pentium 4's Trace Cache
The Trace Cache Depends on Good Branch Prediction
Instead of a classic L1 instruction cache, the Pentium 4 designers felt
confident enough in their branch prediction algorithms to implement a trace
cache. Rather than storing standard x86 instructions, the trace cache stores the
instructions after they've already been decoded into RISC-style instructions.
Intel calls them "μops" (micro-ops) and stores 6 μops for each "trace line".
The trace cache can house up to 12K μops. Since the instructions have already
been decoded, hardware knows about any branches and fetches instructions that
follow the branch. As we learned, it's the conditional branches that could
really cause a problem, since we won't know if we're wrong until the branch
condition check in Arithmetic Logic Unit 0 (ALU0) of the execution core. By
then, our trace cache could have pre-fetched and decoded a lot of instructions
we don't need. The pipeline could also allow several out-of-order instructions
to proceed if the branch instruction was forced to wait for ALU0.
Hopefully, the alternative branch address is somewhere in the trace cache.
Otherwise, we'll have to pay those 7 cycles of latency to get the proper
instructions from the L2 cache (pity us if it's not there either, as the L2
cache would need to get the instructions from main memory) plus the time to
decode the fetched x86 instructions. Intel's reference to the 20-stage P4
pipeline actually starts with the trace cache, and does not include the cycles
for instruction or data fetches from system memory or L2 cache.
The Trace Cache has Several Advantages
If our predictors work well, then the trace cache is able to provide (the
correct) three μops per cycle to the execution scheduler. Since the trace cache
is (hopefully) only storing instructions that actually get executed, then it
makes more efficient use of the limited cache space. Since the branch target
instruction has already been decoded and fetched in execution order, there isn't
any extra latency for branches. The person in the back of the room just reminded
us of an interesting point. We never mentioned a TLB check for the trace cache,
because it does not use one. So, the Pentium 4 isn't so complicated after all.
Most of you correctly observed that this cache uses virtual addressing, so there
isn't any need to convert to physical addresses until we access the L2 cache.
Intel documents don't give the size of the instruction TLB for the L2 cache.
Pentium 4 Decoder Relies on Trace Cache to Buffer μops
The Pentium 4 decoder can only convert a single x86 instruction on each clock,
fewer than other architectures. However, since the μops are cached in the trace
buffer (and hopefully reused), the decode bandwidth is probably adequate to
match the instruction issue rate (three μops/cycle). If an x86 instruction
requires more than four μops, then the decoder fetches μops directly from a
μops "Read-Only Memory (ROM)". All x86 processor architectures use some sort of
ROM for infrequently used instructions or multi-cycle string operations.
The Execution Engine Runs Out Of Order
For an out-of-order machine, the main design goal is to provide enough parallel
compute resources to make it worth all the extra complexity. In this case, the
machine is working to schedule instructions for 7 different parallel units,
shown in the figure below. Two of these units dispatch loads and stores (the
Data Access stage of our original computer model). The other processing tasks
use multiple schedulers and are dispatched through the 2 Exec Ports. Each port
could have a fast ALU operation scheduled every half cycle, though other μops
get scheduled every cycle. The figure below shows what each port can dispatch.
Figure 7: Pentium 4 Execution Resources
URL : http://www.ezdoum.com/upload/ooo/ooo7.gif
Notice the numerous issue restrictions (structural hazards). If you were to have
just fast ALU μops on both Exec Ports and a simultaneous Load and Store
dispatch, then a total of 6 μops/cycle (four double-speed ALU instructions, a
Load, and a Store) can be dispatched to execution units. The performance of the
execution engine will depend on the type of program and how well the schedulers
can align μops to match the execution resources.
Retiring Instructions in Order and Updating the Branch Predictors
The Reorder Buffer can retire three μops/cycle, matching the instruction issue
rate. There are some subtle differences in the way the Pentium 4 ROB and
register renaming are implemented compared to other processors like the Pentium
III, but the operation is very similar. As we've shown, a key to performance is
to avoid mispredicted branches. As instructions are retired from the ROB, the
final branch addresses are used to update the Branch Target Buffer and Branch
In case some of you have finally figured out modern branch predictors, Intel has
chosen to rename the combination of a BTB and a BHB. Intel calls the combination
a "Branch Target Buffer (BTB)", insuring extra confusion for our new students of
Branch Prediction Uses a Combination of Schemes
While there isn't much public information about how the Pentium 4 does branch
prediction, they likely use a two-level predictor and combine information from
the Static Prediction we discussed earlier. They also include a Return Address
Buffer of some undisclosed size. The specific algorithms are part of the "secret
sauce" that processor vendors guard closely. In the past, we've seen various
patent filings describing algorithmic mechanisms used in branch predictors and
other processor subsystems. The patent details shed more light on their
implementations than processor vendors would otherwise choose to disclose
Branch Hints Can Allow Faster Performance on a Known Data Set
The Pentium 4 also allows software-directed branch hints to be passed as
prefixes to branch instructions. These branch hints allow the software to
override the Static Predictor and can be a powerful tool. This is particularly
true if the program is compiled and executed with special features enabled to
collect information about program flow. The information from the prior run can
be fed back to the compiler to create a new executable with Branch Hints that
avoid the earlier mispredictions.
There is some potential for marketing abuse of this feature, since benchmarks
that use a repeatable data set can be optimized to avoid performance-killing
Support for New Media Instructions
The Pentium 4 has retained the earlier x86 instruction extensions (MMX and SSE)
and added 144 new instructions they call SSE2. It will be the task for another
article to give a complete analysis and comparison of the x86 instruction
extensions and execution resources. However, as we've noted several times, the
Pentium 4 is tuned for performance on streaming media applications.
Poor Thermal Management Can Limit Performance
One potentially troubling feature of the Pentium 4 is the "Thermal Monitor" that
can be enabled to slow the internal clock rate to half speed (or less, depending
on the setting) when the die temperature exceeds a certain value. On a 1.5 GHz
Pentium 4 (Willamette), this temperature currently equates to 54.7 Watts of
power (according to Intel's Thermal Design Guide and P4 datasheet). This is
almost certainly a limitation of the package and heat sink, but the maximum
power dissipation of a 1.5 GHz part is currently about 73 Watts.
Intel would argue that this maximum would never be reached, but it is quite
possible that demanding applications will cause a poorly-cooled CPU to exceed
the current thermal cut-off point - losing performance at a time when you need
it the most. As Intel moves to lower voltages in a more advanced manufacturing
process, these limits will be less of a problem - at current clock rates. As
higher clock rate parts are introduced, the potential performance loss will
again be an issue.
Certainly, the Thermal Monitor is a good feature for ensuring that parts don't
destroy themselves. It also is a clever solution to the problem of turning on
fans quickly enough to match the high thermal ramp rates. The concerns may only
arise for low-cost, inadequate heatsinks and fans. Customers may appreciate the
system stability this feature offers, but not the uncertainty about whether
they're getting all the performance they paid for. We've heard from one of
Intel's competitors that certain Dell and HP Pentium 4 systems they tested do
not enable this clock slow-down feature. This is actually a good thing if Dell
and HP are confident about their thermal solution. We plan to write a separate
report on our testing of this feature soon.
Overall Conclusions About the Pentium 4
The large number of complex new features in this processor has required a lot of
explanation. Clearly, this is a design that is intended to scale to dramatically
higher clock rates. Only at higher clock rates does the benefit of the
microarchitecture become realized. It is also likely that the designers were
forced to make painful trade-offs in the sizes for the on-chip memory hierarchy.
With a microarchitecture so sensitive to cache misses, it will be critical to
increase the size of these memories as transistor budgets increase. With good
thermal management, higher clock rates and bigger caches, this chip should
compete well in desktop systems in the future, while doing very well today with
streaming media, memory bandwidth-intensive applications, and functions that use
AMD Athlon Microarchitecture
The Athlon architecture is more similar to our earlier analysis of speculative,
out-of-order machines. This similarity is partly due to the (comforting)
maturity of the architecture, but it should be noted that the original design of
the Athlon microarchitecture emphasized performance above other factors. The
more aggressive initial design approach keeps the architecture sustainable while
minor optimizations are implemented for clock speed or die cost.
AMD will soon ship a new version of Athlon, code-named "Palomino" and possibly
sporting bigger caches and subtle changes to the microarchitecture. For this
article, we examine "Thunderbird", the design introduced in June 2000.
Parallel Compute Resources Benefit From Out-of-Order Approach
The extra complexity of creating an out-of-order machine is wasted if there
aren't parallel compute resources available for taking advantage of those
exposed instructions. Here is where Athlon really shines. The microarchitecture
can execute 9 simultaneous RISC instructions (what AMD calls "OPs").
The figure below shows the block diagram of Athlon. Note the extra resources for
standard floating-point Ops, likely explaining why this processor does so well
on FP-intensive programs. (Well, that person in the back of the room is still
with us.) Yes, indeed the comparative analysis gets more complex if we include
the P4's SSE2 instructions for SIMD floating-point, but we'll have to leave that
analysis for another article. The current Athlon architecture will certainly
have higher performance for applications that don't have high data-level
Figure 8: Block Diagram of AMD Athlon
URL : http://www.ezdoum.com/upload/ooo/ooo8.gif
Cache Architecture Emphasizes Size to Achieve High Hit Rate
Note that AMD has chosen to implement large L1 caches. The L1 instruction and
data caches are each 2-way, 64KB caches. The L1 instruction cache has a
line-size of 64 bytes with a 64-byte sequential pre-fetch. The L1 data cache
provides a second data port to avoid structural hazards caused by the
superscalar design. The L2 cache is a 16-way, 256KB unified cache, backed up by
the fast EV6 bus we discussed in the motherboard article.
If we take a step back and think about differences between P4 and Athlon memory
hierarchies, we can make a few observations. Intel's documentation states that
their 12K trace cache will have the same hit rate as an "8K to 16K byte
conventional instruction cache". By that measure, the Athlon will have much
better hit rates, though hits will have longer latency for decoding
instructions. An L1 miss is much worse for the P4's longer pipeline, though
smart pre-fetching can overcome this limitation. Remember, at these high clock
rates, it doesn't take long to drain an instruction cache. It will eventually
come down to the accuracy of the branch predictor, but the Pentium 4 will still
need a bigger trace cache to match Athlon instruction fetch effectiveness.
Pre-Decoding Uses Extra Cache Bits
To deal with the complexities of the x86 instruction set, AMD does some early
decoding of x86 instructions as they are fetched into the L1 instruction cache.
These extra bits help mark the beginning and end of the variable-length
instructions, as well as identify branches for the pre-fetcher (and predictor).
These extra bits and early (partial) decoding give some of the benefits of a
trace cache, though there is still latency for the completion of the decoding.
Final Decoding Follows 2 Different Paths
Figure 9 shows the decode pipeline for the Athlon. Notice that it matches the
flow of our original computer model, breaking up Instruction Access and Decode
stages into 6 pipeline stages. AMD uses a fixed-length instruction format called
a "MacroOp", containing one or more Ops. The instruction scheduler will turn
MacroOp's into Op's as it dispatches to the execution units. The "DirectPath
Decoder" generates MacroOp's that take one or two Ops. The "VectorPath Decoder"
fetches longer instructions from ROM. Notice in the figure below that the Athlon
can supply three MacroOp's/cycle to the instruction decoder (the IDEC stage),
and later they'll enter the instruction scheduler, equating to a maximum of 6
Ops/cycle decode bandwidth. Note that the actual decode performance depends on
the type of instructions.
Figure 9: Decode Pipeline for the Athlon
URL : http://www.ezdoum.com/upload/ooo/ooo9.gif
[분류: 하드웨어 ]
< 한국어판 UNIX의 내부 | Description of the 'NetBurst' micro-architecture of the Intel Pentium4 Processor >
|AMD Athlon Scheduler, Data Access
Integer Scheduler Dispatches Ops to 6 Execution Units
The figure below shows how pipeline stage 7 buffers up to 18 MacroOP's that are
dispatched as Ops to the integer execution units. This (reservation station) is
where instructions wait for operands (including data from memory) to become
available before executing out of order. As you'll recall, there is a Reorder
Buffer that keeps track of instruction status, operands, and results ensuring
the instructions are retired and committed in program order. Note that Integer
Multiply instructions require more compute resources and force extra issue
Figure 10: (Integer Scheduler dispatches Op's to 6 Execution Units)
URL : http://www.ezdoum.com/upload/ooo/oooa.gif
Data Access Forces Instructions to Wait
Even for an out-of-order machine, our original computer model still holds up
well. Notice in the figure below that loads and stores will use the "Address
Generation Units (AGU's)" to calculate the Effective Address (cycle 9 ADDGEN
stage) and access the data cache (cycle 10 DC ACC). In cycle 11, the data cache
sends back a hit/miss response (and potentially the data). If another
instruction is waiting in the scheduler for this data, the data is forwarded.
Cache misses will cause the instructions to wait. There is a separate 44-entry
Load/Store Unit (LSU) that manages these instructions.
Figure 11: (Data Access forces instructions to wait)
URL : http://www.ezdoum.com/upload/ooo/ooob.gif
Floating Point Instructions Have Their Own Scheduler and Pipeline
The Athlon can simultaneously process three types of floating-point instructions
(FADD, FMUL, and FSTORE), as shown in the figure below. The floating-point units
are "fully pipelined", so that new FP instructions can start while other
instructions haven't yet completed. MMX/3DNow! instructions can be executed in
the FADD and FMUL pipelines. The FP instructions execute out of order, and each
of the three pipelines has several different execution units. There are some
issue restrictions that apply to these pipelines. The performance of the
Athlon's fully-pipelined FP units allow it to consistently outperform the
Pentium III at similar clock speeds, and a 1.33GHz Athlon even performs better
than a 1.5GHz Pentium 4 in some FP benchmarks. We haven't seen enough
SSE2-optimized applications to draw a definitive conclusion with applications
that may benefit from SSE2, however.
Figure 12: (Floating point instructions have their own scheduler and pipeline)
URL : http://www.ezdoum.com/upload/ooo/oooc.gif
Branch Prediction Logic is a Combination of the Latest Methods
There is a 2048-entry Branch Target Buffer that caches the predicted target
address. This works in concert with a Global History Table that uses a "bimodal
counter" to predict whether branches are taken. If the prediction is correct,
then there is a single-cycle delay to change the instruction fetcher to the new
address. (Note that the P4 trace cache doesn't have any predicted-branch-taken
delays). If the predictor is wrong, then the minimum delay is 10 cycles. There
is also a 12-entry Return Address Buffer.
Overall Conclusions About the Athlon Microarchitecture
To prevent this article from beocming interminably long, we have to gloss over
many features of the Athlon architecture, and undoubtedly several features will
change as new versions are introduced. The main conclusion is that Athlon is a
more traditional, speculative, out-of-order machine and requires fewer pipeline
stages than the Pentium 4. At the same clock rate, Athlon should perform better
than Pentium 4 on many of today's mainstream applications. The actual comparison
ratio would depend on how well the P4's SSE2 instructions are being used, how
well the P4's branch predictors and pre-fetchers are working, and how well the
system/memory bus is being utilized. Memory bandwidth-intensive applications
favor the P4 today. There is a lot of room for optimizing code to match the
microarchitecture, and both AMD and Intel are working with software developers
to tune the applications. We look forward to seeing what enhancements AMD
delivers with Palomino.
Centaur C3 Microarchitecture
Even though VIA/Centaur doesn't have the same market share as Intel and AMD,
they have an experienced design team and some interesting architectural
innovations. This architecture also makes a nice contrast with the Intel and AMD
approaches, since Centaur has been able to stay with an in-order pipeline and
still achieve good performance. The Centaur chips use the same P6 system bus and
Socket 370 motherboards.
A great cost advantage for C3 is its diminutive size--only 52 sqmm in its .18
micron process. This compares to 120 sqmm for Athlon and 217 sqmm for P4. Also,
the fastest C3 today at 800MHz consumes a very modest 17.4 watts max at 1.9V,
with typical power measured at 10.4 watts. This is much more energy-efficient
than Athlon and P4.
Improving the Memory Subsystem to Solve the Key Problems
There are some philosophical differences of opinion on how best to spend the
limited transistor budget, especially for architectures specifically designed
for lower cost and power. Intel and AMD are battling for the high-end where the
fastest CPUs command a price premium. They can tolerate the expense of larger
die sizes and more thermally-effective packages and heat sinks. However, when
the goal of maximum performance drops to a number 2 or 3 slot behind power and
cost, then different design choices are made.
Up until now, Intel and AMD have made slight modifications to their
high-performance architectures to address these other markets. As the markets
bifurcate further, AMD and Intel may introduce parts with microarchitectures
that are more optimized for power and cost.
Centaur Uses Cache Design to Directly Deal with Latency
VIA (Centaur) has made early design choices to target the low-cost markets.
Centaur has stressed the value of optimizing the memory subsystem to solve the
key problems of memory latency. If you're constraining your die size to reduce
cost, then many processor designers feel it's often a better trade-off to use
those transistors in the memory subsystem.
Centaur's chip architects believe that their large L1 caches (four-way, 64KB
each) give them a better performance return than if they had used the die area
(and design time) to more aggressively reschedule instructions in the pipeline.
If latency is the key problem, then clever cache design is a direct way to
address it. The figure below shows the block diagram of the Centaur processor.
The Cyrix name has recently been dropped, and this product is marketed as the
"VIA C3" (internally referred to as C5B).
Figure 13: (Centaur uses cache design to directly deal with latency)
URL : http://www.ezdoum.com/upload/ooo/oood.gif
Decoupling the Pipeline to Reduce Instruction Blockage
Even with a pipeline that processes instructions in-order, it is possible to
solve many of the key design problems by allowing the different pipeline stages
to process groups of instructions. At various stages of the pipeline,
instructions are queued up while waiting for resources to come available. Called
a "decoupled architecture", an in-order machine like the Centaur C3 processor
will have the same performance as the out-of-order approach we've described, as
long as no instructions block the pipeline. If a block occurs at a later stage
of the pipeline, the in-order machine continues to fill queues earlier in the
pipeline while waiting for dependencies to clear. It can then proceed again at
full speed to drain the queues.
This is somewhat analogous to the reservation stations in the out-of-order
architectures. As Centaur continues to refine their architecture, they plan to
further decouple the pipeline by adding queues to other stages and execution
Super-Pipelining an In-Order Microarchitecture
The 12 stages of the C3 pipeline are shown on the right-hand side of the block
diagram in figure 13. By now, you're probably able to easily identify what
happens in each stage. Instructions are fetched from the large I-cache and then
pre-decoded (without needing extra pre-decode bits stored in the cache). The
decoder works by first translating x86 instructions into an interim x86 format
and placing them into a five-deep buffer, at which point enough is known about
branches to enable static prediction.
From this buffer, the interim instructions are translated into
micro-instructions, either directly or from a microcode ROM. The
micro-instructions are queued again before passing through the final decoder
where they also receive any data from registers. From there, the instructions
are dispatched to the appropriate execution unit, unless they require access to
the data cache.
Note that this pipeline has the Data Access stages before execution, much
different from our computer model. We'll talk about the implications in a
moment. The floating-point units are not designed for the highest performance,
since they run at half the pipeline frequency and are not fully pipelined (a new
FP instruction starts every other cycle). After the execution stage, all
instructions proceed through a "Store-Branch" stage before the result registers
are updated in the final pipeline stage. Note that the C3 supports MMX and
Breaking Our Simple Load/Store Computer Model
During the Store-Branch stage, a couple of interesting things occur. If a branch
instruction is incorrectly predicted, the new target address is sent to the
I-cache in this stage. The other operation is to move Store data into a store
buffer. Since an instruction has to pass through this pipeline stage anyway,
Centaur was able to directly implement the common Load-ALU and Load-ALU-Store
instructions as single micro-instructions that execute in a single cycle (with
data required to be loaded before the execute stage).
This completely removes the extra Load and Store instructions from the
instruction stream (as found in other current x86 processors following internal
RISC principles), speeding up execution time for these operations. No other
modern x86 processor has this interesting twist to the microarchitecture. It
also has the unfortunate side effect of complicating our original, simple model
of a computer pipeline, since this is a register-memory operation.
A Sophisticated Branch Prediction Mechanism
Since the C3 pipeline is fairly deep (P4's pipeline has changed our
perspective), good branch prediction becomes quite important. (That person in
the back of the room is going to love this discussion, since Centaur uses every
trick and invents some more.) Centaur takes the interesting approach of directly
calculating the target for unconditional branches that use a displacement value
(to an offset address). The designers decided that including a special adder
early in the pipeline was better than relying on a Branch Target Buffer for
these instructions (about 95% of all branches). Obviously, directly calculating
the address will always give the correct target address, whereas the BTB may not
always contain the target address.
For conditional branches, Centaur used the G-Share algorithm we described
earlier. This uses a 13-bit Global Branch History that is XOR'd with the branch
instruction address (an exclusive-OR of each pair of bits returns a 1 if ONLY
one input bit is a 1). The result indexes into the Branch History Buffer to look
up the prediction of the branch. Centaur also uses the "agrees-mode" enhancement
to encode a (single) bit that indicates whether the table look-up agrees with
the static predictor. They also have another 4K-entry table that selects which
predictor (simple or history-based) to use for a particular branch (based on the
previous behavior of the branch). Basically, Centaur uses a static predictor and
two different dynamic predictors, as well as a predictor to select which type of
dynamic predictor to use. To that person in the back of the room, if you'd like
to know more, check out Centaur's patent filings. A future ExtremeTech article
will focus specifically on branch prediction methods.
Overall conclusions about the Centaur architecture
This microarchitecture has some interesting innovations that are made possible
by staying with an in-order pipeline and focusing on low-cost, single-processor
systems. While these microarchitectural features are interesting, our analysis
doesn't draw any conclusions about performance (except to note the half-speed FP
unit). The performance will depend on the type of applications, and a CPU that
is optimized for cost should really be viewed at the system level. If cost is a
primary concern, then the entire system needs to be configured with the minimum
hardware required to acceptably run the applications you care about. Stay tuned
to ExtremeTech for benchmarks of these budget PCs.
This ends our journey of the strange world inside modern CPUs. We started from
basic concepts and went very rapidly through a lot of complicated stuff. We hope
you didn't have too much trouble digesting it all at one sitting. As we stated
at the very beginning, the details about microarchitecture are only interesting
to CPU architects and hard-core PC technology enthusiasts. As you've learned,
the designers have made several trade-offs, and they've been forced to optimize
for certain types of applications. If those applications are important to you,
then check out the appropriate benchmarks running on real systems. In that way,
the CPU microarchitecture can be analyzed in the context of the entire PC
The Future of PC Microarchitectures
It used to be easy to forecast the sort of microarchitectural features coming to
PC processors. All one had to do was look at high-end RISC chips or large
computer systems. Well, most of the high-end design techniques have already made
their way into the PC processor world, and to go forward will require new
innovation by the PC CPU vendors.
Teaching an Old Dog New Tricks
One interesting trend is to return to older approaches that were not previously
viable for the mainstream. The most noteworthy example is "Very Long Instruction
Word (VLIW)" architectures. This is what is referred to as an "exposed pipeline"
where the compiler must specifically encode separate instructions for each
parallel operation in advance of execution. This is much different than forcing
the processor to dynamically schedule instructions while it is running.
The key enabler is that compiler technology has improved dramatically, and a
VLIW architecture makes the compiler do more of the work for discovering and
exploiting instruction-level-parallelism. Transmeta has implemented an internal
VLIW architecture for their low-power Crusoe CPUs, counting on their software
morphing technology to exploit the parallel architecture. Intel's new 64-bit
"Itanium" architecture uses a version of VLIW, but it has been slow to get to
market. It will be several years before enough interesting desktop applications
can be ported to Itanium and make it a mainstream desktop CPU.
AMD Plans to Hammer Its Way into the High End of the Market
Instead of counting on new compilers and the willingness of software developers
to support a radically-new architecture (like Itanium), AMD is evolving the x86
instruction set to support full 64-bit processing. With a 64-bit architecture,
the "Hammer" series of processors will be better at working on very large
problems that require more addressing space (servers and workstations). There
will also be a performance gain for some applications, but the real focus will
be support for large, multi-processor systems. Eventually, the Hammer family
could make its way down into the mainstream desktop.
Still Some Features to Copy From RISC
Some new RISC chips have an interesting and exciting feature that hasn't yet
made its way into the PC space. Called "Simultaneous Multithreading (SMT)", this
approach duplicates all the registers and swaps register sets whenever a
"thread" comes to a long-latency operation. A thread is just an independent
instruction sequence, whether explicitly defined in a single program or part of
a completely different process. This is how multi-processing works with advanced
operating systems, dispatching threads to different processors. Imagine that
future CPUs may take thousands of pipeline cycles for a main memory load.
In an SMT machine, rather than have a processor sit idle while waiting for data
from memory, it could just "context switch" to a different register set and run
code from the different thread. The more sets of registers, the more
simultaneous threads the CPU could switch between. It is rumored that Intel's
new XEON processor based on the P4 core actually has SMT capability built-in but
not yet enabled.
Integration and a Change in Focus
Most of the recent architectural innovation has been directed at performing
better on media-oriented tasks. Instead of just adding instructions for media
processing, why not create a media processor that can also handle x86
instructions? A media processor is a class of CPU that is optimized for
processing multiple streams of timing-critical media data.
The shift in focus from "standard" x86 processing will become even more likely
as CPUs are more tightly-integrated with graphics, video, sound and
communications subsystems. It's unlikely that vendors would market their
products as x86-compatible media processors, rather than just advanced x86
processors, but the shift in design focus is already underway.
Getting Comfortable with Complexity
In all too short a time, even these forthcoming technologies will seem like
simple designs. We'll soon find it humorous that we thought a GHz processor was
a fast chip. We'll eventually consider it quaint that most computers used only a
single processor, since we could be working on machines with hundreds of CPUs on
a chip. Someday we might be forced to pore through complicated descriptions of
the physics of optical processing. We can easily imagine down the road that some
people will long for the simple days when our computers could send data with
metal traces on the chips or circuit boards.
In closing, if you've made it all the way through this article, you agree with
that enthusiastic person in the back of the room. As PC technology enthusiasts,
our hobby will just get better and better. These complex new technologies will
open up yet more worlds for our discovery, and we'll be inspired to explore
every new detail.
List of References
References and Suggestions for Further Reading:
Computer Architecture, a Quantitative Approach, 2nd Edition. Morgan Kaufmann
Publishers. Written by Hennessy & Patterson. This is a great book and is a
collaboration between John Hennessy (the Stanford professor who helped create
the MIPS architecture) and Dave Patterson (the Berkely professor who helped
create the SPARC architecture).
Pentium Pro and Pentium II System Architecture, 2nd Edition. Mindshare, Inc.
Written by Tom Shanley. This book is slightly out of date, but Tom does a great
job of exposing extra details that aren't part of Intel's official
The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal,
First Quarter 2001.
Written by Glenn Hinton, Dave Sager, Mike Upton, Darrell Boggs, Doug Carmean,
Alan Kyker, and Patrice Roussel of Intel Corporation. This is a
surprisingly-detailed look at the Pentium 4 microarchitecture and design
Other Intel links:
AMD Athlon Processor x86 Code Optimization.
http://www.amd.com/products/cpg/athlon/techdocs/pdf/22007.pdf Appendix A of this
document has an excellent walk-through of the Athlon microarchitecture.
Other AMD links:
Other Centaur Links: