Reduced Instruction Set Compute/Computer (RISC)
- is a type of Instruction Set Architecture (ISA)
- simple instructions that typically take 1 clock cycle to execute
- requires more instructions and therefore more memory than CISC
- the burden of complexity is placed on the software
RISC - What is It?
What is RISC?
WARNING: you may want to print this one to read it…(from preceding discussion)::Anyway, it is not a fair comparison. Not by a long stretch. Let’s see:how the Nth generation SPARC, MIPS, and 88K’s do (assuming they last):compared to some new design from scratch.Well, there is baggage and there is BAGGAGE.One must be careful to distinguish between ARCHITECTURE and IMPLEMENTATION:a) Architectures persist longer than implementations, especiallyuser-level Instruction-Set Architecture.b) The first member of an architecture family is usually designedwith the current implementation constraints in mind, and if you’relucky, software people had some input.c) If you’re really lucky, you anticipate 5-10 years of technologytrends, and that modifies your idea of the ISA you commit to.d) It’s pretty hard to delete anything from an ISA, except where:1) You can find that NO ONE uses a feature(the 68020 to 68030 deletions mentioned by someoneelse).2) You believe that you can trap and emulate the feature“fast enough”.i.e., microVAX support for decimal ops,68040 support for transcendentals.Now, one might claim that the i486 and 68040 are RISC implementationsof CISC architectures … and I think there is some truth to this,but I also think that it can confuse things badly:Anyone who has studied the history of computer design knows thathigh-performance designs have used many of the same techniques for years,for all of the natural reasons, that is:a) They use as much pipelining as they can, in some cases, if thismeans a high gate-count, then so be it.b) They use caches (separate I & D if convenient).c) They use hardware, not micro-code for the simpler operations.(For instance, look at the evolution of the S/360 products.Recall that the 360/85 used caches, back around 1969, and within a fewyears, so did any mainframe or supermini.)So, what difference is there among machines if similar implementationideas are used?A: there is a very specific set of characteristics shared by mostmachines labeled RISCs, most of which are not shared by most CISCs.The RISC characteristics:a) Are aimed at more performance from current compiler technology(i.e., enough registers).ORb) Are aimed at fast pipeliningin a virtual-memory environmentwith the ability to still survive exceptionswithout inextricably increasing the number of gate delays(notice that I say gate delays, NOT just how many gates).Even though various RISCs have made various decisions, most of themhave been very careful to omit those things that CPU designers havefound difficult and/or expensive to implement, and especially, thingsthat are painful, for relatively little gain.I would claim, that even as RISCs evolve, they may have certainbaggage that they’d wish weren’t there … but not very much. Inparticular, there are a bunch of objective characteristics shared byRISC ARCHITECTURES that clearly distinguish them from CISCarchitectures.I’ll give a few examples, followed by the detailed analysis:MOST RISCs:3a) Have 1 size of instruction in an instruction stream3b) And that size is 4 bytes3c) Have a handful (1-4) addressing modes) (* it is VERYhard to count these things; will discuss later).3d) Have NO indirect addressing in any form (i.e., where you needone memory access to get the address of another operand in memory)4a) Have NO operations that combine load/store with arithmetic,i.e., like add from memory, or add to memory.(note: this means especially avoiding operations that use thevalue of a load as input to an ALU operation, especially whenthat operation can cause an exception. Loads/stores withaddress modification can often be OK as they don’t have some ofthe bad effects)4b) Have no more than 1 memory-addressed operand per instruction5a) Do NOT support arbitrary alignment of data for loads/stores5b) Use an MMU for a data address no more than once per instruction6a) Have >= 5 bits per integer register specifier6b) Have >= 4 bits per FP register specifierThese rules provide a rather distinct dividing line among architectures,and I think there are rather strong technical reasons for this, suchthat there is one more interesting attribute: almost every architecturewhose first instance appeared on the market from 1986 onward obeys therules above …Note that I didn’t say anything about counting the number ofinstructions…So, here’s a table:C: number of years since first implementation sold in this family(or first thing which with this is binary compatible).3a: # instruction sizes3b: maximum instruction size in bytes3c: number of distinct addressing modes for accessing data (not jumps)>I didn’t count register orliteral, but only ones that referenced memory, and I counted differentformats with different offset sizes separately. This was hard work…Also, even when a machine had different modes for register-relative andPC_relative addressing, I counted them only once.3d: indirect addressing: 0: no, 1: yes4a: load/store combined with arithmetic: 0: no, 1:yes4b: maximum number of memory operands5a: unaligned addressing of memory references allowed in load/store,without specific instructions0: no never (MIPS, SPARC, etc)1: sometimes (as in RS/6000)2: just about any time5b: maximum number of MMU uses for data operands in an instruction6a: number of bits for integer register specifier6b: number of bits for 64-bit or more FP register specifier,distinct from integer registersNote that all of this are ARCHITECTURE issues, and it is usually quitedifficult to either delete a feature (3a-5b) or increase the numberof real registers (6a-6b) given an initial instruction set design.(yes, register renaming can help, but…)Now: items 3a, 3b, and 3c are an indication of the decode complexity3d-5b hint at the ease or difficulty of pipelining, especiallyin the presence of virtual-memory requirements, and need to gofast while still taking exceptions sanelyitems 6a and 6b are more related to ability to take good advantageof current compilers.There are some other attributes that can be useful, but I couldn’timagine how to create metrics for them without being very subjective;for example “degree of sequential decode”, “number of write-backsthat you might want to do in the middle of an instruction, but can’t,because you have to wait to make sure you see all of the instructionbefore committing any state, because the last part might cause apage fault,” or “irregularity/assymetricness of register use”,or “irregularity/complexity of instruction formats”. I’d love touse those, but just don’t know how to measure them.Also, I’d be happy to hear corrections for some of these.So, here’s a table of 12 implementations of various architectures, oneper architecture, with the attributes above. Just for fun, I’m goingto leave the architectures coded at first, although I’ll identify themlater. I’m going to draw a line between H1 and L4 (obviously, theRISC-CISC Line), and also, at the head of each column, I’m going toput a rule, which, in that column, most of the RISCs obey. Any RISCthat does not obey it is marked with a +; any CISC that DOES obey itis marked with a *. So…CPU Age 3a 3b 3c 3d 4a 4b 5a 5b 6a 6b # ODDRULE <6 =1 =4 <5 =0 =0 =1 <2 =1 >4 >3-------------------------------------------------------------------------A1 4 1 4 1 0 0 1 0 1 8 3+ 1B1 5 1 4 1 0 0 1 0 1 5 4 -C1 2 1 4 2 0 0 1 0 1 5 4 -D1 2 1 4 3 0 0 1 0 1 5 0+ 1E1 5 1 4 10+ 0 0 1 0 1 5 4 1F1 5 2+ 4 1 0 0 1 0 1 4+ 3+ 3G1 1 1 4 4 0 0 1 1 1 5 5 -H1 2 1 4 4 0 0 1 0 1 5 4 - RISC---------------------------------------------------------------L4 26 4 8 2* 0* 1 2 2 4 4 2 2 CISCM2 12 12 12 15 0* 1 2 2 4 3 3 1N1 10 21 21 23 1 1 2 2 4 3 3 -O3 11 11 22 44 1 1 2 2 8 4 3 -P3 13 56 56 22 1 1 6 2 24 4 0 -An interesting exercise is to analyze the ODD cases.First, observe that of 12 architectures, in only 2 cases does anarchitecture have an attribute that puts it on the wrong side of the line.Of the RISCs:-A1 is slightly unusual in having more integer registers, and less FPthan usual. [Actually, slightly out of date, 29050 is different,using integer register bank instead, I hear.]-D1 is unusual in sharing integer and FP registers (that’s what theD1:6b 0).-E1 seems odd in having a large number of address modes. I think mostof this is an artifact of the way that I counted, as this architecturereally only has a fundamentally small number of ways to createaddresses, but has several different-sized offsets and combinations,but all within 1 4-byte instruction; I believe that it's addressingmechanisms are fundamentally MUCH simpler than, for example, M2, orespecially N1, O3, or P3, but the specific number doesn't capture itvery well.-F1 .... is not sold any more.-H1 one might argue that this process has 2 sizes of instructions, butI'd observe that at any point in the instruction stream, theinstructions are either 4-bytes long, or 8-bytes long, with thesetting done by a mode bit, i.e., not dynamically encoded in everyinstruction.Of the processors called CISCs:-L4 happens to be one in which you can tell the length of theinstruction from the first few bits, has a fairly regular instructiondecode, has relatively few addressing modes, no indirect addressing.In fact, a big subset of its instructions are actually fairlyRISC-like, although another subset is very CISCy.-M2 has a myriad of instruction formats, but fortunately avoidedindirect addressing, and actually, MOST of instructions only have 1address, except for a small set of string operations with 2. I.e., inthis case, the decode complexity may be high, but most instructionscannot turn into multiple-memory-address-with-side-effects things.-N1,O3, and P3 are actually fairly clean, orthogonal architectures, inwhich most operations can consistently have operands in either memoryor registers, and there are relatively few weirdnesses ofspecial-cased uses of registers. Unfortunately, they also haveindirect addressing, instruction formats whose very orthogonalityalmost guarantees sequential decoding, where it's hard to even knowhow long an instruction is until you parse each piece, and that mayhave side-effects where you'd like to do a register write-back early,but either:must wait until you see all of the instruction until you commit stateormust have "undo" shadow-registersormust use instruction-continuation with fairly tricky exceptionhandling to restore the state of the machineIt is also interesting to note that the original member of the familyto which O3 belongs was rather simpler in some of the critical areas,with only 5 instruction sizes, of maximum size 10 bytes, and noindirect addressing, and requiring alignment (i.e., it was a much moreRISC-like design, and it would be a fascinating speculation to know ifthat extra complexity was useful in practice). Now, here's the tableagain, with the labels:CPU Age 3a 3b 3c 3d 4a 4b 5a 5b 6a 6b \# ODDRULE \<6 =1 =4 \<5 =0 =0 =1 \<2 =1 \>4 \>3-------------------------------------------------------------------------A1 4 1 4 1 0 0 1 0 1 8 3+ 1 AMD 29KB1 5 1 4 1 0 0 1 0 1 5 4 - R2000C1 2 1 4 2 0 0 1 0 1 5 4 - SPARCD1 2 1 4 3 0 0 1 0 1 5 0+ 1 MC88000E1 5 1 4 10+ 0 0 1 0 1 5 4 1 HP PAF1 5 2+ 4 1 0 0 1 0 1 4+ 3+ 3 IBM RT/PCG1 1 1 4 4 0 0 1 1 1 5 5 - IBM RS/6000H1 2 1 4 4 0 0 1 0 1 5 4 - Intel i860---------------------------------------------------------------L4 26 4 8 2\* 0\* 1 2 2 4 4 2 2 IBM 3090M2 12 12 12 15 0\* 1 2 2 4 3 3 1 Intel i486N1 10 21 21 23 1 1 2 2 4 3 3 - NSC 32016O3 11 11 22 44 1 1 2 2 8 4 3 - MC 68040P3 13 56 56 22 1 1 6 2 24 4 0 - VAXGeneral comment: this may sound weird, but in the long term, it mightbe easier to deal with a really complicated bunch of instructionformats, than with a complex set of addressing modes, because at leastthe former is more amenable to pre-decoding into a cache of decodedinstructions that can be pipelined reasonably, whereas the pipeline onthe latter can get very tricky (examples to follow). This can lead tothe funny effect that a relatively "clean", orthogonal architecture mayactually be harder to make run fast than one that is less clean.Obviously, every weirdness has it's penalties.... But consider thefundamental difficulty of pipelining something like (on a VAX):ADDL @(R1)+,@(R1)+,@(R2)+(I.e., something that, might theoretically arise from:register \*\*r1, \*\*r2;\*\*r2++ = \*\*r1++ + \*\*r1++;Now, consider what the VAX has to do:1) Decode the opcode (ADD)2) Fetch first operand specifier from I-stream and work on it.a) Compute the memory address from (r1)If alignedrun through MMUif MMU miss, fixupaccess cacheif cache miss, do write-back/refillElseif unalignedrun through MMU for first part of dataif MMU miss, fixupaccess cache for that part of dataif cache miss, do write-back/refillrun through MMU for second part of dataif MMU miss, fixupaccess cache for second part of dataif cache miss, do write-back/refillNow, in either case, we now have a longword that has theaddress of the actual data.b) Increment r1 \[well, this is where you'd LIKE to do it, orin parallel with step 2a).\] However, see later why not...c) Now, fetch the actual data from memory, using the address justobtained, doing everything in step 2a) again, yielding theactual data, which we need to stick in a temporary buffer, since itdoesn't actually go in a register.3) Now, decode the second operand specifier, which goes thrueverything that we did in step 2, only again, and leaves the resultsin a second temporary buffer. Note that we'd like to be starting thisbefore we get done with all of 2 (and I THINK the VAX9000 probablydoes that??) but you have to be careful to bypass/interlock onpotential side-effects to registers .... actually, you may well haveto keep shadow copies of every register that might get written in theinstruction, since every operand can use auto-increment/decrement.You'd probably want badly to try to compute the address of the secondargument and do the MMU access interleaved with the memory access ofthe first, although the ability of any operand to need 2-4 MMUaccesses probably makes this tricky. \[Recall that any MMU access maywell cause a page fault....\]4) Now, do the add. \[could cause exception\]5) Now, do the third specifier .... only, it might be a littledifferent, depending on the nature of the cache, that is, you cannotmodify cache or memory, unless you know it will complete. (Why? well,suppose that the location you are storing into overlaps with one ofthe indirect-addressing words pointed to by r1 or 4(r1), and supposethat the store was unaligned, and suppose that the last byte of thestore crossed a page boundary and caused a page fault, and that you'dalready written the first 3 bytes. If you did this straightforwardly,and then tried to restart the instruction, it wouldn't do the samething the second time.6) When you're sure all is well, and the store is on its way, then youcan safely update the two registers, but you'd better wait until theend, or else, keep copies of any modified registers until you're sureit's safe. (I think both have been done ??)7) You may say that this code is unlikely, but it is legal, so the CPU mustdo it. This style has the following effects:a) You have to worry about unlikely cases.b) You'd like to do the work, with predictable uses of functionalunits, but instead, they can make unpredictable demands.c) You'd like to minimize the amount of buffering and state,but it costs you in both to go fast.d) Simple pipelining is very, very tough: for example, it ispretty hard to do much about the next instruction following theADDL, (except some early decode, perhaps), without a lot of gatesfor special-casing.(I've always been amazed that CVAX chips are fast as they are,and VAX 9000s are REALLY impressive...)e) EVERY memory operand can potentially cause 4 MMU uses,and hence 4 MMU faults that might actually be page faults...8) Consider how "lazy" RISC designers can be:a) Every load/store uses exactly 1 MMU access.b) The compilers are often free to re-arrange the order, even acrosswhat would have been the next instruction on a CISC.This gets rid of some stalls that the CISC may be stuck with(especially memory accesses).c) The alignment requirement avoids especially the problem withsending the first part of a store on the way before you're SUREthat the second part of it is safe to do.Finally, to be fair, let me add the two cases that I knew of that were moreon the borderline: i960 and Clipper:CPU Age 3a 3b 3c 3d 4a 4b 5a 5b 6a 6b \# ODDRULE \<6 =1 =4 \<5 =0 =0 =1 \<2 =1 \>4 \>3-------------------------------------------------------------------------J1 5 4+ 8+ 9+ 0 0 1 0 2 4+ 3+ 5 ClipperK1 3 2+ 8+ 9+ 0 0 1 2+ - 5 3+ 5 Intel 960KBSUMMARY:1) RISCs share certain architectural characteristics, although thereare differences, and some of those differences matter a lot.2) However, the RISCs, as a group, are much more alike than theCISCs as a group.3) At least some of these architectural characteristics have fairlyserious consequences on the pipelinability of the ISA, especiallyin a virtual-memory, cached environment.4) Counting instructions turns out to be fairly irrelevant:a) It's HARD to actually count instructions in a meaningfulway... (if you disagree, I'll claim that the VAX is RISCierthan any RISC, at least for part of its instruction set :-)Why: VAX has a MOV opcode, whereas RISCs usually havea whole set of opcodes for {LOAD/STORE} {BYTE, HALF, WORD}b) More instructions aren't what REALLY hurts you, anywherenear as much features that are hard to pipeline:c) RISCs can perfectly well have string-support, or decimalarithmetic support, or graphics transforms ... or lots ofstrange register-register transforms, and it won't causeproblems ..... but compare that with the consequence ofadding a single instruction that has 2-3 memory operands,each of which can go indirect, with auto-increments,and unaligned data...====Article: 30346 of comp.archPath: odin!mash.wpd.sgi.com!mashSubject: Updated addressing mode tableNntp-Posting-Host: mash.wpd.sgi.comI promised to repost this with fixes, and people have been asking forit, so here it is again: if you saw it before, all that’s reallydifferent is some fixes in the table, and a few clarifiedexplanations:THE GIANT ADDDRESSING MODE TABLE (Corrections happily accepted) Thistable goes with the higher-level table of general architecturecharacteristics.Address mode summaryr registerr+ autoincrement (post) [by size of data object]-r autodecrement (pre) [by size,…and this was the one I meant]>r modify base register [generally, effective address -> base]NOTE: sometimes this subsumes r+, -r, etc,and is more general, so I categorize itas a separate case.d displacement d1 & d2 if 2 different displacementsx index registers scaled indexa absolute [as a separate mode, as opposed to displacement+(0)I IndirectShown below are 22 distinct addressing modes [you can argue whetherthese are right categories]. In the table are the *number* ofdifferent encodings/variations [and this is a little fuzzy; you canespecially argue about the 4 in the HP PA column, I’m not even surethat’s right]. For example, I counted as different variants on a modethe case where the structure was the same, but there weredifferent-sized displacements that had to be decoded. Note thatmeaningfully counting addressing modes is *at least as bad* asmeaningfully counting opcodes; I did the best I could, and I spect alot of hours looking at manuals for the chips I hadn’t programmedmuch, and in some cases, even after hours, it was hard for me tofigure out meaningful numbers… *Most* of these architectures are usedin general-purpose systems and *most* have at least one version thatuses caches: those are important because many of the issues inthinking about addressing modes come from their interactions with MMUsand caches…1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22r rr r r +d1 +d1r r r | | r r | r r+ +d +d1 I +sr r r +d +x +s| s+ s+|s+ +d +d|r+ +d I I I +s Ir +d +x +s >r >r >r|r+ -r a a r+|-r +x +s|I I +s +s +d2 +d2 +d2— — — — — — —|— — — — —|— — —|— — — — --- --- ---AMD 29K 1 | | |Rxxx 1 | | |SPARC 1 1 | | |88K 1 1 1 | | |HP PA 2 1 1 4 1 1| | |ROMP 1 2 | | |POWER 1 1 1 1 | | |i860 1 1 1 1 | | |Swrdfish 1 1 1 | 1 | |ARM 2 2 2 1 1| 1 1Clipper 1 3 1 | 1 1 2 | |i960KB 1 1 1 1 | 2 2 | 1 |S/360 1 | 1 |i486 1 3 1 1 | 1 1 2 | 2 3|NSC32K 3 | 1 1 3 3 | 3| 9MC68000 1 1 | 1 1 2 | 2 |MC68020 1 1 | 1 1 2 | 2 4| 16 16VAX 1 3 1 | 1 1 1 1 1| 1 3| 1 3 1 3COLUMN NOTES:1) Columns 1-7 are addressing modes used by many machines, but veryfew, if any clearly-RISC architectures use anything else. They areall characterized by what they don’t have:2 adds needed before generating the addressindirect addressingvariable-sized decoding2) Columns 13-15 include fairly simple-looking addressing modes, whichhowever, *may* require 2 back-to-back adds beforet he address isavailable. [*may* because some of them use index-register=0 orsomething to avoid indexing, and usually in such machines, you’ll seevariable timing figures, depending on use of indexing.]3) Columns 16-22 use indirect addressing.ROW NOTES1) Clipper & i960, of current chips, are more on the RISC-CISC border,or are sort of “modern CISCs”. ARM is also characterized (by ARMpeople, Hot Chips IV: “ARM is not a “pure RISC”.2) ROMP has a number of characteristics different from the rest of theRISCs, you might call it “early RISC”, and it is of course no longermade.3) You might consider HP PA a little odd, as it appears to have moreaddressing modes, in the same way that CISCs do, but I don’t thinkthis is the case: it’s an issue of whether you call something severalmodes or one mode with a modifier, just as there is trouble countingopcodes (with & without modifiers). From my view, neither PA norPOWER have truly “CISCy” addressing modes.4) Notice difference between 68000 and 68020 (and later 68Ks): a bunchof incredibly-general & complex modes got added…5) Note that the addressing on the S/360 is actually pretty simple,mostly base+displacement, although RX-addressing does take 2regs+offset.6) A dimension *not* shown on this particular chart, but also highlyrelevant, is that this chart shows the different *types* of modes,*not* how many addresses can be found in each instruction. That maybe worth noting also:AMD : i960 1 one address per instructionS/360 - MC68020 2 up to 2 addressesVAX 6 up to 6By looking at alignment, indirect addressing, and looking only atthose chips that have MMUs, consider the number of times an MMU*might* be used per instruction for data address translations:AMD - Clipper 2 [Swordfish & i960KB: no TLB]S/360 - NSC32K 4MC68Ks (all) 8VAX 24When RS/6000 does unaligned, it must be in the same cache line (andthus also in same MMU page), and traps to software otherwise, thusavoiding numerous ugly cases.Note: in some sense, S/360s & VAXen can use an arbitrary number oftranslations per instruction, with MOVE CHARACTER LONG, or similaroperations & I don’t count them as more, because they’re defined to beinterruptable/restartable, saving state in general-purpose registers,rather than hidden internal state.SUMMARY:1) Computer design styles mostly changed from machines with:2-6 addresses per instruction, with variable sized encodingaddress specifiers were usually “orthogonal”, so that any could ggoanywhere in an instructionsometimes indirect addressingsometimes need 2 adds *before* effective address is availablesometimes with many potential MMU accesses (and possible exceptions)per instruction, often buried in the middle of the instruction,and often *after* you’d normally want to commit state becauseof auto-increment or other side effects.to machines with:1 address per instructionaddress specifiers encoded in small # of bits in 32-bit instructionno indirect addressingnever need 2 adds before address availableuse MMU once per data accessand we usually call the latter group RISCs. I say “changed” becauseif you put this table together with the earlier one, which has the agein years, the older ones were one way, and the newer ones aredifferent.2) Now, ignoring any other features, but looking at this singleattribute (architectural addressing features and implementationeffects thereof), it ought to be clear that the machines in the firstpart of the table are doing something *technically* different fromthose in the second part of the table. Thus, people may sometimescall something RISC that isn’t, for marketing reasons, but the peoplecalling the first batch RISC really did have some serious technicalissues at heart.3) One more time: this is *not* to say that RISC is better than CISC,or that the few in the middle are bad, or anything like that … butthat there are clear technical characteristics…—-john mashey DISCLAIMER: generic disclaimer, I speak for me only, etcUUCP: mash@sgi.comDDD: 415-390-3090 FAX: 415-967-8496USPS: Silicon Graphics 6L-005, 2011N. Shoreline Blvd, Mountain View, CA 94039-7311