Alternative as in choice, choice as in freedom

Alternative ISA General is a discussion thread about non x86 hardware. "Alternative" doesn't mean "unpopular" it means "alternative to x86". While there have been such threads in the past, they were usually sporadic and not very well connected with one another, which meant that whatever transpired in one thread wasn't carried over to the next one.

Due to the rise of desktop-class ARM chips, interest in alternative hardware has risen, with many Anons even coming up with projects of their own. Therefore, a centralised place was needed, where we could keep track of the development and goals of the community.

While discussion of Intel or AMD hardware is not absolutely prohibited (and even if it were, who is gonna enforce this? LOL), due to the ubiquity of x86 hardware, it is assumed that whatever concerns such architecture can be discussed in any of the other gorillion threads on the board.

Old threads are available on Desuarchive. The OP msg is kept for reference.

1 Ongoing projects
2 Resources
3 ISA Overview
4 ISA Features
5 Microarchitecture
- 5.1 Pipelining
- 5.2 Speculative Execution
6 ISA Implementations
- 6.1 Processors
- 6.2 Home Made Processors
  - 6.2.1 TTL Processors
  - 6.2.2 Soft Cores
7 Making your own ISA
- 7.1 Design
- 7.2 Implementation
8 Alternative OS for Alternative ISA
9 Simulators
10 Links
11 To Do

Ongoing projects

SOON™

Anons are currently interested in porting several open source projects to the PowerPC architecture. Currently the following proposals have been made:

Grand Theft Auto III

Re3 is a homebrew engine intended to replace proprietay RenderWare with an open source implementation. Anons have been discussing making a port for the 32-bit PowerPC version of Mac OS X.

The Elder Scrolls III: Morrowind

OpenMW is a free and open source modern re-implementation of the Gamebryo engine.

Tomb Raider

OpenLara is a Classic Tomb Raider open-source engine.

Resources

Anon has been kind enough to put together a small reference library.

A collection of brief infocards for many different processors is available on TextFiles which also holds huge archives relating to programming and microcomputers, especially 8-bit processors. Also WikiChip and CPU Shack have a lot of information on alternative ISAs, even though the front page is dominated by mainstream processors.

Wikipedia has a Comparison of instruction set architectures. There was also a List of instruction sets which was decided to merge with the Comparison article, but in typical Wikipedia fashion a redirect was made without anyone doing the real work of merging the articles. Thus the archived List of instruction sets is still worth a review.

ISA Overview

ISA simply means Instruction Set Architecture. This is what the programmer sees from the outside, which these days is very different from the microcode and state machines operating inside the processors, normally inaccessible for normal programmers. The mid 1970's saw a Cambrian explosion of architectures that later fossilised into what we see today. Any assembly programmer and academics such as Hennessy and Patterson agree that x86 stinks, but as usual inertia and money trumps speed, efficiency and elegance. Those three qualities are what we instead celebrate in this general.

The same ISA can be implemented by many different internal architectures and microcode. The ISA is the main topic but since many architectures are DIY we discuss both. The best way is to illustrate by examples of processors. An ISA is defined by a fairly large set of parameters.

Much can be summarised as CISC that is complex, or RISC which is simple. The RISC definition has drifted over the decades, and changed from "simple" to register-to-register based operations with load-store memory handling with greatly reduced and simplified addressing modes.

ISA Features

Registers

The question is simply few or many. Few is good for low latency, many registers are good for lazy programmers and poor compilers. 6502 managed with just one accumulator plus a handful other registers. Compiler writers prefer at least 16 registers. One cannot avoid noting modern processors have tons of registers but still seem sluggish.

Register Use

Operations are performed on registers of some form.

Accumulator based where all processing is via one (or 2, rarely more) main register, examples are most early 8-bit processors.

Advantage: Efficient, the destination register is implied and saves space in instructions.

Disadvantage: Too much pressure on a single register, can be harder for parallelism such as out-of-order (ooo) processing.

Workaround: use two accumulators sch as in 6809, at the cost of needing a single bit bitfield in the instructions.

Stack based where everything is performed on a stack.

Advantage: Efficient, the destination is implied and makes for compact implementations

Disadvantage: Too much pressure on the stack, can be harder for parallelism such as out-of-order (ooo) processing.

Workaround: use TOS (top of stack) in a register for fast access and operations, possibly also next on stack (NOS), examples are Novix NC4000 and many virtual processors.

Register file where many registers can be used in similar ways, examples are 68K and many modern processors.

Advantage: Great freedom in register use, ease of parallelism.

Disadvantage: Costs a lot in bitfield space in instructions, especially for 3-op ISA.

Workaround: Split register bank in data- and address registers and cut down on bitfield width, as seen in 68k.

Register Types

Accumulator that is the default destination(s) of operations, usually tightly coupled to the ALU (Arithmetic Logic Unit)

Data Register similar to accumulator, but on processors with register files such as 68K that had 8 data registers

Address Registers used for addressing into memory, usually tightly coupled to the data address generators. Stack pointers can be a form of an address register

Index Registers used for indexing into memory from an offset that may come from an address register

Operands

2-op instructions of the form A += X; or

3-op instructions of the form A = B + C.

The former requires a little extra thought but 3-op is simple for lazy programmers and poor compilers. One cannot avoid noting modern processors drift towards 3-op instructions

Modes

Addressing modes
Mode	Description	Example
Accumulator	The instruction operates on the accumulator (and not, say, memory)	6502: ROL A
Absolute	The instruction operates on memory defined by a full width address	6502, 16-bit address: LDA $FF00
Absolute, X	The instruction operates on memory defined by a full width address, with offset defined by index register X	6502, 16-bit address: LDA $FF00, X
Absolute, Y	The instruction operates on memory defined by a full width address, with offset defined by index register Y	6502, 16-bit address: STA $FF00, Y
Immediate	The instruction uses data in program memory subsequent to the instruction	6502: LDX #$FD
Implied	Data is implied the instruction	6502: SEI
Indirect	Data is accessed indirectly via a pointer	6502: JMP ($F000)
Indexed Indirect	Data is accessed indirectly via a pointer where the pointer is offset by the X register	6502: LDA ($C0, X)
Indirect Indexed	Data is accessed indirectly via a pointer where the target of the pointer is offset by the Y register	6502: LDA ($D0), Y
Relative	Data is accessed or program counter is accessed by an offset from present position	6502: BNE $F300
Zero Page	Data is accessed from the zero page, addressed by a single byte	6502: LDA $A0
Zero Page, X	Data is accessed from the zero page, addressed by a single byte, with offset defined by index register X	6502: LDA $A2, X
Zero Page, Y	Data is accessed from the zero page, addressed by a single byte, with offset defined by index register Y	6502: LDA $A4, Y
Post Increment	Data is accessed via an address register that is incremented after access	68K: MOVE.L (A0)+,D3
Pre Decrement	Data is accessed via an address register that is decremented before access	68K: MOVE.W -(A7),D4

Implied, immediate, absolute, absolute indexed, zeropage, zeropage indexed, stack relative, index indirect, indirect indexed, all with or without pre/post increment/decrement. Much can be combined. Stack relative addressing is very helpful for C programming, where variables are often transferred to the stack when calling a function. Relative addressing makes it possible to make relocatable code, which is useful when running several programs without MMU.

Zero page was an addressing mode where the address was a single byte and it was implied this related to addresses 0x00 - 0xff where the high byte of the address (the "page") was set to zero. This saved space and time, typically 30 percent. This was later improved by Direct page wherein the page was set by a separate 8-bit register, such as on the 6809. This allowed moving the active page around and reduced the zero page pressure enormously. Lately zero page and direct page have fallen out of fashion.

Memory Architecture

Von Neuman which is an unified memory architecture for program and data.

Harvard Architecture where program and data are located in different memory spaces.

Recent designs are often hybrids, in that ISA seen by programmers is von Neumann, while at cache 1 level there is a Harvard architecture. This means self modifying code will fail to work.

Overall Design

CPU that we are all familiar with

DSP are Digital Signal Processors that tend to use Harvard or super Harvard architecture, with separate memory buses for X-, Y-, and program memory. These are optimised for processing long series of numbers, typically sampled signals from an ADC, low power consumption and hard real-time requirements. Typically a DSP is an accumulator based design tightly coupled to the MAC (Multiply and Accumulate) unit, which is the heart of the DSP. In a typical clock tick, the DSP loads a parameter from X-memory and a parameter from the Y-memory, multiplies these and sums this with the accumulator, while incrementing the pointers into X- and Y-memory. Optionally there is also a shift and rounding in every tick. The pointer incrementing typically takes place in the data address generators that serve to feed the MAC at maximum speed. A well known C equivalent is

A += *x++ * *y++

GPU are more recent design where competition drives towards an all out performance design no matter the thermal issues.

Microarchitecture

Pipelining

CPUs nowadays use instruction pipelining. Instructions cannot be performed in just one clock cycle. Without pipelining, a particular instruction will hog the entire CPU for many cycles. However, instructions need not use every single part of the CPU at the same time (they don't need to fetch something from memory and use the ALU and access the register file etc. all in the same clock cycle). To that end, we split up the CPU into stages. Imagine you were doing laundry, and you had a washer and a dryer, and multiple loads of laundry to wash. You could put the first load into the washer, wait until it's done washing, then put that load into the dryer, wait until it's done drying, and then after the first load is completely finished, start washing the second load. But when the first load is drying, no one is using the washer. So why not put the second load into the washer while the first is drying? That's the basic idea behind pipelining. We have multiple stages, and keep pushing instructions through the pipeline so we're always making full use of the CPU. This way your instruction throughput is 1 instruction per cycle - every cycle, you complete 1 instruction (though is isn't true for the first few instructions - no instructions get completed until the very first instruction gets to the final stage). You can imagine this will cause some problems depending on how you divide the pipeline. Say we divide our pipeline to have an instruction fetching stage, an instruction decoding/register reading stage, an execution/ALU stage, a memory read/write stage, and a register writing stage. What if we have an instruction that writes a value to a register, and an instruction right after that that reads that register to do some other computation? Take this example:

ADD R1, R2, R2 //r1 = r2+r2

ADD R3, R1, R1 //r3 = r1 + r1

The second instruction will be in the register reading stage while the first instruction will only be in the execution stage. So, we'll read in the old value of R1 in the second instruction and get incorrect behavior. The class of problems that arise from pipelining are called pipeline hazards. To fix this, we can add a metadata bit to the register file that indicates whether the current value of the register is stale or not. Then, we can just completely stall the pipeline - stop the instruction that's reading a stale value from proceeding to the next stage and stop all previous instructions from doing so as well. So, in the previous example, we would have the second instruction and all subsequent instructions stop and wait until the first instruction progresses to and finishes the register writing stage. Doing this is pretty expensive performance-wise; we're making all instructions wait and do nothing until that first instruction finishes. A better way we might be able to use is register forwarding, where we implement a way for an earlier stage to read the result of a future stage. In the previous example, the execution stage is where the value of R1 is calculated. So, we can add a connection between the result of the execution stage to the register reading stage. This way, if the register reading stage sees that a register is currently stale, it can try reading the forwarded register from the execution stage instead. So ADD R3, R1, R1 can read R1 not from the register file, but from the result of the execution stage where the result R2+R2 is calculated for the sake of the ADD R1, R2, R2 instruction. This method is preferable, since we don't have to stall the pipeline for as many cycles.

Speculative Execution

You might wonder how branches are handled with a pipelined architecture. Say you have code like the following:

BLT R0, R1, END

ADD R1, R0, R0

END: ADD R2, R1, R0

The branch instruction will be loaded into the pipeline. Again, let's assume a 5-stage pipeline like so. We'll only find out whether we need to branch to END or not once during the execution stage - the third stage. So when the branch instruction gets to the second stage, what should we load into the first stage? We can only know what instruction comes next once the branch is evaluated. One thing we could do is stall the pipeline - wait until the branch instruction reaches the execution stage. This way, we know what instruction to load in next. But we'd like to avoid stalls wherever possible. So instead, we could just guess. Load in either the second instruction or the third instruction. If we happen to be wrong - that's ok. Before the execution stage, notice that we don't actually write anything to memory or the register file. So, if we find out we guessed the wrong branch when the branch instruction gets to the execution stage, we can clear out all the previous stages - zero out the internal registers, etc, so the instructions become bubbles in the pipeline - NOOPs that don't do anything. Then, in the next clock cycle, we can load in the correct branch.

This is called speculative execution - we're speculating about what branch is going to be taken and starting the execution of the instructions at that branch.

Instead of just randomly guessing or always assuming that branch is taken or not taken, we can use branch prediction. Modern CPUs will use advanced techniques to guess what branches are taken, based on the history of branches taken and other factors. This way, we usually guess correctly, and we don't have to erase instructions we've already started executing. Branch prediction can have real effects even in high level, compiler-optimized code. Take this code, for example.

ISA Implementations

Processors

4 Bit Processors

These arrived in the early 1970's with Intel 4004, but were soon overtaken by 8-bit processors. This format still exists, is used in huge volume markets where cost is extremely sensitive and consequently these chips are remarkably cheap.

8 Bit Processors

These usually have 8 bit registers and a 16 bit program counter or instruction pointer (terminologies vary) and can access 64 KB memory. Most are accumulator based which worked well in the 1980's since in this era memory and CPUs were equally slow.

RCA 1802

This is a weird and wonderful processor implemented as bit serial architecture, which made it slow. It was popular for machines such as Cosmac ELF. The fabrication made it radiation resistant and it was also popular for satellites, and is still in production. A modern and compact machine is the 1802 Membership Card in the credit card form factor. Later RCA 1805 was introduced, using the single previously undefined opcode 0x68 as a prefix code and added several more new instructions.

CHIP-8 was a popular virtual processor or very low level language, popular on the 1802 platform, and fast enough for making games. It has been extended and ported to many platforms.

6502

The 6502 was introduced in 1975, and has one accumulator (A), two index registers (X and Y), a stack pointer (S) and a processor status (P), all 8 bit wide; plus a 16 bit program counter (PC). It also has a zero page that could be used as address registers. It entered the market at a much lower price than 6800 and quickly won a following. It was used in many popular computers of that era including Apple 2, BBC and Commodore 64. For all the limitations it was powerful enough in the hands of skilled programmers to power the first spreadsheet (VisiCalc) which was also the first killer application, as well as 3D space games with hidden line wireframe graphics such as Elite.

The 6502 still has many loyal fans, hugely active communities and dozens of implementations. Complete development platforms, simulators, debuggers, operating systems, libraries and more are available, most for free. It is still supported commercially by The Western Design Center, founded by the original designer. An estimated 200 million chips are made annually for an installed base estimated at 2 billion. Not bad for a nearly 50 year old design. This time span also means it is proven, and is therefore used in applications such as pacemakers, where lifetime guarantees take on an entirely new meaning. It is also seen in robots and the occasional terminator. The 6502 has an extremely low transistor count which makes it interesting for new opportunities such as a flexible version.

The 6502 has two weaknesses. First of all it is awkward for 16 bit pointer handling, which is why The Woz overcame this by making SWEET-16 virtual processor. The second is that the 6502 is not suited for stack intensive languages such as C. This has been overcome by other virtual processors such as the p-code for the UCSD p-System and VTL-2 (source), both of which exist for several ISA. A more recent virtual CPU for the 6502 is AcheronVM, self described as the successor to SWEET-16.

Several OS have been made for 6502, including LUnix (Little Unix), Minikernel, GeckOS/A65 and many more. GEOS was an add-on OS for C64 that provided windowing system plus many applications such as text processing, spreadsheets and more - all of this complex system fitting in 64 KB RAM. GEOS was not multi tasking, that extension came with Wheels, which also had a web browser, but increasing RAM requirement to a whopping 128 KB.

Recently a 6502 backend for LLVM has been launched.

6800

This was introduced in 1974 and was thus an early design. It has dual accumulators and one 16-bit index register.

6809

This was the peak of 8-bit architectures with dual accumulators (A and B) that could be merged to a 16 bit accumulator (D), and even featured an opcode for multiplication. Hitachi got a license and made the 6309 variant that includes more registers including another set of dual accumulators (E and F) that could be merged to a 16 bit accumulator (W).

Motorola made an extensive monitor for 6809 called Assist09.

Zilog Z80

This is an offshoot of Intel's 8080 by Zilog and hugely popular in business applications thanks to CP/M. Z-80 was also used in the MSX range of home computers. Zilog played evil games and won evil prizes.

While the chip may be old, people are still making new multi tasking windowing operating systems for it.

Three Sega video game consoles used the Z80: The Sega Master System and Sega Game Gear used it as a CPU, whereas the m68k-based Sega Genesis had an onboard Z80 for backwards compatibility with Master System games, and was often used for audio control.

The Z80 enjoyed widespread use in embedded and handheld applications well into the 21st century. Many of Texas Instruments' once-ubiquitous graphing calculators (specifically the TI-81 through TI-86) were built around the Z80; the TI-84 plus is still in production.

16-bit Processors

Typical 16-bit architectures support 20- or 24-bit addressing and 16-bit data. Typical clock speeds are in the megahertz to low tens of megahertz range.

Intel x86-16 (8086, 80186, 80286)

16-bit offerings from Intel included the 8086, 80186, and 80286.

WDC 65816 (65C816)

The '816 is essentially a 16-bit 6502 with some additional enhancements, such as a relocatable zero page. This processor was used in the Apple IIgs and the Super Famicom (SNES). Significant compatibility with the 6502; on reset, the processor is in compatibility mode, wherein it behaves substantially like a 65C02. The processor is not pin-compatible with the 6502, however.

Zilog Z8000 (Z8001, Z8002, Z8003, Z8004)

Introduced in 1979. Sixteen 16-bit general purpose registers that can be used in 32-bit or 64-bit combinations. Not compatible with the earlier Z80.

32-bit Processors

The 16-bit generation had a short reign before being overtaken by 32-bit processors.

Motorola 68k series (68000, 680x0)

Motorola's evolution of the 6800, introduced in 1979. The first generation processors (68000, 68010, 68012) are generally described as being mixed 16-/32-bit CPUs (the 68008 is described as mixed 8-/32-bit). This is due to the width of its data and address ALUs, and internal and external data buses. Later generations are all fully 32-bit.

The first Apple Macintosh computers used the 68000. Apple continued to use m68k CPUs until transitioning to the PowerPC in the mid-1990s. It is said that WDC was designing at 32-bit successor to the 65C816, but Apple chose to go with Motorola chips and the rest is history. The Amiga, Atari ST, and Sega Genesis also used m68k CPUs.

This ISA brought high performance with many registers (8 data registers and 8 address registers) and numerous addressing modes. The complexity might at first glance seem overwhelming, nevertheless it was very popular and performant with assembly programmer.

NS32000

This was an early 32-bit processor but troubled by bugs. More recently it has been recreated in Verilog with many improvements.

PowerPC (PPC) architecture

Conceived in 1991 by a consortium of Apple, IBM, and Motorola (called AIM). PPC-based Apple Macintosh computers—dubbed "Power Macintosh"—entered the market in 1994, and all m68k-based Macintosh computers were discontinued by mid-1996. Apple used PPC until transitioning to x86 processors in 2006.

PPC was also used in a variety of video game consoles, such as Nintendo's GameCube, Wii, and Wii U; and Microsoft's Xbox 360. The 64-bit Cell processor—developed by a consortium of Sony, Toshiba, and IBM—was notably used in the PlayStation 3 console.

VAX

This is probably the peak of CISC and powered VAX computers, typically running the VMS operating system with a reliability where uptimes was measured in 10+ years. This can be simulated by SimH, see below.

Home Made Processors

Making a CPU chip requires a lot of work and infrastructure. Thankfully there are alternatives. The first is to use several chips, and TTL (Transistor-Transistor-Logic) chips were popular, and also used to prototype processors. Later FPGA (Field Programmable Gate Arrays) made things even simpler and faster.

TTL Processors

These can be wire wrap monsters but work surprisingly well. A well known example is the Home Brew CPU complete with an adapted C-compiler and a port of Minix. It is accessible from the net. Other home built processors can be found at the Homebuilt CPUs WebRing.

A very recent and interesting case is the Gigatron TTL Computer that has a micro code system that can emulate a 6502 processor and a 16-bit processor, at a speed sufficient for simple games.

There is also the scamp-cpu which has its own ISA implemented with microcode.

Soft Cores

Not to be confused with pr0n, a softcore is a description (typically in languages such as VHDL or Verilog) that is compiled and then downloaded into a FPGA. In the raw state an FPGA is a large collection of primitive components such as adders, MUX etc. that are connected together by the bitstream from the compiler, and then turns into nearly any kind of digital devices such as a CPU, DSP, GPU, state machine or similar. A large collection of open source designs can be found on Github and OpenCores. These tend to be a lot faster than TTL processors, both in building/programming and in operations.

It should be noted that the FPGA companies also provides softcores, such as

Picoblaze for the Xilinx range of products,
Mico8 from the Lattice range,
Nios for the Altera (acquired by Intel) range

Gartner alleges that Nios is the most widely-used softcore tech in the FPGA industry.

Making your own ISA

This is where things get exciting!

Design

Start simple. Tempting as it may be to make the definitive ISA that once and for all will kick Intel off the market is not a good first project. And face it, if you are here reading this you are fairly new to ISA design. Start simple and get a feel for how it works. Like C or assembly programming, also this is about skill, experience and elegance that only comes from experience. And if you don't want to make it elegant, well, Intel has shown even that can have utility. So start simple, perhaps 8-bit or even 4-bit. Reimplement an existing ISA, the 6502 is very popular in this respect. Some information can be found in the Nand to Tetris courses.

Implementation

Going for a TTL design on breadboard or wirewrap, is an exercise in patience. FPGA might be simpler and avoids short circuiting pins, especially if you use development boards with FPGA and some auxiliary parts such as display, switches and LEDs. You may have done software debugging using printf, now you might have to do debugging using a LED...

FPGAs are configured using designs in VHDL or Verilog. More information on that can be found in this thread over in Anycpu.org, including links to books.

Alternative OS for Alternative ISA

Many cross platform operating systems are available. Contiki OS is available for 6502, AVR and more. Microware OS-9 is available for 6809, 68K and more. FUZIX is a UNIX like OS available for many 8-bit processors and 68K

Simulators

Often it can be impractical to run the actual hardware in order to test old software, such as ordering a large VAX to test VMS. The solution is a simulator, such as SimH, which is capable of simulating a large number of architectures.