





# **REVIEW: Embedded System Hardware**

Embedded system hardware is frequently used in a loop (*"hardware in a loop"*):



cyber-physical systems

# **REVIEW: Single-instruction, multiple-data (SIMD)**

- Multimedia instructions exploit that many registers, adders etc are quite wide (32/64 bit),
- whereas most multimedia data types are narrow (e.g. 8 bit per color, 16 bit per audio sample per channel)
- 2-8 values can be stored per register and added. E.g.:



## **Code-size efficiency**

- CISC machines: RISC machines designed for run-time-, not for code-size-efficiency
- Compression techniques: key idea



#### **Code-size efficiency**

- Compression techniques (continued):
  - 2nd instruction set, e.g. ARM Thumb instruction set:





- Reduction to 65-70 % of original code size
- 130% of ARM performance with 8/16 bit memory
- 85% of ARM performance with 32-bit memory

#### Same approach for LSI TinyRisc, ... Requires support by compiler, assembler etc.

[ARM, R. Gupta]

# Dictionary approach, two level control store (indirect addressing of instructions)

"Dictionary-based coding schemes cover a wide range of various coders and compressors.

Their common feature is that the methods use some kind of a dictionary that contains parts of the input sequence which frequently appear.

The encoded sequence in turn contains references to the dictionary elements rather than containing these over and over."

[Á. Beszédes et al.: Survey of Code size Reduction Methods, Survey of Code-Size Reduction Methods, *ACM Computing Surveys*, Vol. 35, Sept. 2003, pp 223-267]

## Key idea (for d bit instructions)



## **Cache-based decompression**

- Main idea: decompression whenever cache-lines are fetched from memory.
- Cache lines ↔ variable-sized blocks in memory
  Ine address tables (LATs) for translation of instruction addresses into memory addresses.
- Tables may become large and have to be bypassed by a line address translation buffer.

## **Partitioned register files**

- Many memory ports are required to supply enough operands per cycle.
- Memories with many ports are expensive.
- Registers are partitioned into (typically 2) sets
  e.g. for TM 320C6x
  data path A
  data path B



## More encoding flexibility with IA-64 Itanium

#### 3 instructions per **bundle**:

| 127       |           |           | 0        |
|-----------|-----------|-----------|----------|
| instruc 1 | instruc 2 | instruc 3 | template |

There are 5 instruction types:

- A: common ALU instructions
- I: more special integer instructions (e.g. shifts)
- M: Memory instructions
- F: floating point instructions
- B: branches

The following combinations can be encoded in templates:

• MII, MMI, MFI, MIB, MMB, MFB, MMF, MBB, BBB, MLX with LX = move 64-bit immediate encoded in 2 slots Instruction grouping information

## **Templates and instruction types**



Very restricted placement of stops within bundle. Parallel execution within groups possible. Parallel execution can span several bundles



# **Reconfigurable Logic**

- •Full custom chips may be too expensive, software too slow.
- Combine the speed of HW with the flexibility of SW
  - The with programmable functions and interconnect.
  - Se of configurable hardware; common form: field programmable gate arrays (FPGAs)
- Applications: bit-oriented algorithms like
  - encryption,
  - fast "object recognition" (medical and military)
  - Adapting mobile phones to different standards.
- Very popular devices from
  - XILINX (XILINX Virtex 6 are recent devices)
  - Actel, Altera and others



## **Floor-plan of VIRTEX II FPGAs**



## Virtex 5 Configurable Logic Block (CLB)



# **Virtex 5 Slice (simplified)**



Memories typically used as look-up tables to implement any Boolean function of  $\leq 6$ variables.



Virtex II Pro Devices include up to 4 PowerPC processor cores

> Virtex 5 Devices include up to 2 PowerPC processor cores

[© and source: Xilinx Inc.: Virtex-II Pro™ Platform FPGAs: Functional Description, Sept. 2002, //www.xilinx.com]

#### Memory

 Speed gap between processor and main DRAM increases



Similar problems also for embedded systems

Memory access times >> processor cycle times
 "Memory wall" problem

[P. Machanik: Approaches to Addressing the Memory Wall, TR Nov. 2002, U. Brisbane]

BF - ES

#### **Clock speed**



Copyright © 2011, Elsevier Inc. All rights Reserved. [Hennessy/Patterson: Computer Architecture, 5th ed., 2011]

BF - ES

#### **Parallel performance**



[Hennessy/Patterson: Computer Architecture, 5th ed., 2011]

# **Hierarchical memories** using scratch pad memories (SPM)

SPM is a small, physically separate memory mapped into the address space



Address space



no tag memory



Selection is by an appropriate address decoder (simple!)

#### **Energy consumption per memory access**



## Communication



# **Communication requirements**

- Real-time behavior
- Efficient, economical (e.g. centralized power supply)
- Appropriate bandwidth and communication delay
- Robustness
- Fault tolerance
- Diagnosability
- Maintainability
- Security
- Safety

# **Basic techniques: Electrical robustness**

 Single-ended vs. differential signals



Voltage at input of Op-Amp positive  $\rightarrow$  '1'; otherwise  $\rightarrow$  '0'



## **Evaluation**

#### Advantages:

- Subtraction removes most of the noise
- Changes of voltage levels have no effect
- Reduced importance of ground wiring
- Higher speed

#### Disadvantages:

- Requires negative voltages
- Increased number of wires and connectors

#### Applications:

- USB, FireWire, ISDN
- Ethernet (STP/UTP CAT 5/6 cables)
- differential SCSI
- High-quality analog audio signals (XLR)

## **Priority-based arbitration of communication media**

For example, consider a bus



- Bus arbitration (allocation) is frequently priority-based
- Communication delay depends on communication traffic of other partners
- No tight real-time guarantees, except for highest priority partner

#### Ethernet

- Carrier-sense multiple-access/collision-detection (CSMA/CD, Standard Ethernet): no guaranteed response time.
- Alternatives:
  - token rings, token busses
  - Carrier-sense multiple-access/collision-avoidance (CSMA/CA)
    - WLAN techniques with request preceding transmission
    - Each partner gets an ID (priority). After bus transfer: partners try setting their ID on the bus; Partners detecting higher ID disconnect themselves. Highest priority partner gets guaranteed response time; others only if they are given a chance.

# Time division multiple access (TDMA) busses

 Each communication partner is assigned a fixed time slot. Example:



- TDMA resources have a deterministic timing behavior
- TDMA provides QoS guarantees in networks on chips

## **Overview of embedded systems design**

