# Pinmux, IO Pads, and JTAG Boundary scan

Links:

* <http://www2.eng.cam.ac.uk/~dmh/4b7/resource/section14.htm>
* <https://www10.edacafe.com/book/ASIC/CH02/CH02.7.php>
* <https://ftp.libre-soc.org/Pin_Control_Subsystem_Overview.pdf>
* <https://bugs.libre-soc.org/show_bug.cgi?id=50>
* <https://git.libre-soc.org/?p=c4m-jtag.git;a=tree;hb=HEAD>

Managing IO on an ASIC is nowhere near as simple as on an FPGA.
An FPGA has built-in IO Pads, the wires terminate inside an
existing silicon block which has been tested for you.
In an ASIC, you are going to have to do everything yourself.
In an ASIC, a bi-directional IO Pad requires three wires (in, out,
out-enable) to be routed right the way from the ASIC, all
the way to the IO PAD, where only then does a wire bond connect
it to a single pin.

[[!img CH02-44.gif]]

Designing an ASIC, there is no guarantee that the IO pad is
working when manufactured. Worse, the peripheral could be
faulty.  How can you tell what the cause is? There are two
possible faults, but only one symptom ("it dunt wurk").
This problem is what JTAG Boundary Scan is designed to solve.
JTAG can be operated from an external digital clock,
at very low frequencies (5 khz is perfectly acceptable)
so there is very little risk of clock skew during that testing.

Additionally, an SoC is designed to be low cost, to use low cost
packaging. ASICs are typically only 32 to 128 pins QFP
in the Embedded
Controller range, and between 300 to 650 FBGA in the Tablet /
Smartphone range, absolute maximum of 19 mm on a side.
2 to 3 in square 1,000 pin packages common to Intel desktop processors are
absolutely out of the question.

(*With each pin wire bond smashing
into the ASIC using purely heat of impact to melt the wire,
cracks in the die can occur. The more times
the bonding equipment smashes into the die, the higher the
chances of irreversible damage, hence why larger pin packaged
ASICs are much more expensive: not because of their manufacturing
cost but because far more of them fail due to having been
literally hit with a hammer many more times*)

Yet, the expectation from the market is to be able to fit 1,000+
pins worth of peripherals into only 200 to 400 worth of actual
IO Pads. The solution here: a GPIO Pinmux, described in some
detail here <https://ftp.libre-soc.org/Pin_Control_Subsystem_Overview.pdf>

This page goes over the details and issues involved in creating
an ASIC that combines **both** JTAG Boundary Scan **and** GPIO
Muxing, down to layout considerations using coriolis2.

# JTAG Boundary Scan

JTAG Scanning is a (paywalled) IEEE Standard: 1149.1 which with
a little searching can be found online.  Its purpose is to allow
a well-defined method of testing ASIC IO pads that a Foundry or
ASIC test house may apply easily with off-the-shelf equipment.
Scan chaining can also connect multiple ASICs together so that
the same test can be run on a large batch of ASICs at the same
time.

IO Pads generslly come in four primary different types:

* Input
* Output
* Output with Tristate (enable)
* Bi-directional Tristate Input/Output with direction enable

Interestingly these can all be synthesised from one
Bi-directional Tristate IO Pad.  Other types such as Differential
Pair Transmit may also be constructed from an inverter and a pair
of IO Pads.  Other more advanced features include pull-up
and pull-down resistors, Schmidt triggering for interrupts,
different drive strengths, and so on, but the basics are
that the Pad is either an input, or an output, or both.

The JTAG Boundary Scan therefore needs to know what type
each pad is (In/Out/Bi) and has to "insert" itself in between
*all* the Pad's wires, which may be just an input, or just an output,
and, if bi-directional, an "output enable" line.

The "insertion" (or, "Tap") into those wires requires a
pair of Muxes for each wire.  Under normal operation
the Muxes bypass JTAG entirely: the IO Pad is connected,
through the two Muxes,
directly to the Core (a hardware term for a "peripheral",
in Software terminology).

When JTAG Scan is enabled, then for every pin that is
"tapped into", the Muxes flip such that:

* The IO Pad is connected directly to latches controlled
  by the JTAG Shift Register
* The Core (peripheral) likewise but to *different bits*
  from those that the Pad is connected to

In this way, not only can JTAG control or read the IO Pad,
but it can also read or control the Core (peripheral).
This is its entire purpose: interception to allow for the detection
and triaging of faults.

* Software may be uploaded and run which sets a bit on
  one of the peripheral outputs (UART Tx for example).
  If the UART TX IO Pad was faulty, no possibility existd
  without Boundary Scan to determine if the peripheral
  was at fault.  With the UART TX pin function being
  redirected to a JTAG Shift Register, the results of the
  software setting UART Tx may be detected by checking
  the appropriate Shift Register bit.
* Likewise, a voltage may be applied to the UART RX Pad,
  and the corresponding SR bit checked to see if the
  pad is working.  If the UART Rx peripheral was faulty
  this would not be possible.

<img src="https://libre-soc.org/shakti/m_class/JTAG/jtag-block.jpg"
  width=500 />

## Clock synchronisation

Take for example USB ULPI:

<img src="https://www.crifan.com/files/pic/serial_story/other_site/p_blog_bb.JPG"
width=400 />

Here there is an external incoming clock, generated by the PHY, to which
both Received *and Transmitted* data and control is synchronised.  Notice
very specifically that it is *not the main processor* generating that clock
Signal, but the external peripheral (known as a PHY in Hardware terminology)

Firstly: note that the Clock will, obviously, also need to be routed
through JTAG Boundary Scan, because, after all, it is being received
through just another ordinary IO Pad, after all.  Secondly: note thst
if it didn't, then clock skew would occur for that peripheral because
although the Data Wires went through JTAG Boundary Scan MUXes, the
clock did not.  Clearly this would be a problem.

However, clocks are very special signals: they have to be distributed
evenly to all and any Latches (DFFs) inside the peripheral so that
data corruption does not occur because of tiny delays.
To avoid that scenario, Clock Domain Crossing (CDC) is used, with
Asynchronous FIFOs:

        rx_fifo = stream.AsyncFIFO([("data", 8)], self.rx_depth, w_domain="ulpi", r_domain="sync")
        tx_fifo = stream.AsyncFIFO([("data", 8)], self.tx_depth, w_domain="sync", r_domain="ulpi")
        m.submodules.rx_fifo = rx_fifo
        m.submodules.tx_fifo = tx_fifo

However the entire FIFO must be covered by two Clock H-Trees: one
by the ULPI external clock, and the other the main system clock.
The size of the ULPI clock H-Tree, and consequently the size of
the PHY on-chip, will result in more Clock Tree Buffers being
inserted into the chain, and, correspondingly, matching buffers
on the ULPI data input side likewise must be inserted so that
the input data timing precisely matches that of its clock.

The problem is not receiving of data, though: it is transmission
on the output ULPI side.  With the ULPI Clock Tree having buffers
inserted, each buffer creates delay.  The ULPI output FIFO has to
correspondingly be synchronised not to the original incoming clock
but to that clock *after going through H Tree Buffers*.  Therefore,
there will be a lag on the output data compared to the incoming
(external) clock

# GPIO Muxing

[[!img gpio_block.png]]

# nMigen Core/Pad Connection + JTAG Mux

Diagram constructed from the nmigen plat.py file.

[[!img i_o_io_tristate.png]]