Merge branch 'master' of ssh://git.libre-riscv.org:922/libreriscv
[libreriscv.git] / docs / pinmux.mdwn
1 # Pinmux, IO Pads, and JTAG Boundary scan
2
3 Links:
4
5 * <http://www2.eng.cam.ac.uk/~dmh/4b7/resource/section14.htm>
6 * <https://www10.edacafe.com/book/ASIC/CH02/CH02.7.php>
7 * <https://ftp.libre-soc.org/Pin_Control_Subsystem_Overview.pdf>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=50>
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=750>
10 * <https://git.libre-soc.org/?p=c4m-jtag.git;a=tree;hb=HEAD>
11 * Extra info: [[/docs/pinmux/temp_pinmux_info]]
12
13 Managing IO on an ASIC is nowhere near as simple as on an FPGA.
14 An FPGA has built-in IO Pads, the wires terminate inside an
15 existing silicon block which has been tested for you.
16 In an ASIC, you are going to have to do everything yourself.
17 In an ASIC, a bi-directional IO Pad requires three wires (in, out,
18 out-enable) to be routed right the way from the ASIC, all
19 the way to the IO PAD, where only then does a wire bond connect
20 it to a single external pin.
21
22 Below, therefore is a (simplified) diagram of what is
23 usually contained in an FPGA's bi-directional IO Pad,
24 and consequently this is what you must also provide, and explicitly
25 wire up in your ASIC's HDL.
26
27 [[!img asic_iopad_gen.svg]]
28
29 Designing an ASIC, there is no guarantee that the IO pad is
30 working when manufactured. Worse, the peripheral could be
31 faulty. How can you tell what the cause is? There are two
32 possible faults, but only one symptom ("it dunt wurk").
33 This problem is what JTAG Boundary Scan is designed to solve.
34 JTAG can be operated from an external digital clock,
35 at very low frequencies (5 khz is perfectly acceptable)
36 so there is very little risk of clock skew during that testing.
37
38 Additionally, an SoC is designed to be low cost, to use low cost
39 packaging. ASICs are typically only 32 to 128 pins QFP
40 in the Embedded
41 Controller range, and between 300 to 650 FBGA in the Tablet /
42 Smartphone range, absolute maximum of 19 mm on a side.
43 2 to 3 in square 1,000 pin packages common to Intel desktop processors are
44 absolutely out of the question.
45
46 (*With each pin wire bond smashing
47 into the ASIC using purely heat of impact to melt the wire,
48 cracks in the die can occur. The more times
49 the bonding equipment smashes into the die, the higher the
50 chances of irreversible damage, hence why larger pin packaged
51 ASICs are much more expensive: not because of their manufacturing
52 cost but because far more of them fail due to having been
53 literally hit with a hammer many more times*)
54
55 Yet, the expectation from the market is to be able to fit 1,000+
56 pins worth of peripherals into only 200 to 400 worth of actual
57 IO Pads. The solution here: a GPIO Pinmux, described in some
58 detail here <https://ftp.libre-soc.org/Pin_Control_Subsystem_Overview.pdf>
59
60 This page goes over the details and issues involved in creating
61 an ASIC that combines **both** JTAG Boundary Scan **and** GPIO
62 Muxing, down to layout considerations using coriolis2.
63
64 # Resources, Platforms and Pins
65
66 When creating nmigen HDL as Modules, they typically know nothing about FPGA
67 Boards or ASICs. They especially do not know anything about the
68 Peripheral ICs (UART, I2C, USB, SPI, PCIe) connected to a given FPGA
69 on a given PCB, and they should not have to.
70
71 Through the Resources, Platforms and Pins API, a level of abstraction
72 between peripherals, boards and HDL designs is provided. Peripherals
73 may be given `(name, number)` tuples, the HDL design may "request"
74 a peripheral, which is described in terms of Resources, managed
75 by a ResourceManager, and a Platform may provide that peripheral.
76 The Platform is given
77 the resposibility to wire up the Pins to the correct FPGA (or ASIC)
78 IO Pads, and it is the HDL design's responsibility to connect up
79 those same named Pins, on the other side, to the implementation
80 of the PHY/Controller, in the HDL.
81
82 Here is a function that defines a UART Resource:
83
84 #!/usr/bin/env python3
85 from nmigen.build.dsl import Resource, Subsignal, Pins
86
87 def UARTResource(*args, rx, tx):
88 io = []
89 io.append(Subsignal("rx", Pins(rx, dir="i", assert_width=1)))
90 io.append(Subsignal("tx", Pins(tx, dir="o", assert_width=1)))
91 return Resource.family(*args, default_name="uart", ios=io)
92
93 Note that the Subsignal is given a convenient name (tx, rx) and that
94 there are Pins associated with it.
95 UARTResource would typically be part of a larger function that defines,
96 for either an FPGA or an ASIC, a full array of IO Connections:
97
98 def create_resources(pinset):
99 resources = []
100 resources.append(UARTResource('uart', 0, tx='A20', rx='A21'))
101 # add clock and reset
102 clk = Resource("clk", 0, Pins("sys_clk", dir="i"))
103 rst = Resource("rst", 0, Pins("sys_rst", dir="i"))
104 resources.append(clk)
105 resources.append(rst)
106 return resources
107
108 For an FPGA, the Pins names are typically the Ball Grid Array
109 Pad or Pin name: A12, or N20. ASICs can do likewise: it is
110 for convenience when referring to schematics, to use the most
111 recogniseable well-known name.
112
113 Next, these Resources need to be handed to a ResourceManager or
114 a Platform (Platform derives from ResourceManager)
115
116 from nmigen.build.plat import TemplatedPlatform
117
118 class ASICPlatform(TemplatedPlatform):
119 def __init__(self, resources):
120 super().__init__()
121 self.add_resources(resources)
122
123 An HDL Module may now be created, which, if given
124 a platform instance during elaboration, may request
125 a UART (caveat below):
126
127 from nmigen import Elaboratable, Module, Signal
128
129 class Blinker(Elaboratable):
130 def elaborate(self, platform):
131 m = Module()
132 # get the UART resource, mess with the output tx
133 uart = platform.request('uart')
134 intermediary = Signal()
135 m.d.comb += uart.tx.eq(~intermediary) # invert, for fun
136 m.d.comb += intermediary.eq(uart.rx) # pass rx to tx
137
138 return m
139
140 The caveat here is that the Resources of the platform actually
141 have to have a UART in order for it to be requestable! Thus:
142
143 resources = create_resources() # contains resource named "uart"
144 asic = ASICPlatform(resources)
145 hdl = Blinker()
146 asic.build(hdl)
147
148 Finally the association between HDL, Resources, and ASIC Platform
149 is made:
150
151 * The Resources contain the abstract expression of the
152 type of peripheral, its port names, and the corresponding
153 names of the IO Pads associated with each port.
154 * The HDL which knows nothing about IO Pad names requests
155 a Resource by name
156 * The ASIC Platform, given the list of Resources, takes care
157 of connecting requests for Resources to actual IO Pads.
158
159 This is the simple version. When JTAG Boundary Scan needs
160 to be added, it gets a lot more complex.
161
162 # JTAG Boundary Scan
163
164 JTAG Scanning is a (paywalled) IEEE Standard: 1149.1 which with
165 a little searching can be found online. Its purpose is to allow
166 a well-defined method of testing ASIC IO pads that a Foundry or
167 ASIC test house may apply easily with off-the-shelf equipment.
168 Scan chaining can also connect multiple ASICs together so that
169 the same test can be run on a large batch of ASICs at the same
170 time.
171
172 IO Pads generally come in four primary different types:
173
174 * Input
175 * Output
176 * Output with Tristate (enable)
177 * Bi-directional Tristate Input/Output with direction enable
178
179 Interestingly these can all be synthesised from one
180 Bi-directional Tristate IO Pad. Other types such as Differential
181 Pair Transmit may also be constructed from an inverter and a pair
182 of IO Pads. Other more advanced features include pull-up
183 and pull-down resistors, Schmidt triggering for interrupts,
184 different drive strengths, and so on, but the basics are
185 that the Pad is either an input, or an output, or both.
186
187 The JTAG Boundary Scan therefore needs to know what type
188 each pad is (In/Out/Bi) and has to "insert" itself in between
189 *all* the Pad's wires, which may be just an input, or just an output,
190 and, if bi-directional, an "output enable" line.
191
192 The "insertion" (or, "Tap") into those wires requires a
193 pair of Muxes for each wire. Under normal operation
194 the Muxes bypass JTAG entirely: the IO Pad is connected,
195 through the two Muxes,
196 directly to the Core (a hardware term for a "peripheral",
197 in Software terminology).
198
199 When JTAG Scan is enabled, then for every pin that is
200 "tapped into", the Muxes flip such that:
201
202 * The IO Pad is connected directly to latches controlled
203 by the JTAG Shift Register
204 * The Core (peripheral) likewise but to *different bits*
205 from those that the Pad is connected to
206
207 In this way, not only can JTAG control or read the IO Pad,
208 but it can also read or control the Core (peripheral).
209 This is its entire purpose: interception to allow for the detection
210 and triaging of faults.
211
212 * Software may be uploaded and run which sets a bit on
213 one of the peripheral outputs (UART Tx for example).
214 If the UART TX IO Pad was faulty, no possibility existd
215 without Boundary Scan to determine if the peripheral
216 was at fault. With the UART TX pin function being
217 redirected to a JTAG Shift Register, the results of the
218 software setting UART Tx may be detected by checking
219 the appropriate Shift Register bit.
220 * Likewise, a voltage may be applied to the UART RX Pad,
221 and the corresponding SR bit checked to see if the
222 pad is working. If the UART Rx peripheral was faulty
223 this would not be possible.
224
225 [[!img jtag-block.svg ]]
226
227 ## C4M JTAG TAP
228
229 Staf Verhaegen's Chips4Makers JTAG TAP module includes everything
230 needed to create JTAG Boundary Scan Shift Registers,
231 as well as the IEEE 1149.1 Finite State Machine to access
232 them through TMS, TDO, TDI and TCK Signalling. However,
233 connecting up cores (a hardware term: the equivalent software
234 term is "peripherals") on one side and the pads on the other is
235 especially confusing, but deceptively simple. The actual addition
236 to the Scan Shift Register is this straightforward:
237
238 from c4m.nmigen.jtag.tap import IOType, TAP
239
240 class JTAG(TAP):
241 def __init__(self):
242 TAP.__init__(self, ir_width=4)
243 self.u_tx = self.add_io(iotype=IOType.Out, name="tx")
244 self.u_rx = self.add_io(iotype=IOType.In, name="rx")
245
246 This results in the creation of:
247
248 * Two Records, one of type In named rx, the other an output
249 named tx
250 * Each Record contains a pair of sub-Records: one core-side
251 and the other pad-side
252 * Entries in the Boundary Scan Shift Register which if set
253 may control (or read) either the peripheral / core or
254 the IO PAD
255 * A suite of Muxes (as shown in the diagrams above) which
256 allow either direct connection between pad and core
257 (bypassing JTAG) or interception
258
259 During Interception Mode (Scanning) pad and core are connected
260 to the Shift Register. During "Production" Mode, pad and
261 core are wired directly to each other (on a per-pin basis,
262 for every pin. Clearly this is a lot of work).
263
264 It is then your responsibility to:
265
266 * connect up each and every peripheral input and output
267 to the right IO Core Record in your HDL
268 * connect up each and every IO Pad input and output
269 to the right IO Pad in the Platform.
270 * **This does not happen automatically and is not the
271 responsibility of the TAP Interface, it is yours**
272
273 The TAP interface connects the **other** side of the pads
274 and cores Records: **to the Muxes**. You **have** to
275 connect **your** side of both core and pads Records in
276 order for the Scan to be fully functional.
277
278 Both of these tasks are painstaking and tedious in the
279 extreme if done manually, and prone to either sheer boredom,
280 transliteration errors, dyslexia triggering or just utter
281 confusion. Despite this, let us proceed, and, augmenting
282 the Blinky example, wire up a JTAG instance:
283
284 class Blinker(Elaboratable):
285 def elaborate(self, platform):
286 m = Module()
287 m.submodules.jtag = jtag = JTAG()
288
289 # get the records from JTAG instance
290 utx, urx = jtag.u_tx, jtag.u_rx
291 # get the UART resource, mess with the output tx
292 p_uart = platform.request('uart')
293
294 # uart core-side from JTAG
295 intermediary = Signal()
296 m.d.comb += utx.core.o.eq(~intermediary) # invert, for fun
297 m.d.comb += intermediary.eq(urx.core.i) # pass rx to tx
298
299 # wire up the IO Pads (in right direction) to Platform
300 m.d.comb += uart.rx.eq(utx.pad.i) # receive rx from JTAG input pad
301 m.d.comb += utx.pad.o.eq(uart.tx) # transmit tx to JTAG output pad
302 return m
303
304 Compared to the non-scan-capable version, which connected UART
305 Core Tx and Rx directly to the Platform Resource (and the Platform
306 took care of wiring to IO Pads):
307
308 * Core HDL is instead wired to the core-side of JTAG Scan
309 * JTAG Pad side is instead wired to the Platform
310 * (the Platform still takes care of wiring to actual IO Pads)
311
312 JTAG TAP capability on UART TX and RX has now been inserted into
313 the chain. Using openocd or other program it is possible to
314 send TDI, TMS, TDO and TCK signals according to IEEE 1149.1 in order
315 to intercept both the core and IO Pads, both input and output,
316 and confirm the correct functionality of one even if the other is
317 broken, during ASIC testing.
318
319 ## Libre-SOC Automatic Boundary Scan
320
321 Libre-SOC's JTAG TAP Boundary Scan system is a little more sophisticated:
322 it hooks into (replaces) ResourceManager.request(), intercepting the request
323 and recording what was requested. The above manual linkup to JTAG TAP
324 is then taken care of **automatically and transparently**, but to
325 all intents and purposes looking exactly like a Platform even to
326 the extent of taking the exact same list of Resources.
327
328 class Blinker(Elaboratable):
329 def __init__(self, resources):
330 self.jtag = JTAG(resources)
331
332 def elaborate(self, platform):
333 m = Module()
334 m.submodules.jtag = jtag = self.jtag
335
336 # get the UART resource, mess with the output tx
337 uart = jtag.request('uart')
338 intermediary = Signal()
339 m.d.comb += uart.tx.eq(~intermediary) # invert, for fun
340 m.d.comb += intermediary.eq(uart.rx) # pass rx to tx
341
342 return jtag.boundary_elaborate(m, platform)
343
344 Connecting up and building the ASIC is as simple as a non-JTAG,
345 non-scanning-aware Platform:
346
347 resources = create_resources()
348 asic = ASICPlatform(resources)
349 hdl = Blinker(resources)
350 asic.build(hdl)
351
352 The differences:
353
354 * The list of resources was also passed to the HDL Module
355 such that JTAG may create a complete identical list
356 of both core and pad matching Pins
357 * Resources were requested from the JTAG instance,
358 not the Platform
359 * A "magic function" (JTAG.boundary_elaborate) is called
360 which wires up all of the seamlessly intercepted
361 Platform resources to the JTAG core/pads Resources,
362 where the HDL connected to the core side, exactly
363 as if this was a non-JTAG-Scan-aware Platform.
364 * ASICPlatform still takes care of connecting to actual
365 IO Pads, except that the Platform.resource requests were
366 triggered "behind the scenes". For that to work it
367 is absolutely essential that the JTAG instance and the
368 ASICPlatform be given the exact same list of Resources.
369
370
371 ## Clock synchronisation
372
373 Take for example USB ULPI:
374
375 <img src="https://www.crifan.com/files/pic/serial_story/other_site/p_blog_bb.JPG"
376 width=400 />
377
378 Here there is an external incoming clock, generated by the PHY, to which
379 both Received *and Transmitted* data and control is synchronised. Notice
380 very specifically that it is *not the main processor* generating that clock
381 Signal, but the external peripheral (known as a PHY in Hardware terminology)
382
383 Firstly: note that the Clock will, obviously, also need to be routed
384 through JTAG Boundary Scan, because, after all, it is being received
385 through just another ordinary IO Pad, after all. Secondly: note thst
386 if it didn't, then clock skew would occur for that peripheral because
387 although the Data Wires went through JTAG Boundary Scan MUXes, the
388 clock did not. Clearly this would be a problem.
389
390 However, clocks are very special signals: they have to be distributed
391 evenly to all and any Latches (DFFs) inside the peripheral so that
392 data corruption does not occur because of tiny delays.
393 To avoid that scenario, Clock Domain Crossing (CDC) is used, with
394 Asynchronous FIFOs:
395
396 rx_fifo = stream.AsyncFIFO([("data", 8)], self.rx_depth, w_domain="ulpi", r_domain="sync")
397 tx_fifo = stream.AsyncFIFO([("data", 8)], self.tx_depth, w_domain="sync", r_domain="ulpi")
398 m.submodules.rx_fifo = rx_fifo
399 m.submodules.tx_fifo = tx_fifo
400
401 However the entire FIFO must be covered by two Clock H-Trees: one
402 by the ULPI external clock, and the other the main system clock.
403 The size of the ULPI clock H-Tree, and consequently the size of
404 the PHY on-chip, will result in more Clock Tree Buffers being
405 inserted into the chain, and, correspondingly, matching buffers
406 on the ULPI data input side likewise must be inserted so that
407 the input data timing precisely matches that of its clock.
408
409 The problem is not receiving of data, though: it is transmission
410 on the output ULPI side. With the ULPI Clock Tree having buffers
411 inserted, each buffer creates delay. The ULPI output FIFO has to
412 correspondingly be synchronised not to the original incoming clock
413 but to that clock *after going through H Tree Buffers*. Therefore,
414 there will be a lag on the output data compared to the incoming
415 (external) clock
416
417 # Pinmux GPIO Block
418 The following diagram is an example of a GPIO block with switchable banks and comes from the Ericson presentation on a GPIO architecture.
419
420 [[!img gpio-block.svg size="800x"]]
421
422 The block we are developing is very similar, but is lacking some of configuration of the former (due to complexity and time constraints).
423
424 ## Diagram
425 [[!img banked_gpio_block.jpg size="600x"]]
426
427 *(Diagram is missing the "ie" signal as part of the bundle of signals given to the peripherals, will be updated later)*
428
429 ## Explanation
430 The simple GPIO module is multi-GPIO block integral to the pinmux system.
431 To make the block flexible, it has a variable number of of I/Os based on an
432 input parameter.
433
434 By default, the block is memory-mapped WB bus GPIO. The CPU
435 core can just write the configuration word to the GPIO row address. From this
436 perspective, it is no different to a conventional GPIO block.
437
438 ### Bank Select Options
439 * bank 0 - WB bus has full control (GPIO peripheral)
440 * bank 1,2,3 - WB bus only controls puen/pden, periphal gets o/oe/i/ie (Not
441 fully specified how this should be arranged yet)
442
443 Bank select however, allows to switch over the control of the GPIO block to
444 another peripheral. The peripheral will be given sole connectivity to the
445 o/oe/i/ie signals, while additional parameters such as pull up/down will either
446 be automatically configured (as the case for I2C), or will be configurable
447 via the WB bus. *(This has not been implemented yet, so open to discussion)*
448
449 ## Configuration Word
450 After a discussion with Luke on IRC (14th January 2022), new layout of the
451 8-bit data word for configuring the GPIO (through CSR):
452
453 * oe - Output Enable (see the Ericson presentation for the GPIO diagram)
454 * ie - Input Enable
455 * puen - Pull-Up resistor enable
456 * pden - Pull-Down resistor enable
457 * i/o - When configured as output (oe set), this bit sets/clears output. When
458 configured as input, shows the current state of input (read-only)
459 * bank_sel[2:0] - Bank Select (only 4 banks used)
460
461 ### Simultaneous/Packed Configuration
462 To make the configuration more efficient, multiple GPIOs can be configured with
463 one data word. The number of GPIOs in one "row" is dependent on the width of the
464 WB data bus.
465
466 If for example, the data bus is 64-bits wide, eight GPIO configuration bytes -
467 and thus eight GPIOs - are configured in one go. There is no way to specify
468 which GPIO in a row is configured, so the programmer has to keep the current
469 state of the configuration as part of the code (essentially a shadow register).
470
471 The diagram below shows the layout of the configuration byte, and how it fits
472 within a 64-bit data word.
473
474 [[!img gpio_csr_example.jpg size="600x"]]
475
476 If the block is created with more GPIOs than can fit in a single data word,
477 the next set of GPIOs can be accessed by incrementing the address.
478 For example, if 16 GPIOs are instantiated and 64-bit data bus is used, GPIOs
479 0-7 are accessed via address 0, whereas GPIOs 8-15 are accessed by address 8
480 (TODO: DOES ADDRESS COUNT WORDS OR BYTES?)
481
482 ## Example Memory Map
483 [[!img gpio_memory_example.jpg size="600x"]]
484
485 The diagrams above show the difference in memory layout between 16-GPIO block
486 implemented with 64-bit and 32-bit WB data buses.
487 The 64-bit case shows there are two rows with eight GPIOs in each, and it will
488 take two writes (assuming simple WB write) to completely configure all 16 GPIOs.
489 The 32-bit on the other hand has four address rows, and so will take four write transactions.
490
491 64-bit:
492
493 * 0x00 - Configure GPIOs 0-7
494 * 0x01 - Configure GPIOs 8-15
495
496 32-bit:
497
498 * 0x00 - Configure GPIOs 0-3
499 * 0x01 - Configure GPIOs 4-7
500 * 0x02 - Configure GPIOs 8-11
501 * 0x03 - Configure GPIOs 12-15
502
503
504 ## Combining JTAG BS Chain and Pinmux (In Progress)
505 [[!img io_mux_bank_planning.JPG size="600x"]]
506
507 The JTAG BS chain need to have access to the bank select bits, to allow
508 selecting different peripherals during testing. At the same time, JTAG may
509 also require access to the WB bus to access GPIO configuration options
510 not available to bank 1/2/3 peripherals.
511
512 ### Proposal
513 TODO: REWORK BASED ON GPIO JTAG DIAGRAMS BELOW
514 The proposed JTAG BS chain is as follows:
515
516 * Between each peripheral and GPIO block, add a JTAG BS chain. For example
517 the I2C SDA line will have core o/oe/i/ie, and from JTAG the pad o/oe/i/ie will
518 connect to the GPIO block's ports 1-3.
519 * Provide a test port for the GPIO block that gives full access to configuration
520 (o/oe/i/ie/puen/pden) and bank select. Only allow full JTAG configuration *IF*
521 ban select bit 2 is set!
522 * No JTAG chain between WB bus and GPIO port 0 input *(not sure what to do for
523 this, or whether it is even needed)*.
524
525 Such a setup would allow the JTAG chain to control the bank select when testing
526 connectivity of the peripherals, as well as give full control to the GPIO
527 configuration when bank select bit 2 is set.
528
529 For the purposes of muxing peripherals, bank select bit 2 is ignored. This means
530 that even if JTAG is handed over full control, the peripheral is still connected
531 to the GPIO block (via the BS chain).
532
533 Signals for various ports:
534
535 * WB bus or Periph0: WB data read, data write, address, cyc, stb, ack
536 * Periph1/2/3: o,oe,i,ie (puen/pden are only controlled by WB, test port, or
537 fixed by functionality)
538 * Test port: bank_select[2:0], o,oe,i,ie,puen,pden. In addition, internal
539 address to access individual GPIOs will be available (this will consist of a
540 few bits, as more than 16 GPIOs per block is likely to be to big).
541
542 As you can see by the above list, the GPIO block is becoming quite a complex
543 beast. If there are suggestions to simplify or reduce some of the signals,
544 that will be helpful.*
545
546 The diagrams below show 1-bit GPIO connectivity, as well as the 4-bit case.
547
548 [[!img gpio_jtag_1bit.jpg size="600x"]]
549
550 [[!img gpio_jtag_4bit.jpg size="600x"]]
551
552 # Core/Pad Connection + JTAG Mux
553
554 Diagram constructed from the nmigen plat.py file.
555
556 [[!img i_o_io_tristate_jtag.svg ]]
557