Add dividing lines between sections in picture
[libreriscv.git] / docs / pinmux.mdwn
1 # Pinmux, IO Pads, and JTAG Boundary scan
2
3 Links:
4
5 * <http://www2.eng.cam.ac.uk/~dmh/4b7/resource/section14.htm>
6 * <https://www10.edacafe.com/book/ASIC/CH02/CH02.7.php>
7 * <https://ftp.libre-soc.org/Pin_Control_Subsystem_Overview.pdf>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=50>
9 * <https://git.libre-soc.org/?p=c4m-jtag.git;a=tree;hb=HEAD>
10 * Extra info: [[/docs/pinmux/temp_pinmux_info]]
11
12 Managing IO on an ASIC is nowhere near as simple as on an FPGA.
13 An FPGA has built-in IO Pads, the wires terminate inside an
14 existing silicon block which has been tested for you.
15 In an ASIC, you are going to have to do everything yourself.
16 In an ASIC, a bi-directional IO Pad requires three wires (in, out,
17 out-enable) to be routed right the way from the ASIC, all
18 the way to the IO PAD, where only then does a wire bond connect
19 it to a single external pin.
20
21 Below, therefore is a (simplified) diagram of what is
22 usually contained in an FPGA's bi-directional IO Pad,
23 and consequently this is what you must also provide, and explicitly
24 wire up in your ASIC's HDL.
25
26 [[!img asic_iopad_gen.svg]]
27
28 Designing an ASIC, there is no guarantee that the IO pad is
29 working when manufactured. Worse, the peripheral could be
30 faulty. How can you tell what the cause is? There are two
31 possible faults, but only one symptom ("it dunt wurk").
32 This problem is what JTAG Boundary Scan is designed to solve.
33 JTAG can be operated from an external digital clock,
34 at very low frequencies (5 khz is perfectly acceptable)
35 so there is very little risk of clock skew during that testing.
36
37 Additionally, an SoC is designed to be low cost, to use low cost
38 packaging. ASICs are typically only 32 to 128 pins QFP
39 in the Embedded
40 Controller range, and between 300 to 650 FBGA in the Tablet /
41 Smartphone range, absolute maximum of 19 mm on a side.
42 2 to 3 in square 1,000 pin packages common to Intel desktop processors are
43 absolutely out of the question.
44
45 (*With each pin wire bond smashing
46 into the ASIC using purely heat of impact to melt the wire,
47 cracks in the die can occur. The more times
48 the bonding equipment smashes into the die, the higher the
49 chances of irreversible damage, hence why larger pin packaged
50 ASICs are much more expensive: not because of their manufacturing
51 cost but because far more of them fail due to having been
52 literally hit with a hammer many more times*)
53
54 Yet, the expectation from the market is to be able to fit 1,000+
55 pins worth of peripherals into only 200 to 400 worth of actual
56 IO Pads. The solution here: a GPIO Pinmux, described in some
57 detail here <https://ftp.libre-soc.org/Pin_Control_Subsystem_Overview.pdf>
58
59 This page goes over the details and issues involved in creating
60 an ASIC that combines **both** JTAG Boundary Scan **and** GPIO
61 Muxing, down to layout considerations using coriolis2.
62
63 # Resources, Platforms and Pins
64
65 When creating nmigen HDL as Modules, they typically know nothing about FPGA
66 Boards or ASICs. They especially do not know anything about the
67 Peripheral ICs (UART, I2C, USB, SPI, PCIe) connected to a given FPGA
68 on a given PCB, and they should not have to.
69
70 Through the Resources, Platforms and Pins API, a level of abstraction
71 between peripherals, boards and HDL designs is provided. Peripherals
72 may be given `(name, number)` tuples, the HDL design may "request"
73 a peripheral, which is described in terms of Resources, managed
74 by a ResourceManager, and a Platform may provide that peripheral.
75 The Platform is given
76 the resposibility to wire up the Pins to the correct FPGA (or ASIC)
77 IO Pads, and it is the HDL design's responsibility to connect up
78 those same named Pins, on the other side, to the implementation
79 of the PHY/Controller, in the HDL.
80
81 Here is a function that defines a UART Resource:
82
83 #!/usr/bin/env python3
84 from nmigen.build.dsl import Resource, Subsignal, Pins
85
86 def UARTResource(*args, rx, tx):
87 io = []
88 io.append(Subsignal("rx", Pins(rx, dir="i", assert_width=1)))
89 io.append(Subsignal("tx", Pins(tx, dir="o", assert_width=1)))
90 return Resource.family(*args, default_name="uart", ios=io)
91
92 Note that the Subsignal is given a convenient name (tx, rx) and that
93 there are Pins associated with it.
94 UARTResource would typically be part of a larger function that defines,
95 for either an FPGA or an ASIC, a full array of IO Connections:
96
97 def create_resources(pinset):
98 resources = []
99 resources.append(UARTResource('uart', 0, tx='A20', rx='A21'))
100 # add clock and reset
101 clk = Resource("clk", 0, Pins("sys_clk", dir="i"))
102 rst = Resource("rst", 0, Pins("sys_rst", dir="i"))
103 resources.append(clk)
104 resources.append(rst)
105 return resources
106
107 For an FPGA, the Pins names are typically the Ball Grid Array
108 Pad or Pin name: A12, or N20. ASICs can do likewise: it is
109 for convenience when referring to schematics, to use the most
110 recogniseable well-known name.
111
112 Next, these Resources need to be handed to a ResourceManager or
113 a Platform (Platform derives from ResourceManager)
114
115 from nmigen.build.plat import TemplatedPlatform
116
117 class ASICPlatform(TemplatedPlatform):
118 def __init__(self, resources):
119 super().__init__()
120 self.add_resources(resources)
121
122 An HDL Module may now be created, which, if given
123 a platform instance during elaboration, may request
124 a UART (caveat below):
125
126 from nmigen import Elaboratable, Module, Signal
127
128 class Blinker(Elaboratable):
129 def elaborate(self, platform):
130 m = Module()
131 # get the UART resource, mess with the output tx
132 uart = platform.request('uart')
133 intermediary = Signal()
134 m.d.comb += uart.tx.eq(~intermediary) # invert, for fun
135 m.d.comb += intermediary.eq(uart.rx) # pass rx to tx
136
137 return m
138
139 The caveat here is that the Resources of the platform actually
140 have to have a UART in order for it to be requestable! Thus:
141
142 resources = create_resources() # contains resource named "uart"
143 asic = ASICPlatform(resources)
144 hdl = Blinker()
145 asic.build(hdl)
146
147 Finally the association between HDL, Resources, and ASIC Platform
148 is made:
149
150 * The Resources contain the abstract expression of the
151 type of peripheral, its port names, and the corresponding
152 names of the IO Pads associated with each port.
153 * The HDL which knows nothing about IO Pad names requests
154 a Resource by name
155 * The ASIC Platform, given the list of Resources, takes care
156 of connecting requests for Resources to actual IO Pads.
157
158 This is the simple version. When JTAG Boundary Scan needs
159 to be added, it gets a lot more complex.
160
161 # JTAG Boundary Scan
162
163 JTAG Scanning is a (paywalled) IEEE Standard: 1149.1 which with
164 a little searching can be found online. Its purpose is to allow
165 a well-defined method of testing ASIC IO pads that a Foundry or
166 ASIC test house may apply easily with off-the-shelf equipment.
167 Scan chaining can also connect multiple ASICs together so that
168 the same test can be run on a large batch of ASICs at the same
169 time.
170
171 IO Pads generally come in four primary different types:
172
173 * Input
174 * Output
175 * Output with Tristate (enable)
176 * Bi-directional Tristate Input/Output with direction enable
177
178 Interestingly these can all be synthesised from one
179 Bi-directional Tristate IO Pad. Other types such as Differential
180 Pair Transmit may also be constructed from an inverter and a pair
181 of IO Pads. Other more advanced features include pull-up
182 and pull-down resistors, Schmidt triggering for interrupts,
183 different drive strengths, and so on, but the basics are
184 that the Pad is either an input, or an output, or both.
185
186 The JTAG Boundary Scan therefore needs to know what type
187 each pad is (In/Out/Bi) and has to "insert" itself in between
188 *all* the Pad's wires, which may be just an input, or just an output,
189 and, if bi-directional, an "output enable" line.
190
191 The "insertion" (or, "Tap") into those wires requires a
192 pair of Muxes for each wire. Under normal operation
193 the Muxes bypass JTAG entirely: the IO Pad is connected,
194 through the two Muxes,
195 directly to the Core (a hardware term for a "peripheral",
196 in Software terminology).
197
198 When JTAG Scan is enabled, then for every pin that is
199 "tapped into", the Muxes flip such that:
200
201 * The IO Pad is connected directly to latches controlled
202 by the JTAG Shift Register
203 * The Core (peripheral) likewise but to *different bits*
204 from those that the Pad is connected to
205
206 In this way, not only can JTAG control or read the IO Pad,
207 but it can also read or control the Core (peripheral).
208 This is its entire purpose: interception to allow for the detection
209 and triaging of faults.
210
211 * Software may be uploaded and run which sets a bit on
212 one of the peripheral outputs (UART Tx for example).
213 If the UART TX IO Pad was faulty, no possibility existd
214 without Boundary Scan to determine if the peripheral
215 was at fault. With the UART TX pin function being
216 redirected to a JTAG Shift Register, the results of the
217 software setting UART Tx may be detected by checking
218 the appropriate Shift Register bit.
219 * Likewise, a voltage may be applied to the UART RX Pad,
220 and the corresponding SR bit checked to see if the
221 pad is working. If the UART Rx peripheral was faulty
222 this would not be possible.
223
224 [[!img jtag-block.svg ]]
225
226 ## C4M JTAG TAP
227
228 Staf Verhaegen's Chips4Makers JTAG TAP module includes everything
229 needed to create JTAG Boundary Scan Shift Registers,
230 as well as the IEEE 1149.1 Finite State Machine to access
231 them through TMS, TDO, TDI and TCK Signalling. However,
232 connecting up cores (a hardware term: the equivalent software
233 term is "peripherals") on one side and the pads on the other is
234 especially confusing, but deceptively simple. The actual addition
235 to the Scan Shift Register is this straightforward:
236
237 from c4m.nmigen.jtag.tap import IOType, TAP
238
239 class JTAG(TAP):
240 def __init__(self):
241 TAP.__init__(self, ir_width=4)
242 self.u_tx = self.add_io(iotype=IOType.Out, name="tx")
243 self.u_rx = self.add_io(iotype=IOType.In, name="rx")
244
245 This results in the creation of:
246
247 * Two Records, one of type In named rx, the other an output
248 named tx
249 * Each Record contains a pair of sub-Records: one core-side
250 and the other pad-side
251 * Entries in the Boundary Scan Shift Register which if set
252 may control (or read) either the peripheral / core or
253 the IO PAD
254 * A suite of Muxes (as shown in the diagrams above) which
255 allow either direct connection between pad and core
256 (bypassing JTAG) or interception
257
258 During Interception Mode (Scanning) pad and core are connected
259 to the Shift Register. During "Production" Mode, pad and
260 core are wired directly to each other (on a per-pin basis,
261 for every pin. Clearly this is a lot of work).
262
263 It is then your responsibility to:
264
265 * connect up each and every peripheral input and output
266 to the right IO Core Record in your HDL
267 * connect up each and every IO Pad input and output
268 to the right IO Pad in the Platform.
269 * **This does not happen automatically and is not the
270 responsibility of the TAP Interface, it is yours**
271
272 The TAP interface connects the **other** side of the pads
273 and cores Records: **to the Muxes**. You **have** to
274 connect **your** side of both core and pads Records in
275 order for the Scan to be fully functional.
276
277 Both of these tasks are painstaking and tedious in the
278 extreme if done manually, and prone to either sheer boredom,
279 transliteration errors, dyslexia triggering or just utter
280 confusion. Despite this, let us proceed, and, augmenting
281 the Blinky example, wire up a JTAG instance:
282
283 class Blinker(Elaboratable):
284 def elaborate(self, platform):
285 m = Module()
286 m.submodules.jtag = jtag = JTAG()
287
288 # get the records from JTAG instance
289 utx, urx = jtag.u_tx, jtag.u_rx
290 # get the UART resource, mess with the output tx
291 p_uart = platform.request('uart')
292
293 # uart core-side from JTAG
294 intermediary = Signal()
295 m.d.comb += utx.core.o.eq(~intermediary) # invert, for fun
296 m.d.comb += intermediary.eq(urx.core.i) # pass rx to tx
297
298 # wire up the IO Pads (in right direction) to Platform
299 m.d.comb += uart.rx.eq(utx.pad.i) # receive rx from JTAG input pad
300 m.d.comb += utx.pad.o.eq(uart.tx) # transmit tx to JTAG output pad
301 return m
302
303 Compared to the non-scan-capable version, which connected UART
304 Core Tx and Rx directly to the Platform Resource (and the Platform
305 took care of wiring to IO Pads):
306
307 * Core HDL is instead wired to the core-side of JTAG Scan
308 * JTAG Pad side is instead wired to the Platform
309 * (the Platform still takes care of wiring to actual IO Pads)
310
311 JTAG TAP capability on UART TX and RX has now been inserted into
312 the chain. Using openocd or other program it is possible to
313 send TDI, TMS, TDO and TCK signals according to IEEE 1149.1 in order
314 to intercept both the core and IO Pads, both input and output,
315 and confirm the correct functionality of one even if the other is
316 broken, during ASIC testing.
317
318 ## Libre-SOC Automatic Boundary Scan
319
320 Libre-SOC's JTAG TAP Boundary Scan system is a little more sophisticated:
321 it hooks into (replaces) ResourceManager.request(), intercepting the request
322 and recording what was requested. The above manual linkup to JTAG TAP
323 is then taken care of **automatically and transparently**, but to
324 all intents and purposes looking exactly like a Platform even to
325 the extent of taking the exact same list of Resources.
326
327 class Blinker(Elaboratable):
328 def __init__(self, resources):
329 self.jtag = JTAG(resources)
330
331 def elaborate(self, platform):
332 m = Module()
333 m.submodules.jtag = jtag = self.jtag
334
335 # get the UART resource, mess with the output tx
336 uart = jtag.request('uart')
337 intermediary = Signal()
338 m.d.comb += uart.tx.eq(~intermediary) # invert, for fun
339 m.d.comb += intermediary.eq(uart.rx) # pass rx to tx
340
341 return jtag.boundary_elaborate(m, platform)
342
343 Connecting up and building the ASIC is as simple as a non-JTAG,
344 non-scanning-aware Platform:
345
346 resources = create_resources()
347 asic = ASICPlatform(resources)
348 hdl = Blinker(resources)
349 asic.build(hdl)
350
351 The differences:
352
353 * The list of resources was also passed to the HDL Module
354 such that JTAG may create a complete identical list
355 of both core and pad matching Pins
356 * Resources were requested from the JTAG instance,
357 not the Platform
358 * A "magic function" (JTAG.boundary_elaborate) is called
359 which wires up all of the seamlessly intercepted
360 Platform resources to the JTAG core/pads Resources,
361 where the HDL connected to the core side, exactly
362 as if this was a non-JTAG-Scan-aware Platform.
363 * ASICPlatform still takes care of connecting to actual
364 IO Pads, except that the Platform.resource requests were
365 triggered "behind the scenes". For that to work it
366 is absolutely essential that the JTAG instance and the
367 ASICPlatform be given the exact same list of Resources.
368
369
370 ## Clock synchronisation
371
372 Take for example USB ULPI:
373
374 <img src="https://www.crifan.com/files/pic/serial_story/other_site/p_blog_bb.JPG"
375 width=400 />
376
377 Here there is an external incoming clock, generated by the PHY, to which
378 both Received *and Transmitted* data and control is synchronised. Notice
379 very specifically that it is *not the main processor* generating that clock
380 Signal, but the external peripheral (known as a PHY in Hardware terminology)
381
382 Firstly: note that the Clock will, obviously, also need to be routed
383 through JTAG Boundary Scan, because, after all, it is being received
384 through just another ordinary IO Pad, after all. Secondly: note thst
385 if it didn't, then clock skew would occur for that peripheral because
386 although the Data Wires went through JTAG Boundary Scan MUXes, the
387 clock did not. Clearly this would be a problem.
388
389 However, clocks are very special signals: they have to be distributed
390 evenly to all and any Latches (DFFs) inside the peripheral so that
391 data corruption does not occur because of tiny delays.
392 To avoid that scenario, Clock Domain Crossing (CDC) is used, with
393 Asynchronous FIFOs:
394
395 rx_fifo = stream.AsyncFIFO([("data", 8)], self.rx_depth, w_domain="ulpi", r_domain="sync")
396 tx_fifo = stream.AsyncFIFO([("data", 8)], self.tx_depth, w_domain="sync", r_domain="ulpi")
397 m.submodules.rx_fifo = rx_fifo
398 m.submodules.tx_fifo = tx_fifo
399
400 However the entire FIFO must be covered by two Clock H-Trees: one
401 by the ULPI external clock, and the other the main system clock.
402 The size of the ULPI clock H-Tree, and consequently the size of
403 the PHY on-chip, will result in more Clock Tree Buffers being
404 inserted into the chain, and, correspondingly, matching buffers
405 on the ULPI data input side likewise must be inserted so that
406 the input data timing precisely matches that of its clock.
407
408 The problem is not receiving of data, though: it is transmission
409 on the output ULPI side. With the ULPI Clock Tree having buffers
410 inserted, each buffer creates delay. The ULPI output FIFO has to
411 correspondingly be synchronised not to the original incoming clock
412 but to that clock *after going through H Tree Buffers*. Therefore,
413 there will be a lag on the output data compared to the incoming
414 (external) clock
415
416 # Pinmux GPIO Block
417 The following diagram is an example of a GPIO block with switchable banks and comes from the Ericson presentation on a GPIO architecture.
418
419 [[!img gpio-block.svg size="800x"]]
420
421 The block we are developing is very similar, but is lacking some of configuration of the former (due to complexity and time constraints).
422
423 ## Diagram
424 [[!img banked_gpio_block.jpg size="600x"]]
425
426 *(Diagram is missing the "ie" signal as part of the bundle of signals given to the peripherals, will be updated later)*
427
428 ## Explanation
429 The simple GPIO module is multi-GPIO block integral to the pinmux system.
430 To make the block flexible, it has a variable number of of I/Os based on an
431 input parameter.
432
433 By default, the block is memory-mapped WB bus GPIO. The CPU
434 core can just write the configuration word to the GPIO row address. From this
435 perspective, it is no different to a conventional GPIO block.
436
437 ### Bank Select Options
438 * bank 0 - WB bus has full control (GPIO peripheral)
439 * bank 1,2,3 - WB bus only controls puen/pden, periphal gets o/oe/i/ie (Not
440 fully specified how this should be arranged yet)
441
442 Bank select however, allows to switch over the control of the GPIO block to
443 another peripheral. The peripheral will be given sole connectivity to the
444 o/oe/i/ie signals, while additional parameters such as pull up/down will either
445 be automatically configured (as the case for I2C), or will be configurable
446 via the WB bus. *(This has not been implemented yet, so open to discussion)*
447
448 ## Configuration Word
449 After a discussion with Luke on IRC (14th January 2022), new layout of the
450 8-bit data word for configuring the GPIO (through CSR):
451
452 * oe - Output Enable (see the Ericson presentation for the GPIO diagram)
453 * ie - Input Enable
454 * puen - Pull-Up resistor enable
455 * pden - Pull-Down resistor enable
456 * i/o - When configured as output (oe set), this bit sets/clears output. When
457 configured as input, shows the current state of input (read-only)
458 * bank_sel[2:0] - Bank Select (only 4 banks used)
459
460 ### Simultaneous/Packed Configuration
461 To make the configuration more efficient, multiple GPIOs can be configured with
462 one data word. The number of GPIOs in one "row" is dependent on the width of the
463 WB data bus.
464
465 If for example, the data bus is 64-bits wide, eight GPIO configuration bytes -
466 and thus eight GPIOs - are configured in one go. There is no way to specify
467 which GPIO in a row is configured, so the programmer has to keep the current
468 state of the configuration as part of the code (essentially a shadow register).
469
470 The diagram below shows the layout of the configuration byte, and how it fits
471 within a 64-bit data word.
472
473 [[!img gpio_csr_example.jpg size="600x"]]
474
475 If the block is created with more GPIOs than can fit in a single data word,
476 the next set of GPIOs can be accessed by incrementing the address.
477 For example, if 16 GPIOs are instantiated and 64-bit data bus is used, GPIOs
478 0-7 are accessed via address 0, whereas GPIOs 8-15 are accessed by address 8
479 (TODO: DOES ADDRESS COUNT WORDS OR BYTES?)
480
481 ## Example Memory Map
482 [[!img gpio_memory_example.jpg size="600x"]]
483
484 The diagrams above show the difference in memory layout between 16-GPIO block
485 implemented with 64-bit and 32-bit WB data buses.
486 The 64-bit case shows there are two rows with eight GPIOs in each, and it will
487 take two writes (assuming simple WB write) to completely configure all 16 GPIOs.
488 The 32-bit on the other hand has four address rows, and so will take four write transactions.
489
490 64-bit:
491
492 * 0x00 - Configure GPIOs 0-7
493 * 0x01 - Configure GPIOs 8-15
494
495 32-bit:
496
497 * 0x00 - Configure GPIOs 0-3
498 * 0x01 - Configure GPIOs 4-7
499 * 0x02 - Configure GPIOs 8-11
500 * 0x03 - Configure GPIOs 12-15
501
502
503 ## Combining JTAG BS Chain and Pinmux (In Progress)
504 [[!img io_mux_bank_planning.JPG size="600x"]]
505
506 The JTAG BS chain need to have access to the bank select bits, to allow
507 selecting different peripherals during testing. At the same time, JTAG may
508 also require access to the WB bus to access GPIO configuration options
509 not available to bank 1/2/3 peripherals.
510
511 ### Proposal
512 TODO: REWORK BASED ON GPIO JTAG DIAGRAMS BELOW
513 The proposed JTAG BS chain is as follows:
514
515 * Between each peripheral and GPIO block, add a JTAG BS chain. For example
516 the I2C SDA line will have core o/oe/i/ie, and from JTAG the pad o/oe/i/ie will
517 connect to the GPIO block's ports 1-3.
518 * Provide a test port for the GPIO block that gives full access to configuration
519 (o/oe/i/ie/puen/pden) and bank select. Only allow full JTAG configuration *IF*
520 ban select bit 2 is set!
521 * No JTAG chain between WB bus and GPIO port 0 input *(not sure what to do for
522 this, or whether it is even needed)*.
523
524 Such a setup would allow the JTAG chain to control the bank select when testing
525 connectivity of the peripherals, as well as give full control to the GPIO
526 configuration when bank select bit 2 is set.
527
528 For the purposes of muxing peripherals, bank select bit 2 is ignored. This means
529 that even if JTAG is handed over full control, the peripheral is still connected
530 to the GPIO block (via the BS chain).
531
532 Signals for various ports:
533
534 * WB bus or Periph0: WB data read, data write, address, cyc, stb, ack
535 * Periph1/2/3: o,oe,i,ie (puen/pden are only controlled by WB, test port, or
536 fixed by functionality)
537 * Test port: bank_select[2:0], o,oe,i,ie,puen,pden. In addition, internal
538 address to access individual GPIOs will be available (this will consist of a
539 few bits, as more than 16 GPIOs per block is likely to be to big).
540
541 As you can see by the above list, the GPIO block is becoming quite a complex
542 beast. If there are suggestions to simplify or reduce some of the signals,
543 that will be helpful.*
544
545 The diagrams below show 1-bit GPIO connectivity, as well as the 4-bit case.
546
547 [[!img gpio_jtag_1bit.jpg size="600x"]]
548
549 [[!img gpio_jtag_4bit.jpg size="600x"]]
550
551 # Core/Pad Connection + JTAG Mux
552
553 Diagram constructed from the nmigen plat.py file.
554
555 [[!img i_o_io_tristate_jtag.svg ]]
556