
The code in this directory is for a large example design, BxB Demo 1,
also called the BxBApp.

With this FPGA app and its software, the RFSoC4x2 board can perform
the functions of a spectrum analyzer and an oscilloscope.  It can also
measure transfer functions.  It can characterize performance of the
RFSoC ADC or DAC.  If you're using an RFSoC4x2 to build some design,
you can run this app on the target board instead of the design's app
to double-check signals coming into the board or produce test signals
out.

This is a very unique design, because it has been designed for its DSP
processing to run at a variable clock rate, where the clock rate is
programmed via the board's clock chips, external to the FPGA.  Because
the clock rate is variable, MMCM's and PLL's can't be used.  So all
DSP clocks must be obtained without them.  Furthermore, the ADC's
clock divider in the RFSoC FPGA is unable to divide by the correct
clock ratio to get the desired FPGA frequencies easily.  To overcome
this, a DAC clock is generated in the external clock chips with the
correct clock ratio to the ADC clock and synchronous to it, so that
everything can be correctly clocked.

Support for a variable clock rate variable allows the software
application to configure the ADC sampling clock to a very wide range
of values -- practically anything desired between 500MHz and 7GHz
(these numbers are from memory, and are probably not perfectly
accurate).  Of particular interest is that the max sampling rate spec
for this part is 5GHz, but this FPGA design is capable of testing
sampling at much higher rates above this spec.

This variability allows designers to characterize the ADC at almost
any desired sampling rate, which can be useful to understand how the
ADC may perform in a particular end application.

This design demonstrates many things.  Of particular interest is that
no interrupts are used between hardware and software.  The philosophy
is that it is best to avoid interrupts in Linux applications, because
they are much more difficult to implement.  Also, they add little to a
well-thought-out design.  The purpose of interrupts is for immediate
response to an event, for things that are time-critical.  But the
processor is slow compared to the hardware, and it may get bogged
down.  Instead of having time-critical software, it makes more sense
to put anything truly time-critical into a hardware function, if at
all possible, so that software functions have flexibility in their
timing.

Thus instead of having interrupt-driven Linux software, it makes sense
to have a multithreaded application, where the sofware polls for
hardware events.  One way to implement the polling is for software to
check for an event when it is convenient, at the end of some other
task.  Another way is for the polling to be asynchronous to the other
software processes.  This is done by giving the polling its own
thread, that sleeps for a short time, then polls, then sleeps, then
polls.  Because this is asynchronous to the rest of the program it is
similar in effect to how an interrupt works.  It has minor
disadvantages over an interrupt solution, with additional overhead and
a small response delay.  However, it needs no kernel driver, and is
thus immensely easier to implement and maintain in a Linux
environment.  As seen in the example BxBApp, great things can be
accomplished this way.

Incidentally, bare metal programming is different from programming in
a Linux environment, and an interrupt solution is preferable for bare
metal.  Also, bare metal applications are preferable in many
situations, where the desired FPGA SoC functionality isn't too
complicated.  However, bare-metal programming greatly increases the
complexity of many things, including the networking and DisplayPort
capabilities of the BxBApp.  Because of this, it would be an immense
task to design something like the BxBApp using bare metal.
Consequently, the BxBApp uses Linux, and thus it uses an interrupt
polling solution that is preferable in Linux.

Another thing demonstrated is how to avoid DMA for getting data to the
processor.  It is easy for the processor to just read the data
directly, if the hardware is designed for it, and that's waht the
BxBApp does.

Sometimes a DMA is considered to be more efficient.  However with a
DMA, the DMA must read the data from its capture buffer in the PL,
then write the data to PS SDRAM, then have the processor read it
again.  So there is an extra read and write versus when the processor
just reads the data directly.  So DMA's efficiency isn't clear cut.

A DMA scheme that bypasses storage in the PL is also a possibility.
This would write data to the PS SDRAM as it is generated.  This at
least saves buffer memory in the PL.  However, this has complications
because the AXI bus might be backed up, causing data loss that needs
recovery methods.  Also, it still requires more reads/writes than
having the processor dirctly access data from the PL.

This design also demonstrates several utility modules.  There is a
module for measuring clock frequencies, relative to a known reference
clock.  There is a module for reading the FPGA's DNA value, which can
be used as a random seed to differentiate boards or as part of a
scheme for copy protection.

Another thing demonstrated is a method of getting Vivado to meet
timing.  Many designs may struggle to meet timing, and it is desirable
to have a technique that enables meeting timing with the least
possible manual labor.

This BxBApp design meets timing almost always.  However, because
the clock rate for this application is variable and may be much higher
than the default, it is desirable to meet timing with the greatest
possible Fmax.  So in this case, a technique that improves timing is
also quite valuable.

There are many strategies for meeting timing, regarding changing place
options, changing route options, or re-running place or route with
additional guidance from past runs.  The technique demonstrated here
is none of these.  It is entirely automated, yet in my experience
attains results at least as good as these other more manual methods.

The technique is based on the idea that most of the timing issues are
caused in the placement stage, where design elements are placed too
far from each other.  Also, the technique is based on the realization
that this placement, although intelligent, is also in large part
random.  It's repeatable, in that if you make no change to the design
you get identical results.  However, if you make even a small change
to the design you get an entirely different placement, and thus
entirely different timing results.  Designs can easily have hundreds
of MHz different Fmax from minor changes that a person would consider
to be almost entirely cosmetic.

The idea behind the technique is that this randomness can be
exploited.  If minor changes to the design can make hundreds of MHz
difference, then one could purposely make minor and inconsequential
changes to the design until the best result is obtained.  One merely
needs a method to make minor and inconsequential design changes in an
automated fashion.  Then a machine can be let loose to try many design
perturbations until a good one is found, without a person doing any
additional work to meet timing.

It turns out that one such minimum change is to change the value of
clock uncertainty, using set_clock_uncertainty.  This doesn't change
any of the PL source code at all, and great variability in placement
can be had by using it.  That's what is done in this BxBApp example,
by the scripts "compile_design.sh" (to compile the design with a
specified clock uncertaintly) and "multi_compile.sh" (to loop over
many design uncertainties, trying each one).  The script
"report_timing_best_to_worst.sh" prints out a summary of what Fmax was
achieved for each design, by extracting that information from the
Vivado timing reports.  Fmax is measured on the clock that typically
has the greatest difficulty meeting timing.  This allows all of the
runs created by "multi_compile.sh" to be quickly evaluated to identify
the best one, which can then be examined in detail to make sure that
all other timing is met.

The process of running these scripts has been performed for you, with
the most important results saved off in the "best_results" directory.
The results are selected and saved off automatically by running
"cache_best_results.sh" after "multi_compile.sh" completes.  The file
"best_results/README.txt" explains some of the cached results and
their purpose.  It also gives a list of all the different runs that
were made, and how they all performed in Fmax before the best one was
selected.

The details of how "compile_design.sh" changes the clock uncertainty
are as follows.  To give a random placement, set_clock_uncertainty is
used to set a specific desired uncertainty prior to placement.
However, the value set with set_clock_uncertainty must be removed
prior to routing.  This is so that final timing estimates don't
include this false uncertainty, and so that routing produces routes
targeted at the correct timing values.  In this way, the finished
design is fully checked off for the correct timing, as if
set_clock_uncertainty had never been used -- except for its effect on
placement.

The "compile_design.sh" script compiles in project mode.  This is
because the design is a project mode design created with the GUI, then
saved off as a TCL file.  So the script re-creates the project and
then builds it using project mode.  One result of this is that after
the "compile_design.sh" script is run to build the project, the
project can again be opened with the Vivado GUI, to make changes.
Then a new TCL file can be saved off, the old one replaced, and the
project is then updated with the changes from the GUI.  This makes
development somewhat convenient, having some of the better parts of
both scripting and GUI.

However, compiling in Project Mode makes changing clock uncertainty a
bit tricky.  In project mode, set_clock_uncertainty can only be used
in a constraints fie.  So for "compile_design.sh" to change the clock
uncertainty for placement but then retore it for routing,
"compile_design.sh" must add an appropriate set_clock_uncertainty to a
constraints file, and then later remove that set_clock_uncertainty in
the middle of the design run, so that it's not there when routing happens.
The TCL commands I settled on to do this can be seen in "compile_design.sh".

In non-project flow it's easier, since set_clock_uncertainty can be
entered directly before placement, and then zeroed out again after
placement, without so much fuss.

Another thing demonstrated is how to deal with clock transitions.
This design has a number of them, since various signals and data need
to be communicated between the AXI clock domain and the DSP clock
domain.  This interface is fully asynchronous, with synchronizer
circuits between the domains and adequate timing guards to prevent
there from being any issue.  However, Vivado still tries to do timing
across the synchronizer circuits, even with them marked as ASYNC_REG.
These clock transitions need to be marked as false paths, since the
signals are fully asynchronous and thus any timing is acceptable.

Vivado has no specific method to mark asynchronous paths in
synchronizers as false paths from the Verilog source code.  This can
be done in the constraints file with TCL commands, but it's really
annoying to do it there since it's hard to get the TCL to match the
Verilog, it's hard to maintain the match, and it's easy to miss
something.

However, there is a trick I picked up from some generous soul on the
internet (whose name sadly I forgot).  First, you give the register
receiving the asynchronous signal the attribute FALSE_PATH_DEST.
Vivado recognizes no such attribute, but Vivado carries it along
anyway and the corresponding cell has that attribute after synthesis.
Then, in the TCL constraints, you add a command as follows:

set_false_path -to [get_cells -hier -filter {(FALSE_PATH_DEST == 1)}]

This command finds all of the registers in the Verilog source that you
marked with a FALSE_PATH_DEST attribute, and it marks paths to them as
false paths.  This makes things very easy.  In your verilog source,
you have:

`define raxi   always@(posedge S_AXI_ACLK)
`define rref   always@(posedge REF_CLK)

 (* ASYNC_REG  = "TRUE" *)                            reg   enable_ar;
 (* ASYNC_REG  = "TRUE" *) (* FALSE_PATH_DEST = 1 *)  reg   enable_xr;
 (* ASYNC_REG  = "TRUE" *)                            reg   enable_xrr;

`raxi  enable_ar    <= input_w;     // Input register clocked on S_AXI_ACLK
`rref  enable_xr    <= enable_ar;   // First register clocked on REF_CLK
`rref  enable_xrr   <= enable_xr;   // Second register clocked on REF_CLK

Then with the above TCL in the constraints, all the false paths you
marked in the Verilog become correctly marked as false paths to
Vivado.

Doing this greatly relieves Vivado's confusion about clock
transitions.  For this design, if you don't mark the clock transitions
as false paths, Vivado only rarely gives an Fmax that is above the
target Fmax -- about 3 times out of 50 placement-randomized runs.
Even in those cases, Fmax is only barely over the desired value.  If
you do mark these paths as false paths as shown, all 50 runs achieve
timing on this design, and in some cases they achieve Fmax that is
nearly 80MHz above the requested value.

This example is not by any means a perfect design; some parts of it
were created before I had adequate understandings of some issues, some
parts just grew instead of being well designed, some parts were
rushed, and for some parts I just had a bad day.  However, I think
there are many excellent things in this design, and I hope you learn
something from it if that is your desire.

If you have a tip for me, please send me a note!  I know there are
lots of other experts out there, each with their own tricks and tips.
I'd love to hear from you.

Thanks for reading!  I hope you enjoy this example.  The example is
free for non-commercial use and distribution, but of course without
warranty of any kind.

For commercial use of any part of this example, contact Bit by Bit
Signal Processing.  Contact info is available at https://bxbsp.com.

Regards,

Ross Martin
ross@bxbsp.com
