Basic UVM testbench for a Stream Processor

This page: The generic Device Under Test

The generic Device Under Test

This section describes the generic stream processor, the interfaces, some internals, and how it is expected to be used. This section elaborates on the Overview but stops short of providing heavy details. What is described here is how to use the SP independent of any particular function it will perform. Here is a block diagram again for reference. The bus interface signals are fully described in the specification.

In brief:

The stream processor looks like two independent single port peripherals.

The operation is:
1) Data/command frames are written to the SP via the write channel.
2) Computations are performed.
3) Results can be read via the read channel when the computations are complete.

The content of the frame, or data record, determines the computation and format of the results on read.

A portion of the frame is reserved for control/command information, a portion for input data, and a portion for result data.

Frames include a frameID since:
1) frames can complete out-of-order, and
2) frames can be chained as a stream of pipelined computations, and it's via use of bit 7 of the CMMD field and the frameID that the chaining occurs.

An abort signal is provided on the write channel and asserting it will cause all computational units to abandon processing and return to idle.

Neither the write nor read channel need know the number of computational units available. That number is N.

Neither the write nor read channel need know the number of computation units/functions available in each computational unit. That number is M.

Compuational units can run in parallel or be scheduled serially as they become avaialble.

The signal ok2write indicates that at least one processing unit is available.

The signal ok2read indicates that at least one processing unit is done and results can be read.

The core(s) of a stream processor are the "oneproc" units. They contain some number of kernel functions. In this project M = 2.

oneproc is "one processing unit".

oneproc_control manages the bus interfaces of the write and read channels to/from the internal memory, and manages control of the processor via the computational unit control.

There are only 4 fixed bytes in the data record, the frameID, the CMMD field, and arg0 and arg1 of the CMMD. The oneproc_control uses CMMD, arg0, and arg1 to select the correct computation unit.

All other values in the data record are interpreted in the context of the CMMD, arg0, and arg1 fields. oneproc and oneproc_control have no knowledge of what computation is being performed. oneproc_control simply uses those three bytes to set some multiplexor select lines, pass control to the computational unit control, and wait.

The diagram below (the internal memory block is not shown) shows a oneproc processor that can perform a single function. That is, it recognizes a single command from the CMMD field of the data record. New functions can be added to a oneproc unit with no change to the external interface. This provides a means to drop-in new computational units as the need arises for the two classes of stream processors I need:
• Independent computational units performing different tasks on the same data sets or different elements in a data set (task parallelism), and
• Multiple units assigned identical computational tasks on the same data sets or different elements in a data set (data parallelism).

Here is a block diagram of an M-function oneproc processor. Again, adding functions has no impact on the external interface. It only means reserving a CMMD and correctly using the data record for that function:

In later pages I will refer to what I call theResource of a stream processor. There is one theResource per stream processor and it will comprise N oneproc units each with M functions:

In the specific DUT I'll be testing, the HWSP, for "Hello World Stream Processor", there will be two computation units per oneproc and 8 oneproc units. One of the computational units does a Pearson's r correlation computation, and the other does a smoothing function on a set of data. The Pearson's r correlation is done in a "computation unit" up to 21 times. Those 21 iterations on the set of data are controlled by the computation unit control and are transparent at both the read and write interfaces. Results are available in reserved fields of the data frame. The smoothing function returns no results but rather modifies the data in place in the internal memory.

In addition to the block diagrams above it may be helpful to provide explanations of expected use cases of the stream processor, again, independent of any particular function. Here are four such cases. N is the number of oneproc processing units, and M is the number of functions a processing unit can perform.

An NxM stream processor using each of N serially.
The write channel writes when it's ok2write, and the read channel reads when it's ok2read. On reads, it is the responsibility of the user of the stream processor to know how to interpret the read data from the data record. The frameID is provided for such an accounting mechanism.

An NxM stream processor using each of N in parallel.
In this mode all the processors can operate on the same data set performing the same function, albeit with different parameters, or operate on the same data set performing different functions. Using different data sets in the parallel mode may not bring any benefits over just running serially since data is written serially.

An NxM stream processor using each of N in parallel, then aborting and issuing new CMMDs in serial.
In this situation it's a matter of taking the first result that meets a criteria, aborting all unfinished computations, and launching a new set of computation tasks.

An NxM stream processor N stage pipeline
The initial data is written to the internal memory, the processor is given a CMMD, it performs the function, and result data is written to the internal memory. In order to chain a sequence of functions together to get a streaming effect, successive writes need only change the CMMD/arg0/arg1 fields. The final step of processing needs to set bit 7 of the CMMD. Bit 7 is the "final" bit and when 0 indicates to the read channel that it should NOT read more than bytes 0-3. It should say thanks and free the oncproc unit for further processing. Keeping the same frameID throughout a pipelined process is up to the external task scheduler.

The notation below is <oneproc number> "frame" <frame id>. So "3 frame 5a" is frame ID 5a running on oneproc 3.

Note that if the data set and/or data result is not small, then the oneproc "data" can consist of pointers to a larger memory. In that situation, data in a block of external memory is operated on by successive processing units. "External memory" need not be actual memory: it could be sensor/drive I/O in a continuous control system.

Note also that proc 1 is used for all three frame 1 functions, proc 2 is used for all three frame 2 functions, etc; but that only happens to be the case here. It need not be the case that the same oneproc unit be used to perform all functions in a stream.

The details of how each data record is structured for any given computation unit can be found in dut_pkg.sv. The diagram below shows that inside each oneproc unit is a memory, and with the excpeption of the first 32-bit word, the meaning of the contents is wholly determined by what's in the dut_pkg. Not to put too fine a point on it:

the meaning of fields in the data record is not fixed,
the length of a data record is not fixed,
the write and read channels are independent, the data bus widths can differ form each other, and are not fixed,

meaning that all those not-fixed parameters are determined by what's in the dut_pkg.svh file. This poses some interesting complications for the testbench in that the drivers and monitors need to be condigured by the contents of the dut_pkg.

Eventually there will be 4 forms of the stream processors:

Serial Load Data Parallel SLDP (this is the current state of the RTL)
Serial Load Task Parallel SLTP
Parallel Load Data Parallel PLDP
Parallel Load Task Parallel PLTP

So that's a description of the generic stream processor. To give some minimal real content to the DUT I've made a design which I call HWSP, for "Hello World Stream Processor". The details of HWSP are on here: A Specific Device Under Test.

This is a work in progress