FIELD OF THE INVENTION
[0001] This invention relates to the instruction set architecture of a dual multiply-accumulator
(MAC) based digital signal processor.
CROSS REFERENCE TO RELATED APPLICATIONS
[0002] This application claims priority under 35 U.S.C. ยง 119 from U.S. Provisional Application
Serial No. 60/058,157 entitled "NEAR-ORTHOGONAL DUAL-MAC INSTRUCTION SET WITH MINIMAL
ENCODING BITS," filed on September 8, 1997, the contents of which is hereby incorporated
by reference.
BACKGROUND OF THE INVENTION
[0003] A digital signal processor (DSP) is a special-purpose CPU utilized for digital processing
and analysis of signals from analogue sources, such as sound. The analog signals are
converted into digital data and analyzed using various algorithms, such as Fast Fourier
Transforms. DSPs are designed for particularly fast performance of certain operations,
such as multiplication, multiplying the accumulating, and shifting and accumulating,
because the math-intensive processing applications for DSPs rely heavily on such operations.
For this reason, a DSP will typically include special hardware circuits to perform
multiplication, accumulation and shifting operations.
[0004] One popular form of DSP architecture is known as a Multiply-Accumulate or MAC processor.
The MAC processor implements an architecture that takes advantage of the fact that
the most common data processing operations involve multiplying two values, then adding
the resulting value to another and accumulating the result. These basic operations
are efficiently carried out utilizing specially configured, high-speed multipliers
and accumulators, hence the "Multiply-Accumulate" nomenclature. In order to increase
the processing power of MAC processors, they have been designed to perform different
processes concurrently. Towards this end, DSP architectures with plural MAC structures
have been developed. For example, a dual MAC processor is capable of performing two
independent MAC operations concurrently.
[0005] A conventional multiply-accumulator (MAC) has a 2-input multiplier M which stores
its output in a product register P. The product register is connected to one input
of a two-input adder A whose output is stored in one of several accumulator registers.
A second input of the adder is connected to the accumulator array to allow for a continuous
series of cumulative operations. Conventional vector processors are made of several
MAC processors operating in parallel. Each MAC operates on its own independent data
stream and the parallel MACs are joined only by a common set of accumulators. The
number of instructions available for each individual MAC is fairly limited and thus,
even when several MACs are combined in a parallel vector processor, the total number
of MAC commands which must be encoded is relatively small.
[0006] The architecture of the newly designed dual-MAC processor shown in Fig. 1 differs
from conventional parallel vector processors by the addition of the cross-connecting
data lines. The dual-MAC architecture of Fig. 1 consists of two 32-bit input x and
y (the 16-bit high and low data halves will be referred to as
xh and
yh, and
xl and
yl, respectively) which hold the operands to the two multipliers
M0 and
M1. The
x and
y registers are cross-connected to both of the multipliers so that each multiplier
can operate on any two of the four possible input factors. The products
p0 and
p1 are accumulated with the contents of any of the accumulators
a0 to
a7 by the two adders
A0 and
A1. The
p0 product is also cross-connected to the A1 adder, which is capable of 3-input addition.
In the preferred embodiment, the dual-MAC processor is implemented in conjunction
with an aligned double word memory architecture which can return two double words
in a single 32-bit fetch.
[0007] This cross-connected dual-MAC architecture allows a single FIR or IIR digital filter
applied to a single data stream to be processed by both MACs in parallel, two taps
at a time, where each "tap" is a multiply-accumulate operation. Conventional vector
processors with no interconnects can compute two FIR filters in parallel, but each
filter is processed one tap at a time. Thus, for a single FIR or IIR filter, the cross-connected
architecture operates twice as quickly as a conventional vector processor.
[0008] Figure 2 is a list of all the possible instruction commands for the cross-connected
dual-MAC architecture of Fig. 1. The commands are divided into accumulate statements
and product statements. Each statement represents either a single operation done on
one MAC side or the other, or two operations done on both sides in parallel. When
these commands are implemented as part of architected instructions, each will either
be an accumulate statement (add), a product statement (multiply), or a combination
of addition and multiplication.
[0009] As shown in Fig. 2, there are 12 possible accumulate combinations and 20 possible
product combinations in the orthogonal dual-MAC instruction set. Thus, the total number
of commands which can be architected in the command processor and encoded within the
commands is 12 * 20 + 12 + 20 = 272. The architected cross-connections result in a
combinatorial multiplication of the number of possible functions which can be encoded
as architected commands.
[0010] An issue which arises with this architecture is that encoding 272 separate dual-MAC
operations within a command code requires 9 bits. It is advantageous to reduce the
number of bits required to encode dual-MAC instructions without impacting available
functionality. This is especially true when the number of bits available to encode
commands is limited and other commands must also be encoded within the same limited
number of bits. In the specific dual-MAC processor at issue, only 7 bits have been
dedicated to encoding commands for dual-MAC operations.
[0011] According to the present invention, a near-orthogonal dual-MAC instruction set is
provided which implements virtually the entire functionality of the orthogonal instruction
set of 272 commands using only 65 commands. The reduced instruction set is achieved
by eliminating instructions based on symmetry with respect to the result of the commands
and by imposing simple restrictions related to items such as the order of data presentation
by the programmer. Specific selections of commands are also determined by the double
word aligned memory architecture which is associated with the dual-MAC architecture.
The reduced instruction set architecture preserves the functionality and inherent
parallelism of the command set and requires fewer command bits to implement than the
full orthogonal set.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] The foregoing and other features of the present invention will be more readily apparent
from the following detailed description and drawings of illustrative embodiments of
the invention in which:
FIG. 1 is a simplified block diagram of a dual-MAC processor.
FIG. 2 is a table showing an orthogonal command set for the dual-MAC processor of
Fig. 1.
FIG. 3 is a table showing a near-orthogonal command set according to the present invention
for the dual-MAC processor of Fig. 1.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
[0013] As shown in Fig. 2, there are 12 possible accumulate statements and 20 possible product
statements for the dual-MAC processor shown in Fig. 1, which can be architected separately
or in combination. The reduction of the orthogonal instruction set to provide a near-orthogonal
reduced instruction set is discussed with respect to each function type.
[0014] There are four possible single two-input accumulate statements:




where aD and aS are any of the 8 accumulators for the dual-MAC processor of Fig.
1. It should be noted that, in addition to encoding the dual-MAC command itself, the
identity of the D and S accumulators to which the command is directed also must be
stored within the command op code. When one of these commands is executed, only one
of the two MAC paths will be engaged. The other will be idle. Because the result of
the accumulate statement is stored in an accumulator which is equally accessible to
both MAC paths, there is no need to encode commands for both paths. By default, the
p0 path is chosen. The reduced command set is therefore: aD = aS +/- p0.
[0015] There are four possible 3-input accumulate statements:




Theoretically, a reduction of these commands could be made by recognizing that
p0 and p1 may be "swapped" if the programmer adjusts the order of the inputs to the
multiplier accordingly. However, this strategy cannot be used to reduce both the accumulate
instruction set and the product instruction set, discussed below, because the accumulate
and product commands must be capable of being encoded as accumulate/product pairs.
Thus, only one of the two commands in the pair can be reduced in this manner. The
input-swap strategy provides the greatest reduction for the product statements, discussed
below, and therefore all four of the 3-input accumulate statements are implemented.
[0016] There are four possible dual two-input accumulate statements, each consisting of
two-input accumulate statements which are executed in parallel:
aD0 = aS0 + p0 |
aD1 = aS1 + p1 |
aD0 = aS0 + p0 |
aD1 = as1 - p1 |
aD0 = aS0 - p0 |
aD1 = aS1 + p1 |
aD0 = aS0 - p0 |
aD1 = as1 - p1 |
where aD0, aD1, aS0, and aS1 each refer to one of the 8 accumulators. The number
of commands in this set is not reduced for the same reasons discussed with respect
to the three-input accumulate statements, above. However, a reduction in encoding
bits elsewhere in the command op code may be achieved by limiting the accumulators
that can be selected in the command to adjacent accumulator pairs. In the preferred
embodiment, aD0 is limited to even accumulator addresses and aD1 is defined as the
accumulator aD0+1. The consecutive pairs of accumulators are designated aD and aDP.
Similarly, accumulator pairs aS0 and aS1 are limited to aS and aSP. This reduction
provides a modified command set of:
aD = aS + p0 |
aDP = aSP + p1 |
aD = aS + p0 |
aDP = aSP - p1 |
aD = aS - p0 |
aDP = aSP + p1 |
aD = aS - p0 |
aDP = aSP - p1 |
Although four commands must still be encoded, only one accumulator of the pair needs
to be specified since the commands are limited to adjacent accumulator pairs. Because
there are eight accumulators, three bits are required to identify an accumulator.
By limiting the accumulators to adjacent pairs so that only two accumulators need
to be identified, as opposed to four, and selecting the even accumulator of the pair
to define, the total number of bits needed to identify the accumulators in the op
code is reduced from 12 to 4.
[0017] There are four possible factors which can be presented to the multipliers. These
factors are stored in two divided registers. The x-register holds the xh and xl factors
and the y-register holds the yh and yl factors. Because of the cross-connected architecture,
any two of the four factors may be input to each multiplier M0, M1.
[0019] There are twelve possible dual product statements:
1. |
p0 = xh * yh |
p1 = xh * yl |
2. |
p0 = xh * yh |
p1 = xl * yh |
3. |
p0 = xh * yh |
p1 = xl * yl |
4. |
p0 = xh * yl |
p1 = xh * yh |
5. |
p0 = xh * yl |
p1 = xl * yh |
6. |
p0 = xh * yl |
p1 = xl * yl |
7. |
p0 = xl * yh |
p1 = xh * yh |
8. |
p0 = xl * yh |
p1 = xh * yl |
9. |
p0 = xl * yh |
p1 = xl * yl |
10. |
p0 = xl * yl |
p1 = xh * yh |
11. |
p0 = xl * yl |
p1 = xh * yl |
12. |
p0 = xl * yl |
p1 = xl * yh |
Several of these commands result in identical multiplications being performed, differing
only in which MAC processor is used and thus which product register the result appears
in. The symmetric pairs are 1-4, 2-7, 3-10, 5-8, 6-11, and 9-12. Thus, a first reduction
can take advantage of this symmetry and encode only one command of each symmetric
pair. The reduction results in the 6 commands shown below:
1. |
p0 = xh * yh |
p1 = xh * yl |
2. |
p0 = xh * yh |
p1 = xl * yh |
3. |
p0 = xh * yh |
p1 = xl * yl |
5. |
p0 = xh * yl |
p1 = xl * yh |
6. |
p0 = xh * yl |
p1 = xl * yl |
9. |
p0 = xl * yh |
p1 = xl * yl |
[0020] The set can be further reduced by recognizing that "nearly-symmetric" pairs can be
eliminated by relying on the ability of the programmer to direct data into the x-
or y-register as desired. For items 1 and 2, note that the p0 operations are identical.
The p1 operations differ only in which register the high-word factor and low-word
factor are chosen from. Switching the x- and y- register data in command 2 gives the
same result as command 1. Thus, only one of the two commands needs to be implemented.
No functionality is lost because the programmer can simply switch the order of the
inputs. Items 6 and 9 are also nearly symmetric.
[0021] It should be noted that in the preferred embodiment, the dual-MAC processor is implemented
with an aligned double word memory architecture. As a result, this near-symmetry is
not available with respect to items 3 and 5. Although in theory, the programmer has
absolute control over where the factor data is stored in the registers and thus how
it is presented to the dual-MAC processor, the aligned double word memory architecture
of the preferred embodiment provides for two data values to be fetched in a single
double word operation and stored in the x- or y-register. Switching which register
the two data values are stored in does not carry with it a performance penalty. However,
dividing the data requires extra commands and therefore carries a performance penalty.
To avoid this situation, both commands 3 and 5 are implemented. The final reduced
dual product command set is:
p0 = xh * yh |
p1 = xh * yl |
p0 = xh * yh |
p1 = xl * yl |
p0 = xh * yl |
p1 = xl * yh |
p0 = xl * yh |
p1 = xl * yl |
[0022] The complete reduced command set is illustrated in Fig. 3. There are 10 different
accumulate statements and 5 different product statements, resulting in total number
of architected commands equal to 10 * 5 + 10 + 5 = 65. Encoding these operations with
7 bits of the op code allows for an additional 63 commands to be implemented without
increasing the number of required bits.
1. A method for constructing a reduced set of instructions for operating a cross-connected
dual-MAC processor having a complete command set including at least 4 two-input accumulator
statements, at least 4 three-input accumulator statements, at least 4 dual two-input
accumulator statements, at least 8 single product statements, and at least 12 dual
product statements, comprising the steps of:
including in said instruction set only two-input accumulate statements associated
with a particular one of said two MAC processors;
including in said instruction set said three-input accumulator statements;
including in said instruction set said dual two-input accumulator statements and restricting
said dual two-input accumulator statements to adjacently numbered pairs of accumulators;
including in said instruction set only one single product statement; and
including in said instruction set only one dual product statement from each of six
symmetric dual product statement pairs.
2. The method of claim 1, further comprising the step of additionally reducing said instruction
set by including in said instruction set only one dual product statement from a nearly-symmetric
pair of dual product statements.
3. A method of constructing a reduced set of instructions for controlling a pair of multipliers
in a processor having a cross-connected dual-MAC architecture supporting 4n possible
dual-product multiplier statements, said method comprising the steps of:
arranging the 4n dual-product statements into 2n symmetric pairs of dual-product statements;
and
including in said instruction set only one dual product statement from each of said
2n symmetric dual product statement pairs.
4. The method of claim 3, further comprising the steps of:
arranging the 2n dual-product statements selected from said 2n symmetric pairs into
n pairs of nearly-symmetric dual-product statements; and
for at least one of said n nearly-symmetric pairs, including in said instruction set
only one dual-product statement from said at least one of n nearly-symmetric pairs.
5. In a microprocessor including two cross-connected MAC processors, each MAC processor
having a multiplier connected to an adder, said adders connected to an accumulator
having a plurality of registers, a reduced set of instructions in which the instructions
for controlling said multipliers and adders are limited essentially to:
two single two-input accumulate statements associated with a particular one of said
two MAC processors;
four dual two-input accumulate statements, each of said dual two-input accumulate
statements restricted to acting on predefined pairs of accumulator registers;
four three-input accumulate statements;
one single-product statement; and
four dual-product statements, no two of which form a symmetric pair.
6. The microprocessor of claim 5, wherein:
said accumulator comprises eight accumulator registers;
said predefined pairs of registers comprise four pairs of numerically adjacent registers;
and
two bits are reserved in said dual two-input accumulate statements to identify a accumulator
pair.
7. In a microprocessor including four operand registers xh, xl, yh, and yl, selectively
connected as inputs for first and second two-input multipliers having output product
registers p0 and p1, a first adder receiving as input the value in p0 and the value
of a register selected from an accumulator array and providing an output to the accumulator
array, a second adder receiving as input the value in p0, the value in p1, and a data
value selected from a register in the accumulator array and providing an output to
the accumulator array, said microprocessor connected to a memory system supporting
aligned-double word fetches of data, a reduced set of instructions, in which the instructions
for controlling said multipliers and adders are limited essentially to:
two two-input accumulate statements of the form:

where aD and aS indicate destination and source accumulator registers, respectively;
four three-input accumulate statement of the form:

four dual two-input accumulate statements of the form:

where aDP and aSP indicate destination and source accumulator registers, respectively,
the destination registers aD and aDP and the source registers aS and aSP each indicating
a predefined pair of accumulator registers;
one single product statement of the form p0 = xh * yh; and
four dual product statements of the form:
p0 = xh * yh |
p1 = xh * yl; |
p0 = xh * yh |
p1 = xl * yl; |
p0 = xh * yl |
p1 = xl * yh; |
p0 = xl * yh |
p1 = xl * yl. |
8. A reduced set of instructions for controlling a microprocessor including two cross-connected
MAC processors, each MAC processor having a multiplier connected to an adder, said
adders connected to an accumulator having a plurality of registers, in which the instructions
in said instruction set for controlling said multipliers and adders are limited essentially
to:
two single two-input accumulate statements associated with a particular one of said
two MAC processors;
four dual two-input accumulate statements, each of said dual two-input accumulate
statements restricted to acting on predefined pairs of accumulator registers;
four three-input accumulate statements;
one single-product statement; and
four dual-product statements, no two of which form a symmetric pair.
9. The instruction set of claim 5, wherein:
said accumulator comprises eight accumulator registers;
said predefined pairs of registers comprise four pairs of numerically adjacent registers;
and
two bits are reserved in said dual two-input accumulate statements to identify a accumulator
pair.
10. A reduced set of instructions for controlling a microprocessor including four operand
registers xh, xl, yh, and yl, selectively connected as inputs for first and second
two-input multipliers having output product registers p0 and p1, a first adder receiving
as input the value in p0 and the value of a register selected from an accumulator
array and providing an output to the accumulator array, a second adder receiving as
input the value in p0, the value in p1, and a data value selected from a register
in the accumulator array and providing an output to the accumulator array, said microprocessor
connected to a memory system supporting aligned-double word fetches of data, in which
the instructions in said instruction set for controlling said multipliers and adders
are limited essentially to:
two two-input accumulate statements of the form:

where aD and aS indicate destination and source accumulator registers, respectively;
four three-input accumulate statement of the form:

four dual two-input accumulate statements of the form:

where aDP and aSP indicate destination and source accumulator registers, respectively,
the destination registers aD and aDP and the source registers aS and aSP each indicating
a predefined pair of accumulator registers;
one single product statement of the form p0 = xh * yh; and
four dual product statements of the form:
p0 = xh * yh |
p1 = xh * yl; |
p0 = xh * yh |
p1 = xl * yl; |
p0 = xh * yl |
p1 = xl * yh; |
p0 = xl * yh |
p1 = xl * yl. |