1 Replies - 3954 Views - Last Post: 02 April 2011 - 05:11 PM

#1 TopdeK   User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 3
  • Joined: 02-April 11

MIPS64 Optimization of Code

Posted 02 April 2011 - 05:06 PM

Hi all,

I was wondering if you could take a look at some code for me and see if I've done enough to optimize the code. Firstly, just to explain, my assignment was to translate a C++ program used to find the legendre symbol for two values X and P into a working MIPS64 code. Then after translating the C++ code, optimize the MIPS64 code. I have translated the code to the best of my ability and then I optimized it too. I have it down to 278 cycles using the values X: 40902 and P: 84143. My biggest issues seem to be Structural Stalls (25) and Branch Taken Stalls (25). I'm using a simulator (winmips64) provided by my lecturer and its got some limitations. I've provided the Instruction Set for the simulator below. Finally, the latency for my pipelines is as follows: Addition - 3, Multiplication - 4 and Division - 10. Can anyone spot a simple rescheduling of code to reduce it from 278. If not, I'm happy to submit it at 278. If rumor is true, it should get me about 87% of the marks. But I'm aiming for top 5 - 10%.

Thanks in advance. No worries if nothing can be done. I'm just looking for an outside perspective.

NOTE: For t = t % x, I used the algorithm: "t - [t / x] * x"


x:	.word 40902
p:	.word 84143

	ld r1, x(r0)		; load x
	ld r2, p(r0)		; load p

	daddi r15, r0, 1	; value of 1 for check with while loop
	dadd r3, r0, r0		; variable m set to zero

main:	daddi r30, r2, 0	; set R30 to p

	beq r30, r15, end	; if equal to 1, finish ... [while(p > 1)]

for:	andi r16, r1, 1		; x % 2
	andi r5, r2, 7		; p8 = p % 8
	bnez r16, if		; if x is odd, branch.
	daddi r4, r4, 1		; increment k by 1
	dsrl r1, r1, 1		; divide x by 2
	j for			; loop around

if:	andi r17, r4, 1		; k % 2
	dmul r18, r5, r5	; p8 * p8
	beqz r17, skip		; if k is even, branch
	nop			; dmul is not ready with R18. RAW stall
	daddi r18, r18, -1	; (p8 * p8) - 1
	dsrl r27, r18, 3	; [(p8 * p8) - 1] / 8
	dadd r3, r3, r27	; add to m

skip:	ddiv r19, r2, r1	; p / x ... since t=p and t=t%x, optimize as p%x
	daddi r5, r5, -1	; p8 - 1
	andi r23, r1, 3		; x % 4
	daddi r29, r23, -1	; [x % 4] - 1
	dmul r24, r29, r5	; (p8 - 1) * [(x % 4) - 1]
	daddi r31, r0, -1	; -1 for negative output at end, rescheduled to reduce stalls
	dadd r4, r0, r0		; variable k
	dsrl r25, r24, 2	; divide by 4
	dadd r3, r3, r25	; add to m
	dmul r20, r19, r1	; (p / x) * x
	andi r3, r3, 1		; m % 2
	dadd r2, r0, r1		; p = x
	dsub r6, r30, r20	; p - [(p / x) * x] ... Assign to t
	dadd r1, r0, r6		; x = t
	j main			; jump back to check condition of while loop
end:	bnez r3, minus
	daddi r10, r10, 1
	j exit

minus:	dadd r10, r0, r31

exit:	halt

;r1 = x
;r2 = p
;r3 = m
;r4 = k
;r5 = p8
;r6 = t
;r10 = legendre symbol

int legendre(int x, int p)
	int m, k, p8, t;

	m = 0;
	while(p > 1)
		for(k=0; x%2==0; k++)

		p8 = p%8;

	if(m==0) return 1;
	else return -1;

The following assembler directives are supported

.data                 - start of data segment
.text                 - start of code segment
.code                 - start of code segment (same as .text)  
.org    <n>           - start address
.space  <n>           - leave n empty bytes
.asciiz <s>           - enters zero terminated ascii string
.ascii  <s>           - enter ascii string
.align  <n>           - align to n-byte boundary
.word   <n1>,<n2>..   - enters word(s) of data (64-bits)
.byte   <n1>,<n2>..   - enter bytes
.word32 <n1>,<n2>..   - enters 32 bit number(s)
.word16 <n1>,<n2>..   - enters 16 bit number(s)
.double <n1>,<n2>..   - enters floating-point number(s)

where <n> denotes a number like 24, <s> denotes a string like "fred", and
<n1>,<n2>.. denotes numbers seperated by commas. The integer registers can 
be referred to as r0-r31, or R0-R31, or $0-$31 or using standard MIPS 
pseudo-names, like $zero for r0, $t0 for r8 etc. Note that the size of an 
immediate is limited to 16-bits. The maximum size of an immediate register 
shift is 5 bits (so a shift by greater than 31 is illegal).

Floating point registers can be referred to as f0-f31, or F0-F31

The following instructions are supported

lb      - load byte
lbu     - load byte unsigned
sb      - store byte
lh      - load 16-bit half-word
lhu     - load 16-bit half word unsigned
sh      - store 16-bit half-word
lw      - load 32-bit word
lwu     - load 32-bit word unsigned
sw      - store 32-bit word
ld      - load 64-bit double-word
sd      - store 64-bit double-word
l.d     - load 64-bit floating-point
s.d     - store 64-bit floating-point
halt    - stops the program

daddi   - add immediate
daddui  - add immediate unsigned
andi    - logical and immediate
ori     - logical or immediate
xori    - exclusive or immediate
lui     - load upper half of register immediate

slti    - set if less than or equal immediate
sltiu   - set if less than or equal immediate unsigned

beq     - branch if pair of registers are equal
bne     - branch if pair of registers are not equal
beqz    - branch if register is equal to zero
bnez    - branch if register is not equal to zero

j       - jump to address
jr      - jump to address in register
jal     - jump and link to address (call subroutine)
jalr    - jump and link to address in register (call subroutine)

dsll    - shift left logical
dsrl    - shift right logical
dsra    - shift right arithmetic
dsllv   - shift left logical by variable amount 
dsrlv   - shift right logical by variable amount
dsrav   - shift right arithmetic by variable amount
movz    - move if register equals zero
movn    - move if register not equal to zero
nop     - no operation
and     - logical and
or      - logical or
xor     - logical xor
slt     - set if less than
sltu    - set if less than unsigned
dadd    - add integers
daddu   - add integers unsigned
dsub    - subtract integers
dsubu   - subtract integers unsigned
dmul    - signed integer multiplication
dmulu   - unsigned integer multiplication
ddiv    - signed integer division
ddivu   - unsigned integer division

add.d   - add floating-point
sub.d   - subtract floating-point
mul.d   - multiply floating-point
div.d   - divide floating-point
mov.d   - move floating-point
cvt.d.l - convert 64-bit integer to a double floating-point format
cvt.l.d - convert double floating-point to a 64-bit integer format
c.lt.d  - set FP flag if less than
c.le.d  - set FP flag if less than or equal to
c.eq.d  - set FP flag if equal to
bc1f    - branch to address if FP flag is FALSE
bc1t    - branch to address if FP flag is TRUE 
mtc1    - move data from integer register to floating-point register
mfc1    - move data from floating-point register to integer register

Memory Mapped I/O area

Addresses of CONTROL and DATA registers

CONTROL: .word32 0x10000
DATA:    .word32 0x10008

Set CONTROL = 1, Set DATA to Unsigned Integer to be output
Set CONTROL = 2, Set DATA to Signed Integer to be output
Set CONTROL = 3, Set DATA to Floating Point to be output
Set CONTROL = 4, Set DATA to address of string to be output
Set CONTROL = 5, Set DATA+5 to x coordinate, DATA+4 to y coordinate, and DATA to RGB colour to be output
Set CONTROL = 6, Clears the terminal screen
Set CONTROL = 7, Clears the graphics screen
Set CONTROL = 8, read the DATA (either an integer or a floating-point) from the keyboard
Set CONTROL = 9, read one byte from DATA, no character echo.

Is This A Good Question/Topic? 0
  • +

Replies To: MIPS64 Optimization of Code

#2 macosxnerd101   User is offline

  • Games, Graphs, and Auctions
  • member icon

Reputation: 12742
  • View blog
  • Posts: 45,926
  • Joined: 27-December 08

Re: MIPS64 Optimization of Code

Posted 02 April 2011 - 05:11 PM

Moved to Assembly.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1