【问题标题】:Optimize Sieve of Eratosthenes in MIPS在 MIPS 中优化 Eratosthenes 筛
【发布时间】:2015-12-06 21:31:35
【问题描述】:

我在优化 Eratosthenes 筛的代码方面取得了一些进展,但我需要进一步提高指令数和数据缓存命中率。任何帮助表示赞赏。

    .data           # the data segment to store global data
space:  .asciiz " "     # whitespace to separate prime numbers

    .text           # the text segment to store instructions
    .globl  main        # define main to be a global label
main:   li  $s0, 0x00000000 # initialize $s0 with zeros
    nor $s1, $s0, $s0   # saves one ALU count over using li $s1, 0x11111111
    li  $t9, 200    # find prime numbers from 2 to $t9

    add $s2, $sp, 0 # backup bottom of stack address in $s2

    li  $t0, 2      # set counter variable to 2

init:   sw  $s1, ($sp)  # write ones to the stackpointer's address
    add $t0, $t0, 1 # increment counter variable
    sub $sp, $sp, 4 # subtract 4 bytes from stackpointer (push)
    bne     $t0, $t9, init  # take loop if $t0 != $t9, changed ble to bne
    addi    $t8, $t0, 15    # approximate square root of 200
    li  $t0, 1      # reset counter variable to 1

outer:  add     $t0, $t0, 2 # increment counter variable (start at 2)
    mul $t1, $t0, $t0   # squaring $t0 and save it to $t1
    bgt $t1, $t8, print # start printing prime numbers when $t1 > $t8, changed so only bgt if $t1 > square root

check:  add $t2, $s2, 0 # save the bottom of stack address to $t2
    sll $t3, $t0, 2 # calculate the number of bytes to jump over
    sub $t2, $t2, $t3   # subtract them from bottom of stack address
    add $t2, $t2, 8 # add 2 words - we started counting at 2!

    lw  $t3, ($t2)  # load the content into $t3

    beq $t3, $s0, outer # only 0's? go back to the outer loop

inner:  add $t2, $s2, 0 # save the bottom of stack address to $t2
    sll $t3, $t1, 2 # calculate the number of bytes to jump over

    add     $t4, $t1, 2 # save $t1 + 2 into $t4, added
    sll     $t5, $t4, 2 # mul by 4, added

    sub $t2, $t2, $t3   # subtract them from bottom of stack address

    add $t2, $t2, 8 # add 2 words - we started counting at 2!

    sw  $s0, ($t2)  # store 0's -> it's not a prime number!

    add $t1, $t1, $t0   # do this for every multiple of $t0

    add $t1, $t1, $t0   # adding $t0 to $t1, added
    sub $t2, $t2, $t5   # save $t2 - $t5 into $t2, added
    add $t2, $t2, 8 # adding 8 to $t2, added
    sw  $s0, ($t2)  # store contents of $s0 at address contained in $t2, added
    add $t4, $t4, $t0   # adding $t0 to $t4, added

    blt $t1, $t9, inner # every multiple done? go back to outer loop, changed to blt and branching to inner

    j   outer       # some multiples left? go back to inner loop, changed to branching to outer

print:  li  $t0, 1      # reset counter variable to 1

    # hard coding a 2
    li  $v0, 1
    addi    $a0, $a0, 2
    syscall

    # hard coding a space
    li  $v0, 4
    la  $a0, space
    syscall

count:  add $t0, $t0, 2 # increment counter variable (start at 2), skipping even numbers

    bgt $t0, $t9, exit  # make sure to exit when all numbers are done (branch to exit if $t0 > $t9)

    add $t2, $s2, 0 # save the bottom of stack address to $t2
    sll $t3, $t0, 2 # calculate the number of bytes to jump over
    sub $t2, $t2, $t3   # subtract them from bottom of stack address
    add $t2, $t2, 8 # add 2 words - we started counting at 2!

    lw  $t3, ($t2)  # load the content into $t3
    beq $t3, $s0, count # only 0's? go back to count loop

    add $t3, $s2, 0 # save the bottom of stack address to $t3

    sub $t3, $t3, $t2   # substract higher from lower address (= bytes)
    srl $t3, $t3, 2 # changed div to srl
    add $t3, $t3, 2 # add 2 (words) = the final prime number!

    li  $v0, 1      # system code to print integer
    add $a0, $t3, 0 # the argument will be our prime number in $t3
    syscall         # print it!

    li  $v0, 4      # system code to print string
    la  $a0, space  # the argument will be a whitespace
    syscall         # print it!

    bne $t0, $t9, count # take loop when $t0 != $t9, changed ble to bne

exit:   li  $v0, 10     # set up system call 10 (exit)
    syscall 

【问题讨论】:

    标签: mips sieve-of-eratosthenes


    【解决方案1】:

    第一个优化是用 C/C++ 对算法进行编码并进行尽职调查(清理/收紧代码)。如果这还不够快,请开始对 Eratosthenes 筛应用通常的优化:

    • 使用压缩位数组而不是用字节或字表示数字
    • 使用仅胜算的筛子(需要时从稀薄的空气中抽出素数 2)
    • 筛入缓存友好的小段(例如,许多 CPU 上的 32KB L1 缓存)
    • 记住段之间的偏移以避免昂贵的模除法
    • 使用对应于通过少量小素数筛选的位模式初始化段
    • 分派段以并行筛选所有可用内核

    Sieve of Eratosthenes - segmented to increase speed and range 展示了分段筛子的简洁实现并解释了这些优化(多线程除外);它还具有指向可编译测试程序的链接,以演示这些优化。

    除非 C 编译器异常糟糕,否则汇编编码只能将筛子的普通 C 实现的速度提高大约 10%。在现实世界的数字中,这意味着 C 版本可能需要 15 秒才能筛选出最多 2^32 的数字,而汇编版本可能需要 13 秒。应用上面列表中的前五个优化可以将 C 版本缩短到大约 2 秒,而添加最后一个优化可以让你不到一秒(如果你有 8 个空闲内核,大约是 0.3)。

    如果这还不够快,那么汇编编码可能无济于事,因为现在的优化更多是为了减少争用,避免深内存缓存层次结构的延迟,并保持流水线充足的指令准备好执行(而不是等待对于依赖项)。即,优化战争是在算法和架构层面上赢得的,而不是在装配微优化上。当前的编译器往往非常擅长自动应用微优化,例如指令调度、避免分支错误预测惩罚等。pp. 当手动编码汇编时,甚至需要非常努力地工作才能达到编译代码的速度,更不用说超越它了.

    下一级优化将扩展仅赔率筛的概念,以排除更多的小素数(3、5、7 等)。这通常被称为“*”。然而,增加更多辐条所带来的收益越来越小。双辐轮(只有奇数的筛子)已经将工作量减少了一半,但是在图片中添加 3 只会删除另外三分之一,依此类推,同时使代码变得相当复杂。复杂性可能会使代码比仅赔率版本慢,因此您需要使用更高阶的*才能看到显着的改进。

    无论如何,优化都需要以高级语言进行原型设计和参考实现,例如人类可读的东西。 C/C++,不需要 MIPS 的大祭司解释。

    【讨论】: