Intro to SIMD in avo
Sun 21 December 2025
In the previous post we wrote a Hello World in avo. Let's do something practical this time, e.g., related to performance since we go into all this trouble of writing Go assembly. You can find the code examples in github.com/marselester/misc.
Processing more data in a single CPU instruction makes our programs faster. That's what SIMD (Single Instruction Multiple Data) technique is for. The caveat is that we need to think in terms of vectors, not scalars. For example, let's say we want to find a sum of eight 64-bit integers. Our options look as follows:
- sum of scalars:
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 - sum of vectors:
[1, 2, 3, 4] + [5, 6, 7, 8]or[1, 2] + [3, 4] + [5, 6] + [7, 8]
The first option is straightforward.
func Sum(input []int64) int64 {
var sum int64
for _, v := range input {
sum += v
}
return sum
}
The second one — not so much 😬.
At least my CPU (Intel i5-10600) supports AVX2,
meaning it can execute 256-bit SIMD instructions.
That's exactly enough to add our vectors [1, 2, 3, 4] and [5, 6, 7, 8] with just a single CPU instruction.
The plan is to add the 4-element vectors, then keep folding the resulting vectors adding their halfes, see the calculations below.
[1, 2, 3, 4] [6, 8] [16, 20]
+ + +
[5, 6, 7, 8] ➡ [10, 12] ➡ [0, 16]
= = =
[6, 8, 10, 12] [16, 20] [16, 36]
🏁
With this in mind, let's implement it in Go assembler!
Adding vectors
We can start small and just focus on adding 8 numbers.
The first step is to create a dummy function SumVec and a corresponding test.
It always returns zero no matter the input it gets.
Note, we used asm.XORQ(sum, sum) to set the register associated with sum variable to zero.
We'll see Q postfix quite often later on, it stands for quadword (8 bytes) on amd64.
//go:build ignore
package main
import asm "github.com/mmcloughlin/avo/build"
//go:generate go run asm.go -out sum.s -stubs sum.go
func main() {
asm.TEXT("SumVec", asm.NOSPLIT, "func(input []int64) int64")
sum := asm.GP64()
asm.XORQ(sum, sum)
asm.Store(sum, asm.ReturnIndex(0))
asm.RET()
asm.Generate()
}
🔻 sum_test.go
package sum
import "testing"
func TestSumVec(t *testing.T) {
input := []int64{1, 2, 3, 4, 5, 6, 7, 8}
var want int64 = 36
if got := SumVec(input); got != want {
t.Fatalf("expected %d got %d", want, got)
}
}
Not surprisingly, the test fails as it expects the sum to be 36.
﹩ go generate ./sum/asm.go && go test ./sum
--- FAIL: TestSumVec (0.00s)
sum_test.go:10: expected 36 got 0
The second step is to learn the input []int64 slice length and
where its backing array is located in memory,
so we could load its elements into a vector register.
When the function is called, a three-field slice structure is passed on the stack.
type slice struct {
array unsafe.Pointer
len int
cap int
}
Its fields can be accessed in assembler as follows:
input_base+0(FP)pointer to the underlying array (the base memory address)input_len+8(FP)length of the sliceinput_cap+16(FP)capacity of the slice
avo API is very similar,
here is how we can load the array pointer and the length
into general-purpose registers AX and CX assigned by avo:
inputData := asm.GP64() // Base pointer of the slice is in AX.
inputLen := asm.GP64() // Number of elements in the slice is in CX.
// MOVQ input_base+0(FP), AX
asm.Load(asm.Param("input").Base(), inputData)
// MOVQ input_len+8(FP), CX
asm.Load(asm.Param("input").Len(), inputLen)
The third step is to load the left half of the array into a vector register.
vecLeft := asm.YMM() // 256-bit vector register Y0.
// VMOVDQU (AX), Y0
asm.VMOVDQU(operand.Mem{Base: inputData}, vecLeft)
Examining a generated Go assembly, we'll see VMOVDQU (AX), Y0 instruction:
VMOVDQUstands for Vector MOVe Double Quadword Unaligned. It copies the[1, 2, 3, 4]elements from a possibly unaligned memory address stored inAXto vector registerY0. Unaligned means not starting at a memory address that is a multiple of the vector's size. We don't useVMOVDQA(the aligned version) since we don't know if the array's address is aligned to 256.
Despite its "double quadword" (128-bit vector) naming, the instruction is capable of moving 256 bits.
(AX)operand means use address from registerAX. Itsavoequivalent isoperand.Mem{Base: inputData}.Y0operand is a 256-bit vector register allocated byvecLeft := asm.YMM()
🦉 Since we mentioned vectors of different sizes, let's name them for reference:
- 512-bit ZMM registers:
Z0...Z31for AVX-512 (not our case) - 256-bit YMM registers:
Y0...Y15for AVX andY31for AVX-512 - 128-bit XMM registers:
X0...X15for AVX andX31for AVX-512
Moving on to the fourth step — loading the right half of the array into another vector register.
The important part is to determine the memory address from which to copy four 64-bit integers.
As we can see from the diagram below, we need to start at the array index 4.
We can deduce the address of element 5 like this
inputData + index * int64InBytes = 0xc000054760 + 4 * 8 assuming the array is stored at 0xc000054760.
0xc000054760
⬇️
array: [1, 2, 3, 4, 5, 6, 7, 8]
index: 0 1 2 3 4 5 6 7
⬆️
0xc000054760 + 4 * 8
The assembly code looks similar to what we saw in the previous step:
MOVQcopies 64 bits of a literal value0x00000004(our index4represented as 32-bit unsigned integer) toCXregister.VMOVDQUcopies 256 bits starting from memory address defined by operand(AX)(CX*8)to vector registerY1. The operand(AX)(CX*8)reads asAX + CX * 8, i.e., take memory address stored inAXregister (0xc000054760in our example), then add it to a product of value stored inCXregister ($0x00000004) and a scaling factor8since the array contains 64-bit integers.
MOVQ $0x00000004, CX
VMOVDQU (AX)(CX*8), Y1
The assembler DSL is a little bit verbose, but it provides type safety.
For instance, it makes sure we pass a valid immediate value when setting up the index to 4
(asm.MOVQ() docs indicate imm32 and imm64) as the first operand in asm.MOVQ(operand.U32(4), index).
Note, operand.U64(4) would also work.
index := asm.GP64() // The array index is stored in register CX.
// MOVQ $0x00000004, CX
asm.MOVQ(operand.U32(4), index)
vecRight := asm.YMM() // 256-bit vector register Y1.
// VMOVDQU (AX)(CX*8), Y1
asm.VMOVDQU(
operand.Mem{
Base: inputData, // Array starts at 0xc000054760 address.
Index: index, // Array index is 4.
Scale: 8, // The multiplier of the index is 8 bytes (int64).
},
vecRight,
)
Now we've got both vectors filled, we can finally add them up!
It's done with VPADDQ Y0, Y1, Y0 instruction which reads
as Vector Packed ADD Quadword, i.e., 64-bit elements of vectors Y0 and Y1 are added
and the result is stored in Y0.
"Packed" signifies that the instruction operates on all the elements packed within the register,
i.e., it is not a scalar operation.
// VPADDQ Y0, Y1, Y0
asm.VPADDQ(vecLeft, vecRight, vecLeft)
Now Y0 contains [6, 8, 10, 12].
Adding half-vectors
We summarize the Y0 = [6, 8, 10, 12] vector by adding its halfes [6, 8] and [10, 12].
That's called a horizontal reduction summation.
[6, 8]
+
[10, 12]
=
[16, 20]
To do that, we can copy its left half (bits 128-255) to a 128-bit XMM vector register X1
using VEXTRACTI128 (Vector Extract Integer 128-bit) instruction.
Y0 = [6, 8, 10, 12]
⬇️ ⬇️
X1 = [6, 8]
The first operand $0x01 in VEXTRACTI128 $0x01, Y0, X1 is a control byte
that refers to extracting the upper 128-bit lane.
The second operand is the source YMM register (vecLeft in our avo program),
and the third one is an XMM register (we use vecRight.AsX()
which is the lower portion of vecRight register).
vecRightLow := vecRight.AsX()
// VEXTRACTI128 $0x01, Y0, X1
asm.VEXTRACTI128(operand.U8(1), vecLeft, vecRightLow)
Since X0 represents the right half of Y0, we can add X0 and X1
which by now contains the left half of Y0.
Y0 = [6, 8, 10, 12]
⬇️ ⬇️ [10, 12] = X0
X1 = [6, 8]
Go code looks familiar.
vecLeftLow := vecLeft.AsX()
// VPADDQ X0, X1, X0
asm.VPADDQ(vecLeftLow, vecRightLow, vecLeftLow)
At this point X0 contains [16, 20].
Our goal is to line up 16 with 20 to get our scalar result 36.
We can shift 16 right by 8 bytes since we're dealing with 64-bit integers.
Before: [16, 20]
➡️
After: [ 16] 20
The VPSRLDQ $0x08, X0, X1 instruction does that, i.e., it shifts X0 bits right,
fills the empty space with zeros, and stores the result in X1.
The addition instruction is the same VPADDQ X0, X1, X0.
[16, 20] X0
+
[0, 16] X1
=
[16, 36] X0
🏁
Here is an avo code.
// VPSRLDQ $0x08, X0, X1
asm.VPSRLDQ(operand.U8(8), vecLeftLow, vecRightLow)
// VPADDQ X0, X1, X0
asm.VPADDQ(vecLeftLow, vecRightLow, vecLeftLow)
That's it, we've got out final result 36 in the X0 = [16, 36] vector.
We just need to somehow return it from the SumVec function 🤔.
The cool thing about VMOVQ instruction is that it can copy the lower quadword
(our 36 value) from a vector to a scalar register like this VMOVQ X0, AX.
Note, VMOVQ Y0, AX wouldn't work since a YMM operand isn't supported.
These are the final lines of Go code that generate Go assembly. The complete example is on GitHub.
sum := asm.GP64()
// VMOVQ X0, AX
asm.VMOVQ(vecLeftLow, sum)
// MOVQ AX, ret+24(FP)
asm.Store(sum, asm.ReturnIndex(0))
// RET
asm.RET()
This time the tests pass 🎉.
﹩ go generate ./sum/asm.go && go test ./sum
ok myprog/sum 0.289s
Working with larger arrays
It would be even better if SumVec worked with larger than 8 items arrays (what two YMM registers could fit).
We can use the sum of scalars approach when an array length is less than 8,
and leverage the SIMD technique in a loop for bigger arrays.
Oftentimes, we would have to use a scalar sum anyways to
add up a tail of an array if its length isn't a multiple of 4,
e.g., [1, 2, 3, 4] + [5, 6, 7, 8] + 9 + 10 where 9, 10 is a tail.
The function starts similarly: inputData points to an array, inputLen holds its length,
and sum with index variables set to zero.
🔻 asm.go part #1
inputData := asm.GP64()
inputLen := asm.GP64()
asm.Load(asm.Param("input").Base(), inputData)
asm.Load(asm.Param("input").Len(), inputLen)
sum := asm.GP64()
index := asm.GP64()
asm.XORQ(sum, sum)
asm.XORQ(index, index)
asm.CMPQ(inputLen, operand.U8(8))
asm.JL(operand.LabelRef("scalar_loop"))
The new thing is CMPQ and JL instructions, they implement a fallback to a scalar sum:
CMPQ CX, $0x08means compare quadwordinputLento8JL scalar_loopmeans jump to a code block labeledscalar_loopifinputLenis less than8
// CMPQ CX, $0x08
asm.CMPQ(inputLen, operand.U8(8))
// JL scalar_loop
asm.JL(operand.LabelRef("scalar_loop"))
Otherwise, keep adding the YMM vectors in the loop like this.
/*
Y0 + Y1 = [1, 2, 3, 4] + [5, 6, 7, 8]
Y0 + Y1 = [6, 8, 10, 12] + [9, 10, 11, 12]
Y0 + Y1 = [15, 18, 21, 24] + [13, 14, 15, 16]
...
Y0 + Y1
*/
for index += 4; index <= loopEnd; index += 4 {
// ...
}
🔻 asm.go part #2
vecLeft := asm.YMM()
vecRight := asm.YMM()
asm.VMOVDQU(operand.Mem{Base: inputData}, vecLeft)
asm.Comment("loopEnd = inputLen - (inputLen % 4)")
loopEnd := asm.GP64()
asm.MOVQ(inputLen, loopEnd)
asm.ANDQ(operand.I8(-4), loopEnd)
asm.Label("vector_loop")
{
asm.ADDQ(operand.U32(4), index)
asm.CMPQ(loopEnd, index)
asm.JLE(operand.LabelRef("vector_loop_end")) // Exit the vector loop.
asm.VMOVDQU(
operand.Mem{
Base: inputData,
Index: index,
Scale: 8,
},
vecRight,
)
asm.VPADDQ(vecLeft, vecRight, vecLeft)
asm.JMP(operand.LabelRef("vector_loop"))
}
asm.Label("vector_loop_end")
asm.Comment("Horizontal reduction.")
{
vecRightLow := vecRight.AsX()
asm.VEXTRACTI128(operand.U8(1), vecLeft, vecRightLow)
vecLeftLow := vecLeft.AsX()
asm.VPADDQ(vecLeftLow, vecRightLow, vecLeftLow)
asm.VPSRLDQ(operand.U8(8), vecLeftLow, vecRightLow)
asm.VPADDQ(vecLeftLow, vecRightLow, vecLeftLow)
asm.VMOVQ(vecLeftLow, sum)
asm.VZEROUPPER()
}
asm.Comment("Set index = loopEnd to sum the tail of the array.")
asm.MOVQ(loopEnd, index)
The loop end is calculated as loopEnd = inputLen - (inputLen % 4), e.g.,
if an array length is 9, the loopEnd will be 8 = 9 - (9 % 4),
so Y1 register is always fully filled with 4 integers on each iteration.
Since 4 is a power of 2, we can efficiently calculate (in 1 CPU cycle)
the loopEnd using bitwise 9 AND -4 which rounds 9 down to 8
which is the nearest multiple of 4.
loopEnd := asm.GP64() // The loop end is stored in SI register.
// MOVQ CX, SI
asm.MOVQ(inputLen, loopEnd)
// ANDQ $-4, SI
asm.ANDQ(operand.I8(-4), loopEnd)
How does it work?
All bits on the left side starting from the third bit are a multiple of 4.
Two bits on the right side are the remainder.
In the example below we have number 15 (1111 in binary) that has a remainder 3 (11 in binary).
Therefore, we just need to get rid of last two bits in inputLen to calculate the loopEnd.
# 4 3 2 1
⬅️➡️
bits: 1 1 1 1
powers: 8 4 2 1
loopEnd = 12 | remainder = 3
Computers use two's complement
method to represent integers, so -4 is represented as 1111 1100 in binary
(i.e., invert the bits of 4 then add 1).
That can be used to mask the remainder bits using ANDQ instruction.
4 = 0000 0100
^4 = 1111 1011
^4 + 1 = 1111 1100 = -4
9 = 0000 1001
9 & -4 = 0000 1000 = 8
With that out of the way, let's look at the loop itself:
vector_loop:defines thevector_looplabel which is a named memory location that denotes the beginning of our loop. We jump there unconditionally withJMP vector_loopat the end of the loop's body.ADDQ $0x00000004, BXincrements the arrayindexby4CMPQ SI, BXcompares theloopEndto theindexJLE vector_loop_endexits the loop by jumping to the labelvector_loop_endif theloopEnd <= indexVMOVDQU (AX)(BX*8), Y1like before loads a 256-bit chunk frominputData[index:index+4]into theY1registerVPADDQ Y0, Y1, Y0adds the vectors
vector_loop:
ADDQ $0x00000004, BX
CMPQ SI, BX
JLE vector_loop_end
VMOVDQU (AX)(BX*8), Y1
VPADDQ Y0, Y1, Y0
JMP vector_loop
vector_loop_end:
The assembly above was generated from this Go code. Note, I've used curly braces to make the code look nicer.
asm.Label("vector_loop")
{
asm.ADDQ(operand.U32(4), index)
asm.CMPQ(loopEnd, index)
asm.JLE(operand.LabelRef("vector_loop_end")) // Exit the loop.
asm.VMOVDQU(
operand.Mem{
Base: inputData,
Index: index,
Scale: 8,
},
vecRight,
)
asm.VPADDQ(vecLeft, vecRight, vecLeft)
asm.JMP(operand.LabelRef("vector_loop"))
}
asm.Label("vector_loop_end")
The vector loop is followed by the horizontal reduction logic we've already seen before.
There is a new instruction though — VZEROUPPER.
It clears bits 128-255 in YMM registers.
From what I understand, we should place it right after we're done using the SIMD instructions
to prevent a potential performance penalty.
vecRightLow := vecRight.AsX()
asm.VEXTRACTI128(operand.U8(1), vecLeft, vecRightLow)
vecLeftLow := vecLeft.AsX()
asm.VPADDQ(vecLeftLow, vecRightLow, vecLeftLow)
asm.VPSRLDQ(operand.U8(8), vecLeftLow, vecRightLow)
asm.VPADDQ(vecLeftLow, vecRightLow, vecLeftLow)
asm.VMOVQ(vecLeftLow, sum)
asm.VZEROUPPER()
After the reduction, we need to summarize the tail of the array,
so we set our array index = loopEnd to start the scalar loop
from the end of the vector loop.
// MOVQ SI, BX
asm.MOVQ(loopEnd, index)
And here is the scalar loop itself.
for index; index < inputLen; index++ {
sum += inputData[index]
}
You'll notice that this loop resembles the vector loop,
except that we've got ADDQ instead of VPADDQ instruction,
and the index gets incremented by one instead of four.
asm.Label("scalar_loop")
{
asm.CMPQ(inputLen, index)
asm.JLE(operand.LabelRef("scalar_loop_end"))
asm.ADDQ(
operand.Mem{
Base: inputData,
Index: index,
Scale: 8,
},
sum,
)
asm.INCQ(index)
asm.JMP(operand.LabelRef("scalar_loop"))
}
asm.Label("scalar_loop_end")
Finally, let's see how SIMD SumVec stacks up against a scalar Sum implementation.
It's x2 faster on my machine.
﹩ benchstat old.txt new.txt
name old time/op new time/op delta
Sum-12 38.9µs ± 0% 17.8µs ± 2% -54.23% (p=0.000 n=8+10)
You can find the full code and benchmarks here. Cheers!
References:
- avo docs and examples by Michael McLoughlin
- From slow to SIMD: A Go optimization story by Camden Cheek
- Advanced Vector Extensions
- Two's complement
- x86 and amd64 instruction reference by Félix Cloutier