May 23, 20255 min 52 sec read

Single Instruction, Multiple Data (SIMD)

Remember the first time you discovered that your sorting algorithm could run in O(n log n) instead of O(n^2)? That moment when everything clicked and you realized you'd been doing things the hard way? That's exactly what happened to me when I discovered SIMD - except the performance gains were even more dramatic.

During the past few weeks, I committed to diving into the deep internals of Deep Learning models and how engineers bridge the gap between computer science and math. Yeah, all the optimizations in math are cool and all, but computers have constraints. Now, it sounds obvious, but most of the code we write isn't optimized to be executed sequentially hundreds of thousands of times. Imagine what a speedup of 5% at such scale can mean!

Deep Learning models e.g. LLMs, at their core are chained matmul operations. Lots of them. I know it sounds obvious again, but bear with me. Every single time you interact with an LLM, trillions of FLOPs are executed, and most of the time it happens so fast that we don't even take a moment to appreciate it. So I wanted to find out how this is optimized at the lowest software level, and that's when I stumbled across SIMD - Single Instruction, Multiple Data.

If you've ever wondered why your carefully optimized code still feels sluggish while libraries like NumPy seem to work magic, you're in the right place.

Parallel execution, for real

SIMD is a parallel computing architecture that allows a single instruction to be executed on multiple data elements simultaneously. Instead of processing data one element at a time, SIMD instructions can perform the same operation on vectors of data in a single CPU cycle.

Think of your CPU as a chef in a kitchen: It's rush hour, everyone wants their meal as soon as possible, and some of your teammates called in sick. Traditional scalar processing is like chopping vegetables one by one - it works, but a bit of speedup could help. SIMD is like having a knife that can slice through 8 onions at once with a single motion.

SIMD is tightly coupled to your hardware, so there are different ways to implement it depending on your system. It's like having to slice bread instead of onions-you still need a knife, but it's slightly different.

The most common SIMD instruction sets are:

  • x86: SSE, AVX, AVX2, AVX-512
  • ARM: NEON, SVE (Scalable Vector Extension)
  • GPU: CUDA cores, Metal Shaders

SIMD is particularly effective for workloads involving repetitive operations on large datasets: image processing, audio processing, mathematical computations, machine learning, and scientific simulations.

How SIMD looks like

Imagine we're trying to perform a simple operation: adding two vectors. Sounds straightforward and something anyone with minimum interest in computer science could implement in their language of choice. The logic is simple: make sure the two vectors are the same size and sum each element. Something like the following will do the work:

for (int i = 0; i < a.size(); i++) {
    result[i] = a[i] + b[i];
}

And like that, it just works. We test it with ten-element vectors, we even dare to add some thousand-element vectors, and it works fast, almost instant. However, give it vectors with millions of elements, and you'll soon realize there must be something wrong. Worry no more! There's nothing wrong with your code, it's just a skills issue.

Look at the following code:

for (int i = 0; i < a.size(); i += 8) {
    __m256 va = _mm256_load_ps(&a[i]);  // load 8 floats
    __m256 vb = _mm256_load_ps(&b[i]);  // load 8 floats  
    __m256 vr = _mm256_add_ps(va, vb);  // add 8 pairs at once
    _mm256_store_ps(&result[i], vr);    // store 8 results
}

You might think this looks like too much code for such a simple problem, that this might be sluggish and that such complexity isn't needed. That, my friend, uses 8x fewer instructions than our initial approach.

The hard(ware) truth: What's actually inside your processor

Every modern CPU contains specialized vector processing units that have been waiting for you to use them:

  • SSE (128-bit): 4 floats or integers per instruction
  • AVX2 (256-bit): 8 floats or integers per instruction
  • AVX-512 (512-bit): 16 floats or integers per instruction

When you execute _mm256_add_ps(va, vb), your CPU's vector execution unit contains 8 parallel floating-point adders that all fire simultaneously. It's not sequential, it's genuinely parallel hardware.

Numbers don't lie

Let me show you what this looks like in practice. Here's a simple benchmark comparing scalar vs SIMD vector addition:

#include <arm_neon.h>
#include <chrono>
#include <vector>
 
void benchmarkVectorAddition() {
    const size_t SIZE = 10000000;  // 10 million elements
    std::vector<float> a(SIZE), b(SIZE), result(SIZE);
    
    // fill with test data
    for (size_t i = 0; i < SIZE; ++i) {
        a[i] = i * 0.1f;
        b[i] = i * 0.2f;
    }   
    
    // scalar version
    auto start = std::chrono::high_resolution_clock::now();
    for (size_t i = 0; i < SIZE; ++i) {
        result[i] = a[i] + b[i];
    }   
    auto scalar_time = std::chrono::duration_cast<std::chrono::milliseconds>(
        std::chrono::high_resolution_clock::now() - start).count();
    
    // SIMD version (ARM NEON - 4 floats at once)
    start = std::chrono::high_resolution_clock::now();
    for (size_t i = 0; i < SIZE; i += 4) {
        float32x4_t va = vld1q_f32(&a[i]);
        float32x4_t vb = vld1q_f32(&b[i]);
        float32x4_t vr = vaddq_f32(va, vb);
        vst1q_f32(&result[i], vr);
    }   
    auto simd_time = std::chrono::duration_cast<std::chrono::milliseconds>(
        std::chrono::high_resolution_clock::now() - start).count();
    
    std::cout << "scalar: " << scalar_time << "ms\n";
    std::cout << "SIMD:   " << simd_time << "ms\n";
    std::cout << "speedup: " << (double)scalar_time / simd_time << "x\n";
}

My results:

  • scalar: 24ms
  • SIMD: 13ms
  • speedup: 1.84615x

Why this matters for Deep Learning and LLMs

Remember how I mentioned that Deep Learning is essentially chained matrix multiplications? Here's where SIMD becomes crucial:

matmul: The heart of AI

// simple approach - one multiplication at a time
for (int i = 0; i < N; ++i) {
    for (int j = 0; j < N; ++j) {
        float sum = 0.0f;
        for (int k = 0; k < N; ++k) {
            sum += A[i*N + k] * B[k*N + j];  // one at a time
        }
        C[i*N + j] = sum;
    }
}

// SIMD approach - 8 multiplications simultaneously
for (int i = 0; i < N; ++i) {
    for (int j = 0; j < N; ++j) {
        __m256 sum = _mm256_setzero_ps();
        for (int k = 0; k < N; k += 8) {
            __m256 a = _mm256_load_ps(&A[i*N + k]);
            __m256 b = _mm256_load_ps(&B[k*N + j]);
            sum = _mm256_fmadd_ps(a, b, sum);  // 8 multiply-adds at once
        }
        // horizontal sum to get final result
        C[i*N + j] = horizontal_sum(sum);
    }
}

This is why libraries like PyTorch, TensorFlow, and NumPy are so fast: They're built on highly optimized SIMD implementations.

Python: Getting SIMD without the complexity

The beautiful thing about Python is that you can get SIMD performance without writing a single intrinsic:

import numpy as np
import time

# Create large arrays
size = 5_000_000
a = np.random.random(size).astype(np.float32)
b = np.random.random(size).astype(np.float32)

# pure Python (slow)
start = time.time()
result_python = [a[i] + b[i] for i in range(len(a))]
python_time = time.time() - start

# numPy (SIMD-optimized automatically)
start = time.time()
result_numpy = a + b
numpy_time = time.time() - start

print(f"pure Python: {python_time:.3f}s")
print(f"numPy:       {numpy_time:.3f}s") 
print(f"speedup:     {python_time/numpy_time:.1f}x")

My results:

  • pure Python: 0.604s
  • numPy: 0.002s
  • speedup: 369.5x

NumPy automatically uses the best available SIMD instructions on your processor. No intrinsics, no complexity, just massive performance gains.

Conclusions

SIMD is a powerful tool but not everything is rainbows and sunshines: SIMD is not the right choice for complex branching logic software, small datasets, irregular memory access patters and so on, but it shines in mathematical operations with arrays, image and video processing or linear algebra, to name a few. It isn not just another optimization technique, it actually is the core fundamental of how we implement complex computations nowadays.