Show HN: Lockstep – A data-oriented programming language

jaen · 2026-03-16T08:18:35 1773649115

For similar languages / compilers in this area:

1. ISPC [1], the Intel® Implicit SPMD Program Compiler also compiles SIMD programs with branches and other control flow efficiently using predication/masking etc.

2. Futhark [2] compiles nice-looking functional programs into efficent parallel GPU/CPU code.

[1]: https://github.com/ispc/ispc [2]: https://futhark-lang.org/

goosethe · 2026-03-16T12:01:39 1773662499

It's a bit philosophically different than ISPC. When SIMD lanes diverge, the ISPC compiler implicitly handles the execution masks and lane disabling behind the scenes.

We take a more draconian approach. We completely ban if and else inside compute kernels. If you want conditional logic, you must explicitly use branchless intrinsics like mix, step, or select. The goal is to make the cost of divergence mathematically explicit to the programmer rather than hiding it in the compiler. If a pathway is truly divergent, you handle it at the pipeline level using a filter node to split the stream. We also ban arbitrary pointers entirely. All memory is handled via a Host-Owned Static Arena, and structs are automatically decomposed into Struct-of-Arrays layouts. Because the compiler controls the exact byte-offset and knows there are no arbitrary pointers, it can aggressively decorate every LLVM IR pointer with noalias.

Panzerschrek · 2026-03-16T05:44:47 1773639887

What about real performance? Does parallelization via SIMD instructions work well? What about supporting older CPUs with fewer SIMD instructions?

Does it have some loophole to allow loops? Or it just allows linear execution?

Can it read or write external memory (allocated within other language like C++)?

goosethe · 2026-03-16T12:09:05 1773662945

LLVM is able to auto-vectorize the generated IR extremely well. There are no branches to mis-predict, so, theoretically, it just blasts through the data.

Since it emits standard LLVM IR, LLVM handles the actual instruction set targeting. Right now in v0.1.0, the compiler hardcodes a SIMD width of 8 (assuming AVX2). However, parameterized SIMD widths are already on the roadmap for v0.4.0. Once that is added, you will be able to pass a --target-width flag to compile down to narrower vector units (like SSE on older CPUs) or up to AVX-512 and ARM NEON.

There are strictly no loopholes for loops inside the compute kernels. Inside a shader block, execution is 100% linear. However, the host application calling the pipeline effectively acts as the loop over the data elements. To help, we allow linear accumulators: You consume these with a fold operation, which the compiler lowers into a lock-free parallel reduction tree rather than a traditional for loop.

The memory model is a host-owned static arena where your host application allocates a flat, contiguous block of memory and passes that pointer to Lockstep_BindMemory(ptr). Lockstep does all its reads and writes exclusively within that allocated buffer. Because it doesn't have arbitrary pointers, it can't reach outside that arena, which is exactly how we mathematically guarantee the noalias pointer optimizations in LLVM.