GDSL is written in C++ with use of STL, templates and lambdas, so it's 2600 lines of such C++ source code. There is no self-hosting: neither the LISP compiler nor the C compiler can compile itself. No operating system is implemented, the word kernel in the title means something else.
FYI Here is a 700-line subset-of-C compiler which can compile itself: https://github.com/valdanylchuk/xcc700 . FYI The linker and the libc are not included.
Kernel as used in the title means shared processing core, not necessarily an OS kernel, which probably deserves clarification. As for Lisp and C, those are meant to be subset demonstrations not yet full compilers, thus the phrasing. The compiler you linked is also a subset of C with only what's needed for a minimal compiler.
I’ve been working towards making it self hosting eventually, which is why there’s the start of code for generics in GDSL-C, though that’s a significant project which will take significant time. Add to that, a lot of people would prefer to write handlers in C++ rather than learn a new language. So it may stay as a C++ project for a while.
That being said, I was also hesitant to post this until I had it self hosting, but I decided that being able to implement even the demonstrated subsets was interesting enough to be worth sharing. This is a project in progress, and I hope to expand it massively in the months to come.
I would love to compare notes, reading through Mu has given me a lot to think about that I may be writing about soon.
I see in Mu a mirror of my own aspirations, GDSL is something I intend to take down to the metal once I have access to a computer where I can reach deeper than my Mac allows. Though the path from here to an OS is by no means a straight one.
Mu is what I would call a MIX layer, the real substance of a process which turns one line of code into the next. Arguably, a MIX is the core of what makes a program a program, and the work of Mu, like Lisp and others, is to elevate it high enough that it becomes the whole interface.
For a deeply curious mind the MIX is the only thread they wish to pull. Because the comprehension of things at their most fundamental level is fuel enough. Unfortunately, the majority of people are not nearly so curious, thus the dominance of languages like Python.
So what makes Python so much more appealing than direct control? Pure TAST. Sugar for the hungry, and grammar that lets you say everything while knowing nothing. Somewhere, between the sugar and the substance lives the real heart of what makes a tool, and that’s what I’ve been picking at from both angles.
I would be curious to see how these could be unified, a Python TAST, Rust or Haskell DRE (for type systems and borrow checking) and a Mu MIX underneath. Let the user be lured in by the promise of ease, look under the hood to see the entire compiler fitting in just under ten thousand lines, and burrow their way down to the MIX and fundamental understanding.
Good to see you kicking around and here in a thread about things small enough to think about! I've not seen any blog posts from you in a while Kartik but I come back to your lua musings from time to time.
You made my day with your kind words! I don't know if we've spoken before, but feel free to hit me up offline if you'd like to chat more about stuff like this.
Sparkie is correct, and an actually fully working C compiler in GDSL would probably be 12,000 lines or so total, possibly much more.
The point in showing this is that so much of the size of these compilers just comes from managing seams, as I discuss in the essays, and by preserving information through the compiler they shrink dramatically.
I would encourage you to compare the implementation of GDSL-C's identifiers to GCC, for a good example of how information availability allows me to link and disambiguate in just 80 lines what GCC takes upwards of a thousand spread across many files to do.
That's the MIX. As mentioned in the README I haven't setup a system for these yet, just made the scaffolding, as a lot of the work in this space is actually around just the MIX layer.
I'll probably be posting about more backend systems once I get started on making them, this project has only been in the works for 5 weeks so far. Though I would invite anyone interested to contribute a MIX module, that's where I have the least expertise.
The demos run as interpreted trees through the x stage, no code emission yet, just direct execution of the AST. That's why they're demos rather than compilers. The scaffolding for native emission is there but empty.
My thoughts exactly, and it's what I intend to test in the coming weeks.
I say 'subset' because the point isn't "here's a full C compiler" it's "here's two wildly different frontends handled by one system".
I tried to get started on the backend this Friday, but got frustrated with the restrictions of running on a Mac, I wanted to go down lower and work from the metal for this, so I'm waiting to get access to a device I can start programming an OS on. If I find success, I'll definitely be posting about it.
In the meantime though, I am more than open to anyone interested in helping on the backend, it's the biggest gap in my abilities and a new domain for me.
Really interesting writeup. What stood out to me most was the shift from the earlier node execution path to the streamed path. The benchmark gap between execute_r_nodes and execute_stream is huge, and the latter getting relatively close to the handwritten C++ baseline is the part I keep thinking about.
After building this, where do you think most compiler complexity actually comes from? My impression from your post is that a lot of the “millions of lines” are not from the core syntax-to-execution path itself, but from language surface area, tooling, diagnostics, optimization passes, and long-tail ecosystem baggage.
The streaming (essentially a JIT) was actually from the early architecture three weeks ago. Though I'm glad you've read even the first post. On the current architecture performance hasn't been a target yet, though the core hasn't changed, I could build a MIX that streams and it would reach the same benchmark.
And I love the question. A lot of the complexity is coming from the management of seams, places where we have to go from one representation of information to another. The tooling, diagnostics, and optimization passes are as large as they are precisely because of these seams. Consider a liveness pass in LLVM, which spends a lot of time reconstructing information thrown away by the compiler so it could emit SSA. In GDSL, a liveness pass is simply handlers in the e_stage, in the example I snuck into GDSL-C print statements stamp liveness tokens onto their children via qualifiers and at assignment nodes, those without such tokens are killed. I can do the logic in a straightforward manner because we have all the information to work with, no seams, no SSA to derive scopes from, thus why a subset of it fits in 80 lines instead of 80,000.
FYI Here is a 700-line subset-of-C compiler which can compile itself: https://github.com/valdanylchuk/xcc700 . FYI The linker and the libc are not included.
reply