Making your own programming language is actually far easier that you think it is, mostly because there’s a good chance you have some definitions mixed up. Let’s get those straightened out first so we know what we’re talking about it future posts.
What is a programming language?
A programming language is an artificial language for expressing computer programs.
This is the definition for a programming language that you would typically find on Wikipedia. Notice how it says nothing about a program to “understand” and execute said program. In essence, a programming language, is a set of alphabets or symbols, abiding a set of rules, known as the grammar or syntax of the language. Creating any new programming language is pretty trivial. Define a set of symbols you are allowed to use in the language, and then define rules on how to arrange such symbols so that it may convey some meaning. Designing a good programming language is however pretty hard.
Many have tried and failed in creating the next big C or Python. And therefore I will not be trying too hard to make my programming language good. It requires absolutely zero coding, just theoretical knowledge which I lack, currently. So for now we are going to work with a grammar built off of things I like from other languages and just vibes. I may indulge in some theory in my spare time and tweak and edit my language but, no promises.
Then what am I building here?
A compiler. “Well, what is a compiler then?”, you may be thinking. And to that Wikipedia answers,
In computing, a compiler is software that translates computer code written in one programming language into another language. The name “compiler” is primarily used for programs that translate source code from a high-level programming language to a low-level programming language to create an executable program.
So when you tell someone you’ve made your own programming language, this is probably what they’ll be thinking of, instead of the actual language. The compiler is the actually hard (and fun) part. Converting the high level language you created down into simple machine code that the computer can understand and execute, is the gist of what a compiler does. But a compiler can be further broken down into 3-ish parts.
Parts of a compiler
1. The lexer
The first part of the compiler is called the lexer (or tokenizer) converts the program from a simple byte string to an array of tokens. A token is the smallest measure of data with any meaning to it. Let’s take the following code block as an example.
if (var1 < 0) {
var2 = 12 + 23 * 34;
} else {
var2 = false;
}
Assuming var1
and var2
are integers defined previously, what this code tries to do is pretty straightforward right? The code looks like this right now.
[i, f, (, v, a, r, 1, , <, 0, ), {, v, a, r, 2, =, 1, 2, +, 2, 3, *, ..., }]
The lexer takes this array of bytes, and derives tokens from them, giving some meaning to it
[if, (, var1, <, 0, ), {, var2, =, 12, +, 23, *, 34, ;, }, else, {, var2, =, false, ;, }]
Breaking the code down into tokens makes it far easier for the parser to “understand” whats going on in the program you just wrote.
2. The parser
The stream of tokens generated by the lexer is passed on to the parser, which checks the grammar/syntax of the code and structures the code into an abstract syntax tree (AST). Performing a traversal of this tree gives the correct order to execute the code.
The above token stream gets converted to the following AST
This will be the AST generated by the parser from the token stream. An interpreter takes this AST and traverses this tree to run the code. Each node is executed by executing the child nodes. The
Binary Operation
nodes, for instance, execute each operand node, and then execute the node itself. The If
node, works differently, it executes the condition node and then executes the corresponding then
or else
node.
2.5. The semantic analyzer and type checker
The semantic analyzer analyzes the AST and just makes sure everythings in check. Sometimes the code may be correct syntax-wise, but logically it wouldnt be making sense.
const a = 0;
a = 3;
There’s nothing wrong with this syntax, but symantically, it doesn’t make much sense. You can’t reassign a const right? This is what the semantic analyzer checks when we’re compiling down a program.
The type checker also runs on this similar level, and it checks if the types of variables and operands are of the correct types or not. In our case, the type checker would catch the b = false;
assignement as b
is an int
while false
is a boolean value.
3. The compiler or code-generator
Now onto the final part, the compiler (or code generator). This is the part that converts the AST generated by the parser into compiled target code, which could be x86 assembly, a binary format or even another language. In my case, a custom assembly/binary encoding that runs on a custom VM/byte-code interpreter that runs said code on a stack machine.
blt t0, x0, THEN # if (t0 < 0) jump to THEN
j ELSE # otherwise go to ELSE
THEN:
li t2, 23 # t2 = 23
li t3, 34 # t3 = 34
mul t2, t2, t3 # t2 = 23 * 34
li t3, 12 # t3 = 12
add t1, t2, t3 # var2 = 12 + (23*34)
j END
ELSE:
li t1, 0 # "false" = 0
END:
# continue...
This would an example of the assembly a C compiler would generate for the code above.
And that’s about all you’ll need to know about compilers. The rest, we’ll learn along the way as we build this thing up.
What stopped me from doing this before?
As I mentioned, I had made an interpreter for a toy language before, written from scratch in Python. It was pretty decent, I had functions implemented and everything. But I wanted a compiler. I wanted a binary I could share with anyone. So I decided I’d make one. I got all the way to a complete parser but I had absolutely no clue how to turn the AST into a binary. After some research I eventually landed on converting it into assembly code. Great news, except I didn’t know any assembly. So I went on another tangent learning about assembly and such.
After I picked up enough Arm64 assembly, I started writing the compiler again. Only to find out I still had no clue how to bridge the gap between AST and assembly. It all felt so confusing until I just gave up and moved on.
Recently, I saw a video about how someone else had made a compiler which compiled to byte-code that an interpreter then runs, just like JVM! And that scratched my compiler itch just right, even the VM itch too. So two birds one stone, I dove right into the project.
Current status of the project
Currently the compiler is statically typed. It supports integers and booleans, arithmetic, and comparisons between numbers. It also supports variable definitions and assignments. The definition of the the language right now can be seen on file on the repository.
The VM right now is a stack machine. There are flags for the comparisons, jump instructions implemented, and local variables are stored on the stack. There’s no IO right now but I hope to at least get output support soon with memory-mapped IO. I want the VM to simulate a real machine as much as a I can without getting too into the details. Basically I draw the line when I’m too lazy to implement some hyper-specific feature.
That’s all for now, you can go check out the project on the repo and I’ll see you on the next one for some actual developement!