SimpleHLST needs to read in some code, written by humans, to begin generating data structures that we can actually work with. This can take up to three stages:
-
Lexical Analysis – convert the input text into a list of tokens that distinguishes language features from whitespace, comments, pre-processor commands, etc. (“Unknown character on line 2″ errors are produced by this stage).
-
Syntax Analysis – convert the list of tokens into a syntax tree which shows the entire file hierarchically as perfectly valid expressions (this is where “Missing semicolon on line 4″ type errors come from).
-
Semantic Analysis – process the syntax tree to make sure that it makes sense according to the rules of the language. Type checking falls under this category for instance, so this is where “warning, assigning ‘int’ to ’short’ – possible loss of data” type errors come from.
SimpleHLST only needs a very basic language and I’m not planning to have it support complex data, so I don’t think we’ll need a semantic analysis stage at the moment.
An example
Suppose we have the following mathematical expression, f(t) = 2t + k, and we want to implement it in some program; we could write something like
f(t) = 2*t + k.
Before any more advanced stages can take place, the compiler must extract the information from this expression. This is exactly the same principle as we humans extracting information from text. We have languages that each have a vocabulary of words with specific definitions which we can use to convey detailed information. Each word only represents a small piece of information; the context and sentence structure contains the rest.
Programming languages are exactly the same and this is where Lexical Analysis (or, colloquially, Tokenisation) gets its name from. Programming languages tend to use symbols, or tokens, instead of words for brevity. The output of this stage would be an array that looks somewhat like the following table.
|
Index |
Type |
Data |
Index |
Type |
Data |
|
0 |
Char |
f |
5 |
Number |
2 |
|
1 |
Symbol |
( |
6 |
Symbol |
* |
|
2 |
Char |
t |
7 |
Char |
t |
|
3 |
Symbol |
) |
8 |
Symbol |
+ |
|
4 |
Symbol |
= |
9 |
Char |
k |
Now the parser can operate on this set of tokens instead of having to wade through the text itself. This example is, obviously, very simplistic and having a separate tokenising and parsing stages may even complicate things, but it is incredibly useful for handling more complex languages such as C and Verilog. (continue reading…)


Recent Comments