SimpleHLST needs to read in some code, written by humans, to begin generating data structures that we can actually work with. This can take up to three stages:

  • Lexical Analysis – convert the input text into a list of tokens that distinguishes language features from whitespace, comments, pre-processor commands, etc. (“Unknown character on line 2″ errors are produced by this stage).
  • Syntax Analysis – convert the list of tokens into a syntax tree which shows the entire file hierarchically as perfectly valid expressions (this is where “Missing semicolon on line 4″ type errors come from).
  • Semantic Analysis – process the syntax tree to make sure that it makes sense according to the rules of the language. Type checking falls under this category for instance, so this is where “warning, assigning ‘int’ to ’short’ – possible loss of data” type errors come from.

SimpleHLST only needs a very basic language and I’m not planning to have it support complex data, so I don’t think we’ll need a semantic analysis stage at the moment.

An example

Suppose we have the following mathematical expression, f(t) = 2t + k, and we want to implement it in some program; we could write something like

f(t) = 2*t + k.

Before any more advanced stages can take place, the compiler must extract the information from this expression. This is exactly the same principle as we humans extracting information from text. We have languages that each have a vocabulary of words with specific definitions which we can use to convey detailed information. Each word only represents a small piece of information; the context and sentence structure contains the rest.

Programming languages are exactly the same and this is where Lexical Analysis (or, colloquially, Tokenisation) gets its name from. Programming languages tend to use symbols, or tokens, instead of words for brevity. The output of this stage would be an array that looks somewhat like the following table.

Index

Type

Data

Index

Type

Data

0

Char

f

5

Number

2

1

Symbol

(

6

Symbol

*

2

Char

t

7

Char

t

3

Symbol

)

8

Symbol

+

4

Symbol

=

9

Char

k

Now the parser can operate on this set of tokens instead of having to wade through the text itself. This example is, obviously, very simplistic and having a separate tokenising and parsing stages may even complicate things, but it is incredibly useful for handling more complex languages such as C and Verilog.

Finite Automata

It’s very tempting to try and define all the jobs of the parser as regular expressions (regex) to begin with, but there’s a very commonly used construct that they cannot cope with: parenthesis. It’s easy to define a regex that can handle say “()” but that can’t adapt to “(())” and so on. We need something more general.

Finite Automata (FA) are abstract state machines that operate on sets of data. Despite being very abstract we can actually use regular expression to define a simple FA but we may not be able to do the reverse of this. Let’s consider a compiler reading in a keyword or a variable name.

This diagram shows a Deterministic Finite Automata (DFA) that will transition into the terminal state when it encounters a letter and will stay in that state for every alphanumeric character or underscore it encounters thereafter. Every state transition is clearly defined therefore it is said to be deterministic but this may not always be the case.

One of the state transitions in this diagram is marked with ε which means that it is taken if no other transitions match. This allows us to make very complex state transitions with optional states, possibly even bypassing large sections of the state diagram (much like the * and ? operators allow complex regular expressions). Unfortunately every state transition is no longer rigidly defined and hence they are called Non-deterministic Finite Automata (NFA).

ε-edges should be used with caution. It is perfectly acceptable (at least, in the theory) to have several ε-edges transitioning out of a single state but this means that all of these edges must be taken at once. The first deterministic edge taken on any of these branches then defines the next state of the NFA hence all other edges must be discarded. In practice this makes NFAs unattractive for compiler front-ends because they potentially need large amounts of memory and computation. It is possible to convert NFAs into DFAs but I think I’ve rambled on long enough about the theory now.

From Theory to Code

Theory is useful and worth understanding but it can sometimes be difficult to map onto actual code. I learned a lot from Tommy Carlier’s blog series before I wrote my own lexer so it might be worth looking there as well to fill in any gaps I leave. There are several ways to implement finite automata in code and the final decision will probably depend on the language you’re using. I’ve decided to write SimpleHLST in C++ which might not be the best decision but it’s the language I feel most comfortable with.

output = 15*in_1 + 39*in_2

If we look at the above expression there are only actually three types of token (words, integers and symbols) so the finite automaton only needs three states. The SimpleHLST tokeniser implements each state as a function call inside the main worker function, readNextToken, as shown below.

int Tokeniser::readNextToken ()
{
    // Skip over all whitespace
    munchWhitespace ();
 
    // Make sure we don't advance beyond the end of file
    if (atEnd ())
        m_curTokType = T_EOF;
    else if (isalpha (m_char) || m_char == '_')
        readWordToken ();
    else if (isdigit (m_char))
        readNumericalToken ();
    else // Assume it must be a symbol
        readSymbolToken ();
 
    // Return the token type
    return m_curTokType;
}

Before attempting to read a token, this function skips over all whitespace using the function shown below. The obvious improvement this function needs is to be able to skip over comments.

void Tokeniser::munchWhitespace ()
{
    // Just keep reading until there's no whitespace left.
    while (isspace (m_char) && !m_eof)
        readNextChar ();
}

Finally, readWordToken is listed below which is capable of extracting “output” and “in_1″ from the expression given earlier. The pushBuffer and extractBuffer functions operate on a std::string member variable that is used to store words and integers. An accessor function, tokenData, can then be used to find out what the token data is.

void Tokeniser::readWordToken ()
{
    // Keep reading alphanumeric characters into the buffer.
    while (isalnum (m_char) || m_char == '_')
        pushBuffer ();
 
    // Store the token details
    extractBuffer ();
    m_curTokType = T_WORD;
}

You can download the source code for the tokeniser in the form of a Visual Studio 2010 solution. I haven’t used any features particular to VS2010 so it should be possible to build with any prior version or g++ but I can’t confirm this. I’ve also included some sample code that I hope will become the HDL we’ll feed into SimpleHLST at a later stage.

To try it out: build the project and open a command prompt in the same directory as the SimpleHLST executable. Then type SimpleHLST <expression or filename> on the command line to see the output. The picture above shows some sample output.

T_EOF

Now we can read a stream in terms of tokens instead of characters we can get started on the parser! Next time, of course…