Go to the first, previous, next, last section, table of contents.


Implementation

The Parser

The `CSlang' parser is based on the ANSI C compatible grammar developed by James A. Roskind. This grammar solves almost all the shift/reduce conflicts in the ANSI C proposed draft, most of them being typedef ambiguities. The only ambiguity left is the traditional if-if-else conflict whose removal doesn't seem to worth the effort.

Here is a brief description of one of the problems solved by the grammar.

ANSI C section 3.5.4.3, "In a parameter declaration, a single typedef name in parentheses is taken to be an abstract declarator that specifies a function with a single parameter, not as redundant parentheses around the identifier". This is extended to cover the following cases:

typedef float T;
int noo(const (T[5]));
int moo(const (T(int)));

Where again the '(' immediately to the left of 'T' is interpreted as being the start of a parameter type list, and not as a redundant paren around a redeclaration of T. Hence an equivalent code fragment is:

typedef float T;
int noo(const int identifier1 (T identifier2 [5]));
int moo(const int identifier1 (T identifier2 (int identifier3)));

The parser's main job is to create a forest of type chains for each and every `constant', `identifier', `expression' and `statement' parsed and to build an execute tree corresponding to the given `C' input file. The interpretor will execute that tree after parsing will be over and perform all the operations specified.

This double purpose of the parser coerced us to implement a scheme where two (in fact three) kinds of information are propagated during parsing. The first one is the type of the expression just parsed. That is, while building expressions, each time we enhance one, we have to recompute its type and propagate the new type to upper level rules so that pertinent decisions can be taken regarding further processing of that expression.

The second information that we need to propagate during parsing is the root of the execute sub-tree currently built. As it will be explained later, the decision on the position of such a sub-tree in the big execute tree cannot be taken right away.

The third type of information that we need to propagate is the token returned by the lexer. This information might not be required when that token is first detected. For example, when such an `assignment_expression' will be parsed:

i;

by looking ahead at the ; token, the parser will be able to detect that i is an identifier, and the action corresponding to the `IDENTIFIER' rule will be run. After that, the name of the identifier (stored in the structure returned by the lexer) will no longer be needed.

However, when a function call will be parsed, as in:

f();

the parser will look ahead at the ( token and take the decision of reducing `IDENTIFIER' to `primary_expression' and later to `postfix_expression' when the name of that identifier will be required in reporting errors in checking the arguments of the call against the available prototypes.

IDENTIFIER <-- `f'
|                 .... look ahead at '('
primary_expression
postfix_expression
|       '(' <-- `('
|       |                 .... look ahead at ')'
|       |       ')' <-- `)'
+-------+-------+
|
postfix_expression
|                 .... look ahead at ';'

Parsing and Building C Types

`CSlang' will use the types provided by the native `C' compiler. We will briefly explain the algorithm for building type nodes for each of them.

A simple example will demonstrate that building a type chain is not a simple issue, even for the most common declarations. Let's consider the following declaration:

int *i;

The way this declaration is parsed might be confusing and weird at first sight. Even though one might expect the * token to be first reduced with the int and then with the identifier x, things are a little bit different:

	  .... look ahead at INT   `int'
INT <-- `int'
basic_type_name
basic_type_specifier
|                 .... look ahead at '*'   `*'
type_specifier
|       '*' <-- `*'
|       |                 .... look ahead at IDENTIFIER   `x'
|       |       IDENTIFIER <-- `x'
|       |       paren_identifier_declarator
|       |       |                 .... look ahead at ';'   `;'
|       |       identifier_declarator
|       +-------+
|       |
|       unary_identifier_declarator
|       identifier_declarator
|       declarator
|       |       $$7
|       |       |       initializer_opt
+-------+-------+-------+
|
declaring_list
|       ';' <-- `;'
+-------+
|

Even though the way this declaration is reduced might seem unnatural, it is nevertheless correct. The * token is reduced with the x token, instead of being reduced with int, because in `C' a single declaration can specify more than one variable. For example, a declaration like:

int *x, p;

will reduce int *x first but propagate the type node of int (`type_specifier') to the upper level rules so that when the p identifier will be reduced, the parser will be able to connect it with is correct type.

	  .... look ahead at INT   `int'
INT <-- `int'
basic_type_name
basic_type_specifier
|                 .... look ahead at '*'   `*'
type_specifier
|       '*' <-- `*'
|       |                 .... look ahead at IDENTIFIER   `x'
|       |       IDENTIFIER <-- `x'
|       |       paren_identifier_declarator
|       |       |                 .... look ahead at ','   `,'
|       |       identifier_declarator
|       +-------+
|       |
|       unary_identifier_declarator
|       identifier_declarator
|       declarator
|       |       $$7
|       |       |       initializer_opt
+-------+-------+-------+
|
declaring_list
|       ',' <-- `,'
|       |                 .... look ahead at IDENTIFIER   `c'
|       |       IDENTIFIER <-- `c'
|       |       paren_identifier_declarator
|       |       |                 .... look ahead at ';'   `;'
|       |       identifier_declarator
|       |       declarator
|       |       |       $$8
|       |       |       |       initializer_opt
+-------+-------+-------+-------+
|
declaring_list
|       ';' <-- `;'
+-------+
|

As I said before, it might not be obvious. One might think that a better idea would have been to reduce int and *, creating a type node denoting `pointer to int', then simply connect the x identifier with that type node. It is not correct, and the previous example shows why.

The only problem that arise from the way declarations are parsed is that after building a quasi complete type chain, as in:

int *a[2][3]

we end up with a chain of types like array 2 of array 3 of pointer to to which we have to connect the `type_specifier' int. In order to do so, we have to find the tail of the chain. This might sound inefficient at first and there are solutions, of course, but considering the fact that `C' declaration hardly exceed four or five levels of indirections, it doesn't seem to worth the trouble.

A word on type equivalence. It is important to mention that `CSlang' implements two types of type equivalence, precisely the same way `C' does. A strict equivalence is required in declarations (i.e. int f(signed char c) is not the same thing as int f(unsigned char c)) while relaxed equivalence is required in expressions.

Basic Types

`CSlang' supports all the basic types supported by `C' except for the floating point types float and double. That is, it supports int, char and void. In addition, all the basic type modifiers supported by `C' are available: long, short, signed and unsigned.

Basic types are build on the basic_type_name grammar rule. During upper level reductions, basic type can be mixed with `type qualifiers' or with `storage class' specifiers. The initial type build on basic_type_name can be incomplete, as basic_type_name is a grammar rule where tokens representing both `basic types' and `basic type modifiers' are read from the lexer. Depending on the input, type constructs representing `basic types' and `basic type modifiers' are build separately and unified later, if the unification makes sense.

As an example, here is an excerpt of how a declaration containing both `basic types' and `basic type modifiers' is reduced by the parser:

	  .... look ahead at UNSIGNED   `unsigned'
UNSIGNED <-- `unsigned'
basic_type_name
basic_type_specifier
|                 .... look ahead at SHORT   `short'
|       SHORT <-- `short'
|       basic_type_name
+-------+
|
basic_type_specifier
|                 .... look ahead at INT   `int'
|       INT <-- `int'
|       basic_type_name
+-------+
|
basic_type_specifier
|                 .... look ahead at IDENTIFIER   `i'
type_specifier
|       IDENTIFIER <-- `i'

As you can see, the parser first detects the unsigned keyword. A type node containing only a `basic type modifier' will be created. That node will be propagated and later unified with the node created similarly for the short keyword. The result will be eventually reduced again by the `basic_type_specifier' rule below and unified with the type created for the `int' keyword. This way, we end up with a single node containing one `basic type' (int) and two `basic type modifiers' (unsigned and short). The type nodes created temporary to hold the `basic type modifiers' are deleted as soon as they are no longer needed.

basic_type_specifier:
	| basic_type_specifier basic_type_name
	;

Of course, since the grammar itself accepts a superset of the language, it cannot detect conflicts in the way `basic types' and `basic type modifiers' are used. This is for the grammar actions to do. Whenever two or more `basic types' are specified or two `basic type modifiers' conflict, `CSlang' will report the error. As an example, you cannot specify both long and short.

Moreover, when trying to unify `basic types' and `basic type modifiers', if the `basic type' is missing, it is the job of the grammar actions to fill the type node with de default `basic type' (int).

Structures and Unions

First of all, the current `CSlang' implementation does not support `bit fields'. Therefore we will not discuss them.

The main problem in dealing with structures in `C' comes from the fact that one can define incomplete structures, that is structures for witch only the tag name is specified. This is useful in implementing self and forward references, as in the following examples:

struct data_tag
{
    ...
    struct data_tag *next;
};
struct data1_tag
{
    ...
    struct data2_tag *info;
};

struct data2_tag
{
    ...
    struct data1_tag *info;
};

`CSlang' will detect invalid uses of incomplete types by inspecting the size of objects it deals with. When such an object ie encountered, an error message is issued.

Unions are treated precisely the same way structures are, the only difference being that all the union's fields start at offset 0.

Arrays

`Arrays' type nodes are build a little bit different than the other nodes. The major difference is in the way the size of such nodes is computed.

While the size of `basic types', `pointers', `structures' and `unions' can be computed by simply using the sizeof() operator or sometimes adding up the individual sizes as in the case of `structures', in the case of `arrays' we cannot compute their size until the entire declaration has been parsed. The reason is that the size of the array depends on the size of the object pointed to by the last type node in the type chain. For example, the following declaration:

int *a[2][3];

creates chain of type nodes that looks like this:

	`array 2 of array 3 of pointers to int'

As you can easily see the size of the entire array depends on the knowledge of the fact that the last element in the chain is int. Moreover, all the type nodes in the chain that are of type `array' need to have their size computed at the end of the declaration parsing. Therefore, we have to recursively parse the type chain and compute the size of all `array' nodes from tail to head.

Pointers

`Pointers' are probably the most simple types. At run time, when the size of an `array' node is no longer important, `pointers' and `arrays' are treated more or less in the same way. Of course, we must prevent assignments to `arrays', as they are not `lvalues', but as far as the code emulating memory references is concerned, there is no big difference between them.

Labels

`Labels' can be defined anywhere inside a function. We cannot check and link the `jump statements' (i.e. `gotos') with their `target statements' until the end of the current function is reached. When we encounter a jump to a label that has already been defined, we simply specify the execute node of the statement associated with that label as the `target' execute tree node. Unfortunately, when the label has not yet been defined, we have to build a list of labels referenced in the current function and try to resolve them later, when the entire body of the current function will have been parsed.

Functions

At parse time we can encounter both `function' prototypes and function definitions. Of course, an unlimited number of prototypes can be encountered, but only one `function' definition is allowed.

`CSlang' follows the `C' convention regarding function prototypes specifying no arguments. Such prototypes do not actually say that the function cannot receive arguments, they simply state that the given identifier is a function and let the types of arguments be specified by a later prototype or by the function definition itself.

This leads us the a somehow tricky implementation. Each time we encounter `function prototypes' or `definitions', we have to check them against older `function prototypes' or `definitions'. While it is allowed to have several prototypes declare a function with or without arguments, if a prototype has specified that a function should receive some arguments, the `function definition' cannot conflict with it.

The concept of `strict' and `relaxed' equivalence of types is extended to functions in such a way that `strict equivalence' is required between a `function prototypes' and a `function definition', while `relaxed equivalence' is allowed between two `function prototypes'.

The Run Time System

Address Space Emulation

`Unix' programs are designed with a flat memory model in mind. That is, each `Unix' program will see its address space as a contiguous sequence of bytes, starting at some address greater than 0 and ending up at some huge address, somewhere around 4 Gb for 32bit machines or even upper on 64bit ones. The text segment is usually located at the beginning of this address space, with the data segment immediately after it. The stack usually grows downwards from the end of the address space. The data segment grows upwards. There is a small risk of overlap between the data segment and the stack but this is being monitored by the operating system.

There is an important thing we should point out here. No properly designed program makes assumptions on the locations of the text, data or stack segments into memory, or on their relative position.

With `C' being a relatively low level language, we need to emulate this things, to be able to provide the run time system with a proper data and stack segments. We won't bother emulating a text segment because the only reason why we would need it is to deal with pointers to functions. Pointer to functions will be implemented by associating them with function nodes in the execute tree, therefore eliminating the need for a text segment emulation.

The emulation of the data and stack segments is heavily based on the virtual memory subsystem found in all the modern operating systems. `CSlang' allocates two big chunks of memory at startup, relying on the underlying operating system not to actually allocate that memory, but reserving space for it in the run time heap.

Let's say we ask the `CSlang' initialization procedure to allocate 8 Mb for the data segment and 4 Mb for the stack segment. `CSlang' will allocate them using malloc(), but the operating system will actually allocate that memory only when `CSlang' will use it. The restriction is that once this memory is filled, it will be impossible to extend it.

I have taken this approach in implementing the address space emulation because it has several advantages. First, it provides a method of getting rid of the reallocation problems. `CSlang' implements pointers, thus reallocation of the data and stack segments is impossible. Second, it provides the simplest and fastest way of detecting invalid memory references. Each time a pointer is dereferred, all we have to do is check that pointer against the bounds of the data and stack segments. This way, no invalid memory reference can hang the interpretor or the program in which it is embedded.

The only disadvantage of this scheme is that `CSlang' will not be usable at its full capacity on older systems that do not provide virtual memory (like MS-DOS). To be precise, if `CSlang' will attempt to allocate 16 Mb of memory, the system will actually try to allocate that memory and most likely fail to do so. The solution is to allocate a smaller amounts of memory and to try not to exceed their limits.

Memory References

In order to prevent buggy programs to hang the interpretor, each memory access is checked against the limits of the data and stack segments. That is, `CSlang' provides for the `C' programs it interprets the same functionality the `Unix' kernel offers to user level programs. If nothing goes wrong in the interpretor itself, every attempt to read or write past the end or before the beginning of the data and stack segments will be caught and reported.

The Execute Tree

During parsing, the grammar actions are used to build the execute tree of the program to be interpreted. That is, all the constructions that are normally encountered in `C' programs will appear as a collection of nodes in the execute tree. At the end of parsing, the `CSlang' run time system will try to locate a user specified function (usually main()) and run it.

Here is the detailed description of almost all the nodes that can be found in an execute tree. Some of them express simple actions (like assignments or declaration of identifiers), others carry out sets of computations (like postfix expressions of function calls). Once a sub-expression is computed by the run time system, its value (or a pointer to it, depending on the expression's type and on whether or not it is an `lvalue') is returned to the caller in a structure of type `value_t' that also contains additional information about that expression (the type and the `lvalue' flag). This structures are located on the interpretor's stack, that is, there is no auxiliary stack for them.

An important thing to notice is that not all the nodes are connected to the execute tree from the very beginning. Quite complex execution sub-trees can be build at parse time, their root being connected to the execute tree only later, when the parser will be able to make a decision on where that sub-tree should be connected.

For example, an expression will eventually be reduced to an `assignment_expression' and later to a `comma_expression'. The `comma_expression' cannot be connected to the inner scope's list of `comma_expressions' right away because the parser must decide whether that comma expression is stand alone (it will reduce to a `statement') or it is part of an `iteration_statement' (for, do, while) or `selection_statement' (if, switch). In the later case, the root of the `comma_expression' sub-tree will be connected to the execute tree through the node of the `iteration' or `selection statement'.

Identifiers and Constants

Identifiers and constants are the most simple nodes in the execute tree nodes. Each time the run time system encounters an identifier or a constant, a `value_t' structure is initialized for it and returned to the caller.

Postfix expressions

The need for nodes containing `postfix expressions' might not be obvious. We can quite simply create a sub-tree for prefix expressions like ++i:

 ++i
  |
  +
 / \
i   1

because, as you see, the result of ++i will be used further in evaluating the rest of the expression, whatever that be. Unfortunately we cannot create a sub-tree for postfix expressions like i++ because the result of i++ is not what we need in further evaluating the expression of which i++ is a sub-expression. This is why we have to create a special kind of node that will first set up the return value as i and only after that will evaluate the sub-expression i++, whose value will be ignored.

Structure members

Structure members are another type of execute tree nodes whose evaluation is quite simple. Once the structure to which they belong has been evaluated and its address found, the `value_t' structure to be returned as the result of evaluating the structure member will be filled with the address of that structure plus the offset of the member into it. For efficiency reasons, instead of keeping the field names in the execute tree node and figure out the offsets at run time, the offsets are computed at parse time and stored directly into the execute tree node.

Array subscripts

Array subscripting is in fact an operation that applies to both pointers and arrays. As explained earlier, every pointer dereference that will access memory outside the data and stack segments emulated by the run time system will be caught and reported.

Comma expressions

Comma expressions are implemented as a node containing a list of assignment expressions that are executed, one by one, at run time. The only thing that is worth mentioning here is the fact that in `C' most `comma expressions' contain only one `assignment expression', thus being subject of some simple optimizations as explained in section Optimizations.

Conditional expressions

Nothing important about them.

Declarations

Declarations are connected to `scope' statements as soon as they are encountered. For each symbol, an appropriate entry is created in the symbol table. Since this is pretty straightforward, it needs no further explanation.

Unary operations

There is nothing special about unary operation. The result is computed and returned to the caller.

Binary operations

Binary operations have been optimized a little when basic types are involved. This has been done by using a statically defined array of types that should result from each given combination of basic types. By storing the codes corresponding to the basic types involved in the binary operation in the execute tree node, the interpretor no longer has to recompute the resulting type at run time. The implementation is based on a big switch statement that has been organised based on the basic type codes in such a way that the an optimizing compiler can immediately detect how each type should be promoted before actually performing the binary operation.

As a special case, the || and && logical operators are treated differently since the execution of the right hand expression depends on the result of the evaluation of the left hand one.

Assignment operations

The only thing that is worth mentioning about assignment operations is related to assignments to basic types. Due to the big number of possible combinations of basic types, the same optimization as in binary operations has been implemented. That is, basic type codes are kept in the execute tree nodes and used at run time to detect the precise type promotion required.

Cast operations

Yet another simple issue. Casts are checked at parse time. Casts from void or from aggregate types (structures and unions) are rejected. Casts to arrays, functions or aggregate types are rejected as well. The only thing that is left to be done at run time is to simply set the type of the expression to the type specified in the cast.

If-else Statements

The implementation of the `if-else' statement at run time is quite simple. The interpretor first evaluates the `if' condition and based on the value of the result (0 or non-zero) it will execute the true or false branch of the statement.

Switch Statements

The implementation of the `switch' statement is not very simple. It is tightly connected with the implementation of the `scope' statement. The interpretor first evaluates the `switch' condition, then searches through all the `case' labels of its `scope' statement in order to locate the entry point. Once this has been accomplished it sets up the environment in such a way that when the `scope' statement will be executed, it will start at the sub-statement pointed to by the `case' label that matched the value of the `switch' condition. If no `case' label matched, the sub-statement pointed to by the `default' label will be used as the entry point. If no `default' label is found, the statement associated with the `switch' is simply ignored.

Iteration Statements

Since `while' statements are so similar to `for' statements, they are using the same type of execute tree nodes. The implementation for both `for/while' loops and for `do' loops is simple and doesn't worth a detailed description.

Scope statements

What we call here `scope statements' or simply `scopes' are practically compound statements. We call them `scopes' because each time a compound statement is parsed or executed, a new scope is created where local variables can be defined. The life time of those variables, their `scope' is the compound statement.

`Scope' statements are necessary as part of `selection' statements, `iteration' statements, goto statements and functions. By far the most demanding is the switch statement which requires run time selectable entry points (one for each case label). The emulation of the `switch' statement has been coded carefully with the idea of getting the best performance with an implementation that is common to all the four composite statements previously mentioned.

Practically each `scope' statement has to keep a list of execute tree nodes connected to it (its children), a list of case labels, their constant expressions and target statements, the index of the default statement, etc. At run time, it keeps the number of the statement to be executed next in such a way that this number can be modified by inner goto statements or by outer switch statements when they need to coerce the interpretor to continue the execution from a specific point. The value of the run time stack is saved when a `scope' statement is entered and restored before exit.

Goto Statements

goto statements are only allowed to exit `scopes', as explained in section Jumps. They work by simply setting the number of the statement to be executed next in the destination `scope' to the index of the target statement.

Function definitions

Function definitions are in fact execute sub-trees that are connected to the big tree in all the places where calls to the functions they represent are issued.

Break Statements

The break statement will simply set a flag that will coerce every `compound statement' currently executing to stop whatever it was doing and return until the level of the innermost switch or `iteration statement' is reached. When this is accomplished, the flag is reset.

Continue Statements

The continue statement will behave more or less the same way as the break statement. It will set a flag based on which the callers will decide whether they should continue whatever they were doing or simply return until the scope of the targeted loop is reached.

Return Statements

return statements are a little bit tricky because in order to correctly convert the returned expression to the type of the function's return value, an assignment has to be performed. While in a compiler generated code the return value is stored in a register, the `CSlang' interpretor stores it in a global variable. The return statement will also set a flag that will signal the fact that the current function should return immediately. Besides being used by all the composed statements up to the level of the caller to check whether they are allowed to continue, this flag is used by the function call code to detect whether or not the current function has been terminated due to a return statement or to a natural termination of its statement (for void functions only).

In the future, it might worth switching to an implementation based on non-local jumps for the break, `continue' and `return' statements. Such an implementation might be faster, with the drawback that it might leave uncleaned things on the interpretor stack.

Function Calls

Function calls are by far the most complex and time consuming operation performed by the run time system. Every time a function call is performed, the interpretor has to assign the the call's arguments to to the formal arguments (which of course are of the same type as the function's arguments) in order to perform all the necessary type conversions and integer promotions. After that, the interpretor pushes the formals onto the emulated stack. For extra call arguments, no conversion is necessary, so they are pushed on the stack precisely the same way they have been specified (no type conversion). Even though `CSlang' does not currently implement variadic functions, with the current implementation such a feature will not be difficult to add.

Since the name of the formal arguments must be visible in the `scope' statement of the function's body, the interpretor has to create a symbol for each formal argument for which a name has been specified in the function definition.

After executing the function's body, the interpretor will check for a return value and if such a value exists (a return statement has been executed) it will be returned to the caller.

Unfortunately, since `CSlang' is an interpretor and does not perform flow control analysis, it is impossible to detect at parse time whether or not a function that should return a non-void value actually does return such a value using a return statement. However, such a situation can and will be detected and reported at run time.

Optimizations

Until now, the main effort was towards making `CSlang' as close to `C' as possible. However, during development a few simple optimization have become obvious and I have decided to implement them.

The first one is related to `comma expressions'. The code of an execute tree node implementing a `comma expressions' is a simple loop that executes each `assignment expression' in that `comma expression' and returns the value of the last one. Most `comma expressions' consist of only one `assignment expression' and when this is the case, it makes no sense to connect the `comma expression' node in the execute tree instead of directly connecting that `assignment expression'. This will both save time by not executing the additional et_run() call, and memory by saving a stack frame for function calls.

As an example, this is the `iteration_statement' rule. As you can see, it is heavily based on `comma expressions' (`comma_expression_opt' is a grammar rule that can reduce to a `comma_expression' or to nothing).

iteration_statement:
	  WHILE '(' comma_expression ')' statement
	| DO statement WHILE '(' comma_expression ')' ';'
	| FOR '(' comma_expression_opt ';'
		  comma_expression_opt ';'
		  comma_expression_opt ')' statement
	;

The second improvement is in the way function calls, assignments and return statements are stored in the execute tree nodes, particularly when basic types are involved. I have tried to put in these execute tree nodes as many information required for the default conversions as possible, so that there will be no need to recompute them at run time. Remember, `CSlang' is an interpretor, not a byte-compiler. The optimization consists of computing at parse time the basic types to which the expressions involved should be promoted and storing the codes of this basic types into the execute tree node. When the node will be run, the interpretor will know precisely what kind of integer promotions are required.

Reporting Errors

Most modern compilers are able to recover from errors and continue to parse the input, with the idea of reporting as many errors as possible. Such a feature is not present in `CSlang', in the attempt of keeping it small and simple.


Go to the first, previous, next, last section, table of contents.