Math Expression Parser (1/2)

In this article a parser for simple mathematical expressions, for example expressions of the form 5-(2*3-4) or -sin(pi/2*x) will be developed. First, a scanner for decomposing the input string into a list of tokens will be developed. Then an algorithm for converting the input to postfix notation will be discussed. Finally, we will see how to evaluate the formula in postfix notation.

GitLab repositories:

wo80/math-parser (C# version)
wo80/math-parser-ts (Typescript version)

Only the C# code will be presented in this article.

The Scanner class

Scanners (also called tokenizers or lexers) convert a source string into a sequence of syntactic units, also called tokens. Before starting the design of our parser, it is important to understand what the building blocks of the language to be scanned should look like. The scanner then ultimately works like a finite state machine, reading the input character by character and trying to recognize the tokens. The following token types are used:

public enum TokenType
{
    Operand,
    Operator,
    Parenthesis,
    Symbol,
    Function,
    Separator
}

An operand is a number (more precisely, a floating point number), e.g. 2 or 3.14. An operator is an element of the set { +, -, *, /, ^ }. A token of type parenthesis is an opening ( or a closing ) parenthesis. A symbol represents a variable or constant, like pi or e. A function is just a special type of symbol, representing functions like sin or cos. Finally, a separator token represents a separator for function arguments. A comma , is the only valid separator and the parser currently only accepts functions with a maximum of two arguments (like min and max).

In the second example expression above -sin(pi/2*x), the minus sign has a different meaning than the minus sign in the first example 5-(2*3-4). It is a so-called unary (prefix) operator, i.e. the evaluation affects only one operand (the one after the operator). The unary minus is essentially the only exception and requires special treatment in further processing during parsing.

An example of a unary postfix operator is the factorial (e.g. 5! = 1*2*3*4*5). Not only to avoid further special cases, but also because of the known problematic properties of the factorial (keyword fast growth) this operator is not supported by the parser.

Examples

The expression 5-(2*3-4) will be scanned as:

   5     Operand
   -     Operator
   (     Parenthesis
   2     Operand
   *     Operator
   3     Operand
   -     Operator
   4     Operand
   )     Parenthesis

The expression -sin(pi/2*x) will be scanned as:

   -     Operator
 sin     Function
   (     Parenthesis
  pi     Symbol
   /     Operator
   2     Operand
   *     Operator
   x     Symbol
   )     Parenthesis

The scanner has no knowledge of the semantics of the expression, so in general no checks for correctness are made when the input is decomposed. For performance reasons, however, it may be useful to perform some checks during the scanning phase. A standard example is the check for balanced parentheses.

The Token class

The Token class is a simple data record holding information about token type, content and position of the token in the input string. When a token of type operand is created, it is automatically tried to convert the string into a double. The value can be accessed using the Value property.

public class Token
{
    public Token(TokenType type, string text, int position)
    {
        Type = type;
        Text = text;
        Position = position;
    }

    public Token(TokenType type, string text, int position, double value)
        : this(type, text, position)
    {
        Value = value;
    }

    #region Public properties

    /// <summary>
    /// Gets the position of token in the input expression.
    /// </summary>
    public int Position { get; }

    /// <summary>
    /// Gets the token type.
    /// </summary>
    public TokenType Type { get; }

    /// <summary>
    /// Gets the text that is represented by the token.
    /// </summary>
    public string Text { get; }

    /// <summary>
    /// If token is of type operand, the value of the operand is returned. Otherwise, zero.
    /// </summary>
    public double Value { get; }

    #endregion

    public override string ToString() => Text;
}

The Scanner class

The Scanner class basically consists of one public method Tokenize(), which returns an array of tokens.

public static Token[] Tokenize(string expression)
{
    int position = 0;

    var token = new List<Token>();

    int length = expression.Length;

    // Counts opening and closing brackets
    int balance = 0;

    // Scan the input expression
    while (position < length)
    {
        char c = expression[position];

        if (char.IsWhiteSpace(c))
        {
            // Ignore whitespace
        }
        else if (Operator.IsKnownOperator(c))
        {
            // Detect unary prefix operators.
            if (Operator.IsUnaryPrefixOperator(c))
            {
                int count = token.Count;

                if (count == 0 || token[count - 1].Type == TokenType.Operator
                               || token[count - 1].Type == TokenType.Separator
                               || token[count - 1].Text.Equals("("))
                {
                    // Unary minus is the only prefix operator.
                    c = '~';
                }
            }

            token.Add(new Token(TokenType.Operator, c.ToString(), position));
        }
        else if (c == '(')
        {
            token.Add(new Token(TokenType.Parenthesis, "(", position));
            balance++;
        }
        else if (c == ')')
        {
            balance--;

            if (balance < 0)
            {
                throw new ScannerException("Unbalanced parenthesis", position);
            }

            /*
            int count = token.Count;

            // Empty parenthesis are not allowed
            if (count == 0 || token[count - 1].Text.Equals("("))
            {
                throw new ScannerException("Invalid parenthesis", position);
            }
            */

            token.Add(new Token(TokenType.Parenthesis, ")", position));
        }
        else if (char.IsDigit(c) || c == '.')
        {
            token.Add(ReadNumber(expression, length, ref position));
        }
        else if (char.IsLetter(c) || c == '_')
        {
            token.Add(ReadSymbol(expression, length, ref position));
        }
        else if (c == Parser.ArgumentSeparator)
        {
            token.Add(new Token(TokenType.Separator, c.ToString(), position));
        }
        else
        {
            throw new ScannerException("Unsupported input character", position);
        }

        position++;
    }

    if (balance != 0)
    {
        throw new ScannerException("Unbalanced parenthesis", position - 1);
    }

    return token.ToArray();
}

The private ReadNumber() method is called to recognize numbers:

private static Token ReadNumber(string expression, int length, ref int position)
{
    // A regex like "\d*\.?\d+([eE][-+]?\d+)?" could do the same.

    int start = position;
    int i = position;

    var sb = new StringBuilder(16);

    // Read digits before the decimal point.
    consumeDigits();

    if (i < length && expression[i] == '.')
    {
        sb.Append('.');
        i++;

        // Read digits after the decimal point.
        consumeDigits();
    }

    // Read exponential part.
    if (i < length && char.ToLower(expression[i]) == 'e')
    {
        if (expression[i - 1] == '.')
        {
            throw new ScannerException("Invalid floating point format", i);
        }

        sb.Append('e');
        i++;

        if (i == length)
        {
            throw new ScannerException("Invalid floating point format", i);
        }

        if (expression[i] == '+' || expression[i] == '-')
        {
            sb.Append(expression[i]);
            i++;
        }

        if (i == length)
        {
            throw new ScannerException("Invalid floating point format", i);
        }

        if (!char.IsDigit(expression[i]))
        {
            throw new ScannerException("Invalid floating point format", i);
        }

        // Read exponential digits.
        consumeDigits();
    }

    string text = sb.ToString();

    if (!double.TryParse(text, NumberStyles.Float, Parser.NumberFormat, out var value))
    {
        throw new ScannerException("Invalid floating point format", start);
    }

    position = (i < length) ? i - 1 : i;

    return new Token(TokenType.Operand, text, start, value);

    void consumeDigits()
    {
        while (i < length && char.IsDigit(expression[i]))
        {
            sb.Append(expression[i]);
            i++;
        }
    }
}

Note that the scanner will also split expressions like 4.5.5 into the tokens 4.5 and .5. We don’t care about this in the scanning phase. In the course of further parsing such input expressions are recognized as incorrect.

The private method ReadSymbol() returns a token of type symbol. A symbol starts with either a letter or an underscore. Valid identifiers are for example x, sin, pi or _x_pow_2. A function is recognized by the fact that the symbol is followed by an opening parenthesis.

private static Token ReadSymbol(string expression, int length, ref int position)
{
    var type = TokenType.Symbol;

    int start = position;
    int i = position;

    var sb = new StringBuilder(8);

    while (i < length && (char.IsLetterOrDigit(expression[i]) || expression[i] == '_'))
    {
        sb.Append(expression[i]);
        i++;
    }

    // There may be space between function name and arguments
    while (i < length && char.IsWhiteSpace(expression[i])) i++;

    position = i - 1;

    if (i < length && expression[i] == '(')
    {
        type = TokenType.Function;
    }

    return new Token(type, sb.ToString(), start);
}

Take a look at the unit tests to see how the Scanner is used.

The Parser class

After successful scanning of the input, the parser is responsible to transform the list of input tokens in a way, so that the expression can be easily evaluated.

The way we learn how to write down mathematical expressions in school is called infix notation. Unfortunately, the infix notation requires knowledge of some implicit math rules which makes it hard for automatic interpretation by a computer. Let us consider the expression 4-2*3. A correct interpretation is only possible if we are familiar with the priorities of the operators. Reading and applying the operators from left to right would lead to a wrong result. Another problem is the associativity of operators. The expression 4/2*3 evaluates to 6 due to the left-associativity of the operators. The expression 4^2^3 evaluates to 65536 since the exponential operator ^ is considered to be a right-associative.

All the problems the infix notation brings along are avoided in postfix notation. It is also called Reverse Polish Notation (RPN) and corresponds to a post-order traversal of the expression’s syntax tree. A queue is used to store the tokens in postfix notation. Information about parenthesis, operator priority and associativity is determined by the position of the tokens in the queue.

The Shunting-Yard Algorithm

The shunting-yard algorithm, which goes back to Edsger W. Dijkstra, will compute the RPN of an input expression in infix notation. The postfix expression is inserted piece by piece into an output queue. Furthermore, a temporary stack is needed, on which operators, functions and opening brackets are stored during the processing of the input tokens. The pseudo code for the algorithm:

For each token in the input array:

If token is an operand, add it to the output queue.
If token is a function, push it onto operators stack.
If token is a symbol
- If token is a known constant, change type to operand and add it to the output queue.
- If token is a variable, add it to the output queue.
If token is a unary prefix operator, push it onto operators stack.
If token is a (binary) operator A:
- While operators stack contains an operator B that has higher precedence than A or has same precedence and is left-associative
  - Pop B from stack and add it to the output queue.
- Push A onto operators stack.
If token is an argument separator:
- While operator A on top of the operators stack is not a left parenthesis:
  - Add A to output queue.
If token is left parenthesis, push it onto operators stack.
If token is right parenthesis:
- While operator A on top of the operators stack is not a left parenthesis:
  - Pop A and add to output queue.
- Pop the left parenthesis from the operators stack and discard it.
- If item on top of the operators stack is a function, add it to output queue.

When all tokens are read:

While there are tokens on the operators stack:
- Pop operator and add it to the output queue.

The ToPostfix() method of the Parser class converts the infix input token array to postfix notation using the shunting-yard algorithm:

private void ToPostfix(Token[] input)
{
    int length = input.Length;

    if (input[length - 1].Type == TokenType.Operator)
    {
        // We don't support postfix operators, so the last token can never be one.
        throw new ParserException("Invalid postfix operator", input[length - 1]);
    }

    // The output queue.
    var output = new List<Token>(length);

    // Temporary operator stack.
    var operators = new Stack<Token>(length);

    // For all tokens in the array
    for (int i = 0; i < length; i++)
    {
        var token = input[i];

        if (token.Type == TokenType.Operand)
        {
            // Encountered two successive operands.
            if (i > 0 && input[i - 1].Type == TokenType.Operand)
            {
                throw new ParserException("Invalid operand token", token);
            }

            // Enqueue operand
            output.Add(token);
        }
        else if (token.Type == TokenType.Symbol)
        {
            if (Constants.TryGetValue(token.Text, out double value))
            {
                output.Add(new Token(TokenType.Operand, token.Text, token.Position, value));
            }
            else if (Variables.ContainsKey(token.Text))
            {
                output.Add(token);
            }
            else
            {
                throw new ParserException("Undefined symbol", token);
            }
        }
        else if (token.Type == TokenType.Function)
        {
            if (Function.IsKnownFunction(token.Text))
            {
                operators.Push(token);
            }
            else
            {
                throw new ParserException("Undefined function", token);
            }
        }
        else if (token.Type == TokenType.Operator)
        {
            if (token.IsUnaryMinus())
            {
                operators.Push(token);
                continue;
            }

            if (operators.Count > 0)
            {
                int a = Operator.GetPrecedence(token.Text);
                int b = Operator.GetPrecedence(operators.Peek().Text);

                bool leftAssociative = Operator.IsLeftAssociative(token.Text);

                while (operators.Count > 0 && (a < b || (a == b && leftAssociative)))
                {
                    // Pop operator from stack and enqueue
                    output.Add(operators.Pop());

                    if (operators.Count > 0)
                    {
                        b = Operator.GetPrecedence(operators.Peek().Text);
                    }
                }
            }

            // Push operator to stack
            operators.Push(token);
        }
        else if (token.Type == TokenType.Separator)
        {
            // Pop operators from stack until there is an opening parenthesis
            while (operators.Count > 0 && !operators.Peek().Text.Equals("("))
            {
                output.Add(operators.Pop());
            }
        }
        else if (token.Text.Equals("("))
        {
            // Push to stack
            operators.Push(token);
        }
        else if (token.Text.Equals(")"))
        {
            // Pop operators from stack until there is an opening parenthesis
            while (operators.Count > 0)
            {
                var t = operators.Peek();

                if (t.Type == TokenType.Parenthesis && t.Text.Equals("("))
                {
                    operators.Pop();
                    break;
                }

                output.Add(operators.Pop());
            }

            // Check if a function is on top the stack
            if (operators.Count > 0 && operators.Peek().Type == TokenType.Function)
            {
                output.Add(operators.Pop());
            }
        }
    }

    // Pop the remaining operators from stack and enqueue them
    while (operators.Count > 0)
    {
        if (operators.Peek().Type == TokenType.Parenthesis)
        {
            throw new ParserException("Invalid parenthesis", operators.Peek());
        }

        output.Add(operators.Pop());
    }
    
    expression = output.ToArray();
}

Take a look at the unit tests to see how the Parser is used.

Example

The algorithm will now be applied to the example expression 5-(2*3-4) consisting of nine tokens.

   Infix-Notation | Operators Stack | Postfix Queue
5 - ( 2 * 3 - 4 ) |                 |
  - ( 2 * 3 - 4 ) |                 | 5
    ( 2 * 3 - 4 ) |  -              | 5
      2 * 3 - 4 ) |  - (            | 5
        * 3 - 4 ) |  - (            | 5 2
          3 - 4 ) |  - ( *          | 5 2
            - 4 ) |  - ( *          | 5 2 3
              4 ) |  - ( -          | 5 2 3 *
                ) |  - ( -          | 5 2 3 * 4
                  |  -              | 5 2 3 * 4 -
                  |                 | 5 2 3 * 4 - -

The Expression class

The parser returns an instance of the Expression class, which will hold the expression in postfix notation. Evaluation of the postfix expression is very easy. All that is needed is an evaluation stack on which the already calculated values are stored during the processing of the postfix queue:

While there are tokens in the queue:

If token is an operand, push its value onto the evaluation stack.
If token is an operator O:
- If O is a binary operator:
  - Pop value op2 from stack.
  - Pop value op1 from stack.
  - Push value of operation O(op1, op2) onto the stack.
- If O is unary operator:
  - Pop value op from stack.
  - Push value of operation O(op) onto the stack.
If token is a function F:
- If F is a binary function:
  - Pop value op2 from stack.
  - Pop value op1 from stack.
  - Push value of operation F(op1, op2) onto the stack.
- If F is unary function:
  - Pop value op from stack.
  - Push value of operation F(op) onto the stack.
If token is a variable, push its value onto the stack.

The single top element on the stack is the value of the expression.

If the stack contains more than one element after passing the queue, the input expression is invalid.

public double Evaluate()
{
    int length = expression.Length;

    if (length == 0)
    {
        throw new InvalidOperationException();
    }

    var stack = new double[length];
    int index = 0;

    try
    {
        for (int i = 0; i < length; i++)
        {
            var token = expression[i];
            
            if (token.Type == TokenType.Operand)
            {
                stack[index++] = token.Value;
            }
            else if (token.Type == TokenType.Operator)
            {
                if (Operator.IsBinaryOperator(token.Text))
                {
                    double op2 = stack[--index];
                    double op1 = stack[--index];

                    stack[index++] = Operator.Evaluate(token.Text, op1, op2);
                }
                else
                {
                    double op = stack[--index];
                    stack[index++] = Operator.Evaluate(token.Text, op);
                }
            }
            else if (token.Type == TokenType.Function)
            {
                if (Function.TryGetFunction(token.Text, out Func1 func1))
                {
                    double op = stack[--index];

                    stack[index++] = func1(op);
                }
                else if (Function.TryGetFunction(token.Text, out Func2 func2))
                {
                    double op2 = stack[--index];
                    double op1 = stack[--index];

                    stack[index++] = func2(op1, op2);
                }
            }
            else if (token.Type == TokenType.Symbol)
            {
                stack[index++] = Variables[token.Text];
            }
            else
            {
                throw new ParserException("Unknown token", token);
            }
        }
    }
    catch (Exception e)
    {
        throw new ParserException("Invalid postfix expression", e);
    }

    if (index != 1)
    {
        throw new ParserException("Error evaluating postfix expression");
    }

    int digits = Round <= 0 ? 12 : Math.Min(16, Round);

    return Math.Round(stack[0], digits);
}

Example

The evaluation algorithm will now be applied to the postfix queue 5 2 3 * 4 - -:

Postfix Queue | Evaluation Stack
5 2 3 * 4 - - |
  2 3 * 4 - - | 5.0
    3 * 4 - - | 5.0  2.0
      * 4 - - | 5.0  2.0  3.0
        4 - - | 5.0  6.0
          - - | 5.0  6.0  4.0
            - | 5.0  2.0
              | 3.0

Remarks

The parser supports adding user-defined functions using the static Parser.Functions property. Functions with zero, one or two arguments are supported. Constants can be added using the static Parser.Constants property. For example

Parser.Constants.Add("one", 1);
Parser.Functions.Add("sqr", x => x * x);
Parser.Functions.Add("rnd", () => Random.Shared.NextDouble());

In case you should wonder why an expression like -5+5.1 returns 0.0999999999999996, I recommend reading What Every Computer Scientist Should Know About Floating-Point Arithmetic. The Expression class provides a static property Round. Setting it to its maximum allowed value 15, the previous expression evaluates to 0.1, which might be more convenient to display to a user.

Please use the contact form or the GitLab issues section for comments.