In this article a parser for simple mathematical expressions, for example expressions of the form 5-(2*3-4)
or -sin(pi/2*x)
will be developed. First, a scanner for decomposing the input string into a list of tokens will be developed. Then an algorithm for converting the input to postfix notation will be discussed. Finally, we will see how to evaluate the formula in postfix notation.
GitLab repositories:
- wo80/math-parser (C# version)
- wo80/math-parser-ts (Typescript version)
Only the C# code will be presented in this article.
The Scanner class
Scanners (also called tokenizers or lexers) convert a source string into a sequence of syntactic units, also called tokens. Before starting the design of our parser, it is important to understand what the building blocks of the language to be scanned should look like. The scanner then ultimately works like a finite state machine, reading the input character by character and trying to recognize the tokens. The following token types are used:
public enum TokenType
{
Operand,
Operator,
Parenthesis,
Symbol,
Function,
Separator
}
An operand is a number (more precisely, a floating point number), e.g. 2
or 3.14
. An operator is an element of the set { +, -, *, /, ^ }
. A token of type parenthesis is an opening (
or a closing )
parenthesis. A symbol represents a variable or constant, like pi
or e
. A function is just a special type of symbol, representing functions like sin
or cos
. Finally, a separator token represents a separator for function arguments. A comma ,
is the only valid separator and the parser currently only accepts functions with a maximum of two arguments (like min
and max
).
In the second example expression above -sin(pi/2*x)
, the minus sign has a different meaning than the minus sign in the first example 5-(2*3-4)
. It is a so-called unary (prefix) operator, i.e. the evaluation affects only one operand (the one after the operator). The unary minus is essentially the only exception and requires special treatment in further processing during parsing.
An example of a unary postfix operator is the factorial (e.g. 5! = 1*2*3*4*5
). Not only to avoid further special cases, but also because of the known problematic properties of the factorial (keyword fast growth) this operator is not supported by the parser.
Examples
The expression 5-(2*3-4)
will be scanned as:
5 Operand
- Operator
( Parenthesis
2 Operand
* Operator
3 Operand
- Operator
4 Operand
) Parenthesis
The expression -sin(pi/2*x)
will be scanned as:
- Operator
sin Function
( Parenthesis
pi Symbol
/ Operator
2 Operand
* Operator
x Symbol
) Parenthesis
The scanner has no knowledge of the semantics of the expression, so in general no checks for correctness are made when the input is decomposed. For performance reasons, however, it may be useful to perform some checks during the scanning phase. A standard example is the check for balanced parentheses.
The Token class
The Token
class is a simple data record holding information about token type, content and position of the token in the input string. When a token of type operand is created, it is automatically tried to convert the string into a double
. The value can be accessed using the Value
property.
public class Token
{
public Token(TokenType type, string text, int position)
{
Type = type;
Text = text;
Position = position;
}
public Token(TokenType type, string text, int position, double value)
: this(type, text, position)
{
Value = value;
}
#region Public properties
/// <summary>
/// Gets the position of token in the input expression.
/// </summary>
public int Position { get; }
/// <summary>
/// Gets the token type.
/// </summary>
public TokenType Type { get; }
/// <summary>
/// Gets the text that is represented by the token.
/// </summary>
public string Text { get; }
/// <summary>
/// If token is of type operand, the value of the operand is returned. Otherwise, zero.
/// </summary>
public double Value { get; }
#endregion
public override string ToString() => Text;
}
The Scanner class
The Scanner
class basically consists of one public method Tokenize()
, which returns an array of tokens.
public static Token[] Tokenize(string expression)
{
int position = 0;
var token = new List<Token>();
int length = expression.Length;
// Counts opening and closing brackets
int balance = 0;
// Scan the input expression
while (position < length)
{
char c = expression[position];
if (char.IsWhiteSpace(c))
{
// Ignore whitespace
}
else if (Operator.IsKnownOperator(c))
{
// Detect unary prefix operators.
if (Operator.IsUnaryPrefixOperator(c))
{
int count = token.Count;
if (count == 0 || token[count - 1].Type == TokenType.Operator
|| token[count - 1].Type == TokenType.Separator
|| token[count - 1].Text.Equals("("))
{
// Unary minus is the only prefix operator.
c = '~';
}
}
token.Add(new Token(TokenType.Operator, c.ToString(), position));
}
else if (c == '(')
{
token.Add(new Token(TokenType.Parenthesis, "(", position));
balance++;
}
else if (c == ')')
{
balance--;
if (balance < 0)
{
throw new ScannerException("Unbalanced parenthesis", position);
}
/*
int count = token.Count;
// Empty parenthesis are not allowed
if (count == 0 || token[count - 1].Text.Equals("("))
{
throw new ScannerException("Invalid parenthesis", position);
}
*/
token.Add(new Token(TokenType.Parenthesis, ")", position));
}
else if (char.IsDigit(c) || c == '.')
{
token.Add(ReadNumber(expression, length, ref position));
}
else if (char.IsLetter(c) || c == '_')
{
token.Add(ReadSymbol(expression, length, ref position));
}
else if (c == Parser.ArgumentSeparator)
{
token.Add(new Token(TokenType.Separator, c.ToString(), position));
}
else
{
throw new ScannerException("Unsupported input character", position);
}
position++;
}
if (balance != 0)
{
throw new ScannerException("Unbalanced parenthesis", position - 1);
}
return token.ToArray();
}
The private ReadNumber()
method is called to recognize numbers:
private static Token ReadNumber(string expression, int length, ref int position)
{
// A regex like "\d*\.?\d+([eE][-+]?\d+)?" could do the same.
int start = position;
int i = position;
var sb = new StringBuilder(16);
// Read digits before the decimal point.
consumeDigits();
if (i < length && expression[i] == '.')
{
sb.Append('.');
i++;
// Read digits after the decimal point.
consumeDigits();
}
// Read exponential part.
if (i < length && char.ToLower(expression[i]) == 'e')
{
if (expression[i - 1] == '.')
{
throw new ScannerException("Invalid floating point format", i);
}
sb.Append('e');
i++;
if (i == length)
{
throw new ScannerException("Invalid floating point format", i);
}
if (expression[i] == '+' || expression[i] == '-')
{
sb.Append(expression[i]);
i++;
}
if (i == length)
{
throw new ScannerException("Invalid floating point format", i);
}
if (!char.IsDigit(expression[i]))
{
throw new ScannerException("Invalid floating point format", i);
}
// Read exponential digits.
consumeDigits();
}
string text = sb.ToString();
if (!double.TryParse(text, NumberStyles.Float, Parser.NumberFormat, out var value))
{
throw new ScannerException("Invalid floating point format", start);
}
position = (i < length) ? i - 1 : i;
return new Token(TokenType.Operand, text, start, value);
void consumeDigits()
{
while (i < length && char.IsDigit(expression[i]))
{
sb.Append(expression[i]);
i++;
}
}
}
Note that the scanner will also split expressions like 4.5.5
into the tokens 4.5
and .5
. We don’t care about this in the scanning phase. In the course of further parsing such input expressions are recognized as incorrect.
The private method ReadSymbol()
returns a token of type symbol. A symbol starts with either a letter or an underscore. Valid identifiers are for example x
, sin
, pi
or _x_pow_2
. A function is recognized by the fact that the symbol is followed by an opening parenthesis.
private static Token ReadSymbol(string expression, int length, ref int position)
{
var type = TokenType.Symbol;
int start = position;
int i = position;
var sb = new StringBuilder(8);
while (i < length && (char.IsLetterOrDigit(expression[i]) || expression[i] == '_'))
{
sb.Append(expression[i]);
i++;
}
// There may be space between function name and arguments
while (i < length && char.IsWhiteSpace(expression[i])) i++;
position = i - 1;
if (i < length && expression[i] == '(')
{
type = TokenType.Function;
}
return new Token(type, sb.ToString(), start);
}
Take a look at the unit tests to see how the Scanner
is used.
The Parser class
After successful scanning of the input, the parser is responsible to transform the list of input tokens in a way, so that the expression can be easily evaluated.
The way we learn how to write down mathematical expressions in school is called infix notation. Unfortunately, the infix notation requires knowledge of some implicit math rules which makes it hard for automatic interpretation by a computer. Let us consider the expression 4-2*3
. A correct interpretation is only possible if we are familiar with the priorities of the operators. Reading and applying the operators from left to right would lead to a wrong result. Another problem is the associativity of operators. The expression 4/2*3
evaluates to 6
due to the left-associativity of the operators. The expression 4^2^3
evaluates to 65536
since the exponential operator ^
is considered to be a right-associative.
All the problems the infix notation brings along are avoided in postfix notation. It is also called Reverse Polish Notation (RPN) and corresponds to a post-order traversal of the expression’s syntax tree. A queue is used to store the tokens in postfix notation. Information about parenthesis, operator priority and associativity is determined by the position of the tokens in the queue.
The Shunting-Yard Algorithm
The shunting-yard algorithm, which goes back to Edsger W. Dijkstra, will compute the RPN of an input expression in infix notation. The postfix expression is inserted piece by piece into an output queue. Furthermore, a temporary stack is needed, on which operators, functions and opening brackets are stored during the processing of the input tokens. The pseudo code for the algorithm:
For each token in the input array:
- If
token
is an operand, add it to theoutput
queue. - If
token
is a function, push it ontooperators
stack. - If
token
is a symbol- If
token
is a known constant, change type to operand and add it to theoutput
queue. - If
token
is a variable, add it to theoutput
queue.
- If
- If
token
is a unary prefix operator, push it ontooperators
stack. - If
token
is a (binary) operatorA
:- While
operators
stack contains an operatorB
that has higher precedence thanA
or has same precedence and is left-associative- Pop
B
from stack and add it to theoutput
queue.
- Pop
- Push
A
ontooperators
stack.
- While
- If
token
is an argument separator:- While operator
A
on top of theoperators
stack is not a left parenthesis:- Add
A
tooutput
queue.
- Add
- While operator
- If
token
is left parenthesis, push it ontooperators
stack. - If
token
is right parenthesis:- While operator
A
on top of theoperators
stack is not a left parenthesis:- Pop
A
and add tooutput
queue.
- Pop
- Pop the left parenthesis from the
operators
stack and discard it. - If item on top of the
operators
stack is a function, add it tooutput
queue.
- While operator
When all tokens are read:
- While there are tokens on the
operators
stack:- Pop operator and add it to the
output
queue.
- Pop operator and add it to the
The ToPostfix()
method of the Parser
class converts the infix input token array to postfix notation using the shunting-yard algorithm:
private void ToPostfix(Token[] input)
{
int length = input.Length;
if (input[length - 1].Type == TokenType.Operator)
{
// We don't support postfix operators, so the last token can never be one.
throw new ParserException("Invalid postfix operator", input[length - 1]);
}
// The output queue.
var output = new List<Token>(length);
// Temporary operator stack.
var operators = new Stack<Token>(length);
// For all tokens in the array
for (int i = 0; i < length; i++)
{
var token = input[i];
if (token.Type == TokenType.Operand)
{
// Encountered two successive operands.
if (i > 0 && input[i - 1].Type == TokenType.Operand)
{
throw new ParserException("Invalid operand token", token);
}
// Enqueue operand
output.Add(token);
}
else if (token.Type == TokenType.Symbol)
{
if (Constants.TryGetValue(token.Text, out double value))
{
output.Add(new Token(TokenType.Operand, token.Text, token.Position, value));
}
else if (Variables.ContainsKey(token.Text))
{
output.Add(token);
}
else
{
throw new ParserException("Undefined symbol", token);
}
}
else if (token.Type == TokenType.Function)
{
if (Function.IsKnownFunction(token.Text))
{
operators.Push(token);
}
else
{
throw new ParserException("Undefined function", token);
}
}
else if (token.Type == TokenType.Operator)
{
if (token.IsUnaryMinus())
{
operators.Push(token);
continue;
}
if (operators.Count > 0)
{
int a = Operator.GetPrecedence(token.Text);
int b = Operator.GetPrecedence(operators.Peek().Text);
bool leftAssociative = Operator.IsLeftAssociative(token.Text);
while (operators.Count > 0 && (a < b || (a == b && leftAssociative)))
{
// Pop operator from stack and enqueue
output.Add(operators.Pop());
if (operators.Count > 0)
{
b = Operator.GetPrecedence(operators.Peek().Text);
}
}
}
// Push operator to stack
operators.Push(token);
}
else if (token.Type == TokenType.Separator)
{
// Pop operators from stack until there is an opening parenthesis
while (operators.Count > 0 && !operators.Peek().Text.Equals("("))
{
output.Add(operators.Pop());
}
}
else if (token.Text.Equals("("))
{
// Push to stack
operators.Push(token);
}
else if (token.Text.Equals(")"))
{
// Pop operators from stack until there is an opening parenthesis
while (operators.Count > 0)
{
var t = operators.Peek();
if (t.Type == TokenType.Parenthesis && t.Text.Equals("("))
{
operators.Pop();
break;
}
output.Add(operators.Pop());
}
// Check if a function is on top the stack
if (operators.Count > 0 && operators.Peek().Type == TokenType.Function)
{
output.Add(operators.Pop());
}
}
}
// Pop the remaining operators from stack and enqueue them
while (operators.Count > 0)
{
if (operators.Peek().Type == TokenType.Parenthesis)
{
throw new ParserException("Invalid parenthesis", operators.Peek());
}
output.Add(operators.Pop());
}
expression = output.ToArray();
}
Take a look at the unit tests to see how the Parser
is used.
Example
The algorithm will now be applied to the example expression 5-(2*3-4)
consisting of nine tokens.
Infix-Notation | Operators Stack | Postfix Queue
5 - ( 2 * 3 - 4 ) | |
- ( 2 * 3 - 4 ) | | 5
( 2 * 3 - 4 ) | - | 5
2 * 3 - 4 ) | - ( | 5
* 3 - 4 ) | - ( | 5 2
3 - 4 ) | - ( * | 5 2
- 4 ) | - ( * | 5 2 3
4 ) | - ( - | 5 2 3 *
) | - ( - | 5 2 3 * 4
| - | 5 2 3 * 4 -
| | 5 2 3 * 4 - -
The Expression class
The parser returns an instance of the Expression
class, which will hold the expression in postfix notation. Evaluation of the postfix expression is very easy. All that is needed is an evaluation stack on which the already calculated values are stored during the processing of the postfix queue:
While there are tokens in the queue:
- If token is an operand, push its value onto the evaluation stack.
- If token is an operator O:
- If O is a binary operator:
- Pop value
op2
from stack. - Pop value
op1
from stack. - Push value of operation O(op1, op2) onto the stack.
- Pop value
- If O is unary operator:
- Pop value
op
from stack. - Push value of operation O(op) onto the stack.
- Pop value
- If O is a binary operator:
- If token is a function F:
- If F is a binary function:
- Pop value
op2
from stack. - Pop value
op1
from stack. - Push value of operation F(op1, op2) onto the stack.
- Pop value
- If F is unary function:
- Pop value
op
from stack. - Push value of operation F(op) onto the stack.
- Pop value
- If F is a binary function:
- If token is a variable, push its value onto the stack.
The single top element on the stack is the value of the expression.
If the stack contains more than one element after passing the queue, the input expression is invalid.
public double Evaluate()
{
int length = expression.Length;
if (length == 0)
{
throw new InvalidOperationException();
}
var stack = new double[length];
int index = 0;
try
{
for (int i = 0; i < length; i++)
{
var token = expression[i];
if (token.Type == TokenType.Operand)
{
stack[index++] = token.Value;
}
else if (token.Type == TokenType.Operator)
{
if (Operator.IsBinaryOperator(token.Text))
{
double op2 = stack[--index];
double op1 = stack[--index];
stack[index++] = Operator.Evaluate(token.Text, op1, op2);
}
else
{
double op = stack[--index];
stack[index++] = Operator.Evaluate(token.Text, op);
}
}
else if (token.Type == TokenType.Function)
{
if (Function.TryGetFunction(token.Text, out Func1 func1))
{
double op = stack[--index];
stack[index++] = func1(op);
}
else if (Function.TryGetFunction(token.Text, out Func2 func2))
{
double op2 = stack[--index];
double op1 = stack[--index];
stack[index++] = func2(op1, op2);
}
}
else if (token.Type == TokenType.Symbol)
{
stack[index++] = Variables[token.Text];
}
else
{
throw new ParserException("Unknown token", token);
}
}
}
catch (Exception e)
{
throw new ParserException("Invalid postfix expression", e);
}
if (index != 1)
{
throw new ParserException("Error evaluating postfix expression");
}
int digits = Round <= 0 ? 12 : Math.Min(16, Round);
return Math.Round(stack[0], digits);
}
Example
The evaluation algorithm will now be applied to the postfix queue 5 2 3 * 4 - -
:
Postfix Queue | Evaluation Stack
5 2 3 * 4 - - |
2 3 * 4 - - | 5.0
3 * 4 - - | 5.0 2.0
* 4 - - | 5.0 2.0 3.0
4 - - | 5.0 6.0
- - | 5.0 6.0 4.0
- | 5.0 2.0
| 3.0
Remarks
The parser supports adding user-defined functions using the static Parser.Functions
property. Functions with zero, one or two arguments are supported.
Constants can be added using the static Parser.Constants
property. For example
Parser.Constants.Add("one", 1);
Parser.Functions.Add("sqr", x => x * x);
Parser.Functions.Add("rnd", () => Random.Shared.NextDouble());
In case you should wonder why an expression like -5+5.1
returns 0.0999999999999996
, I recommend reading What Every Computer Scientist Should Know About Floating-Point Arithmetic. The Expression
class provides a static property Round
. Setting it to its maximum allowed value 15, the previous expression evaluates to 0.1
, which might be more convenient to display to a user.
Please use the contact form or the GitLab issues section for comments.