Efficiently reading large text files with .NET

In this article, we will learn how to efficiently read structured data from large text files. The basic idea is to avoid the creation of temporary strings and thus reducing memory pressure and avoiding garbage collections. The article is motivated by Strings Are Evil written by Indy Singh, but adds some points of reusablity:

A StringTokenizer class removing the need to call string.Split(...)
An efficient floating point number parser
Buffer classes providing similar functionality to .NET StringBuilder

Before going into details, let me explain what I mean by structured data:

By structured data, we understand any data that looks like a table. If we store such data in a text file, a natural choice is to choose the lines as data rows and to store each set of data in one line by separating the columns using a fixed layout. A simple example of a text file format which exhibits such a structured layout is the CSV format.

Source code:

https://gitlab.com/wo80/string-tokenizer

Motivation

The MatrixMarket format is a text file format used in scientific computing to store dense or sparse matrices. If you don’t have a mathematical background and find yourself asking What is the Matrix ? :

A m x n matrix is a 2-dimensional array of numbers with m rows and n columns.

This is a rather simple and non-mathematical definition, but if you are a programmer, just think of a 2-dimensional array of numbers. A matrix is called sparse, if most of its entries are zero. A collection of sparse matrices can be found at sparse.tamu.edu and we will use the FIDAP example 11 matrix as our test matrix (direct download). This matrix has 16.614 rows and columns and 1.096.948 nonzeros, so the density is less than 0.004, which is pretty sparse.

Let us parse the MatrixMarket file using CSparse and see how we perform, using the following code:

Console.WriteLine("Test CSparse MatrixMarketReader");

var fi = new FileInfo(file);

if (!fi.Exists)
{
    throw new Exception("File not found: " + file);
}

Console.WriteLine("File Name: {0}", file);
Console.WriteLine("File Size: {0} kb", fi.Length / 1024);

AppDomain.MonitoringIsEnabled = true;

var A = CSparse.IO.MatrixMarketReader.ReadMatrix<double>(file);

Console.WriteLine("Matrix: {0} x {1}, {2} non-zeros, Memory: {3} kb", A.RowCount, A.ColumnCount, A.NonZerosCount, GetBytes(A) / 1024);
Console.WriteLine("Total Processor Time: {0:0.0} ms", AppDomain.CurrentDomain.MonitoringTotalProcessorTime.TotalMilliseconds);
Console.WriteLine("Allocated Memory: {0} kb", AppDomain.CurrentDomain.MonitoringTotalAllocatedMemorySize / 1024);
Console.WriteLine("Peak Working Set: {0} kb", Process.GetCurrentProcess().PeakWorkingSet64 / 1024);
Console.WriteLine("Gen 0 collections: {0}", GC.CollectionCount(0));

The output is

$ ./benchmark.exe -b csparse ../ex11.mtx
  Test CSparse MatrixMarketReader
  File Name: ../ex11.mtx
  File Size: 30707,7 kb
  Matrix: 16614 x 16614, 1096948 non-zeros, Memory: 12919 kb
  Total Processor Time: 1531,3 ms
  Allocated Memory: 305571 kb
  Peak Working Set: 62712 kb
  Gen 0 collections: 34

Reading the matrix took about 1.5 seconds. Since we have no comparison, there’s no way to tell if this is good or not. Additionally, the runtime depends on machine parameters like CPU or hard disk type and will most likely be different for you.

What looks suspicious is the number of allocated bytes. Having a file size of 30 MB and the matrix occupying 13 MB of memory, over 300 MB were allocated to read the matrix. In the following sections, we will try to reduce the number of allocated bytes, thus reducing gen 0 collections.

Note that the size of the sparse matrix can be considered small. Solving scientific problems today often includes matrices with billions of nonzeros (meaning more than a thousand times larger than our test matrix). So optimizing read performance both for speed and memory consumption is an important aspect.

Reading lines from a text file

Reading data from a text file in C#, you might find yourself writing code like

using var reader = new StreamReader(file);

string line;

while ((line = reader.ReadLine()) != null)
{
    ProcessLine(line);
}

This will allocate a new string for each line in the text file. Let’s see how the ReadLine method is implemented by looking at the .NET source code:

Legacy .NET framework TextReader.cs
Legacy .NET framework StreamReader.cs
Modern dotnet TextReader.cs
Modern dotnet StreamReader.cs

All implementations use a local StringBuilder to add the characters of the current line (the modern StreamReader actually uses a ValueStringBuilder and stackalloc, which is more efficient). So the obvious approach to eliminate the creation of the temporary storage and the returned string, is to write an extension method for the TextReader class which takes a StringBuilder as an argument:

using System.IO;
using System.Text;

public static class TextReaderExtensions
{
    /// <summary>
    /// Reads a line of characters from the text reader and adds it to the <see cref="System.Text.StringBuilder"/> instance.
    /// </summary>
    /// <param name="reader">The <see cref="System.IO.TextReader"/> source.</param>
    /// <param name="sb">The <see cref="System.Text.StringBuilder"/> target.</param>
    /// <returns>Returns false, if no characters were added, otherwise true.</returns>
    public static bool ReadLine(this TextReader reader, StringBuilder sb)
    {
        while (true)
        {
            int ch = reader.Read();
            if (ch == -1) break;
            if (ch == '\r' || ch == '\n')
            {
                if (ch == '\r' && reader.Peek() == '\n') reader.Read();
                return true;
            }
            sb.Append((char)ch);
        }

        return sb.Length > 0;
    }
}

Using this extension method, the above code reads

using var reader = new StreamReader(file);

var sb = new StringBuilder(1024);

while (reader.ReadLine(sb))
{
    ProcessLine(sb);

    sb.Clear();
}

Now we can read lines from a text file using a StringBuilder as a buffer without allocating any additional memory.

But how can we then process the data of the lines? Let’s assume each line of our text file contains three items separated by whitespace: two integers followed by a floating point number, for example:

A naive approach to parse the data could look like

void ProcessLine(string line)
{
    var items = line.Split();

    if (items.Length != 3)
    {
        throw new Exception("Expected 3 items.");
    }

    int i = int.Parse(items[0]);
    int j = int.Parse(items[1]);
    var a = double.Parse(items[2]);

    ProcessData(i, j, a);
}

Leaving aside the fact that the method expects a string and not a StringBuilder, the string.Split() method will allocate three additional temporary strings and an array object.

The StringTokenizer class

A tokenizer processes a given input string and returns a list of tokens. A token is just a subsection of the input string. In our case, the tokenizer will be rather stupid: it splits the string into sections, but has no notion of semantics (that is, the tokenizer will not check if a token at a given position is valid or not).

Since a token is just a subsection of the input string with no further meaning attached (just like the string.Split() method), it is represented by the TextSpan structure:

public readonly struct TextSpan
{
    /// <summary>
    /// Gets the start position of the text span.
    /// </summary>
    public readonly int Start;

    /// <summary>
    /// Gets the end position of the text span.
    /// </summary>
    public readonly int End;

    /// <summary>
    /// Gets the length of the text span.
    /// </summary>
    public readonly int Length => End - Start;

    public TextSpan(int start, int end)
    {
        Start = start;
        End = end;
    }

    public bool IsEmpty() => Start == End;
}

The tokenizer will then allow us to discover tokens by calling the Next() method:

public TextSpan Next()
{
    position = strategy.Next(buffer, position, length, out TextSpan span);

    return span;
}

To discover tokens, the ITokenizerStrategy interface must be implemented:

public interface ITokenizerStrategy
{
    /// <summary>
    /// Read the next token from the buffer.
    /// </summary>
    /// <param name="buffer">The character buffer.</param>
    /// <param name="position">The current position in the buffer.</param>
    /// <param name="length">The length of the current string (less or euqal buffer length).</param>
    /// <param name="span">The <see cref="TextSpan"/>.</param>
    /// <returns>The new position in the buffer (might differ from TextSpan.End).</returns>
    int Next(char[] buffer, int position, int length, out TextSpan span);
}

The default implementation is the WhitespaceStrategy:

/// <summary>
/// Use whitespace as token delimiter.
/// </summary>
public class WhitespaceStrategy : ITokenizerStrategy
{
    public int Next(char[] buffer, int position, int length, out TextSpan span)
    {
        int i = position;

        // Skip whitespace
        while (i < length && char.IsWhiteSpace(buffer[i])) i++;

        int start = i;

        // Check for whitespace.
        while (i < length && !char.IsWhiteSpace(buffer[i])) i++;

        span = new TextSpan(start, i);

        return i;
    }
}

To see how the StringTokenizer can be used, take a look at the unit tests. Here’s an example:

[Test]
public void TestWhiteSpace()
{
    var s = "1 2 3".ToCharArray();

    var tokenizer = new StringTokenizer(s);

    var span = tokenizer.Next();

    Assert.That(span.Start, Is.EqualTo(0));
    Assert.That(span.End, Is.EqualTo(1));

    Assert.That(tokenizer.HasNext());

    span = tokenizer.Next();

    Assert.That(span.Start, Is.EqualTo(2));
    Assert.That(span.End, Is.EqualTo(3));

    Assert.That(tokenizer.HasNext());

    span = tokenizer.Next();

    Assert.That(span.Start, Is.EqualTo(4));
    Assert.That(span.End, Is.EqualTo(5));

    Assert.That(!tokenizer.HasNext());

    span = tokenizer.Next();

    Assert.That(span.IsEmpty());
}

Besides the default WhitespaceStrategy, there’s a QuotedWhitespaceStrategy which allows reading quoted tokens (see unit tests), and a DelimiterStrategy which allows reading tokens that are separated by a delimiter, like a comma in CSV files (see unit tests).

Parsing floating point numbers

To convert the string representation of a number read by the StringTokenizer to its numeric equivalent, we can use the static methods of the Parse class, which allows converting Int32, Int64, Single (32-bit floating point) and Double (64-bit floating point) numbers. The following overloads are provided for double precision floating-point numbers:

public static bool Double(string s, out double result);

public static bool Double(string s, int start, int end, out double result);

public static bool Double(char[] buffer, int start, int end, out double result);

public static bool Double(ReadOnlySpan<char> buffer, out double result);

Until recent versions of .NET, there was no way to achieve the above with the build-in types. Luckily, since .NET 7 we have a Double.Parse(...) method taking a ReadOnlySpan<char>, see https://learn.microsoft.com/en-us/dotnet/api/system.double.parse.

Lets test how our number parser performs compared to this method:

double.Parse (.NET 8): 0.3078 s
Parse.Double (custom): 0.1188 s

The performance gain is probably due to the .NET version allowing for more configuration via the IFormatProvider. Our parser will read numbers with a format corresponding to the CultureInfo.InvariantCulture, and that’s good enough. See the unit tests for details.

Now, to profit from our custom parser, the StringTokenizer provides some overloads of the Next(...) method:

public bool Next(out int value)
{
    position = strategy.Next(buffer, position, length, out TextSpan span);
    
    return Parse.Int32(buffer, span.Start, span.End, out value);
}

public bool Next(out double value)
{
    position = strategy.Next(buffer, position, length, out TextSpan span);
    
    return Parse.Double(buffer, span.Start, span.End, out value);
}

The CharBuffer class

We now have everything in place to process the lines of our input file. Since the .NET StringBuilder does not allow direct access to its internal buffer (besides the GetChunks method), a second buffer would be needed to copy the StringBuilders content and then pass that to the StringTokenizer. Instead , the project provides a CharBuffer class, which derives from a generic Buffer<T> class with the following interface providing the same functionality as the StringBuilder and allowing direct access to the buffer data:

public abstract class Buffer<T> where T : struct, IEquatable<T>
{
    /// <summary>
    /// Initializes a new instance of the <see cref="Buffer{T}"/> class.
    /// </summary>
    public Buffer();

    /// <summary>
    /// Initializes a new instance of the <see cref="Buffer{T}"/> class.
    /// </summary>
    public Buffer(int capacity);

    /// <summary>
    /// Initializes a new instance of the <see cref="Buffer{T}"/> class.
    /// </summary>
    public Buffer(int capacity, int maxCapacity);

    /// <summary>
    /// Initializes a new instance of the <see cref="Buffer{T}"/> class.
    /// </summary>
    public Buffer(T[] value);

    /// <summary>
    /// Initializes a new instance of the <see cref="Buffer{T}"/> class.
    /// </summary>
    public Buffer(T[] value, int capacity);

    /// <summary>
    /// Gets the length of this buffer.
    /// </summary>
    public int Length  { get; }

    /// <summary>
    /// Gets or sets the maximum number of items that can be contained
    /// in the memory allocated by this buffer.
    /// </summary>
    public int Capacity { get; set; }

    /// <summary>
    /// Gets the maximum capacity this buffer is allowed to have.
    /// </summary>
    public int MaxCapacity { get; }

    /// <summary>
    /// Gets the array buffer.
    /// </summary>
    public T[] Data { get; }

    /// <summary>
    /// Ensures that the capacity of this buffer is at least the specified value.
    /// </summary>
    public int EnsureCapacity(int capacity);

    /// <summary>
    /// Appends a copy of the specified values to this instance.
    /// </summary>
    public void Append(T[] value);

    /// <summary>
    /// Appends a copy of a specified data to this instance.
    /// </summary>
    public void Append(T[] value, int startIndex, int count);

    /// <summary>
    /// Appends a copy of a specified data to this instance.
    /// </summary>
    public void Append(ReadOnlySpan<T> value);

    /// <summary>
    /// Removes all data from the current <see cref="Buffer{T}" /> instance.
    /// </summary>
    public void Clear();

    /// <summary>
    /// Copies the data of this instance to the specified destination buffer.
    /// </summary>
    public void CopyTo(Buffer<T> destination);

    /// <summary>
    /// Copies the data of this instance to the specified destination array.
    /// </summary>
    public void CopyTo(T[] destination);

    /// <summary>
    /// Copies the data from a specified segment of this instance to a
    /// specified segment of a destination array.
    /// </summary>
    public void CopyTo(int sourceIndex, T[] destination, int destinationIndex, int count),

    /// <summary>
    /// Determines whether the beginning of this buffer matches the specified value.
    /// </summary>
    public bool StartsWith(T value);

    /// <summary>
    /// Determines whether the beginning of this buffer matches the specified values.
    /// </summary>
    public bool StartsWith(T[] value);

    /// <summary>
    /// Determines whether the end of this buffer matches the specified values.
    /// </summary>
    public bool EndsWith(T[] value);

    /// <summary>
    /// Returns a value indicating whether a specified values occurs within this buffer.
    /// </summary>
    public bool Contains(T[] value);

    /// <summary>
    /// Reports the zero-based index of the first occurrence of the specified value in this buffer.
    /// </summary>
    public int IndexOf(T value);

    /// <summary>
    /// Reports the zero-based index of the first occurrence of the specified value in this buffer.
    /// The search starts at a specified character position.
    /// </summary>
    public int IndexOf(T value, int startIndex);

    /// <summary>
    /// Reports the zero-based index of the first occurrence of the specified values in this buffer.
    /// </summary>
    public int IndexOf(T[] value);

    /// <summary>
    /// Reports the zero-based index of the first occurrence of the specified values in this buffer.
    /// The search starts at a specified character position.
    /// </summary>
    public int IndexOf(T[] value, int startIndex);

    /// <summary>
    /// Return the buffer content as a <see cref="ReadOnlySpan{T}"/>.
    /// </summary>
    public ReadOnlySpan<T> AsSpan();
}

Conclusion

The final version of the file parser could now look something like this:

using var reader = new StreamReader(file);

var buffer = new CharBuffer(1024);

var tokenizer = new StringTokenizer();

while (reader.ReadLine(buffer))
{
    tokenizer.Set(buffer.Data, buffer.Length);

    tokenizer.Next(out int i);
    tokenizer.Next(out int j);
    tokenizer.Next(out double value);

    // Process entry (i, j, value) ...
}

Note that inside the loop we update the tokenizer buffer calling tokenizer.Set(buffer.Data, buffer.Length). This might not be necessary in case you are sure that the initial size of the buffer is large enough (meaning no line read from the text file is larger than this initial size and thus the buffer won’t be expanded while reading the lines). Since in general this assumption might not hold and the internal char buffer might be reallocated, it’s better to update the buffer of the StringTokenizer accordingly.

Let’s see how the CustomMatrixMarketReader performs:

$ ./benchmark.exe -b csparse ../ex11.mtx
  Test CSparse MatrixMarketReader
  File Name: ../ex11.mtx
  File Size: 30707,7 kb
  Matrix: 16614 x 16614, 1096948 non-zeros, Memory: 12919 kb
  Total Processor Time: 1531,3 ms
  Allocated Memory: 305571 kb
  Peak Working Set: 62712 kb
  Gen 0 collections: 34
$ ./benchmark.exe -b custom ../ex11.mtx
  Test custom MatrixMarketReader
  File Name: ../ex11.mtx
  File Size: 30707 kb
  Matrix: 16614 x 16614, 1096948 non-zeros, Memory: 12919 kb
  Total Processor Time: 578,1 ms
  Allocated Memory: 30308 kb
  Peak Working Set: 53600 kb
  Gen 0 collections: 1

We reduced memory allocations by a factor of 10, execution time by a factor of 3 and gen 0 collections from 34 to 1.

Please use the contact form or the GitLab issues section for comments.