kokobob.com

Maximizing C# Regex Performance: A Comprehensive Guide

Written on

Understanding Regular Expressions

Regular expressions, commonly known as regex, serve as robust tools for identifying patterns within text. They empower developers to specify search patterns that can be utilized for finding, replacing, or manipulating certain segments of a string. Regex offers a succinct and adaptable approach to pinpoint specific patterns in textual data.

Throughout my career, I have leveraged regular expressions for a variety of tasks, including:

  • Matching patterns in user inputs
  • Scraping web data
  • Parsing logs and other data files
  • Recovering data in digital forensics

For those interested in diving deeper into regex, I recommend checking out these resources:

  • Basic examples of C# regular expressions
  • Options available in C# Regex

Why Benchmark Regex Performance in C#?

Beyond my routine efforts to profile and optimize applications, I enjoy conducting benchmarks out of sheer curiosity. This is especially true when I discover multiple methods that seem to achieve the same outcome. It prompts me to explore the genuine differences, beyond just syntax and usability. A recent experience with collection initializers demonstrated surprising performance variations, reinforcing the importance of curiosity.

C# regular expressions come in various forms:

  • Static method calls
  • Compiled flag usage
  • Source generators

While the compiled flag is expected to enhance performance, it raises questions about the overhead associated with static method calls, known for their convenience. Additionally, the relatively new source generators for regex in C# piqued my interest, especially after encountering some excellent Microsoft documentation on compiled regex and source generation.

My focus will be on the performance differences when retrieving all matches from a text body. Given the scenarios mentioned, I aim to ascertain which method excels, as they all appear similar initially.

Setting Up C# Regex Performance Benchmarks

While you're likely eager to dive into the details, I encourage you to pause and grasp the benchmarks' context first. My intention when sharing benchmarking insights isn't to persuade you to alter your coding practices, but rather to foster curiosity about your coding choices. Although some pitfalls may be evident, the essence lies in nurturing inquisitiveness.

For this benchmarking endeavor, I'll utilize BenchmarkDotNet, a useful tool for running performance tests. If you're interested in creating your own benchmarks, I have numerous articles on BenchmarkDotNet available for your perusal.

The Test Data for Benchmarking

Our goal here isn't to gauge absolute performance — although you might be interested in that for profiling your applications — but to evaluate the relative performance among various Regex options. Other factors to consider include:

  • The Regex pattern we employ may affect performance across different methods.
  • The data source we match against might influence the outcomes.

I highlight these variables to emphasize the uncertainty surrounding them. While I believe the source data shouldn't pose significant challenges, differing heuristics could impact the performance of compiled or source-generated regular expressions. Alternatively, the selected Regex pattern might either hinder or facilitate performance advantages, which I aim to clarify.

To ensure fairness, I decided to search for patterns in authentic text: words ending in "ing" or "ed." I sourced the text from Project Gutenberg, utilizing an E-Book containing over 2200 lines of English text, providing ample opportunities for pattern matches.

The C# Regex Benchmark Code

These benchmarks aren't particularly unique in comparison to previous ones I've conducted, but here are some highlights to note:

  • I'm utilizing [Params] to load the source file, enabling testing across different datasets.
  • The source data is loaded during the global setup.
  • Some Regex instances are cached for specific benchmarks, performed during the global setup.

You can find the benchmark code available on GitHub, as well as below:

using BenchmarkDotNet.Attributes;

using BenchmarkDotNet.Running;

using System.Reflection;

using System.Text.RegularExpressions;

BenchmarkRunner.Run(

Assembly.GetExecutingAssembly(),

args: args);

[MemoryDiagnoser]

[MediumRunJob]

public partial class RegexBenchmarks

{

private const string RegexPattern = @"bw*(ing|ed)b";

private string? _sourceText;

private Regex? _regex;

private Regex? _regexCompiled;

private Regex? _generatedRegex;

private Regex? _generatedRegexCompiled;

[GeneratedRegex(RegexPattern, RegexOptions.None, "en-US")]

private static partial Regex GetGeneratedRegex();

[GeneratedRegex(RegexPattern, RegexOptions.Compiled, "en-US")]

private static partial Regex GetGeneratedRegexCompiled();

[Params("pg73346.txt")]

public string? SourceFileName { get; set; }

[GlobalSetup]

public void Setup()

{

_sourceText = File.ReadAllText(SourceFileName!);

_regex = new(RegexPattern);

_regexCompiled = new(RegexPattern, RegexOptions.Compiled);

_generatedRegex = GetGeneratedRegex();

_generatedRegexCompiled = GetGeneratedRegexCompiled();

}

[Benchmark(Baseline = true)]

public MatchCollection Static()

{

return Regex.Matches(_sourceText!, RegexPattern!);

}

[Benchmark]

public MatchCollection New()

{

Regex regex = new(RegexPattern!);

return regex.Matches(_sourceText!);

}

[Benchmark]

public MatchCollection New_Compiled()

{

Regex regex = new(RegexPattern!, RegexOptions.Compiled);

return regex.Matches(_sourceText!);

}

[Benchmark]

public MatchCollection Cached()

{

return _regex!.Matches(_sourceText!);

}

[Benchmark]

public MatchCollection Cached_Compiled()

{

return _regexCompiled!.Matches(_sourceText!);

}

[Benchmark]

public MatchCollection Generated()

{

return GetGeneratedRegex().Matches(_sourceText!);

}

[Benchmark]

public MatchCollection Generated_Cached()

{

return _generatedRegex!.Matches(_sourceText!);

}

[Benchmark]

public MatchCollection Generated_Compiled()

{

return GetGeneratedRegexCompiled().Matches(_sourceText!);

}

[Benchmark]

public MatchCollection Generated_Cached_Compiled()

{

return _generatedRegexCompiled!.Matches(_sourceText!);

}

}

For our benchmarks, we will consider the static method on the Regex class as the baseline, providing a reference point for evaluating performance results.

C# Regex Performance Results

Now that we've reviewed the BenchmarkDotNet code, let's explore the results:

NOTE: These results may be misleading due to how MatchCollection operates. I recommend reading this article for further clarification.

From our findings, we observe that repeatedly creating new Regex instances for each match attempt is 100 times slower than using the static method. This is significant; thus, it is advisable to avoid this practice if performance is critical. The situation worsens if the compiled flag is used, making it nearly 1000 times slower — ten times worse than creating a new instance each time. These are practices to steer clear of.

The cached variations demonstrate that we can effectively reverse this trend, yielding a slight performance increase over the static method, with runtimes appearing almost 30% faster. However, it's essential to temper expectations as performance scaling may vary across different datasets and patterns.

Additionally, source-generated C# regular expressions exhibit superior performance compared to the static method, albeit on par with the previously mentioned benchmarks. Although source-generated regexes cache effectively and should incur no overhead, two benchmark variations suggest that maintaining your own cache may provide marginally better performance. However, these results could simply be outliers given the proximity of the outcomes.

For a comprehensive walkthrough of these C# regex benchmarks, check out the video below:

Wrapping Up C# Regex Performance

To optimize C# Regex performance, keep these key takeaways in mind:

  • Avoid declaring regular expressions immediately before their usage.
  • Steer clear of using the compiled flag in such cases, as it can severely hinder performance.

Overall, the Regex class using static methods remains a safe option, but the most significant benefits arise from compiling and caching your regex. Microsoft suggests that using C# Regex source generators can further enhance performance in numerous situations.

If you found this information helpful and are keen to learn more, consider subscribing to my free weekly software engineering newsletter and exploring my YouTube videos! Connect with fellow software engineers in my Discord community!

Want More Dev Leader Content?

Stay updated by following this platform. Subscribe to my free weekly newsletter focused on software engineering and .NET topics, featuring exclusive articles and early access to videos:

SUBSCRIBE FOR FREE

Looking for courses? Check out my offerings:

VIEW COURSES

Explore e-books and additional resources:

VIEW RESOURCES

Watch hundreds of full-length videos on my YouTube channel:

VISIT CHANNEL

Explore my website for numerous articles on diverse software engineering topics, including code snippets:

VISIT WEBSITE

Check out the repository with various code examples from my articles and videos on GitHub:

VIEW REPOSITORY

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

From Homelessness to Wealth: Steve Harvey's Inspiring Journey

Discover how Steve Harvey transformed his life from living in a car to achieving immense success, and the valuable lessons he shares.

Achieving the Unthinkable: Conquering Limiting Beliefs

Discover how to overcome limiting beliefs and achieve your goals, regardless of what others think.

Einstein's Views on God: A Non-Religious Perspective

Exploring Einstein's perspective on religion, pantheism, and his famous letter discussing the concept of God.