Regular Expressions and C#, .NET
This Article explores the concept of Regular Expressions in the context of C#, .NET support for Regular Expressions, Meta-characters and their Description, Character Escapes, Substitutions, Character Classes, Regular Expression Options and Atomic Zero-Width Assertions.
What are regular expressions?
Regular expressions are Patterns that can be used to match strings. You can call it a formula for matching strings that follow some pattern. Regular expression(s) can be considered as a Language, which is designed to manipulate text. You can then ask questions such as
- “Does the given string match the pattern?”, or
- “Does the given string contain characters that match a pattern?”.
Regular Expressions may be used to find one or more occurrences of a pattern of characters within a string. You may choose to replace it with some other characters or perform some other tasks based on the results obtained. These patterns of characters can be simple or very complex. Regular Expressions generally comprises of two types of characters –
1) Literal or Normal Characters such as “abcd123”
2) Special Characters that have a special meaning such as “.” Or “$” or “^”
Due to the special characters Regular Expressions form a very powerful means of manipulating strings and text.
.NET support for Regular Expressions:
.Net provides an extensive set of Regular expressions which you could use to create, modify or compare strings. They can be classified as follows –
a) Character Escapes
b) Substitutions
c) Character Classes
d) Regular Expression Options
e) Atomic Zero-Width Assertions
f) Quantifiers
g) Grouping Constructs
h) Backreference Constructs
i) Alternation Constructs
j) Miscellaneous Constructs
Meta-characters and their Description
. |
Matches any single character. An example of this is the regular expression s.t would match the strings sat, sit, but not sight. |
$ |
Matches the end of a line. For instance, the regular expression reason$ would match the end of the string "He has a reason" but not the string "He has his reasons" |
^ |
Matches the beginning of a line. For instance, the regular expression ^Where would match the beginning of the string "Where is my cap" but would not match "Do you know Where it is " . |
* |
Matches zero or more occurrences of the character immediately preceding. For example, the regular expression .* means match any number of any characters. |
|
This is a escape or quoting character. The character after this is treated as an ordinary character. For example, ^ is used to match the caret sign character (^) rather than the beginning of a line. Similarly, the expression . is used to match the “.” character . |
[ ] |
Matches any one of the characters between the brackets. Ranges of characters can specified by using a hyphen. To match any character except those in the range, the complement range, use the caret as the first character after the opening bracket. |
< > |
Matches the beginning ( < ) or end ( >) or a word. For example, < THE< _fckxhtmljob="2" span > matches on "the" in the string "for the older" but does not match "the" in "rather" |
( ) |
Treat the expression between ( and ) as a group. Also, saves the characters matched by the expression into temporary holding areas. Up to nine pattern matches can be saved in a single regular expression. They can be referenced as 1 through 9. |
| |
Or two conditions together. For example (him|her) matches the line "it belongs to him" and matches the line "it belongs to her" but does not match the line "it belongs to them." |
+ |
Matches one or more occurrences of the character or regular expression immediately preceding. For example, the regular expression 9+ matches 9, 99, 999. |
? |
Matches 0 or 1 occurrence of the character or regular expression immediately preceding. |
{i} {i,j} |
Match a specific number of instances or instances within a range of the preceding character. The expression [0-9]{4,6} any sequence of 4, 5, or 6 digits |
Character Escapes
The escape character (a single backslash) signals to the regular expression parser that the character following the backslash is not an operator
b |
Matches a backspace |
t |
Matches a tab |
r |
Matches a carriage return |
v |
Matches a vertical tab |
f |
Matches a form feed |
n |
Matches a new line |
e |
Matches an escape |
40 |
Matches an ASCII character as octal (up to three digits); |
x20 |
Matches an ASCII character using hexadecimal representation (exactly two digits). |
cC |
Matches an ASCII control character; for example, cC is control-C. |
u0020 |
Matches a Unicode character using hexadecimal representation (exactly four digits). |
Substitutions:
Provides information on the special constructs used in replacement patterns. Substitutions are allowed only within replacement patterns.
Character |
Description |
$number |
Substitutes the last substring matched by group number number (decimal). |
${name} |
Substitutes the last substring matched by a (? ) group. |
$$ |
Substitutes a single "$" literal. |
$& |
Substitutes a copy of the entire match itself. |
$` |
Substitutes all the text of the input string before the match. |
$’ |
Substitutes all the text of the input string after the match. |
$+ |
Substitutes the last group captured. |
$_ |
Substitutes the entire input string. |
Character Classes
A character class is a set of characters that will find a match if any one of the characters included in the set matches.
Character class |
Description |
. |
Matches any character except n. If modified by the Singleline option, a period character matches any character. |
[aeiou] |
Matches any single character included in the specified set of characters. |
[^aeiou] |
Matches any single character not in the specified set of characters. |
[0-9a-fA-F] |
Use of a hyphen (–) allows specification of contiguous character ranges. |
p{name} |
Matches any character in the named character class specified by {name}. |
P{name} |
Matches text not included in groups and block ranges specified in {name}. |
w |
Matches any word character. |
W |
Matches any non-word character. |