regexintro

Package: WA2L/WinTools 1.2.08
Section: File Formats (4)
Updated: 15 July 2018
Index Return to Main Contents

NAME

regexintro - introduction to regular expression usage

SYNOPSIS

regexintro, regex, regexp

AVAILABILITY

WA2L/WinTools

DESCRIPTION

In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters.

Regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.

Regular expressions are heavily used in the commands: awk(1), egrep(1), gawk(1), grep(1), and sed(1).

BASIC CONCEPTS

A regular expression, often called a pattern, is an expression that describes a set of strings. They are usually used to give a concise description of a set, without having to list all elements.

For example, the set containing the three strings Handel , Händel , and Haendel can be described by the pattern H(ä|ae?)ndel (or alternatively, it is said that the pattern matches each of the three strings).

In most formalisms, if there is any regex that matches a particular set then there is an infinite number of such expressions. Most formalisms provide the following operations to construct regular expressions:

Alternation

A vertical bar separates alternatives. For example, gray|grey can match gray or grey .

Grouping

Parentheses are used to define the scope and precedence of the operators (among other uses). For example, gray|grey and gr(a|e)y are equivalent patterns which both describe the set of gray and grey .

Quantification

A quantifier after a token (such as a character) or group specifies how often that preceding element is allowed to occur. The most common quantifiers are ? , * , and + .

?: The question mark indicates there is zero or one of the preceding element. For example, colou?r matches both color and colour .
*: The asterisk indicates there are zero or more of the preceding element. For example, ab*c matches ac , abc , abbc , abbbc , and so on.
+: The plus sign indicates that there is one or more of the preceding element. For example, ab+c matches abc , abbc , abbbc , and so on, but not ac.

These constructions can be combined to form arbitrarily complex expressions, much like one can construct arithmetical expressions from numbers and the operations + , . and * . For example, H(ae?|ä)ndel and H(a|ae|ä)ndel are both valid patterns which match the same strings as the earlier example, H(ä|ae?)ndel .

The precise syntax for regular expressions varies among tools and with context; more detail is given in the Syntax section.

SYNTAX

POSIX BASIC REGULAR EXPRESSIONS

Traditional Unix regular expression syntax followed common conventions but often differed from tool to tool.

The IEEE POSIX Basic Regular Expressions (BRE) standard (released alongside an alternative flavor called Extended Regular Expressions or ERE) was designed mostly for backward compatibility with the traditional syntax but provided a common standard which has since been adopted as the default syntax of many Unix regular expression tools, though there is often some variation or additional features.

Many such tools also provide support for ERE syntax with command line arguments.

In the BRE syntax, most characters are treated as literals - they match only themselves (i.e., a matches a ). The exceptions, listed below, are called meta characters or meta sequences.

.

Matches any single character (many applications exclude newlines, and exactly which characters are considered newlines is flavor, character encoding, and platform specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches abc , etc., but [a.c] matches only a , . , or c .

[ ]

A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches a , b , or c . [a-z] specifies a range which matches any lowercase letter from a to z . These forms can be mixed: [abcx-z] matches a , b , c , x , y , and z, as does [a-cx-z] .

The - character is treated as a literal character if it is the last or the first character within the brackets, or if it is escaped with a backslash: [abc-] , [-abc] , or [a-bc] .

[^ ]

Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than a , b , or c . [^a-z] matches any single character that is not a lowercase letter from a to z . As above, literal characters and ranges can be mixed.

^

Matches the starting position within the string. In line-based tools, it matches the starting position of any line.

$

Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.

*

Matches the preceding element zero or more times. For example, ab*c matches ac , abc , abbbc , etc. [xyz]* matches , x , y , z , zx , zyx , xyzzy , and so on. $ab$* matches , ab , abab , ababab , and so on.

\{m,n\}

Matches the preceding element at least m and not more than n times. For example, a\{3,5\} matches only aaa , aaaa , and aaaaa . This is not found in a few, older instances of regular expressions. For compatibility reasons, this construct should be avoided.

POSIX EXTENDED REGULAR EXPRESSIONS

The meaning of meta characters escaped with a backslash is reversed for some characters in the POSIX Extended Regular Expression (ERE) syntax. With this syntax, a backslash causes the meta character to be treated as a literal character. Additionally, support is removed for \n back references and the following meta characters are added:

?

Matches the preceding element zero or one time. For example, ba? matches b or ba .

+

Matches the preceding element one or more times. For example, ba+ matches ba , baa , baaa , and so on.

|

The choice (aka alternation or set union) operator matches either the expression before or the expression after the operator. For example, abc|def matches abc or def .

POSIX CHARACTER CLASSES

Since many ranges of characters depend on the chosen locale setting (i.e., in some settings letters are organized as abc...zABC...Z, while in some others as aAbBcC...zZ), the POSIX standard defines some classes or categories of characters as shown in the following table. It is expected, that this constructs are less portable, then specifying expressions with the more basic constructs above. Therefore for compatibility reasons, it is recommended to avoid the following constructs.

[:alnum:]: Alphanumeric characters.
[:alpha:]: Alphabetic characters.
[:blank:]: Space and tab.
[:cntrl:]: Control characters.
[:digit:]: Digits.
[:graph:]: Visible characters.
[:lower:]: Lowercase letters.
[:print:]: Visible characters and spaces.
[:punct:]: Punctuation characters.
[:space:]: White-space characters.
[:upper:]: Uppercase letters.
[:xdigit:]: Hexadecimal digits.

POSIX character classes can only be used within bracket expressions. For example, [[:upper:]ab] matches the uppercase letters and lowercase a and b .

EXAMPLES

1)

.at matches any three-character string ending with at , including hat , cat , and bat .

2)

[hc]at matches hat and cat .

3)

[^b]at matches all strings matched by .at except bat .

4)

^[hc]at matches hat and cat , but only at the beginning of the string or line.

5)

[hc]at$ matches hat and cat , but only at the end of the string or line.

6)

[hc]+at matches hat , cat , hhat , chat , hcat , ccchat , and so on, but not at .

7)

[hc]*at matches hat , cat , hhat , chat , hcat , ccchat , and so on, and also at .

8)

[hc]?at matches hat , cat , and at .

9)

cat|dog matches cat or dog .

10)

.* matches any character.

NOTES

This manpage is an extract of the Wikipedia page https://en.wikipedia.org/wiki/Regular_expression version 219305661 (https://en.wikipedia.org/w/index.php?oldid=219305661) which has been written by Boldt Axel and many others. See the mentioned web page to view the complete regular expression description.

BUGS

AUTHOR

regexintro was developed by Christian Walther. Send suggestions and bug reports to wa2l@users.sourceforge.net .

COPYRIGHT

This is free software; see WA2LWinTools/doc/COPYING for copying conditions. There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

This document was created by man2html using the manual pages.
Time: 16:32:51 GMT, September 14, 2024