regexintro
Package: WA2L/WinTools 1.2.08
Section: File Formats (4)
Updated: 15 July 2018
Index
Return to Main Contents
NAME
regexintro - introduction to regular expression usage
SYNOPSIS
regexintro, regex, regexp
AVAILABILITY
WA2L/WinTools
DESCRIPTION
In computing, regular expressions provide a concise and flexible means
for identifying strings of text of interest, such as particular characters,
words, or patterns of characters.
Regular expressions (abbreviated as regex
or regexp, with plural forms regexes, regexps, or regexen) are written in a
formal language that can be interpreted by a regular expression processor,
a program that either serves as a parser generator or examines text and
identifies parts that match the provided specification.
Regular expressions are heavily used in the commands:
awk(1),
egrep(1),
gawk(1),
grep(1),
and
sed(1).
BASIC CONCEPTS
A regular expression, often called a pattern, is an expression that
describes a set of strings. They are usually used to give a concise
description of a set, without having to list all elements.
For example, the set containing the three strings
Handel
,
Händel
, and
Haendel
can be described by the pattern
H(ä|ae?)ndel
(or alternatively, it is said that the pattern matches each of the three strings).
In most formalisms, if there is any regex that matches a particular set then
there is an infinite number of such expressions. Most formalisms provide
the following operations to construct regular expressions:
- Alternation
-
A vertical bar separates alternatives. For example,
gray|grey
can match
gray
or
grey
.
- Grouping
-
Parentheses are used to define the scope and precedence of the operators
(among other uses). For example,
gray|grey
and
gr(a|e)y
are equivalent patterns which both describe the set of
gray
and
grey
.
- Quantification
-
A quantifier after a token (such as a character) or group specifies how often
that preceding element is allowed to occur. The most common quantifiers are
?
,
*
, and
+
.
-
- ?
-
The question mark indicates there is zero or one of the preceding element.
For example,
colou?r
matches both
color
and
colour
.
- *
-
The asterisk indicates there are zero or more of the preceding element.
For example,
ab*c
matches
ac
,
abc
,
abbc
,
abbbc
, and so on.
- +
-
The plus sign indicates that there is one or more of the preceding element.
For example,
ab+c
matches
abc
,
abbc
,
abbbc
, and so on, but not
ac.
These constructions can be combined to form arbitrarily complex expressions,
much like one can construct arithmetical expressions from numbers and the operations
+
,
.
and
*
. For example,
H(ae?|ä)ndel
and
H(a|ae|ä)ndel
are both valid patterns which match the same strings as the earlier example,
H(ä|ae?)ndel
.
The precise syntax for regular expressions varies among tools and with context;
more detail is given in the Syntax section.
SYNTAX
POSIX BASIC REGULAR EXPRESSIONS
Traditional Unix regular expression syntax followed common conventions but
often differed from tool to tool.
The IEEE POSIX Basic Regular Expressions (BRE) standard (released alongside
an alternative flavor called Extended Regular Expressions or ERE) was designed
mostly for backward compatibility with the traditional syntax but provided
a common standard which has since been adopted as the default syntax of
many Unix regular expression tools, though there is often some variation
or additional features.
Many such tools also provide support for ERE syntax with command line arguments.
In the BRE syntax, most characters are treated as literals - they match only
themselves (i.e.,
a
matches
a
). The exceptions, listed below, are called
meta characters or meta sequences.
- .
-
Matches any single character (many applications exclude newlines, and
exactly which characters are considered newlines is flavor, character
encoding, and platform specific, but it is safe to assume that the
line feed character is included). Within POSIX bracket expressions,
the dot character matches a literal dot. For example,
a.c
matches
abc
, etc., but
[a.c] matches only
a
,
.
, or
c
.
- [ ]
-
A bracket expression. Matches a single character that is contained
within the brackets. For example,
[abc]
matches
a
,
b
, or
c
.
[a-z]
specifies a range which matches any lowercase letter from
a
to
z
. These forms can be mixed:
[abcx-z]
matches
a
,
b
,
c
,
x
,
y
, and
z, as does
[a-cx-z]
.
The
-
character is treated as a literal character if it is the
last or the first character within the brackets, or if it is escaped
with a backslash:
[abc-]
,
[-abc]
, or
[a-bc]
.
- [^ ]
-
Matches a single character that is not contained within the brackets.
For example,
[^abc]
matches any character other than
a
,
b
, or
c
.
[^a-z]
matches any single character that is not a lowercase letter
from
a
to
z
. As above, literal characters and ranges can be mixed.
- ^
-
Matches the starting position within the string. In line-based tools,
it matches the starting position of any line.
- $
-
Matches the ending position of the string or the position just before
a string-ending newline. In line-based tools, it matches the ending
position of any line.
- *
-
Matches the preceding element zero or more times. For example,
ab*c
matches
ac
,
abc
,
abbbc
, etc.
[xyz]*
matches
,
x
,
y
,
z
,
zx
,
zyx
,
xyzzy
, and so on.
\(ab\)*
matches
,
ab
,
abab
,
ababab
, and so on.
- \{m,n\}
-
Matches the preceding element at least m and not more than n times. For
example,
a\{3,5\}
matches only
aaa
,
aaaa
, and
aaaaa
. This is not found in a few, older instances of regular expressions.
For compatibility reasons, this construct should be avoided.
POSIX EXTENDED REGULAR EXPRESSIONS
The meaning of meta characters escaped with a backslash is reversed for
some characters in the POSIX Extended Regular Expression (ERE) syntax.
With this syntax, a backslash causes the meta character to be treated as
a literal character. Additionally, support is removed for \n back references
and the following meta characters are added:
- ?
-
Matches the preceding element zero or one time. For example,
ba?
matches
b
or
ba
.
- +
-
Matches the preceding element one or more times. For example,
ba+
matches
ba
,
baa
,
baaa
, and so on.
- |
-
The choice (aka alternation or set union) operator matches either the
expression before or the expression after the operator. For example,
abc|def
matches
abc
or
def
.
POSIX CHARACTER CLASSES
Since many ranges of characters depend on the chosen locale setting
(i.e., in some settings letters are organized as abc...zABC...Z, while
in some others as aAbBcC...zZ), the POSIX standard defines some classes
or categories of characters as shown in the following table.
It is expected, that this constructs are less portable, then specifying
expressions with the more basic constructs above. Therefore for
compatibility reasons, it is recommended to avoid the following constructs.
- [:alnum:]
-
Alphanumeric characters.
- [:alpha:]
-
Alphabetic characters.
- [:blank:]
-
Space and tab.
- [:cntrl:]
-
Control characters.
- [:digit:]
-
Digits.
- [:graph:]
-
Visible characters.
- [:lower:]
-
Lowercase letters.
- [:print:]
-
Visible characters and spaces.
- [:punct:]
-
Punctuation characters.
- [:space:]
-
White-space characters.
- [:upper:]
-
Uppercase letters.
- [:xdigit:]
-
Hexadecimal digits.
POSIX character classes can only be used within bracket expressions. For
example,
[[:upper:]ab]
matches the uppercase letters and lowercase
a
and
b
.
EXAMPLES
- 1)
-
.at
matches any three-character string ending with
at
, including
hat
,
cat
, and
bat
.
- 2)
-
[hc]at
matches
hat
and
cat
.
- 3)
-
[^b]at
matches all strings matched by
.at
except
bat
.
- 4)
-
^[hc]at
matches
hat
and
cat
, but only at the beginning of the string or line.
- 5)
-
[hc]at$
matches
hat
and
cat
, but only at the end of the string or line.
- 6)
-
[hc]+at
matches
hat
,
cat
,
hhat
,
chat
,
hcat
,
ccchat
, and so on, but not
at
.
- 7)
-
[hc]*at
matches
hat
,
cat
,
hhat
,
chat
,
hcat
,
ccchat
, and so on, and also
at
.
- 8)
-
[hc]?at
matches
hat
,
cat
, and
at
.
- 9)
-
cat|dog
matches
cat
or
dog
.
- 10)
-
.*
matches any character.
SEE ALSO
wintoolsintro(1),
awk(3),
egrep(1),
grep(1),
sed(1),
sed1line(1),
https://en.wikipedia.org/w/index.php?oldid=219305661,
https://en.wikipedia.org/wiki/Regular_expression
- [AWK]
-
The AWK Programming Language, October 1988, Aho Alfred V., Weinberger
Peter J., Kernighan Brian W., ISBN 0-201-07981-X
- [REX]
-
Regular Expression, Wikipedia the Free Encyclopedia, 14.06.2008,
Version 219305661, Boldt Axel,
File: https://en.wikipedia.org/w/index.php?oldid=219305661
- [SSP]
-
Shellscript Programmierung, Sun Service, Revision C21 February 1994, Sun Microsystems Inc.,
Sun Part No: 8xx-xxxx-xx
NOTES
This manpage is an extract of the Wikipedia page
https://en.wikipedia.org/wiki/Regular_expression
version 219305661
(https://en.wikipedia.org/w/index.php?oldid=219305661)
which has been written by Boldt Axel and many others.
See the mentioned
web page to view the complete regular expression description.
BUGS
-
AUTHOR
regexintro was developed by Christian Walther. Send suggestions
and bug reports to wa2l@users.sourceforge.net .
COPYRIGHT
Copyright © 2020
Christian Walther
This is free software; see
WA2LWinTools/doc/COPYING
for copying conditions. There is ABSOLUTELY NO WARRANTY; not
even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
This document was created by man2html
using the manual pages.
Time: 16:32:51 GMT, September 14, 2024