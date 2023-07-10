The RegEx standard supported by the RXP Compiler is based on the PCRE pattern specification (see section Related Documentation). The RXP Compiler supports a subset of this standard with the supported and unsupported constructs summarized in sections Backslash and Word Boundary respectively. The RXP Compiler also supports some functionality in addition to the PCRE specification. The extra functionality bolts on to the top of the PCRE support and is summarized in the following table.

Table 13. Additional Functionality Supported by the RXP Compiler Feature Description Anchored to offset This allows the traditional Start of Subject (SOS) anchor to be extended in order to allow the start point to be redefined as a user-specified value (see section Anchored to Offset). XML schema The XML schema features are not enabled by default. In order to use them they must be switched on. The XML schema offers four extra character classes (see table "Supported XML Schema Classes" in section Predefined Classes) and the character class subtraction feature (see table "Supported Character Class Notation" in section User-defined Character Classes). It is important to note that when the XML schema is enabled, the \c escape sequence no longer applies to a control character (see section Non-printing Characters) but now the XML class as shown in table "Supported XML Schema Classes" in section Predefined Classes.

The RXP compiler makes the following assumptions:

All rules are ungreedy. This means the RXP will always report back the shortest possible match. For example, for the RegEx /AB*/ and the data ABBBBB the RXP would match A as this is the shortest match. If this was scanned by PCRE in greedy mode it would match ABBBBB as this is the longest match. To save hardware resources, all parentheses are non-capturing by default (they will only capture if back referencing is required).

The " \ " (backslash) metacharacter is used in many of the constructs listed below. In all of its usage scenarios, the backslash metacharacter will enforce a different meaning on the character directly following it. There is a set of metacharacters that can follow a backslash metacharacter and these metacharacters also determine what function that will be performed by the resulting "backslash+character" metasequence. The following is a list of all the backslash metacharacters that the RXP compiler supports:

Quoting. For example, /\*\?/ will match literally *? .

will match literally . Non-printing characters. For example, /

/ will match the newline character.

will match the newline character. Hexadecimal formats. For example, /\x0A/ will match the newline character.

will match the newline character. Octal formats. For example, /\012/ will match the newline character.

will match the newline character. Predefined general classes. For example, /\s/ will match any whitespace character.

will match any whitespace character. Back references. For example, /(ABC)\1/ will match ABCABC .

will match . Simple assertions. For example, /\AABC\Z/ will match ABC with the A occurring at the beginning of the subject and C occurring at the end of the subject.

The following table provides a brief description of each RegEx construct the RXP Compiler supports. These are described in more detail in the sections that follow.

Table 14. Regex Constructs Supported by the RXP Compiler Construct Example Description Alternation /ABC|DEF/ Functions in the same manner as the logical OR. The example will match ABC or DEF. Anchored to Offset /^.{12}ABCD$/ Match must occur at the specified offset to the beginning of the subject. In the example, ABCD must begin at byte pointer 12. Anchoring /^ABCD$/ Anchors mean the match must occur at the beginning or end of the data stream. Anchors can also apply to

characters if the m modifier is selected (see section Modifiers). The example will only match ABCD if this is the case. Back References /<(A)>BCDE<\1>/ Back references are variables that refer to text matched earlier in the RegEx within capturing parentheses. In the example \1 will match A as it has been captured within the parentheses. Capturing Parentheses /(ABCD)|(DEFG)\1/ For back referencing it is required to capture a match to be referenced. This is achieved through capturing parentheses. In the example, ABCD will be captured and stored in \1 whereas DEFG will not be captured unless a \2 back reference is specified. Conditional Statement /1234(A)?(?(1)B|C)/ A special type of back reference that allows an if/then/else statement to be expressed in an RegEx in the form: (?if then | else). In the example B will only be matched if an A had matched before. Otherwise C will be matched. Dot Metacharacter /ABCD.F/ Can match any ASCII value except newline. The example will match ABCD followed by any character except newline then followed by F. Hexadecimal Formats /\xFF\xFF/ Use the \x delimiter to signify hexadecimal values. The example will match 2 bytes in a row each set to 255. Inline Comments /ABC(?#DEF)GHI/ Inline comments allow notes to be placed within the RegEx. All comments will be ignored by the compiler. The example would match the string ABCGHI as the comment would not be compiled as part of the RegEx. Internal Option Setting /ABCD(?i)DEFG(?-i)/ The case sensitivity (i), multiline (m), dotall (s), and free-spacing (x) options can be toggled off and on within the rule. In the example ABCD will be case-sensitive whereas DEGF will be case insensitive. Literal Strings /ABCD/ A string of ASCII characters. The example will match the string ABCD. Modifiers /ABCD/six Occur directly after the RegEx and affects the whole RegEx. This section provides more details on the modifiers supported by the RXP Compiler. Non-capturing Parentheses /(?:ABCD)*/ Specify grouping and precedence. In the example, the star symbol will apply to all characters contained within the parentheses. ?: will allow for parentheses to be explicitly defined as non-capturing. Non-printing characters /\t

/ Control characters such as tab, newline etc. The example will match a tab followed by a newline. Octal formats /\377\377/ Use the \ delimiter to specify octal values. The example will match 2 bytes in a row each set to 255. POSIX Character Classes /AB[

[:^digit:][:alpha:]]/ The POSIX notation for character classes is supported. The exact list of POSIX classes is locale dependent. The RXP Compiler supports the most widely used ones. The example will match first of all the text AB. This is then followed by a character class or what POSIX refers to as a bracket expression. This will match; newline, any non-digit or any alphabetic character. Predefined Classes /\d\s/ Groups of characters such as digits, whitespace etc. The example will match a digit character followed by a whitespace character. Quoting \*\+ABC\Q?*+\E Quoting will remove the special meaning from special characters. This can be achieved on single characters by preceding them with a backslash. If you wish to remove the special meaning from a sequence of characters they can be placed in between \Q and \E. The example will match the literal string *+ABC?*+. Repetition /A*B+C?/ Specifies multiple occurrences of any construct. The example will match zero to many occurrences of the letter A followed by one to many occurrences of the letter B followed by zero or one occurrences of the letter C. Reset Subpattern Numbers /(?|(AB)DE|(FG))\1/ This allows for the subpattern reference number to be reset for each captured alternation within the group. In the example, because the two alternatives are inside a (?| group they will both be numbered one. The example will therefore match both ABDEAB and FGFG. User-defined Character Classes /[ABCD0-9]/ Classes that represent a user-defined range. The example will match; A, B, C, D or any digit within the range zero to nine. Word Boundary /\bABCD\b/ Matches if the current position sits between a word and non-word character or start/end of job.

Alternation is applied using the “|” (vertical bar) metacharacter, e.g. /(ABC|DE|FGHI)/ will match ABC, DE or FGHI. There is no restriction to the number of alternatives that appear, and an alternative may be empty e.g. /A(B|)/ is equivalent to /AB?/ and will match A or AB.

This allows the match to be anchored to a specified offset from the beginning of the packet. This is represented by a dot metacharacter with a repetition value that must be placed directly after the symbol for the start anchor, e.g. /^.{4}AB/. If the anchored to offset construct is invalidly specified, it will follow the same rules as the repetition construct (see section Repetition). It is also supported to specify multiple offsets for clarity. These will all be merged into one anchor to offset structure. For example, /^.{4}.{8}AB/ is equivalent to /^.{12}AB/. The following table lists the anchor to offset metasequences supported by the RXP Compiler.

Table 15. Supported Anchored to Offset Operators Metasequence Description ^.{m, n} Reference by number n (can be ambiguous with Octal notation (see Octal Formats)) ^.{m} Reference by number n ^.{m,} Reference by number n ^.. Relative reference by number n

Anchoring can be achieved using the ^ and $ anchor metacharacters, and by the use of simple assertions. The multiline modifier (/m) is also supported, which affects the way the anchors are interpreted (see section Modifiers). The following table lists the anchoring formats supported by the RXP Compiler.

Table 16. Supported Anchors and Simple Assertions Metasequence Description ^ Anchor to start of subject and if in multiline mode, after newline also. \A Anchor to start of subject. $ Anchor to end of subject and before newline at end of subject. If in multiline mode, anchor before any newline. \Z Always anchor to end of subject and before newline at end of subject. \z Always anchor to end of subject.

It is important to note that anchoring must be applied at a point in the expression where it can match the beginning or the end of the datastream (or match the beginning or end of a line if multi-line mode is enabled and the ^ or $ metacharacters are used). If this is not adhered to, the anchoring metacharacter will be invalid. An example of a RegEx that will not match anything would be /ABC^DEF/. It is also possible to apply anchoring to each alternation individually e.g. /(^ABC|DEF|^GHI$|JKL)/ or also to elements that occur after optional elements e.g. /(ABC)?(^DEF|GHI)/. In the previous example the DEF part of the alternation will only match if ABC does not occur before it. Valid matches would be ABCGHI, DEF and GHI but not ABCDEF.

Back references offer the ability to reuse a captured part of the RegEx match. The pattern that is back referenced is obtained by using capturing parentheses. The default method of back referencing is to use a backslash followed by a number greater than zero to invoke the back reference. This numeric value will increment for each set of capturing parentheses encountered in the RegEx. It is also possible to reference by name or relative reference. The table below lists the back-referencing formats supported by the RXP Compiler. When a back reference (named or numbered) is used in a RegEx it must be possible to pair it up with its referenced capture, otherwise an error will occur. All named back references must be less than 32 characters in size and can only contain alphanumeric characters and underscores and cannot begin with a number.

Table 17. Supported Back Referencing Styles Back Reference Description

Reference by number n (can be ambiguous with Octal notation (see Octal Formats)). \gn Reference by number n. \g{n} Reference by number n. \g{-n} Relative reference by number n. \k<name> Reference by name (Perl notation). \k’name’ Reference by name (Perl notation). \g{name} Reference by name (Perl notation). \k{name} Reference by name (.NET notation). (?P=name) Reference by name (Python notation).

The use of the \g sequence with a negative number signifies a relative reference. For example, /(ABC)(DEF)\g{-1}/ would match ABCDEFDEF and /(ABC)(DEF)\g{-2}/ matches ABCDEFABC.

An ambiguity exists with the “reference by number” type of back reference and the octal number format (see section Octal Formats). An example of this is \4: in theory this could be a back reference to the value captured in the fourth set of capturing parentheses or it could be the octal number four. To overcome this ambiguity the following rules apply in order of precedence:

Single digit escapes between \1 and \9 will always be interpreted as back references. An escaped number beginning with zero is always an octal escape. E.g. \010 matches the “backspace” character. If there is at least that number of previous capturing subpatterns, it will be taken as a back reference. E.g. \10 will be taken as a back reference if there are at least 10 sets of capturing parentheses before it. If there are not at least 10 sets of capturing parentheses, it will then be taken as the octal escape sequence for the “backspace” character. Otherwise if the value is a qualifying octal number (\000 to \377) then the value will be taken as such.

These are discussed in the PCRE standard documentation and also referred to in the O'Reilly book "Mastering Regular Expressions". As back references are not supported in character classes, it is sufficient to simply infer that in this case any digit following a backslash will always represent an octal digit.

Besides grouping part of a RegEx together, round brackets also capture the part of the match that occurs within them which can then be used later as a back reference. The table below lists each of the capturing formats along with descriptions supported by the RXP Compiler. Groups will only capture data if the group has an associated back reference or conditional statement to use the captured data. If this is not the case, the group will be treated as non-capturing. All named captures must be less than 32 characters in size and can only contain alphanumeric characters and underscores and cannot begin with a number.

Table 18. Supported Capturing Styles Capture Description (…) Capturing group. (?<name>…) Named capturing group (Perl). (?’name’…) Named capturing group (Perl). (?P<name>…) Named capturing group (Python).

If a repetition (see section Repetition) value has been applied to a captured group, the captured value will be reset on all iterations of the loop and not appended to. An example of this is /(A|B)*C\1/ , which will match AACA and ABCB but not ABCAB .

The conditional statement allows an if/then/else statement to be expressed and evaluated within a RegEx. The conditional statement is the form (?(condition)yes-pattern|no-pattern). The condition is always a back reference which is evaluated as to whether or not it has matched previously, yielding a boolean result. A simple example of the conditional statement is /1234(A)?(?(1)B|C)/, which will match 1234AB and 1234C but not 1234AC. The conditions can be expressed in various formats as shown in the table below. All named references must be less than 32 characters in size and can only contain alphanumeric characters and underscores and cannot begin with a number.

Table 19. Condition Styles Condition Format Description (?(n)…) Absolute reference condition. (?(+n)…) Positive relative reference condition. (not supported as requires forward reference). (?(-n)…) Negative relative reference condition. (?(<name>)…) Named reference condition (Perl). (?(’name’)…) Named reference condition (Perl). (?(name)…) Named reference condition (PCRE).

The dot metacharacter is supported. By default, it matches any ASCII character including newline to help maximize sustainable throughput. This can be overridden globally by using the RXP Compiler utility's –s option, then using the /s modifier on individual rules.

This functionality permits comments to be interspersed with the RegEx, where they are used by the RegEx writer to help make the RegExes more understandable. Comments can be specified within the following construct (?#...). The compiler will ignore all comments when processing the input file e.g. /ABC(?#DEF)GHI/ will match ABCGHI.

Internal option setting allows for features usually specified as mode modifiers (see section Modifiers) to be toggled on and off within the rule. These will usually modify the way the pattern matching operation should be performed. A list of supported internal options is given in the following table.

Table 20. Supported Internal Options Modifier Description (?i)…(?-i) This toggles on and off case insensitivity. This is functionally equivalent to the /i modifier. (?m)…(?-m) This toggles on and off multi-line mode. This is functionally equivalent to the /m modifier. (?s)…(?-s) This toggles on and off single-line mode. This is functionally equivalent to the /s modifier. (?x)…(?-x) This toggles on and off free-spacing mode. This is functionally equivalent to the /x modifier.

It is possible to specify multiple options in one statement, e.g. (?ix) to set case insensitive and free-spacing mode. It is also possible to combine the setting and unsetting of internal options such as (?ix-s) to set case insensitive and free-spacing mode and unset single-line mode. All internal options set within a set of parentheses will be turned off at the closing parentheses and the options that were set outside the parentheses will be reinstated. An example of this is /((?i)a)a/, which matches Aa or aa but not AA. It is also possible to specify a span for a sub-pattern where the options are set e.g. (?i-sx:sub-pattern) will match the sub-pattern inside the span with the options "i" and "x" turned on, and "s" turned off. The option settings will carry onto subsequent alternation branches even if the branch on which it occurs is not encountered during the matching process. The reason for this is that the option settings are all dealt with and applied at compile time and their span is applied to the two-dimensional “text” version of the RegEx without knowledge of the RegEx execution engine.

The “\x” delimiter is used to signify hexadecimal values within rules. If the “\x” delimiter is encountered on its own, it will be interpreted as a hexadecimal escape with no following digits, giving a value of zero. Hexadecimal digits may be defined using upper and/or lower case letters. The following table describes the hexadecimal formats that are supported.

Table 21. Supported Hexadecimal Formats Hexadecimal Format Description \xh. h is a one-digit hexadecimal value representing a single character. Will be interpreted as \x0h. \xhh hh is a two-digit hexadecimal value representing a single character. \x{hh} hh is a two-digit hexadecimal value representing a single character.

Strings of characters such as letters, digits, and special characters (including escaped with backslash) are supported. The characters will be bunched together where possible into long strings of an arbitrary length. The longest possible string will be extracted. The RXP Compiler will strive to form the largest possible strings to avail of the RXPs ability to process multiple characters per clock cycle.

Mode modifiers are operators appended to the end of a RegEx to modify the way the pattern matching operation should be performed. A list of supported RegEx modifiers along with descriptions is given in the following table.

Table 22. Supported Mode Modifiers Modifier Description /i If this modifier is set, letters in the pattern match both upper and lower-case letters in the subject string i.e. caseless or case insensitive. Caseless matching is only supported for characters with an ASCII value of less than 128. For caseless matching of characters with a value of greater than 128, Unicode must be supported. /m By default, PCRE treats the subject string as consisting of a single “line” of characters (even if it contains several newlines). The “start of line” metacharacter ^ matches only the start of the string, while the “end of line” metacharacter $ matches only at the end of the string, or before a terminating newline. When this modifier is set, the start of line and end of line constructs match immediately following or immediately before any newline in the subject string, respectively, as well as at the very start and end. If there are no “

” characters in a subject string, or no occurrences of ^ or $ in a pattern, setting this modifier has no effect. /s Enables single-line mode. If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. A negated class such as [^a] always matches a newline character, independent of the setting of this modifier. /x Enables free-spacing mode where all space characters (\x20) between RegEx tokens is ignored. It is important to note that only the space between tokens is ignored. In free-spacing mode the space character can be inserted to the RegEx by escaping it with a backslash or as part of a character class i.e. “\ ” or “[ ]”.

The RXP also has a set of custom modifiers that can affect the way each individual rule is compiled. A list of RXP custom RegEx modifiers along with descriptions is given in the following table.

Table 23. Supported RXP Custom Mode Modifiers Modifier Description /c If this is set then subpattern matching will be enabled for this rule. /o If this modifier is set, the rule will be split at its alternations. /O If this modifier is set, the rule will not be split at its alternations, even if the global switch is set. /q If this modifier is set, strict-quantifier mode will be used for this rule. To help with performance, by default the RXP Compiler treats non-fixed bounded quantifiers as unbounded e.g. .{0,2048} will be the same as .*. This has the caveat of false positives being possible. This modifier will ensure that the original construct is used for this meaning performance will be worse but no false positives will occur. /Q If this modifier is set, strict-quantifier mode will not be used for this rule, even if the global switch is set. /p If this modifier is set, the PTPB filter will be ignored for this rule.

Non-capturing parentheses are only used to group together parts of a RegEx and not used for capturing. By default, all parentheses will be interpreted by the RXP Compiler as non-capturing unless a back reference is paired with its captured data. Non-capturing parentheses can be explicitly expressed in the form (?:…).

The “\” (backslash) character can be used to encode non-printing characters in a fashion that can be seen. Hexadecimal and octal notation can also be used to encode the characters however when used excessively they may obfuscate the RegEx. Inside a character class the “\b” metasequence represents the backspace character. The non-printing characters given in the following table are supported.

Table 24. Supported Non-Printing Characters Non Printing Character Hexadecimal Value Description \0 \x00 NULL. \a \x07 Alarm (BEL). \cx (x = any alphabetic char) Dependent on x If x is a lower case letter it will be converted to uppercase. Then bit 6 of the character will be inverted. This feature is not available in XML mode. \e \x1B Escape. \f \x0C Formfeed.

\x0A Newline.

\x0D Carriage return. \t \x09 Horizontal tab. \b \x08 Backspace (only in character class).

The following table lists all the octal formats along with a description that are supported by the RXP Compiler.

Table 25. Supported Octal Formats Octal Format Description \d d is a one digit octal value ranging from \0 to \7. \dd dd is a two digit octal value ranging from \00 to \77. \ddd ddd is a three digit octal value ranging from \000 to \377.

The RXP Compiler will attempt to extract the maximum number of octal digits immediately following an octal escape sequence to create a legal octal value (\000 to \377) indicating a single character.

There are issues arising with the \d, \dd and \ddd notation as it has an ambiguity with the “reference by number” back reference notation. See section Back References for more details on this and possible avoidance measures.

Note that if the \0 is immediately followed by non-legal octal digits, the \0 shall be interpreted as a NULL character (see section Non-printing Characters).

The POSIX classes given in the table below are supported. These can be inserted in character classes to denote a range of characters e.g. /ABC[0-9[:alpha:]]/ will match ABC followed by an alphanumeric character. The POSIX character classes that are available depend on the POSIX locale. The RXP Compiler supports all of the most widely used POSIX character classes. Note that POSIX classes include letters and digits defined in the locale, not just those in ASCII.

Table 26. Supported POSIX and Shorthand Equivalent Classes Class Equivalent Description [:alnum:] [a-zA-Z0-9] Alphanumeric characters. [:alpha:] [a-zA-Z] Alphabetic characters. [:ascii:] [\x00-\x7F] ASCII characters. [:blank:] [\x20\t] Space or tab. [:cntrl:] [\x00-\x1F\x7F] Control characters. [:digit:] [0-9] Any decimal digit. [:graph:] [\x21-\x7E] Any visible or printing character. [:lower:] [a-z] Any lowercase alphabetic character. [:print:] [\x20-\x7E] Visible characters including space also. [:punct:] [!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~] Any punctuation character. [:space:] [\x20\t

\v\f] Any whitespace character. [:upper:] [A-Z] Any uppercase alphabetic character. [:word:] [a-zA-Z0-9_] Any word character. [:xdigit:] [0-9A-Fa-f] Any hexadecimal digit.

The “\” (backslash) character is used to introduce predefined classes. The general type of predefined classes is used to specify a range of commonly used character classes. A list of these is given in the following table.

Table 27. Supported General Classes Class Equivalent Description \d [0-9] Any decimal digit. \D [^0-9] Not a decimal digit. \h [\t\x20] Any horizontal whitespace character. \H [^\t\x20] Not a horizontal whitespace character. \s [\t

\f

\x20] Any whitespace character. \S [^\t

\f

\x20] Not a whitespace. \v [

\x0B\f

\x85] Any vertical whitespace character. \V [^

\x0B\f

\x85] Not a vertical whitespace character. \w [a-zA-Z0-9_] Any word character. \W [^a-zA-Z0-9_] Not a word character.

The RXP Compiler also has the option to use the XML schema RegExes predefined classes. There are four extra predefined classes supported that are not normally supported by any other flavor. The XML schema classes are shown in the following table.

Table 28. Supported XML Schema Classes Class Equivalent Description \i [_:A-Za-z] Any character that can be the first character of an XML tag name. \I [^_:A-Za-z] Not a character that can be the first character of an XML tag name. \c [-._:A-Za-z0-9] Any character that may occur after the first character of an XML tag name. \C [^-._:A-Za-z0-9] Not a character that may occur after the first character of an XML tag name.

As can be seen from this table, the \c escape sequence is used to represent one of the XML predefined classes. This means when the XML schema mode is enabled the \c sequence no longer represents a control character escape sequence (see section Non-printing Characters) as it conflicts with the XML class notation.

Quoting is used to remove the special meaning from characters and allows them to be treated as literals. Single metacharacters can be quoted by using the “\” (backslash) character e.g. /A\*/ will literally match A*. If a literal backslash character is desired then the \\ sequence can be used i.e. a backslash quoting a backslash. Quoting using a backslash can only be used on single characters. Multiple characters can be quoted by surrounding them with \Q…\E e.g. /\QA*?+\E/ will literally match A*?+. It is important to note that the \E can be omitted at the end so /\QA*?+/ is equivalent to /\QA*?+\E/.

Repetition can be applied to the following supported constructs:

A literal character.

The dot metacharacter.

Any escape that matches a single character.

A character class.

A back reference.

A parenthesized subpattern that is not an assertion.

Repetition is specified by quantifiers which specify a minimum and maximum number of permitted matches. The three most common quantifiers have been given single character abbreviations. All the repetition metacharacters (or metasequences, if they consist of more than one character) given in the following table are supported.

Table 29. Supported Repetition Operators Repetition Metacharacters Description ? 0 or 1 occurrences of previous construct. + 1 or more occurrences of previous construct. * 0 or more occurrences of previous construct. {m, n} Between m and n occurrences of previous construct. {m} Exactly m occurrences of previous construct. {m,} m or more occurrences of previous construct.

It is important to note that the RXP will treat all repetition as ungreedy. The quantifier {0} is permitted causing the expression to behave as if the previous construct and quantifier were not present. Infinite loops can be constructed by following a subpattern that can match no characters with a quantifier that has no upper limit e.g. /(a?)*/. If a repetition construct cannot be interpreted as valid, it will be interpreted as a literal string e.g. /A{,4}/ will match the string A{,4} or /A{1,4aa}/ will match the string A{1,4aa}. If the quantifiers are out of order i.e. the minimum repetition value is greater than the maximum repetition value, this will cause an error. Also if the repetition value exceeds the maximum repetition value of 32K, an error will be generated.

This allows the subpattern reference number to be reset for each alternation e.g. /(?|(A)B|(C))\1/ will match ABA and also CC. This means that when the pattern matches, captured substring one can be used, regardless of which alternative matched. This can be used when it is desirable to capture part of one of a number of alternatives. The captures are numbered as normal inside a “reset subpattern numbers” group except the number is reset at the start of each alternation. E.g. in /(A)(?|(B)|(C(D))/ the captures noted in bold parentheses from left to right would be numbered 1,2,2,3.

Subpattern matching can be switched on for a rule by using the “c” modifier. If subpattern matching is switched on then subpattern matches are reported alongside a full match, e.g. /A(B|C)DEFG/c will match ABCDEFG and also return a subpattern match of C.

The RXP Compiler is capable of supporting user-defined character classes. There are only certain metacharacters that are recognized within a user defined character class. The following table lists each of these along with their usage restrictions.

Table 30. Supported Character Class Metacharacters Metacharacter Description [ The opening square bracket will begin the user-defined character class. It is recognized within the class as a metacharacter only when it is the beginning of a POSIX class (see section POSIX Character Classes). ] The closing square bracket will terminate the user-defined character class. To use a closing square bracket as a literal member of a class it must either be escaped by a backslash ‘\’, or occur directly after the opening square bracket. The closing bracket symbol cannot be used as the end character of a range. ^ The caret symbol can be used to negate the character class. This means that the subject must not match one of the class’s members to be successful. To use the caret symbol as a literal member of the class it must either; be escaped by a backslash, or occur anywhere except directly after the opening square bracket. \ The backslash symbol is used to remove the special meaning from characters and allows them to be treated as literals. The backslash can also be used within a character class to represent the RegEx literal items such as hexadecimal notation, octal notation and non-printing characters. \Q…\E Multiple characters can have their special meaning removed by surrounding them with \Q…\E (see section Quoting). - The minus symbol is used to specify ranges of characters. To use the minus symbol as a literal member of the class it must be escaped by a backslash or positioned in a place where it cannot be interpreted as indicating a range. The range of characters must be in ascending order e.g. [a-z] is valid whereas [z-a] is not.

There are standard methods of defining character classes using the metacharacters as discussed in the previous table. The combination of these metacharacters into the class notation is shown and discussed in the following table.

Table 31. Supported Character Class Notation Class Notation Description […] Character class matching one of the characters contained within the square brackets. [^…] Negated character class matching any one character that is not contained within the square brackets. [x-y] Character class matching one of the characters in the range x to y. [^x-y] Negated character class matching any one character that is not in the range x to y. [[:xxx:]] Match any one character contained in the POSIX set xxx (see section POSIX Character Classes). [[:^xxx:]] Match any one character not contained in the POSIX set xxx (see section POSIX Character Classes). [a-z-[aeiou]] Character class subtraction is only available in XML schema mode and allows the matching of a character which is present in one list but not present in the subtracted list. The subtracted list must always be the last element in its containing character class e.g. [a-z1-4-[aeiou]] is valid but [a-z-[aeiou]1-4] is not. The subtraction will be applied to the entire class. The example shown in the class notation column will match any lowercase consonant i.e. by removing the vowels. Nested character class subtraction is also supported. E.g. [0-9-[0-6-[0-3]]] first subtracts 0-3 from 0-6, yielding [0-9-[4-6]], or [0-37-9], which matches any character in the string 0123789.

The character class supports ranges of numerically specified characters. An example would be the use of hexadecimal notation to represent the character class [\x61-\x7A] or its equivalent in octal format [\141-\172] is also equivalent to the character class [A-Z]. It is also valid to use predefined classes within user-defined character classes. An example of this would be the use of the character class [\dA-Za-z] which will match any alphanumeric character and is equivalent to the \w predefined class.

The character class supports all of the non-printing characters. It also supports a special meaning for the \b escape sequence which usually means word boundary. This is used to represent the backspace character (\x08) in a character class.

A word boundary metacharacter \b is used to determine that at that position in the RegEx if one character is a word character and the other is not, i.e. it sits across a word boundary, this is like the following RegEx:

(\w\W|\W\w)





Note that the word boundary also applies to start and end of a data stream i.e. anchors.

The negated version of the word boundary can be specified as \B.

