Built-in rules
Besides ANY
, matching any single Unicode character, pest
provides several
rules to make parsing text more convenient.
ASCII rules
Among the printable ASCII characters, it is often useful to match alphabetic
characters and numbers. For numbers, pest
provides digits in common
radixes (bases):
Built-in rule | Equivalent |
---|---|
ASCII_DIGIT | '0'..'9' |
ASCII_NONZERO_DIGIT | '1'..'9' |
ASCII_BIN_DIGIT | '0'..'1' |
ASCII_OCT_DIGIT | '0'..'7' |
ASCII_HEX_DIGIT | '0'..'9' | 'a'..'f' | 'A'..'F' |
For alphabetic characters, distinguishing between uppercase and lowercase:
Built-in rule | Equivalent |
---|---|
ASCII_ALPHA_LOWER | 'a'..'z' |
ASCII_ALPHA_UPPER | 'A'..'Z' |
ASCII_ALPHA | 'a'..'z' | 'A'..'Z' |
And for miscellaneous use:
Built-in rule | Meaning | Equivalent |
---|---|---|
ASCII_ALPHANUMERIC | any digit or letter | ASCII_DIGIT | ASCII_ALPHA |
NEWLINE | any line feed format | "\n" | "\r\n" | "\r" |
Unicode rules
To make it easier to correctly parse arbitrary Unicode text, pest
includes a
large number of rules corresponding to Unicode character properties. These
rules are divided into general category and binary property rules.
Unicode characters are partitioned into categories based on their general purpose. Every character belongs to a single category, in the same way that every ASCII character is a control character, a digit, a letter, a symbol, or a space.
In addition, every Unicode character has a list of binary properties (true or false) that it does or does not satisfy. Characters can belong to any number of these properties, depending on their meaning.
For example, the character "A", "Latin capital letter A", is in the general category "Uppercase Letter" because its general purpose is being a letter. It has the binary property "Uppercase" but not "Emoji". By contrast, the character "π °", "negative squared Latin capital letter A", is in the general category "Other Symbol" because it does not generally occur as a letter in text. It has both the binary properties "Uppercase" and "Emoji".
For more details, consult Chapter 4 of The Unicode Standard.
General categories
Formally, categories are non-overlapping: each Unicode character belongs to
exactly one category, and no category contains another. However, since certain
groups of categories are often useful together, pest
exposes the hierarchy of
categories below. For example, the rule CASED_LETTER
is not technically a
Unicode general category; it instead matches characters that are
UPPERCASE_LETTER
or LOWERCASE_LETTER
, which are general categories.
LETTER
CASED_LETTER
UPPERCASE_LETTER
LOWERCASE_LETTER
TITLECASE_LETTER
MODIFIER_LETTER
OTHER_LETTER
MARK
NONSPACING_MARK
SPACING_MARK
ENCLOSING_MARK
NUMBER
DECIMAL_NUMBER
LETTER_NUMBER
OTHER_NUMBER
PUNCTUATION
CONNECTOR_PUNCTUATION
DASH_PUNCTUATION
OPEN_PUNCTUATION
CLOSE_PUNCTUATION
INITIAL_PUNCTUATION
FINAL_PUNCTUATION
OTHER_PUNCTUATION
SYMBOL
MATH_SYMBOL
CURRENCY_SYMBOL
MODIFIER_SYMBOL
OTHER_SYMBOL
SEPARATOR
SPACE_SEPARATOR
LINE_SEPARATOR
PARAGRAPH_SEPARATOR
OTHER
CONTROL
FORMAT
SURROGATE
PRIVATE_USE
UNASSIGNED
Binary properties
Many of these properties are used to define Unicode text algorithms, such as the bidirectional algorithm and the text segmentation algorithm. Such properties are not likely to be useful for most parsers.
However, the properties XID_START
and XID_CONTINUE
are particularly notable
because they are defined "to assist in the standard treatment of identifiers",
"such as programming language variables". See Technical Report 31 for more
details.
ALPHABETIC
BIDI_CONTROL
BIDI_MIRRORED
CASE_IGNORABLE
CASED
CHANGES_WHEN_CASEFOLDED
CHANGES_WHEN_CASEMAPPED
CHANGES_WHEN_LOWERCASED
CHANGES_WHEN_TITLECASED
CHANGES_WHEN_UPPERCASED
DASH
DEFAULT_IGNORABLE_CODE_POINT
DEPRECATED
DIACRITIC
EMOJI
EMOJI_COMPONENT
EMOJI_MODIFIER
EMOJI_MODIFIER_BASE
EMOJI_PRESENTATION
EXTENDED_PICTOGRAPHIC
EXTENDER
GRAPHEME_BASE
GRAPHEME_EXTEND
GRAPHEME_LINK
HEX_DIGIT
HYPHEN
IDS_BINARY_OPERATOR
IDS_TRINARY_OPERATOR
ID_CONTINUE
ID_START
IDEOGRAPHIC
JOIN_CONTROL
LOGICAL_ORDER_EXCEPTION
LOWERCASE
MATH
NONCHARACTER_CODE_POINT
OTHER_ALPHABETIC
OTHER_DEFAULT_IGNORABLE_CODE_POINT
OTHER_GRAPHEME_EXTEND
OTHER_ID_CONTINUE
OTHER_ID_START
OTHER_LOWERCASE
OTHER_MATH
OTHER_UPPERCASE
PATTERN_SYNTAX
PATTERN_WHITE_SPACE
PREPENDED_CONCATENATION_MARK
QUOTATION_MARK
RADICAL
REGIONAL_INDICATOR
SENTENCE_TERMINAL
SOFT_DOTTED
TERMINAL_PUNCTUATION
UNIFIED_IDEOGRAPH
UPPERCASE
VARIATION_SELECTOR
WHITE_SPACE
XID_CONTINUE
XID_START
Script properties
The Unicode script property has included built-in rules for matching characters in particular languages.
For example:
We want match a string that contains any CJK (regexp: \p{CJK}
) characters such as δ½ ε₯½δΈη
or γγγ«γ‘γ―δΈη
or μλ
νμΈμ μΈκ³
.
HAN
: representing Chinese characters, including Simplified Chinese, Traditional Chinese, Japanese kanji, and Korean hanja.HIRAGANA
: representing the Japanese hiragana syllabary.KATAKANA
: representing the Japanese katakana syllabary.HANGUL
: representing Korean alphabetical characters.BOPOMOFO
: representing Chinese phonetic symbols.
So we define a rule named CJK
like this:
CJK = { HAN | HIRAGANA | KATAKANA | HANGUL | BOPOMOFO }
All available rules:
ADLAM
AHOM
ANATOLIAN_HIEROGLYPHS
ARABIC
ARMENIAN
AVESTAN
BALINESE
BAMUM
BASSA_VAH
BATAK
BENGALI
BHAIKSUKI
BOPOMOFO
BRAHMI
BRAILLE
BUGINESE
BUHID
CANADIAN_ABORIGINAL
CARIAN
CAUCASIAN_ALBANIAN
CHAKMA
CHAM
CHEROKEE
CHORASMIAN
COMMON
COPTIC
CUNEIFORM
CYPRIOT
CYPRO_MINOAN
CYRILLIC
DESERET
DEVANAGARI
DIVES_AKURU
DOGRA
DUPLOYAN
EGYPTIAN_HIEROGLYPHS
ELBASAN
ELYMAIC
ETHIOPIC
GEORGIAN
GLAGOLITIC
GOTHIC
GRANTHA
GREEK
GUJARATI
GUNJALA_GONDI
GURMUKHI
HAN
HANGUL
HANIFI_ROHINGYA
HANUNOO
HATRAN
HEBREW
HIRAGANA
IMPERIAL_ARAMAIC
INHERITED
INSCRIPTIONAL_PAHLAVI
INSCRIPTIONAL_PARTHIAN
JAVANESE
KAITHI
KANNADA
KATAKANA
KAWI
KAYAH_LI
KHAROSHTHI
KHITAN_SMALL_SCRIPT
KHMER
KHOJKI
KHUDAWADI
LAO
LATIN
LEPCHA
LIMBU
LINEAR_A
LINEAR_B
LISU
LYCIAN
LYDIAN
MAHAJANI
MAKASAR
MALAYALAM
MANDAIC
MANICHAEAN
MARCHEN
MASARAM_GONDI
MEDEFAIDRIN
MEETEI_MAYEK
MENDE_KIKAKUI
MEROITIC_CURSIVE
MEROITIC_HIEROGLYPHS
MIAO
MODI
MONGOLIAN
MRO
MULTANI
MYANMAR
NABATAEAN
NAG_MUNDARI
NANDINAGARI
NEW_TAI_LUE
NEWA
NKO
NUSHU
NYIAKENG_PUACHUE_HMONG
OGHAM
OL_CHIKI
OLD_HUNGARIAN
OLD_ITALIC
OLD_NORTH_ARABIAN
OLD_PERMIC
OLD_PERSIAN
OLD_SOGDIAN
OLD_SOUTH_ARABIAN
OLD_TURKIC
OLD_UYGHUR
ORIYA
OSAGE
OSMANYA
PAHAWH_HMONG
PALMYRENE
PAU_CIN_HAU
PHAGS_PA
PHOENICIAN
PSALTER_PAHLAVI
REJANG
RUNIC
SAMARITAN
SAURASHTRA
SHARADA
SHAVIAN
SIDDHAM
SIGNWRITING
SINHALA
SOGDIAN
SORA_SOMPENG
SOYOMBO
SUNDANESE
SYLOTI_NAGRI
SYRIAC
TAGALOG
TAGBANWA
TAI_LE
TAI_THAM
TAI_VIET
TAKRI
TAMIL
TANGSA
TANGUT
TELUGU
THAANA
THAI
TIBETAN
TIFINAGH
TIRHUTA
TOTO
UGARITIC
VAI
VITHKUQI
WANCHO
WARANG_CITI
YEZIDI
YI
ZANABAZAR_SQUARE