Built-in rules

Besides ANY, matching any single Unicode character, pest provides several rules to make parsing text more convenient.

ASCII rules

Among the printable ASCII characters, it is often useful to match alphabetic characters and numbers. For numbers, pest provides digits in common radixes (bases):

Built-in ruleEquivalent
ASCII_DIGIT'0'..'9'
ASCII_NONZERO_DIGIT'1'..'9'
ASCII_BIN_DIGIT'0'..'1'
ASCII_OCT_DIGIT'0'..'7'
ASCII_HEX_DIGIT'0'..'9' | 'a'..'f' | 'A'..'F'

For alphabetic characters, distinguishing between uppercase and lowercase:

Built-in ruleEquivalent
ASCII_ALPHA_LOWER'a'..'z'
ASCII_ALPHA_UPPER'A'..'Z'
ASCII_ALPHA'a'..'z' | 'A'..'Z'

And for miscellaneous use:

Built-in ruleMeaningEquivalent
ASCII_ALPHANUMERICany digit or letterASCII_DIGIT | ASCII_ALPHA
NEWLINEany line feed format"\n" | "\r\n" | "\r"

Unicode rules

To make it easier to correctly parse arbitrary Unicode text, pest includes a large number of rules corresponding to Unicode character properties. These rules are divided into general category and binary property rules.

Unicode characters are partitioned into categories based on their general purpose. Every character belongs to a single category, in the same way that every ASCII character is a control character, a digit, a letter, a symbol, or a space.

In addition, every Unicode character has a list of binary properties (true or false) that it does or does not satisfy. Characters can belong to any number of these properties, depending on their meaning.

For example, the character "A", "Latin capital letter A", is in the general category "Uppercase Letter" because its general purpose is being a letter. It has the binary property "Uppercase" but not "Emoji". By contrast, the character "πŸ…°", "negative squared Latin capital letter A", is in the general category "Other Symbol" because it does not generally occur as a letter in text. It has both the binary properties "Uppercase" and "Emoji".

For more details, consult Chapter 4 of The Unicode Standard.

General categories

Formally, categories are non-overlapping: each Unicode character belongs to exactly one category, and no category contains another. However, since certain groups of categories are often useful together, pest exposes the hierarchy of categories below. For example, the rule CASED_LETTER is not technically a Unicode general category; it instead matches characters that are UPPERCASE_LETTER or LOWERCASE_LETTER, which are general categories.

  • LETTER
    • CASED_LETTER
      • UPPERCASE_LETTER
      • LOWERCASE_LETTER
    • TITLECASE_LETTER
    • MODIFIER_LETTER
    • OTHER_LETTER
  • MARK
    • NONSPACING_MARK
    • SPACING_MARK
    • ENCLOSING_MARK
  • NUMBER
    • DECIMAL_NUMBER
    • LETTER_NUMBER
    • OTHER_NUMBER
  • PUNCTUATION
    • CONNECTOR_PUNCTUATION
    • DASH_PUNCTUATION
    • OPEN_PUNCTUATION
    • CLOSE_PUNCTUATION
    • INITIAL_PUNCTUATION
    • FINAL_PUNCTUATION
    • OTHER_PUNCTUATION
  • SYMBOL
    • MATH_SYMBOL
    • CURRENCY_SYMBOL
    • MODIFIER_SYMBOL
    • OTHER_SYMBOL
  • SEPARATOR
    • SPACE_SEPARATOR
    • LINE_SEPARATOR
    • PARAGRAPH_SEPARATOR
  • OTHER
    • CONTROL
    • FORMAT
    • SURROGATE
    • PRIVATE_USE
    • UNASSIGNED

Binary properties

Many of these properties are used to define Unicode text algorithms, such as the bidirectional algorithm and the text segmentation algorithm. Such properties are not likely to be useful for most parsers.

However, the properties XID_START and XID_CONTINUE are particularly notable because they are defined "to assist in the standard treatment of identifiers", "such as programming language variables". See Technical Report 31 for more details.

  • ALPHABETIC
  • BIDI_CONTROL
  • BIDI_MIRRORED
  • CASE_IGNORABLE
  • CASED
  • CHANGES_WHEN_CASEFOLDED
  • CHANGES_WHEN_CASEMAPPED
  • CHANGES_WHEN_LOWERCASED
  • CHANGES_WHEN_TITLECASED
  • CHANGES_WHEN_UPPERCASED
  • DASH
  • DEFAULT_IGNORABLE_CODE_POINT
  • DEPRECATED
  • DIACRITIC
  • EMOJI
  • EMOJI_COMPONENT
  • EMOJI_MODIFIER
  • EMOJI_MODIFIER_BASE
  • EMOJI_PRESENTATION
  • EXTENDED_PICTOGRAPHIC
  • EXTENDER
  • GRAPHEME_BASE
  • GRAPHEME_EXTEND
  • GRAPHEME_LINK
  • HEX_DIGIT
  • HYPHEN
  • IDS_BINARY_OPERATOR
  • IDS_TRINARY_OPERATOR
  • ID_CONTINUE
  • ID_START
  • IDEOGRAPHIC
  • JOIN_CONTROL
  • LOGICAL_ORDER_EXCEPTION
  • LOWERCASE
  • MATH
  • NONCHARACTER_CODE_POINT
  • OTHER_ALPHABETIC
  • OTHER_DEFAULT_IGNORABLE_CODE_POINT
  • OTHER_GRAPHEME_EXTEND
  • OTHER_ID_CONTINUE
  • OTHER_ID_START
  • OTHER_LOWERCASE
  • OTHER_MATH
  • OTHER_UPPERCASE
  • PATTERN_SYNTAX
  • PATTERN_WHITE_SPACE
  • PREPENDED_CONCATENATION_MARK
  • QUOTATION_MARK
  • RADICAL
  • REGIONAL_INDICATOR
  • SENTENCE_TERMINAL
  • SOFT_DOTTED
  • TERMINAL_PUNCTUATION
  • UNIFIED_IDEOGRAPH
  • UPPERCASE
  • VARIATION_SELECTOR
  • WHITE_SPACE
  • XID_CONTINUE
  • XID_START

Script properties

The Unicode script property has included built-in rules for matching characters in particular languages.

For example:

We want match a string that contains any CJK (regexp: \p{CJK}) characters such as δ½ ε₯½δΈ–η•Œ or γ“γ‚“γ«γ‘γ―δΈ–η•Œ or μ•ˆλ…•ν•˜μ„Έμš” 세계.

  • HAN: representing Chinese characters, including Simplified Chinese, Traditional Chinese, Japanese kanji, and Korean hanja.
  • HIRAGANA: representing the Japanese hiragana syllabary.
  • KATAKANA: representing the Japanese katakana syllabary.
  • HANGUL: representing Korean alphabetical characters.
  • BOPOMOFO: representing Chinese phonetic symbols.

So we define a rule named CJK like this:

CJK = { HAN | HIRAGANA | KATAKANA | HANGUL | BOPOMOFO }

All available rules:

  • ADLAM
  • AHOM
  • ANATOLIAN_HIEROGLYPHS
  • ARABIC
  • ARMENIAN
  • AVESTAN
  • BALINESE
  • BAMUM
  • BASSA_VAH
  • BATAK
  • BENGALI
  • BHAIKSUKI
  • BOPOMOFO
  • BRAHMI
  • BRAILLE
  • BUGINESE
  • BUHID
  • CANADIAN_ABORIGINAL
  • CARIAN
  • CAUCASIAN_ALBANIAN
  • CHAKMA
  • CHAM
  • CHEROKEE
  • CHORASMIAN
  • COMMON
  • COPTIC
  • CUNEIFORM
  • CYPRIOT
  • CYPRO_MINOAN
  • CYRILLIC
  • DESERET
  • DEVANAGARI
  • DIVES_AKURU
  • DOGRA
  • DUPLOYAN
  • EGYPTIAN_HIEROGLYPHS
  • ELBASAN
  • ELYMAIC
  • ETHIOPIC
  • GEORGIAN
  • GLAGOLITIC
  • GOTHIC
  • GRANTHA
  • GREEK
  • GUJARATI
  • GUNJALA_GONDI
  • GURMUKHI
  • HAN
  • HANGUL
  • HANIFI_ROHINGYA
  • HANUNOO
  • HATRAN
  • HEBREW
  • HIRAGANA
  • IMPERIAL_ARAMAIC
  • INHERITED
  • INSCRIPTIONAL_PAHLAVI
  • INSCRIPTIONAL_PARTHIAN
  • JAVANESE
  • KAITHI
  • KANNADA
  • KATAKANA
  • KAWI
  • KAYAH_LI
  • KHAROSHTHI
  • KHITAN_SMALL_SCRIPT
  • KHMER
  • KHOJKI
  • KHUDAWADI
  • LAO
  • LATIN
  • LEPCHA
  • LIMBU
  • LINEAR_A
  • LINEAR_B
  • LISU
  • LYCIAN
  • LYDIAN
  • MAHAJANI
  • MAKASAR
  • MALAYALAM
  • MANDAIC
  • MANICHAEAN
  • MARCHEN
  • MASARAM_GONDI
  • MEDEFAIDRIN
  • MEETEI_MAYEK
  • MENDE_KIKAKUI
  • MEROITIC_CURSIVE
  • MEROITIC_HIEROGLYPHS
  • MIAO
  • MODI
  • MONGOLIAN
  • MRO
  • MULTANI
  • MYANMAR
  • NABATAEAN
  • NAG_MUNDARI
  • NANDINAGARI
  • NEW_TAI_LUE
  • NEWA
  • NKO
  • NUSHU
  • NYIAKENG_PUACHUE_HMONG
  • OGHAM
  • OL_CHIKI
  • OLD_HUNGARIAN
  • OLD_ITALIC
  • OLD_NORTH_ARABIAN
  • OLD_PERMIC
  • OLD_PERSIAN
  • OLD_SOGDIAN
  • OLD_SOUTH_ARABIAN
  • OLD_TURKIC
  • OLD_UYGHUR
  • ORIYA
  • OSAGE
  • OSMANYA
  • PAHAWH_HMONG
  • PALMYRENE
  • PAU_CIN_HAU
  • PHAGS_PA
  • PHOENICIAN
  • PSALTER_PAHLAVI
  • REJANG
  • RUNIC
  • SAMARITAN
  • SAURASHTRA
  • SHARADA
  • SHAVIAN
  • SIDDHAM
  • SIGNWRITING
  • SINHALA
  • SOGDIAN
  • SORA_SOMPENG
  • SOYOMBO
  • SUNDANESE
  • SYLOTI_NAGRI
  • SYRIAC
  • TAGALOG
  • TAGBANWA
  • TAI_LE
  • TAI_THAM
  • TAI_VIET
  • TAKRI
  • TAMIL
  • TANGSA
  • TANGUT
  • TELUGU
  • THAANA
  • THAI
  • TIBETAN
  • TIFINAGH
  • TIRHUTA
  • TOTO
  • UGARITIC
  • VAI
  • VITHKUQI
  • WANCHO
  • WARANG_CITI
  • YEZIDI
  • YI
  • ZANABAZAR_SQUARE