Unicode 字符类转义：\p{...}, \P{...}

基线广泛可用

此功能已成熟，可在许多设备和浏览器版本上运行。它自以下时间起在所有浏览器中可用： 2015 年 7 月.

**Unicode 字符类转义** 是一种字符类转义，它匹配由 Unicode 属性指定的一组字符。它仅在 Unicode 感知模式中受支持。当 v 标志启用时，它也可以用于匹配有限长度的字符串。

语法

正则表达式

\p{loneProperty}
\P{loneProperty}

\p{property=value}
\P{property=value}

参数

loneProperty: 一个单独的 Unicode 属性名称或值，遵循与 value 相同的语法。它指定 General_Category 属性的值，或二进制属性名称。在 v 模式下，它也可以是字符串的二进制 Unicode 属性。

注意: ICU 语法允许省略 Script 属性名称，但 JavaScript 不支持这一点，因为大多数情况下 Script_Extensions 比 Script 更有用。
property: Unicode 属性名称。必须由 ASCII 字母 (A–Z, a–z) 和下划线 (_) 组成，并且必须是非二进制属性名称中的一个。
value: Unicode 属性值。必须由 ASCII 字母 (A–Z, a–z)、下划线 (_) 和数字 (0–9) 组成，并且必须是 PropertyValueAliases.txt 中列出的支持值之一。

描述

\p 和 \P 仅在 Unicode 感知模式中受支持。在 Unicode 不感知模式下，它们是 p 或 P 字符的标识转义。

每个 Unicode 字符都有一组描述它的属性。例如，字符 a 具有 General_Category 属性，其值为 Lowercase_Letter，以及 Script 属性，其值为 Latn。\p 和 \P 转义序列允许您根据字符的属性匹配字符。例如，a 可以通过 \p{Lowercase_Letter} (General_Category 属性名称是可选的) 以及 \p{Script=Latn} 来匹配。\P 创建一个 *补码类*，它包含没有指定属性的代码点。

要组合多个属性，请使用字符集交集语法，该语法由 v 标志启用，或者参见模式减法和交集。

在 v 模式下，\p 可以匹配 Unicode 中定义的可能多于一个字符的代码点序列，即“字符串的属性”。这对于表情符号最有用，表情符号通常由多个代码点组成。但是，\P 只能对字符属性进行补充。

注意: 计划将字符串属性功能移植到 u 模式中。

示例

一般类别

一般类别用于对 Unicode 字符进行分类，子类别可用于定义更精确的分类。可以在 Unicode 属性转义中使用简短形式或长形式。

它们可以用于匹配字母、数字、符号、标点符号、空格等。有关一般类别的更详尽列表，请参阅 Unicode 规范。

// finding all the letters of a text
const story = "It's the Cheshire Cat: now I shall have somebody to talk to.";

// Most explicit form
story.match(/\p{General_Category=Letter}/gu);

// It is not mandatory to use the property name for General categories
story.match(/\p{Letter}/gu);

// This is equivalent (short alias):
story.match(/\p{L}/gu);

// This is also equivalent (conjunction of all the subcategories using short aliases)
story.match(/\p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}/gu);

脚本和脚本扩展

一些语言在其书写系统中使用不同的脚本。例如，英语和西班牙语使用拉丁字母书写，而阿拉伯语和俄语则使用其他字母书写（分别为阿拉伯语和西里尔字母）。Script 和 Script_Extensions Unicode 属性允许正则表达式根据主要使用它们的脚本 (Script) 或根据它们所属的脚本集 (Script_Extensions) 来匹配字符。

例如，A 属于 Latin 脚本，而 ε 属于 Greek 脚本。

const mixedCharacters = "aεЛ";

// Using the canonical "long" name of the script
mixedCharacters.match(/\p{Script=Latin}/u); // a

// Using a short alias (ISO 15924 code) for the script
mixedCharacters.match(/\p{Script=Grek}/u); // ε

// Using the short name sc for the Script property
mixedCharacters.match(/\p{sc=Cyrillic}/u); // Л

有关更多详细信息，请参阅 Unicode 规范、ECMAScript 规范中的脚本表以及 ISO 15924 脚本代码列表。

如果一个字符在有限的脚本集中使用，则 Script 属性只会匹配“主要”使用的脚本。如果我们想根据“非主要”脚本匹配字符，我们可以使用 Script_Extensions 属性 (Scx 简称)。

// ٢ is the digit 2 in Arabic-Indic notation
// while it is predominantly written within the Arabic script
// it can also be written in the Thaana script

"٢".match(/\p{Script=Thaana}/u);
// null as Thaana is not the predominant script

"٢".match(/\p{Script_Extensions=Thaana}/u);
// ["٢", index: 0, input: "٢", groups: undefined]

Unicode 属性转义与字符类

使用 JavaScript 正则表达式，还可以使用字符类，尤其是 \w 或 \d 来匹配字母或数字。但是，此类形式仅匹配来自 *Latin* 脚本的字符（换句话说，对于 \w 来说是 a 到 z 以及 A 到 Z，对于 \d 来说是 0 到 9）。如此示例所示，处理非拉丁文本可能有点麻烦。

Unicode 属性转义类别包含更多字符，\p{Letter} 或 \p{Number} 将适用于任何脚本。

// Trying to use ranges to avoid \w limitations:

const nonEnglishText = "Приключения Алисы в Стране чудес";
const regexpBMPWord = /([\u0000-\u0019\u0021-\uFFFF])+/gu;
// BMP goes through U+0000 to U+FFFF but space is U+0020

console.table(nonEnglishText.match(regexpBMPWord));

// Using Unicode property escapes instead
const regexpUPE = /\p{L}+/gu;
console.table(nonEnglishText.match(regexpUPE));

匹配价格

以下示例匹配字符串中的价格

function getPrices(str) {
  // Sc stands for "currency symbol"
  return [...str.matchAll(/\p{Sc}\s*[\d.,]+/gu)].map((match) => match[0]);
}

const str = `California rolls $6.99
Crunchy rolls $8.49
Shrimp tempura $10.99`;
console.log(getPrices(str)); // ["$6.99", "$8.49", "$10.99"]

const str2 = `US store $19.99
Europe store €18.99
Japan store ¥2000`;
console.log(getPrices(str2)); // ["$19.99", "€18.99", "¥2000"]

匹配字符串

使用 v 标志，\p{…} 可以通过使用字符串的属性来匹配可能超过一个字符的字符串

const flag = "🇺🇳";
console.log(flag.length); // 2
console.log(/\p{RGI_Emoji_Flag_Sequence}/v.exec(flag)); // [ '🇺🇳' ]

但是，您不能使用 \P 来匹配“没有属性的字符串”，因为它不清楚应该使用多少个字符。

/\P{RGI_Emoji_Flag_Sequence}/v; // SyntaxError: Invalid regular expression: Invalid property name

规范

规范
ECMAScript 语言规范 # prod-CharacterClassEscape

浏览器兼容性

BCD 表仅在浏览器中加载

另请参见

字符类指南
正则表达式
字符类: [...], [^...]
字符类转义: \d, \D, \w, \W, \s, \S
字符转义: \n, \u{...}
析取: |
Unicode 字符属性在维基百科上
ES2018: RegExp Unicode 属性转义由 Axel Rauschmayer 博士 (2017) 撰写
Unicode 正则表达式 § 属性
Unicode 实用程序: UnicodeSet
带有集合符号和字符串属性的 RegExp v 标志在 v8.dev 上 (2022)