aboutsummaryrefslogtreecommitdiff
path: root/files/zh-cn/web/javascript/guide/regular_expressions/unicode_property_escapes/index.html
blob: 64254c284c5900ec5c35827ed263d3c6aa3910f8 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
title: Unicode property escapes
slug: Web/JavaScript/Guide/Regular_Expressions/Unicode_Property_Escapes
translation_of: Web/JavaScript/Guide/Regular_Expressions/Unicode_Property_Escapes
---
<p>{{jsSidebar("JavaScript Guide")}}</p>

<p><strong>Unicode property escapes</strong> <a href="https://developer.mozilla.org/zh-CN/docs/Web/JavaScript/Guide/Regular_Expressions">正则表达式</a> 支持根据 Unicode 属性进行匹配,例如我们可以用它来匹配出表情、标点符号、字母(甚至适用特定语言或文字)等。同一符号可以拥有多种 Unicode 属性,属性则有 binary ("boolean-like") 和 non-binary 之分。</p>

<div>{{EmbedInteractiveExample("pages/js/regexp-unicode-property-escapes.html", "taller")}}</div>

<div class="note">
<p><strong>备注:</strong>使用 Unicode 属性转义依靠 <a href="/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/unicode"><code>\u</code> 标识</a><code>\u</code> 表示该字符串被视为一串 Unicode 代码点。参考 <code><a href="/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/unicode">RegExp.prototype.nicode</a></code>.</p>
</div>

<div class="note">
<p><strong>备注:</strong> 某些 Unicode 属性比<a href="/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Character_Classes">字符类</a>(如 <code>\w</code> 只匹配拉丁字母 <code>a</code> 到 <code>z</code>)包含更多的字符  ,但后者浏览器兼容性更好 (截至2020 一月).</p>
</div>

<h2 id="句法">句法</h2>

<pre class="brush: js">// Non-binary 属性
\p{<em>Unicode属性值</em>}
\p{<em>Unicode属性名</em>=<em>Unicode属性值</em>}

// Binary and non-binary 属性
\p{<em>UnicodeBinary属性名</em>}

// \P 为 \p 取反
\P{<em>Unicode属性值</em>}
\P{<em>UnicodeBinary属性名</em>}
</pre>

<ul>
 <li><a href="https://unicode.org/reports/tr18/#General_Category_Property">General_Category</a> (<code>gc</code>)</li>
 <li><a href="https://unicode.org/reports/tr24/#Script">Script</a> (<code>sc</code>)</li>
 <li><a href="https://unicode.org/reports/tr24/#Script_Extensions">Script_Extensions</a> (<code>scx</code>)</li>
</ul>

<p>参考 <a href="https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt">PropertyValueAliases.txt </a></p>

<dl>
 <dt>UnicodeBinary属性名</dt>
 <dd><a href="https://tc39.es/ecma262/#table-binary-unicode-properties">Binary 属性</a>名. E.g.: <code><a href="https://unicode.org/reports/tr18/#General_Category_Property">ASCII</a></code><code><a href="https://unicode.org/reports/tr44/#Alphabetic">Alpha</a></code>, <code>Math</code>, <code><a href="https://unicode.org/reports/tr44/#Diacritic">Diacritic</a></code><code><a href="https://unicode.org/reports/tr51/#Emoji_Properties">Emoji</a></code><code><a href="https://unicode.org/reports/tr44/#Hex_Digit">Hex_Digit</a></code><code>Math</code>, <code><a href="https://unicode.org/reports/tr44/#White_Space">White_space</a></code>, 等. 另见 <a href="https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt">Unicode Data PropList.txt</a>.</dd>
 <dt>Unicode属性名</dt>
 <dd><a href="https://tc39.es/ecma262/#table-nonbinary-unicode-properties">Non-binary</a> 属性名:</dd>
 <dt>Unicode属性值</dt>
 <dd>很多值有同名或简写(e.g. 对应着 <code>General_Category</code> 属性名的属性值 <code>Decimal_Number</code> 可以写作 <code>Nd</code>, <code>digit</code>, 或 <code>Decimal_Number</code>). 大多数属性值的 <em><code>Unicode属性名</code></em> 和等号可以省去。如果想明确某 <em><code>Unicode属性名</code></em>,必须给出它的值。</dd>
</dl>

<div class="note">
<p><strong>备注:</strong> 因为可使用的属性和值太多,这里不一一赘述,仅提供几个例子。</p>
</div>

<h2 id="基本原理">基本原理</h2>

<p>在 ES2018 之前,JavaScript 没有强有效的方式用匹配出不同<code>文字</code>(如马其顿语,希腊语,Georgian 等)或不同 <code>属性名</code> (如 Emoji 等)的字符。另见 <a href="https://github.com/tc39/proposal-regexp-unicode-property-escapes">tc39 Proposal on Unicode Property Escapes</a>.</p>

<h2 id="例子">例子</h2>

<h3 id="(一般类别)General_categories">(一般类别)General categories</h3>

<p>General categories 对 Unicode 字符进行分类,子类别用于精确定义类别。长名和简写的 Unicode 属性转义都可用.</p>

<p>它们可匹配字母、数字、符号、标点符号、空格等等。一般类别详见 <a href="https://unicode.org/reports/tr18/#General_Category_Property">the Unicode specification</a>.</p>

<pre class="brush: js">// finding all the letters of a text
let story = "It’s the Cheshire Cat: now I shall have somebody to talk to.";

// Most explicit form
story.match(/\p{General_Category=Letter}/gu);

// It is not mandatory to use the property name for General categories
story.match(/\p{Letter}/gu);

// This is equivalent (short alias):
story.match(/\p{L}/gu);

// This is also equivalent (conjunction of all the subcategories using short aliases)
story.match(/\p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}/gu);
</pre>

<h3 id="文字(Script)和文字扩充(Script_Extensions)">文字(Script)和文字扩充(Script_Extensions)</h3>

<p>某些语言使用不同的文字,如英语和西班牙语使用拉丁文,而阿拉伯语和俄语用阿拉伯文和俄文。<code>Script</code> 和 <code>Script_Extensions</code> Unicode 属性允许正则表达式根据字符所属的<code>文字</code>或该文字所属的<code>文字扩充</code>进行匹配。</p>

<p>比如,<code>A</code> 属于 <code>拉丁文</code><code>ε</code> 属于<code>希腊(Greek)</code>文。</p>

<pre class="brush: js">let mixedCharacters = "aεЛ";

// Using the canonical "long" name of the script
mixedCharacters.match(/\p{Script=Latin}/u); // a

// Using a short alias for the script
mixedCharacters.match(/\p{Script=Grek}/u); // ε

// Using the short name Sc for the Script property
mixedCharacters.match(/\p{Sc=Cyrillic}/u); // Л
</pre>

<p>详见 <a href="https://unicode.org/reports/tr24/#Script">the Unicode specification</a><a href="https://tc39.es/ecma262/#table-unicode-script-values">Scripts table in the ECMAScript specification</a>.</p>

<p>某字符用于多种文字时,<code>Script</code> 优先匹配最主要使用那个字符的文字。如果想要根据非主要的文字进行匹配,我们可以使用 <code>Script_Extensions</code> 属性 (简写为<code>Scx</code>).</p>

<pre class="brush: js">// ٢ is the digit 2 in Arabic-Indic notation
// while it is predominantly written within the Arabic script
// it can also be written in the Thaana script

"٢".match(/\p{Script=Thaana}/u);
// null as Thaana is not the predominant script        super()

"٢".match(/\p{Script_Extensions=Thaana}/u);
// ["٢", index: 0, input: "٢", groups: undefined]
</pre>

<h3 id="Unicode_属性转义_vs._字符类">Unicode 属性转义 vs. 字符类</h3>

<p>JavaScript 正则表达式可以使用 <a href="/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Character_Classes">字符类</a> 尤其是 <code>\w</code> 或 <code>\d</code> 匹配字母或数字,然而,这样的形式只匹配拉丁文字的字符 (换言之,<code>a</code> 到 <code>z</code>、 <code>A</code> 到 <code>Z</code><code>\w</code><code>0</code> 到 <code>9</code> 的 <code>\d</code>),见<a href="/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Character_Classes#Looking_for_a_word_from_Unicode_characters">例子</a>,这样的使用放到非拉丁文本中是有些蠢的。</p>

<p>Unicode 属性转义 categories 包含更多字符, <code>\p{Letter}</code> 或 <code>\p{Number}</code> 将会适用于任何文字。</p>

<pre class="brush: js">// Trying to use ranges to avoid \w limitations:

const nonEnglishText = "Приключения Алисы в Стране чудес";
const regexpBMPWord = /([\u0000-\u0019\u0021-\uFFFF])+/gu;
// BMP goes through U+0000 to U+FFFF but space is U+0020

console.table(nonEnglishText.match(regexpBMPWord));

// Using Unicode property escapes instead
const regexpUPE = /\p{L}+/gu;
console.table(nonEnglishText.match(regexpUPE));
</pre>

<h2 id="Specifications">Specifications</h2>

<table class="standard-table">
 <tbody>
  <tr>
   <th scope="col">Specification</th>
  </tr>
  <tr>
   <td>{{SpecName('ESDraft', '#sec-runtime-semantics-unicodematchproperty-p', 'RegExp: Unicode property escapes')}}</td>
  </tr>
 </tbody>
</table>

<h2 id="Browser_compatibility">Browser compatibility</h2>

<p>For browser compatibility information, check out the <a href="/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#Browser_compatibility">main Regular Expressions compatibility table</a>.</p>

<h2 id="See_also">See also</h2>

<ul>
 <li><a href="/en-US/docs/Web/JavaScript/Guide/Regular_Expressions">Regular expressions guide</a>

  <ul>
   <li><a href="/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Character_Classes">Character classes</a></li>
   <li><a href="/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Assertions">Assertions</a></li>
   <li><a href="/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Quantifiers">Quantifiers</a></li>
   <li><a href="/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Groups_and_Ranges">Groups and ranges</a></li>
  </ul>
 </li>
 <li><a href="/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp">The <code>RegExp()</code> constructor</a></li>
 <li><code><a href="/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/unicode">RegExp.prototype.unicode</a></code></li>
 <li><a href="https://en.wikipedia.org/wiki/Unicode_character_property">Unicode character property — Wikipedia</a></li>
 <li><a href="https://2ality.com/2017/07/regexp-unicode-property-escapes.html">A blog post from Axel Rauschmayer about Unicode property escapes</a></li>
 <li><a href="https://unicode.org/reports/tr18/#Categories">The Unicode document for Unicode properties</a></li>
</ul>