1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
|
---
title: String.prototype.normalize()
slug: Web/JavaScript/Reference/Global_Objects/String/normalize
tags:
- ECMAScript 2015
- JavaScript
- Method
- String
- Unicode
- 参考
- 字符串
- 方法
translation_of: Web/JavaScript/Reference/Global_Objects/String/normalize
---
<p>{{JSRef}}</p>
<p><strong>normalize()</strong> 方法会按照指定的一种 Unicode 正规形式将当前字符串正规化。(如果该值不是字符串,则首先将其转换为一个字符串)。</p>
<div>{{EmbedInteractiveExample("pages/js/string-normalize.html", "taller")}}</div>
<h2 id="语法">语法</h2>
<pre class="syntaxbox"><code><var>str</var>.normalize([<var>form</var>])</code></pre>
<h3 id="参数">参数</h3>
<dl>
<dt><code><var>form</var></code> {{optional_inline}}</dt>
<dd>
<p>四种 Unicode 正规形式(Unicode Normalization Form)<code>"NFC"</code>、<code>"NFD"</code>、<code>"NFKC"</code>,或 <code>"NFKD"</code> 其中的一个, 默认值为 <code>"NFC"</code>。</p>
<p>这四个值的含义分别如下:</p>
<dl>
<dt><code>"NFC"</code></dt>
<dd>Canonical Decomposition, followed by Canonical Composition.</dd>
<dt><code>"NFD"</code></dt>
<dd>Canonical Decomposition.</dd>
<dt><code>"NFKC"</code></dt>
<dd>Compatibility Decomposition, followed by Canonical Composition.</dd>
<dt><code>"NFKD"</code></dt>
<dd>Compatibility Decomposition.</dd>
</dl>
</dd>
</dl>
<h3 id="返回值">返回值</h3>
<p>含有给定字符串的 Unicode 规范化形式的字符串。</p>
<h3 id="可能出现的异常">可能出现的异常</h3>
<dl>
<dt>{{jsxref("RangeError")}}</dt>
<dd>如果给 <code>form</code> 传入了上述四个字符串以外的参数,则会抛出 <code>RangeError</code> 异常。</dd>
</dl>
<h2 id="描述">描述</h2>
<p>Unicode assigns a unique numerical value, called a <em>code point</em>, to each character. For example, the code point for <code>"A"</code> is given as U+0041. However, sometimes more than one code point, or sequence of code points, can represent the same abstract character — the character <code>"ñ"</code> for example can be represented by either of:</p>
<ul>
<li>The single code point U+00F1.</li>
<li>The code point for <code>"n"</code> (U+006E) followed by the code point for the combining tilde (U+0303).</li>
</ul>
<pre class="brush: js">let string1 = '\u00F1';
let string2 = '\u006E\u0303';
console.log(string1); // ñ
console.log(string2); // ñ
</pre>
<p>However, since the code points are different, string comparison will not treat them as equal. And since the number of code points in each version is different, they even have different lengths.</p>
<pre class="brush: js">let string1 = '\u00F1'; // ñ
let string2 = '\u006E\u0303'; // ñ
console.log(string1 === string2); // false
console.log(string1.length); // 1
console.log(string2.length); // 2
</pre>
<p>The <code>normalize()</code> method helps solve this problem by converting a string into a normalized form common for all sequences of code points that represent the same characters. There are two main normalization forms, one based on <strong>canonical equivalence</strong> and the other based on <strong>compatibility</strong>.</p>
<h3 id="Canonical_equivalence_normalization">Canonical equivalence normalization</h3>
<p>In Unicode, two sequences of code points have canonical equivalence if they represent the same abstract characters, and should always have the same visual appearance and behavior (for example, they should always be sorted in the same way).</p>
<p>You can use <code>normalize()</code> using the <code>"NFD"</code> or <code>"NFC"</code> arguments to produce a form of the string that will be the same for all canonically equivalent strings. In the example below we normalize two representations of the character <code>"ñ"</code>:</p>
<pre class="brush: js">let string1 = '\u00F1'; // ñ
let string2 = '\u006E\u0303'; // ñ
string1 = string1.normalize('NFD');
string2 = string2.normalize('NFD');
console.log(string1 === string2); // true
console.log(string1.length); // 2
console.log(string2.length); // 2
</pre>
<h4 id="Composed_and_decomposed_forms">Composed and decomposed forms</h4>
<p>Note that the length of the normalized form under <code>"NFD"</code> is <code>2</code>. That's because <code>"NFD"</code> gives you the <strong>decomposed</strong> version of the canonical form, in which single code points are split into multiple combining ones. The decomposed canonical form for <code>"ñ"</code> is <code>"\u006E\u0303"</code>.</p>
<p>You can specify <code>"NFC"</code> to get the <strong>composed</strong> canonical form, in which multiple code points are replaced with single code points where possible. The composed canonical form for <code>"ñ"</code> is <code>"\u00F1"</code>:</p>
<pre class="brush: js">let string1 = '\u00F1'; // ñ
let string2 = '\u006E\u0303'; // ñ
string1 = string1.normalize('NFC');
string2 = string2.normalize('NFC');
console.log(string1 === string2); // true
console.log(string1.length); // 1
console.log(string2.length); // 1
console.log(string2.codePointAt(0).toString(16)); // f1</pre>
<h3 id="Compatibility_normalization">Compatibility normalization</h3>
<p>In Unicode, two sequences of code points are compatible if they represent the same abstract characters, and should be treated alike in some — but not necessarily all — applications.</p>
<p>All canonically equivalent sequences are also compatible, but not vice versa.</p>
<p>For example, the code point U+FB00 represents the <a href="/en-US/docs/Glossary/Ligature">ligature</a> <code>"ff"</code>. It is compatible with two consecutive U+0066 code points (<code>"ff"</code>).</p>
<p>In some respects (such as sorting) they should be treated as equivalent—and in some (such as visual appearance) they should not, so they are not canonically equivalent.</p>
<p>You can use <code>normalize()</code> using the <code>"NFKD"</code> or <code>"NFKC"</code> arguments to produce a form of the string that will be the same for all compatible strings:</p>
<pre class="brush: js">let string1 = '\uFB00';
let string2 = '\u0066\u0066';
console.log(string1); // ff
console.log(string2); // ff
console.log(string1 === string2); // false
console.log(string1.length); // 1
console.log(string2.length); // 2
string1 = string1.normalize('NFKD');
string2 = string2.normalize('NFKD');
console.log(string1); // ff <- visual appearance changed
console.log(string2); // ff
console.log(string1 === string2); // true
console.log(string1.length); // 2
console.log(string2.length); // 2
</pre>
<p>When applying compatibility normalization it's important to consider what you intend to do with the strings, since the normalized form may not be appropriate for all applications. In the example above the normalization is appropriate for search, because it enables a user to find the string by searching for <code>"f"</code>. But it may not be appropriate for display, because the visual representation is different.</p>
<p>As with canonical normalization, you can ask for decomposed or composed compatible forms by passing <code>"NFKD"</code> or <code>"NFKC"</code>, respectively.</p>
<h2 id="示例">示例</h2>
<h3 id="使用_normalize">使用 <code>normalize()</code></h3>
<pre class="brush:js;">// Initial string
// U+1E9B: LATIN SMALL LETTER LONG S WITH DOT ABOVE
// U+0323: COMBINING DOT BELOW
var str = "\u1E9B\u0323";
// Canonically-composed form (NFC)
// U+1E9B: LATIN SMALL LETTER LONG S WITH DOT ABOVE
// U+0323: COMBINING DOT BELOW
str.normalize("NFC"); // "\u1E9B\u0323"
str.normalize(); // same as above
// Canonically-decomposed form (NFD)
// U+017F: LATIN SMALL LETTER LONG S
// U+0323: COMBINING DOT BELOW
// U+0307: COMBINING DOT ABOVE
str.normalize("NFD"); // "\u017F\u0323\u0307"
// Compatibly-composed (NFKC)
// U+1E69: LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE
str.normalize("NFKC"); // "\u1E69"
// Compatibly-decomposed (NFKD)
// U+0073: LATIN SMALL LETTER S
// U+0323: COMBINING DOT BELOW
// U+0307: COMBINING DOT ABOVE
str.normalize("NFKD"); // "\u0073\u0323\u0307"
</pre>
<h2 id="规范">规范</h2>
<table class="standard-table">
<tbody>
<tr>
<th scope="col">规范</th>
<th scope="col">状态</th>
<th scope="col">备注</th>
</tr>
<tr>
<td>{{SpecName('ESDraft', '#sec-string.prototype.normalize', 'String.prototype.normalize')}}</td>
<td>{{Spec2('ESDraft')}}</td>
<td></td>
</tr>
<tr>
<td>{{SpecName('ES2015', '#sec-string.prototype.normalize', 'String.prototype.normalize')}}</td>
<td>{{Spec2('ES2015')}}</td>
<td>Initial definition.</td>
</tr>
</tbody>
</table>
<h2 id="浏览器兼容性">浏览器兼容性</h2>
<p>{{Compat("javascript.builtins.String.normalize")}}</p>
<h2 id="参见">参见</h2>
<ul>
<li><a href="http://www.unicode.org/reports/tr15/">Unicode Standard Annex #15, Unicode Normalizatoin Forms</a></li>
<li><a href="http://zh.wikipedia.org/zh-cn/Unicode%E7%AD%89%E5%83%B9%E6%80%A7">Unicode 等价性</a></li>
</ul>
|