How to Properly Match Chinese Characters With Regular Expression

三日月綾香

简体中文版

tl;dr

Python

[\u4e00-\u9fff\u3400-\u4dbf\U00020000-\U0002a6df\U0002a700-\U0002ebef\U00030000-\U000323af\ufa0e\ufa0f\ufa11\ufa13\ufa14\ufa1f\ufa21\ufa23\ufa24\ufa27\ufa28\ufa29\u3006\u3007][\ufe00-\ufe0f\U000e0100-\U000e01ef]?

Python (need pip install regex)

[\p{Unified_Ideograph}\u3006\u3007][\ufe00-\ufe0f\U000e0100-\U000e01ef]?

JavaScript (RegExp: Unicode property escapes)

Can I use RegExp: Unicode property escapes

[\p{Unified_Ideograph}\u3006\u3007][\ufe00-\ufe0f\u{e0100}-\u{e01ef}]?

JavaScript (RegExp: Unicode)

Can I use RegExp: Unicode

[\u4e00-\u9fff\u3400-\u4dbf\u{20000}-\u{2a6df}\u{2a700}-\u{2ebef}\u{30000}-\u{323af}\ufa0e\ufa0f\ufa11\ufa13\ufa14\ufa1f\ufa21\ufa23\ufa24\ufa27\ufa28\ufa29\u3006\u3007][\ufe00-\ufe0f\u{e0100}-\u{e01ef}]?

JavaScript

([\u4e00-\u9fff\u3400-\u4dbf\ufa0e\ufa0f\ufa11\ufa13\ufa14\ufa1f\ufa21\ufa23\ufa24\ufa27\ufa28\ufa29\u3006\u3007]|[\ud840-\ud868\ud86a-\ud879\ud880-\ud887][\udc00-\udfff]|\ud869[\udc00-\udedf\udf00-\udfff]|\ud87a[\udc00-\udfef]|\ud888[\udc00-\udfaf])([\ufe00-\ufe0f]|\udb40[\udd00-\uddef])?

Examples

Python

import json
import re

pattern = re.compile(r'[\u4e00-\u9fff\u3400-\u4dbf\U00020000-\U0002a6df\U0002a700-\U0002ebef\U00030000-\U000323af\ufa0e\ufa0f\ufa11\ufa13\ufa14\ufa1f\ufa21\ufa23\ufa24\ufa27\ufa28\ufa29\u3006\u3007][\ufe00-\ufe0f\U000e0100-\U000e01ef]?')

for i, match in enumerate(pattern.finditer('a〆文𦫖﨑禰󠄀')):
    print(f'Match {i}:', match[0], json.dumps(match[0]))

# Match 0: 〆 "\u3006"
# Match 1: 文 "\u6587"
# Match 2: 𦫖 "\ud85a\uded6"
# Match 3: 﨑 "\ufa11"
# Match 4: 禰󠄀 "\u79b0\udb40\udd00"

Python (need pip install regex)

import json
import regex as re

pattern = re.compile(r'[\p{Unified_Ideograph}\u3006\u3007][\ufe00-\ufe0f\U000e0100-\U000e01ef]?')

for i, match in enumerate(pattern.finditer('a〆文𦫖﨑禰󠄀')):
    print(f'Match {i}:', match[0], json.dumps(match[0]))

# Match 0: 〆 "\u3006"
# Match 1: 文 "\u6587"
# Match 2: 𦫖 "\ud85a\uded6"
# Match 3: 﨑 "\ufa11"
# Match 4: 禰󠄀 "\u79b0\udb40\udd00"

JavaScript (RegExp: Unicode property escapes)

Can I use RegExp: Unicode property escapes

const pattern = /[\p{Unified_Ideograph}\u3006\u3007][\ufe00-\ufe0f\u{e0100}-\u{e01ef}]?/gmu;

'a〆文𦫖﨑禰󠄀'.match(pattern).forEach((match, i) => {
   console.log(`Match ${i}: ${match}, length: ${match.length}`);
});
// Match 0: 〆, length: 1
// Match 1: 文, length: 1
// Match 2: 𦫖, length: 2
// Match 3: 﨑, length: 1
// Match 4: 禰󠄀, length: 3

JavaScript (RegExp: Unicode)

Can I use RegExp: Unicode

const pattern = /[\u4e00-\u9fff\u3400-\u4dbf\u{20000}-\u{2a6df}\u{2a700}-\u{2ebef}\u{30000}-\u{323af}\ufa0e\ufa0f\ufa11\ufa13\ufa14\ufa1f\ufa21\ufa23\ufa24\ufa27\ufa28\ufa29\u3006\u3007][\ufe00-\ufe0f\u{e0100}-\u{e01ef}]?/gmu;

'a〆文𦫖﨑禰󠄀'.match(pattern).forEach((match, i) => {
   console.log(`Match ${i}: ${match}, length: ${match.length}`);
});
// Match 0: 〆, length: 1
// Match 1: 文, length: 1
// Match 2: 𦫖, length: 2
// Match 3: 﨑, length: 1
// Match 4: 禰󠄀, length: 3

JavaScript

const pattern = /([\u4e00-\u9fff\u3400-\u4dbf\ufa0e\ufa0f\ufa11\ufa13\ufa14\ufa1f\ufa21\ufa23\ufa24\ufa27\ufa28\ufa29\u3006\u3007]|[\ud840-\ud868\ud86a-\ud879\ud880-\ud887][\udc00-\udfff]|\ud869[\udc00-\udedf\udf00-\udfff]|\ud87a[\udc00-\udfef]|\ud888[\udc00-\udfaf])([\ufe00-\ufe0f]|\udb40[\udd00-\uddef])?/gm;

'a〆文𦫖﨑禰󠄀'.match(pattern).forEach((match, i) => {
   console.log(`Match ${i}: ${match}, length: ${match.length}`);
});
// Match 0: 〆, length: 1
// Match 1: 文, length: 1
// Match 2: 𦫖, length: 2
// Match 3: 﨑, length: 1
// Match 4: 禰󠄀, length: 3

Explanation

CJK Unified Ideographs:

12 CJK Unified Ideographs in the CJK Compatibility Ideographs block:

2 characters in the CJK Symbols and Punctuation block that are often regarded as Chinese characters:

Variation Selectors:

Wrong Solutions

  1. Solutions containing \p{sc=Han} (means the Han script in Unicode) is wrong because it selects more than Chinese characters
  2. Solutions containing \p{Ideo} (means the Ideograph property in Unicode) is wrong because it selects more than Chinese characters
  3. Solutions containing \p{Variation_Selector} is wrong because it also selects Mongolian variation selectors

References

  1. Unicode Scripts
  2. Unicode PropList
  3. Unicode codepoint properties in the Python regex library
  4. Unicode property escapes in JavaScript (ECMA)
  5. Unicode property escapes in JavaScript (MDN)
  6. Surrogate Pair Calculator