[\u3006\u3007\u4e00-\u9fff\u3400-\u4dbf\U00020000-\U0002a6df\U0002a700-\U0002ebef\U00030000-\U0003134f]
U+3006: Character 〆 (often regarded as a Chinese character)U+3007: Character 〇 (often regarded as a Chinese character)U+4E00-U+9FFF: CJK Unified IdeographsU+3400-U+4DBF: CJK Unified Ideographs Extension AU+20000-U+2A6DF: CJK Unified Ideographs Extension BU+2A700-U+2B73F: CJK Unified Ideographs Extension CU+2B740-U+2B81F: CJK Unified Ideographs Extension DU+2B820-U+2CEAF: CJK Unified Ideographs Extension EU+2CEB0-U+2EBEF: CJK Unified Ideographs Extension FU+30000-U+3134F: CJK Unified Ideographs Extension G>>> import re
>>> han_regex = re.compile(r'[\u3006\u3007\u4e00-\u9fff\u3400-\u4dbf\U00020000-\U0002a6df\U0002a700-\U0002ebef\U00030000-\U0003134f]')
>>> is_han = lambda c: bool(han_regex.fullmatch(c))
>>> print([is_han(c) for c in 'm!文𦫖〇〆'])
[False, False, True, True, True, True]如果可以用 ES6 的 Unicode point escapes (\u{...}):
> const isHan = (c) => /^[\u3006\u3007\u4e00-\u9fff\u3400-\u4dbf\u{20000}-\u{2a6df}\u{2a700}-\u{2ebef}\u{30000}-\u{3134f}]$/u.test(c);
> console.log([...'m!文𦫖〇〆'].map(isHan));
[ false, false, true, true, true, true ]如果不能用 ES6,就必须使用代理对:
> const isHan = (c) => /^[\u3006\u3007\u4e00-\u9fff\u3400-\u4dbf]|[\ud840-\ud868\ud86a-\ud879\ud880-\ud883][\udc00-\udfff]|\ud869[\udc00-\udedf\udf00-\udfff]|\ud87a[\udc00-\udfef]|\ud884[\udc00-\udf4f]$/.test(c);
> console.log([...'m!文𦫖〇〆'].map(isHan));
[ false, false, true, true, true, true ]\p{sc=Han}\p{...} 这种语法称为 Unicode property escapes。第一部分是 Unicode property name,sc 表示 script;第二部分 Han 是 Unicode property value。
要查看哪些字符属于 Han script,可以查看 UCD 中的 Scripts.txt。
U+2E80-U+2E99: CJK RadicalsU+2E9B-U+2EF3: CJK RadicalsU+2F00-U+2FD5: Kangxi RadicalsU+3005: Ideographic Iteration MarkU+3007: Ideographic Number ZeroU+3021-U+3029: Suzhou NumeralsU+3038-U+303A: Suzhou NumeralsU+303B: Vertical Ideographic Iteration MarkU+3400-U+4DBF: CJK Unified Ideographs Extension AU+4E00-U+9FFF: CJK Unified IdeographsU+F900-U+FA6D: CJK Compatibility IdeographsU+FA70-U+FAD9: CJK Compatibility IdeographsU+16FE2: Old Chinese Hook MarkU+16FE3: Old Chinese Iteration MarkU+16FF0-U+16FF1: Vietnamese Alternate Reading MarksU+20000-U+2A6DF: CJK Unified Ideographs Extension BU+2A700-U+2B738: CJK Unified Ideographs Extension CU+2B740-U+2B81D: CJK Unified Ideographs Extension DU+2B820-U+2CEA1: CJK Unified Ideographs Extension EU+2CEB0-U+2EBE0: CJK Unified Ideographs Extension FU+2F800-U+2FA1D: CJK Compatibility Ideographs SupplementsU+30000-U+3134A: CJK Unified Ideographs Extension G可以看出它不只包括汉字。
\p{Ideo}类似地,这里的 Ideo 是 Ideograph 的缩写。要查看哪些字符属于 Ideograph,可以查看 UCD 中的 PropList.txt。
U+3006: Ideographic Closing MarkU+3007: Ideographic Number ZeroU+3021-U+3029: Suzhou NumeralsU+3038-U+303A: Suzhou NumeralsU+3400-U+4DBF: CJK Unified Ideographs Extension AU+4E00-U+9FFF: CJK Unified IdeographsU+F900-U+FA6D: CJK Compatibility IdeographsU+FA70-U+FAD9: CJK Compatibility IdeographsU+16FE4: Khitan Small Script FillerU+17000-U+187F7: Tangut IdeographsU+18800-U+18AFF: Tangut ComponentsU+18B00-U+18CD5: Khitan Small Script CharactersU+18D00-U+18D08: Tangut Ideographs SupplementU+1B170-U+1B2FB: Nushu CharactersU+20000-U+2A6DF: CJK Unified Ideographs Extension BU+2A700-U+2B738: CJK Unified Ideographs Extension CU+2B740-U+2B81D: CJK Unified Ideographs Extension DU+2B820-U+2CEA1: CJK Unified Ideographs Extension EU+2CEB0-U+2EBE0: CJK Unified Ideographs Extension FU+2F800-U+2FA1D: CJK Compatibility Ideographs SupplementsU+30000-U+3134A: CJK Unified Ideographs Extension G可以看出它不只包括汉字。