鬼車+Unicodeの\[\[:print:\]\]はPOSIX流じゃないらしい
追記: どっちが正しいとかそういう話ではないので念のため...。
追記2: Technical ReportがAnnexとなっていたのを修正。
追記3: 微妙に誤解があった部分を修正。結論としては同じ。
id:ockeghem さんの、
「POSIX正規表現の[:print:]は改行やタブがマッチするかどうかがPerlとPHPで異なりますね。Perlはマッチしない、PHPはマッチする。どっちが正しいんだ? 」というつぶやきを見て、いろいろ調べてみたんですが、
今回はPHPのせいじゃなかった
みたいなのでいろいろほっとしました。
さて、まずは試してみる
PHP:
<?php foreach (str_split("\x09\x0a\x0d a") as $c) { var_dump(ord($c)); echo "preg_match(): "; var_dump(preg_match("([[:print:]])u", $c)); echo "ereg(): "; var_dump(ereg("[[:print:]]", $c)); mb_regex_encoding("ASCII"); echo "[ASCII] mb_ereg(): "; var_dump(mb_ereg("[[:print:]]", $c)); mb_regex_encoding("UTF-8"); echo "[UTF-8] mb_ereg(): "; var_dump(mb_ereg("[[:print:]]", $c)); mb_regex_encoding("UTF-16"); echo "[UTF-16] mb_ereg(): "; var_dump(mb_ereg("\0[\0[\0:\0p\0r\0i\0n\0t\0:\0]\0]", "\0$c")); } ?>
Ruby:
"\x09\x0a\x0d a".each_char { |c| p c[0], c =~ /[[:print:]]/ }
Perl:
for (split(//, "\x09\x0a\x0d a")) { print ord, " ", 0+/[[:print:]]/, "\n" }
C:
echo '#include<stdio.h>~#include<regex.h>~main(){regex_t r;regcomp(&r,"^[[:print:]]",0);for(char*p="\x09\x0a\x0d a";*p;++p)printf("%d %d\n",*p,!regexec(&r,p,0,0,0));}' | tr '~' '\n' | gcc -std=c99 -xc - && ./a.out
PythonにはPOSIX正規表現的な文字クラスがないみたいだけど
for c in "\x09\x0a\x0d a": import re; print ord(c), re.match('[[:print:]]', c)
結果
PHP (5.2.9) の結果:
int(9) preg_match(): int(0) ereg(): bool(false) [ASCII] mb_ereg(): bool(false) [UTF-8] mb_ereg(): int(1) [UTF-16] mb_ereg(): int(1) int(10) preg_match(): int(0) ereg(): bool(false) [ASCII]] mb_ereg(): bool(false) [UTF-8] mb_ereg(): int(1) [UTF-16] mb_ereg(): int(1) int(13) preg_match(): int(0) ereg(): bool(false) [ASCII] mb_ereg(): bool(false) [UTF-8] mb_ereg(): int(1) [UTF-16] mb_ereg(): int(1) int(32) preg_match(): int(1) ereg(): int(1) [ASCII] mb_ereg(): int(1) [UTF-8] mb_ereg(): int(1) [UTF-16] mb_ereg(): int(1) int(97) preg_match(): int(1) ereg(): int(1) [ASCII] mb_ereg(): int(1) [UTF-8] mb_ereg(): int(1) [UTF-16] mb_ereg(): int(1)
PHP (4.4.9) の結果 (UTF-16の結果は未対応なため省いてあります):
int(9) preg_match(): int(0) ereg(): bool(false) [ASCII] mb_ereg(): bool(false) [UTF-8] mb_ereg(): bool(false) int(10) preg_match(): int(0) ereg(): bool(false) [ASCII] mb_ereg(): bool(false) [UTF-8] mb_ereg(): bool(false) int(13) preg_match(): int(0) ereg(): bool(false) [ASCII] mb_ereg(): bool(false) [UTF-8] mb_ereg(): bool(false) int(32) preg_match(): int(1) ereg(): int(1) [ASCII] mb_ereg(): int(1) [UTF-8] mb_ereg(): int(1) int(97) preg_match(): int(1) ereg(): int(1) [ASCII] mb_ereg(): int(1) [UTF-8] mb_ereg(): int(1)
Rubyの結果:
9 nil 10 nil 13 nil 32 0 97 0
Perlの結果:
9 0 10 0 13 0 32 1 97 1
Cの結果:
9 0 10 0 13 0 32 1 97 1
鬼車の仕様?
PHP5系だけ挙動が違うということは、鬼車に理由があるとしか考えられません。そこで鬼車のドキュメントを参照してみると、
Unicode以外の場合:
- alnum
- 英数字
- alpha
- 英字
- ascii
- 0 - 127
- blank
- \t, \x20
- cntrl
- -
- digit
- 0-9
- graph
- 多バイト文字全部を含む
- lower
- -
- 多バイト文字全部を含む
- punct
- -
- space
- \t, \n, \v, \f, \r, \x20
- upper
- -
- xdigit
- 0-9, a-f, A-F
- word
- 英数字, "_" および 多バイト文字
Unicodeの場合:
POSIXブラケット - 鬼車 正規表現 Version 5.9.1 2007/09/05
- alnum
- Letter | Mark | Decimal_Number
- alpha
- Letter | Mark
- ascii
- 0000 - 007F
- blank
- Space_Separator | 0009
- cntrl
- Control | Format | Unassigned | Private_Use | Surrogate
- digit
- Decimal_Number
- graph
- :^space: && ^Control && ^Unassigned && ^Surrogate
- lower
- Lowercase_Letter
- :graph: | :space:
- punct
- Connector_Punctuation | Dash_Punctuation | Close_Punctuation | Final_Punctuation | Initial_Punctuation | Other_Punctuation | Open_Punctuation
- space
- Space_Separator | Line_Separator | Paragraph_Separator | 0009 | 000A | 000B | 000C | 000D | 0085
- upper
- Uppercase_Letter
- xdigit
- 0030 - 0039 | 0041 - 0046 | 0061 - 0066 (0-9, a-f, A-F)
- word
- Letter | Mark | Decimal_Number | Connector_Punctuation
というように、Unicodeでない文字コード (文字集合) とUnicodeの場合とでPOSIX文字クラスの仕様が異なっていることが分かります。
さらにアーカイブ同梱のHISTORYを見ると
2004/11/08: [spec] fix Unicode character types. 0x00ad (soft hyphen) should be [:cntrl:] and [:space:] type. [0x0009..0x000d], 0x0085 should be [:print:] type. 0x00ad should not be [:punct:] type.
ここで仕様変更が入っていることが分かり、この変更は意図的なものということがはっきりしました。
ちなみにPOSIX的にはどうなのか
Regular Expressions - The Single UNIX © Specification, Version 2
A character class expression represents the set of characters belonging to a character class, as defined in the LC_CTYPE category in the current locale. All character classes specified in the current locale will be recognised. A character class expression is expressed as a character class name enclosed within bracket-colon ([: :]) delimiters.
The following character class expressions are supported in all locales:
[:alnum:] [:cntrl:] [:lower:] [:space:]
[:alpha:] [:digit:] [:print:] [:upper:]
[:blank:] [:graph:] [:punct:] [:xdigit:]
In addition, character class expressions of the form:
[:name:]
are recognised in those locales where the name keyword has been given a charclass definition in the LC_CTYPE category.
このLC_CTYPEカテゴリの定義はLocaleのドキュメントにあります。
Locale (抜粋) - The Single UNIX © Specification, Version 2
- upper
- Define characters to be classified as upper-case letters. In the POSIX locale, the 26 upper-case letters are included: {A B C D E F G H I J K L M N O P Q R S T U V W X Y Z} In a locale definition file, no character specified for the keywords cntrl, digit, punct or space can be specified. The upper-case letters A to Z, as defined in Character Set Description File (the portable character set), are automatically included in this class.
- lower
- Define characters to be classified as lower-case letters. In the POSIX locale, the 26 lower-case letters are included: {a b c d e f g h i j k l m n o p q r s t u v w x y z} In a locale definition file, no character specified for the keywords cntrl, digit, punct or space can be specified. The lower-case letters a to z of the portable character set are automatically included in this class.
- alpha
- Define characters to be classified as letters. In the POSIX locale, all characters in the classes upper and lower are included. In a locale definition file, no character specified for the keywords cntrl, digit, punct or space can be specified. Characters classified as either upper or lower are automatically included in this class.
- digit
- Define the characters to be classified as numeric digits. In the POSIX locale, only: { 0 1 2 3 4 5 6 7 8 } are included. In a locale definition file, only the digits 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9 can be specified, and in contiguous ascending sequence by numerical value. The digits 0 to 9 of the portable character set are automatically included in this class. The definition of character class digit requires that only ten characters the ones defining digits can be specified; alternative digits (for example, Hindi or Kanji) cannot be specified here. However, the encoding may vary if an implementation supports more than one encoding.
- space
- Define characters to be classified as white-space characters. In the POSIX locale, at a minimum, the characters space, form-feed, newline, carriage-return, tab and vertical-tab are included. In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, graph or xdigit can be specified. The characters space, form-feed, newline, carriage-return, tab and vertical-tab of the portable character set, and any characters included in the class blank are automatically included in this class.
- cntrl
- Define characters to be classified as control characters. In the POSIX locale, no characters in classes alpha or print are included. In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, punct, graph, print or xdigit can be specified.
- punct
- Define characters to be classified as punctuation characters. In the POSIX locale, neither the space character nor any characters in classes alpha, digit or cntrl are included. In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, cntrl, xdigit or as the space character can be specified.
- graph
- Define characters to be classified as printable characters, not including the space character. In the POSIX locale, all characters in classes alpha, digit and punct are included; no characters in class cntrl are included. In a locale definition file, characters specified for the keywords upper, lower, alpha, digit, xdigit and punct are automatically included in this class. No character specified for the keyword cntrl can be specified.
- Define characters to be classified as printable characters, including the space character. In the POSIX locale, all characters in class graph are included; no characters in class cntrl are included. In a locale definition file, characters specified for the keywords upper, lower, alpha, digit, xdigit, punct and the space character are automatically included in this class. No character specified for the keyword cntrl can be specified.
printの説明のところだけ訳すと
スペース文字を含む、表示可能文字として分類される文字を定義します。POSIXロカールでは、graphに属するすべてのキャラクタがこれに含まれます。cntrlクラスの文字は含まれません。ロカール定義ファイルにおいてupper、lower、alpha、digit、xdigit、punctに分類されている文字、そしてスペース文字は、自動的にこのクラスに含められます。cntrlで指定したキャラクタはここに指定することはできません。
また、この説明の下に面白い脚注がありました
The space character, which is part of the space and blank classes, cannot belong to punct or graph, but automatically belongs to the print class. Other space or blank characters can be classified as any of punct, graph or print.
Locale (抜粋) - The Single UNIX © Specification, Version 2
- space文字 (" ") は space と blank クラスの一部です。
- space文字は punct と graph クラスに属することはありません。
- space文字は print クラスに自動的に属します。
- 他の space や blank 文字は punct、graph もしくは print に分類することができます。
ということは、cntrlに何を含めるかが焦点ということになりそうです。
Unicode Standard Technical Report #18
Unicode Standard TR #18で、Unicode Regular Expressionsというものが定義されています。このドキュメントの「Annex C: Compatibility Properties」という節に、POSIX文字クラスとUnicode文字クラスの対応表があります。
Annex C: Compatibility Properties - Unicode Standard Technical Report #18 Unicode Regular Expressions
Property Standard Recommendation POSIX Compatible Comments cntrl \p{gc=Control} \p{gc=Control} The characters in \p{gc=Format} share some, but not all aspects of control characters. Many format characters are required in the representation of plain text. \p{graph} \p{blank} -- \p{cntrl} \p{graph} \p{blank} -- \p{cntrl} Includes graph and space-like characters. space / \s \p{Whitespace} \p{Whitespace} See PropList [UCD] for the definition of Whitespace. blank \p{Whitespace} -- [\N{LF} \N{VT} \N{FF} \N{CR} \N{NEL} \p{gc=Line_Separator} \p{gc=Paragraph_Separator}] \p{Whitespace} -- [\N{LF} \N{VT} \N{FF} \N{CR} \N{NEL} \p{gc=Line_Separator} \p{gc=Paragraph_Separator}] "horizontal" whitespace. graph [^\p{space}\p{gc=Control}\p{gc=Surrogate}\p{gc=Unassigned}] [^\p{space}\p{gc=Control}\p{gc=Surrogate}\p{gc=Unassigned}] Warning: the set to the left is defined by excluding space, controls, and so on with ^.
表中に出てくる \p{gc=...} というのは、Unicode文字プロパティGeneral Categoryが...であるものにマッチするという意味です。また、\N{...} は...で示されたキャラクタ名に該当する文字を示します (例: \N{LATIN CAPITAL LETTER A})。「--」は、左辺と右辺の差集合です。
UnicodeData.txtを見ると、U+0000からU+001Fまでは Cc (Control Character) というカテゴリに分類されています。
\p{Whitespace}で表される文字はPropList.txtによると以下のようになっているので、
0009..000D ; White_Space # Cc [5] <control-0009>..<control-000D> 0020 ; White_Space # Zs SPACE 0085 ; White_Space # Cc <control-0085> 00A0 ; White_Space # Zs NO-BREAK SPACE 1680 ; White_Space # Zs OGHAM SPACE MARK 180E ; White_Space # Zs MONGOLIAN VOWEL SEPARATOR 2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE 2028 ; White_Space # Zl LINE SEPARATOR 2029 ; White_Space # Zp PARAGRAPH SEPARATOR 202F ; White_Space # Zs NARROW NO-BREAK SPACE 205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE 3000 ; White_Space # Zs IDEOGRAPHIC SPACE
blankクラスの表す文字は上記から
LF (line feed) | U+000A |
VT (vertical tab) | U+000B |
FF (form feed) | U+000C |
CR (carriage return) | U+000D |
NEL (newline) | U+0085 |
gc=Line_Separator | U+2028 |
gc=Paragraph_Separator | U+2029 |
を差し引いたものということになります。
結果、Unicodeのblankクラスに含まれる文字は
0009 ; White_Space # Cc HORIZONTAL TAB 0020 ; White_Space # Zs SPACE 00A0 ; White_Space # Zs NO-BREAK SPACE 1680 ; White_Space # Zs OGHAM SPACE MARK 180E ; White_Space # Zs MONGOLIAN VOWEL SEPARATOR 2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE 202F ; White_Space # Zs NARROW NO-BREAK SPACE 205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE 3000 ; White_Space # Zs IDEOGRAPHIC SPACE
となり、printクラスに含まれる文字はblankクラスからcntrlクラスに分類される文字、つまりタブ文字を取り除いたもの、
0020 ; White_Space # Zs SPACE 00A0 ; White_Space # Zs NO-BREAK SPACE 1680 ; White_Space # Zs OGHAM SPACE MARK 180E ; White_Space # Zs MONGOLIAN VOWEL SEPARATOR 2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE 202F ; White_Space # Zs NARROW NO-BREAK SPACE 205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE 3000 ; White_Space # Zs IDEOGRAPHIC SPACE
に、graphクラスに属する文字を合わせたものとなります。
一方、POSIXでは、U+0000-U+001Fをすべて制御文字とみなすべきかについては明記していません (調べた限りでは)。
それにしても、PHPのバグを調べていたはずなのに、思ったよりずっと深いところにダイブしてしまった…。