Subscribed unsubscribe Subscribe Subscribe

鬼車+Unicodeの\[\[:print:\]\]はPOSIX流じゃないらしい

追記: どっちが正しいとかそういう話ではないので念のため...。
追記2: Technical ReportがAnnexとなっていたのを修正。
追記3: 微妙に誤解があった部分を修正。結論としては同じ。

id:ockeghem さんの、
POSIX正規表現の[:print:]は改行やタブがマッチするかどうかがPerlとPHPで異なりますね。Perlはマッチしない、PHPはマッチする。どっちが正しいんだ? 」というつぶやきを見て、いろいろ調べてみたんですが、

今回はPHPのせいじゃなかった

みたいなのでいろいろほっとしました。

さて、まずは試してみる

PHP:

<?php
foreach (str_split("\x09\x0a\x0d a") as $c) {
    var_dump(ord($c));

    echo "preg_match(): ";
    var_dump(preg_match("([[:print:]])u", $c));

    echo "ereg(): ";
    var_dump(ereg("[[:print:]]", $c));

    mb_regex_encoding("ASCII");
    echo "[ASCII] mb_ereg(): ";
    var_dump(mb_ereg("[[:print:]]", $c));

    mb_regex_encoding("UTF-8");
    echo "[UTF-8] mb_ereg(): ";
    var_dump(mb_ereg("[[:print:]]", $c));

    mb_regex_encoding("UTF-16");
    echo "[UTF-16] mb_ereg(): ";
    var_dump(mb_ereg("\0[\0[\0:\0p\0r\0i\0n\0t\0:\0]\0]", "\0$c"));
}
?>

Ruby:

"\x09\x0a\x0d a".each_char { |c| p c[0], c =~ /[[:print:]]/ }

Perl:

for (split(//, "\x09\x0a\x0d a")) { print ord, " ", 0+/[[:print:]]/, "\n" }

C:

echo '#include<stdio.h>~#include<regex.h>~main(){regex_t r;regcomp(&r,"^[[:print:]]",0);for(char*p="\x09\x0a\x0d a";*p;++p)printf("%d %d\n",*p,!regexec(&r,p,0,0,0));}' | tr '~' '\n' | gcc -std=c99 -xc - && ./a.out

PythonにはPOSIX正規表現的な文字クラスがないみたいだけど

for c in "\x09\x0a\x0d a": import re; print ord(c), re.match('[[:print:]]', c)

結果

PHP (5.2.9) の結果:

int(9)
preg_match(): int(0)
ereg(): bool(false)
[ASCII] mb_ereg(): bool(false)
[UTF-8] mb_ereg(): int(1)
[UTF-16] mb_ereg(): int(1)
int(10)
preg_match(): int(0)
ereg(): bool(false)
[ASCII]] mb_ereg(): bool(false)
[UTF-8] mb_ereg(): int(1)
[UTF-16] mb_ereg(): int(1)
int(13)
preg_match(): int(0)
ereg(): bool(false)
[ASCII] mb_ereg(): bool(false)
[UTF-8] mb_ereg(): int(1)
[UTF-16] mb_ereg(): int(1)
int(32)
preg_match(): int(1)
ereg(): int(1)
[ASCII] mb_ereg(): int(1)
[UTF-8] mb_ereg(): int(1)
[UTF-16] mb_ereg(): int(1)
int(97)
preg_match(): int(1)
ereg(): int(1)
[ASCII] mb_ereg(): int(1)
[UTF-8] mb_ereg(): int(1)
[UTF-16] mb_ereg(): int(1)

PHP (4.4.9) の結果 (UTF-16の結果は未対応なため省いてあります):

int(9)
preg_match(): int(0)
ereg(): bool(false)
[ASCII] mb_ereg(): bool(false)
[UTF-8] mb_ereg(): bool(false)
int(10)
preg_match(): int(0)
ereg(): bool(false)
[ASCII] mb_ereg(): bool(false)
[UTF-8] mb_ereg(): bool(false)
int(13)
preg_match(): int(0)
ereg(): bool(false)
[ASCII] mb_ereg(): bool(false)
[UTF-8] mb_ereg(): bool(false)
int(32)
preg_match(): int(1)
ereg(): int(1)
[ASCII] mb_ereg(): int(1)
[UTF-8] mb_ereg(): int(1)
int(97)
preg_match(): int(1)
ereg(): int(1)
[ASCII] mb_ereg(): int(1)
[UTF-8] mb_ereg(): int(1)

Rubyの結果:

9
nil
10
nil
13
nil
32
0
97
0

Perlの結果:

9 0
10 0
13 0
32 1
97 1

Cの結果:

9 0
10 0
13 0
32 1
97 1

というわけで、PHP5.2.9のUTF-8UTF-16のときだけ他のものと違う挙動をすることが分かります。

鬼車の仕様?

PHP5系だけ挙動が違うということは、鬼車に理由があるとしか考えられません。そこで鬼車のドキュメントを参照してみると、

Unicode以外の場合:

alnum
英数字
alpha
英字
ascii
0 - 127
blank
\t, \x20
cntrl
-
digit
0-9
graph
多バイト文字全部を含む
lower
-
print
多バイト文字全部を含む
punct
-
space
\t, \n, \v, \f, \r, \x20
upper
-
xdigit
0-9, a-f, A-F
word
英数字, "_" および 多バイト文字

Unicodeの場合:

alnum
Letter | Mark | Decimal_Number
alpha
Letter | Mark
ascii
0000 - 007F
blank
Space_Separator | 0009
cntrl
Control | Format | Unassigned | Private_Use | Surrogate
digit
Decimal_Number
graph
:^space: && ^Control && ^Unassigned && ^Surrogate
lower
Lowercase_Letter
print
:graph: | :space:
punct
Connector_Punctuation | Dash_Punctuation | Close_Punctuation | Final_Punctuation | Initial_Punctuation | Other_Punctuation | Open_Punctuation
space
Space_Separator | Line_Separator | Paragraph_Separator | 0009 | 000A | 000B | 000C | 000D | 0085
upper
Uppercase_Letter
xdigit
0030 - 0039 | 0041 - 0046 | 0061 - 0066 (0-9, a-f, A-F)
word
Letter | Mark | Decimal_Number | Connector_Punctuation
POSIXブラケット - 鬼車 正規表現 Version 5.9.1 2007/09/05

というように、Unicodeでない文字コード (文字集合) とUnicodeの場合とでPOSIX文字クラスの仕様が異なっていることが分かります。

さらにアーカイブ同梱のHISTORYを見ると

2004/11/08: [spec] fix Unicode character types.
                   0x00ad (soft hyphen) should be [:cntrl:] and [:space:] type.
                   [0x0009..0x000d], 0x0085 should be [:print:] type.
                   0x00ad should not be [:punct:] type.

ここで仕様変更が入っていることが分かり、この変更は意図的なものということがはっきりしました。

ちなみにPOSIX的にはどうなのか

OpenGroupのオンラインドキュメントによると、



  1. A character class expression represents the set of characters belonging to a character class, as defined in the LC_CTYPE category in the current locale. All character classes specified in the current locale will be recognised. A character class expression is expressed as a character class name enclosed within bracket-colon ([: :]) delimiters.


    The following character class expressions are supported in all locales:


    [:alnum:]   [:cntrl:]   [:lower:]   [:space:]
    [:alpha:] [:digit:] [:print:] [:upper:]
    [:blank:] [:graph:] [:punct:] [:xdigit:]

    In addition, character class expressions of the form:


    [:name:]

    are recognised in those locales where the name keyword has been given a charclass definition in the LC_CTYPE category.



Regular Expressions - The Single UNIX © Specification, Version 2

このLC_CTYPEカテゴリの定義はLocaleのドキュメントにあります。

upper
Define characters to be classified as upper-case letters. In the POSIX locale, the 26 upper-case letters are included: {A B C D E F G H I J K L M N O P Q R S T U V W X Y Z} In a locale definition file, no character specified for the keywords cntrl, digit, punct or space can be specified. The upper-case letters A to Z, as defined in Character Set Description File (the portable character set), are automatically included in this class.
lower
Define characters to be classified as lower-case letters. In the POSIX locale, the 26 lower-case letters are included: {a b c d e f g h i j k l m n o p q r s t u v w x y z} In a locale definition file, no character specified for the keywords cntrl, digit, punct or space can be specified. The lower-case letters a to z of the portable character set are automatically included in this class.
alpha
Define characters to be classified as letters. In the POSIX locale, all characters in the classes upper and lower are included. In a locale definition file, no character specified for the keywords cntrl, digit, punct or space can be specified. Characters classified as either upper or lower are automatically included in this class.
digit
Define the characters to be classified as numeric digits. In the POSIX locale, only: { 0 1 2 3 4 5 6 7 8 } are included. In a locale definition file, only the digits 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9 can be specified, and in contiguous ascending sequence by numerical value. The digits 0 to 9 of the portable character set are automatically included in this class. The definition of character class digit requires that only ten characters the ones defining digits can be specified; alternative digits (for example, Hindi or Kanji) cannot be specified here. However, the encoding may vary if an implementation supports more than one encoding.
space
Define characters to be classified as white-space characters. In the POSIX locale, at a minimum, the characters space, form-feed, newline, carriage-return, tab and vertical-tab are included. In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, graph or xdigit can be specified. The characters space, form-feed, newline, carriage-return, tab and vertical-tab of the portable character set, and any characters included in the class blank are automatically included in this class.
cntrl
Define characters to be classified as control characters. In the POSIX locale, no characters in classes alpha or print are included. In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, punct, graph, print or xdigit can be specified.
punct
Define characters to be classified as punctuation characters. In the POSIX locale, neither the space character nor any characters in classes alpha, digit or cntrl are included. In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, cntrl, xdigit or as the space character can be specified.
graph
Define characters to be classified as printable characters, not including the space character. In the POSIX locale, all characters in classes alpha, digit and punct are included; no characters in class cntrl are included. In a locale definition file, characters specified for the keywords upper, lower, alpha, digit, xdigit and punct are automatically included in this class. No character specified for the keyword cntrl can be specified.
print
Define characters to be classified as printable characters, including the space character. In the POSIX locale, all characters in class graph are included; no characters in class cntrl are included. In a locale definition file, characters specified for the keywords upper, lower, alpha, digit, xdigit, punct and the space character are automatically included in this class. No character specified for the keyword cntrl can be specified.
Locale (抜粋) - The Single UNIX © Specification, Version 2

printの説明のところだけ訳すと

スペース文字を含む、表示可能文字として分類される文字を定義します。POSIXロカールでは、graphに属するすべてのキャラクタがこれに含まれます。cntrlクラスの文字は含まれません。ロカール定義ファイルにおいてupper、lower、alpha、digit、xdigit、punctに分類されている文字、そしてスペース文字は、自動的にこのクラスに含められます。cntrlで指定したキャラクタはここに指定することはできません。

また、この説明の下に面白い脚注がありました

The space character, which is part of the space and blank classes, cannot belong to punct or graph, but automatically belongs to the print class. Other space or blank characters can be classified as any of punct, graph or print.

Locale (抜粋) - The Single UNIX © Specification, Version 2
  1. space文字 (" ") は space と blank クラスの一部です。
  2. space文字は punct と graph クラスに属することはありません。
  3. space文字は print クラスに自動的に属します。
  4. 他の space や blank 文字は punct、graph もしくは print に分類することができます。

ということは、cntrlに何を含めるかが焦点ということになりそうです。

Unicode Standard Technical Report #18

Unicode Standard TR #18で、Unicode Regular Expressionsというものが定義されています。このドキュメントの「Annex C: Compatibility Properties」という節に、POSIX文字クラスとUnicode文字クラスの対応表があります。

Property Standard Recommendation POSIX Compatible Comments
cntrl \p{gc=Control} \p{gc=Control} The characters in \p{gc=Format} share some, but not all aspects of control characters. Many format characters are required in the representation of plain text.
print \p{graph} \p{blank} -- \p{cntrl} \p{graph} \p{blank} -- \p{cntrl} Includes graph and space-like characters.
space / \s \p{Whitespace} \p{Whitespace} See PropList [UCD] for the definition of Whitespace.
blank \p{Whitespace} -- [\N{LF} \N{VT} \N{FF} \N{CR} \N{NEL} \p{gc=Line_Separator} \p{gc=Paragraph_Separator}] \p{Whitespace} -- [\N{LF} \N{VT} \N{FF} \N{CR} \N{NEL} \p{gc=Line_Separator} \p{gc=Paragraph_Separator}] "horizontal" whitespace.
graph [^\p{space}\p{gc=Control}\p{gc=Surrogate}\p{gc=Unassigned}] [^\p{space}\p{gc=Control}\p{gc=Surrogate}\p{gc=Unassigned}] Warning: the set to the left is defined by excluding space, controls, and so on with ^.
Annex C: Compatibility Properties - Unicode Standard Technical Report #18 Unicode Regular Expressions

表中に出てくる \p{gc=...} というのは、Unicode文字プロパティGeneral Categoryが...であるものにマッチするという意味です。また、\N{...} は...で示されたキャラクタ名に該当する文字を示します (例: \N{LATIN CAPITAL LETTER A})。「--」は、左辺と右辺の差集合です。

UnicodeData.txtを見ると、U+0000からU+001Fまでは Cc (Control Character) というカテゴリに分類されています。

\p{Whitespace}で表される文字はPropList.txtによると以下のようになっているので、

0009..000D    ; White_Space # Cc   [5] <control-0009>..<control-000D>
0020          ; White_Space # Zs       SPACE
0085          ; White_Space # Cc       <control-0085>
00A0          ; White_Space # Zs       NO-BREAK SPACE
1680          ; White_Space # Zs       OGHAM SPACE MARK
180E          ; White_Space # Zs       MONGOLIAN VOWEL SEPARATOR
2000..200A    ; White_Space # Zs  [11] EN QUAD..HAIR SPACE
2028          ; White_Space # Zl       LINE SEPARATOR
2029          ; White_Space # Zp       PARAGRAPH SEPARATOR
202F          ; White_Space # Zs       NARROW NO-BREAK SPACE
205F          ; White_Space # Zs       MEDIUM MATHEMATICAL SPACE
3000          ; White_Space # Zs       IDEOGRAPHIC SPACE

blankクラスの表す文字は上記から

LF (line feed) U+000A
VT (vertical tab) U+000B
FF (form feed) U+000C
CR (carriage return) U+000D
NEL (newline) U+0085
gc=Line_Separator U+2028
gc=Paragraph_Separator U+2029

を差し引いたものということになります。

結果、Unicodeのblankクラスに含まれる文字は

0009          ; White_Space # Cc       HORIZONTAL TAB
0020          ; White_Space # Zs       SPACE
00A0          ; White_Space # Zs       NO-BREAK SPACE
1680          ; White_Space # Zs       OGHAM SPACE MARK
180E          ; White_Space # Zs       MONGOLIAN VOWEL SEPARATOR
2000..200A    ; White_Space # Zs  [11] EN QUAD..HAIR SPACE
202F          ; White_Space # Zs       NARROW NO-BREAK SPACE
205F          ; White_Space # Zs       MEDIUM MATHEMATICAL SPACE
3000          ; White_Space # Zs       IDEOGRAPHIC SPACE

となり、printクラスに含まれる文字はblankクラスからcntrlクラスに分類される文字、つまりタブ文字を取り除いたもの、

0020          ; White_Space # Zs       SPACE
00A0          ; White_Space # Zs       NO-BREAK SPACE
1680          ; White_Space # Zs       OGHAM SPACE MARK
180E          ; White_Space # Zs       MONGOLIAN VOWEL SEPARATOR
2000..200A    ; White_Space # Zs  [11] EN QUAD..HAIR SPACE
202F          ; White_Space # Zs       NARROW NO-BREAK SPACE
205F          ; White_Space # Zs       MEDIUM MATHEMATICAL SPACE
3000          ; White_Space # Zs       IDEOGRAPHIC SPACE

に、graphクラスに属する文字を合わせたものとなります。

一方、POSIXでは、U+0000-U+001Fをすべて制御文字とみなすべきかについては明記していません (調べた限りでは)。

それにしても、PHPのバグを調べていたはずなのに、思ったよりずっと深いところにダイブしてしまった…。