Perlの正規表現のバグ? (後編)

追記: ケース4についての記述がなかったので言及。

前回の調査で、[[:print:]] と \p{IsPrint} は実は等価でないということが分かったわけだけど、これだけではまだ再現できなかった理由、つまり、

use encoding 'utf-8';
$_ = "\t";
# 1. マッチしない
printf("%d\n", /[[:print:]]/);
# 2. マッチする
printf("%d\n", /^[[:print:]]/);
# 3. マッチする
printf("%d\n", /[[:print:]]$/);
# 4. マッチする
printf("%d\n", /^[[:print:]]$/);
# 5. マッチしない
printf("%d\n", /[[:print:]]+/);
# 6. マッチする
printf("%d\n", /^[[:print:]]+/);
# 7. マッチしない
printf("%d\n", /[[:print:]]+$/);

これらに違いが出る理由を説明できてない。なのでこの辺を見ていこうと思う。

まず、上記を "use re 'Debug', 'COMPILE', 'EXECUTE'" つきで走らせた結果を表形式にまとめたのが以下。






























































パタン マッチするか? コンパイル結果 マッチ結果
1

[[:print:]]


×

Compiling REx "[[:print:]]"
Starting first pass (sizing)
 >[[:print:]... |   1|  reg    
                |    |    brnc   
                |    |      piec   
                |    |        atom   
 >[:print:]]<   |    |          clas   
Required size 13 nodes
Starting second pass (creation)
 >[[:print:]... |   1|  reg    
                |    |    brnc   
                |    |      piec   
                |    |        atom   
 >[:print:]]<   |    |          clas   
 ><             |  14|    tail~ ANYOF[ -~+utf8::IsPrint] (1) -> END
first:>  1: ANYOF[ -~+utf8::IsPrint] (13)
first at 1
Peep>  1: ANYOF[ -~+utf8::IsPrint] (13)
minlen: 1 r->minlen:0
Final program:
   1: ANYOF[ -~+utf8::IsPrint] (13)
  13: END (0)
stclass ANYOF[ -~+utf8::IsPrint] minlen 1 
r->extflags: UTF8 

Matching REx "[[:print:]]" against "%t"
UTF-8 pattern and string...
Matching stclass ANYOF[ -~+utf8::IsPrint] against "%t" (1 chars)
Contradicts stclass... [regexec_flags]
Match failed

2

[[:print:]]



Compiling REx "^[[:print:]]"
Starting first pass (sizing)
 >^[[:print:... |   1|  reg    
                |    |    brnc   
                |    |      piec   
                |    |        atom   
 >[[:print:]... |   2|      piec   
                |    |        atom   
 >[:print:]]<   |    |          clas   
Required size 14 nodes
Starting second pass (creation)
 >^[[:print:... |   1|  reg    
                |    |    brnc   
                |    |      piec   
                |    |        atom   
 >[[:print:]... |   2|      piec   
                |    |        atom   
 >[:print:]]<   |    |          clas   
 ><             |  14|      tail~ BOL (1) -> ANYOF
                |  15|    tail~ BOL (1)  
                |    |        ~ ANYOF[ -~+utf8::IsPrint] (2) -> END
first:>  1: BOL (2)
first:>  2: ANYOF[ -~+utf8::IsPrint] (14)
first at 2
Peep>  1: BOL (2)
Peep>  2: ANYOF[ -~+utf8::IsPrint] (14)
minlen: 1 r->minlen:0
Final program:
   1: BOL (2)
   2: ANYOF[ -~+utf8::IsPrint] (14)
  14: END (0)
stclass ANYOF[ -~+utf8::IsPrint] anchored(BOL) minlen 1 
r->extflags: ANCH_BOL UTF8 

Matching REx "^[[:print:]]" against "%t"
UTF-8 pattern and string...
regmatch start
   0 <> <%t>                 |  1:BOL(2)
   0 <> <%t>                 |  2:ANYOF[ -~+utf8::IsPrint](14)
   1 <%t> <>                 | 14:END(0)
Match successful!


3

[[:print:]]$



Compiling REx "[[:print:]]$"
Starting first pass (sizing)
 >[[:print:]... |   1|  reg    
                |    |    brnc   
                |    |      piec   
                |    |        atom   
 >[:print:]]... |    |          clas   
 >$<            |  13|      piec   
                |    |        atom   
Required size 14 nodes
Starting second pass (creation)
 >[[:print:]... |   1|  reg    
                |    |    brnc   
                |    |      piec   
                |    |        atom   
 >[:print:]]... |    |          clas   
 >$<            |  13|      piec   
                |    |        atom   
 ><             |  14|      tail~ ANYOF[ -~+utf8::IsPrint] (1) -> EOL
                |  15|    tail~ ANYOF[ -~+utf8::IsPrint] (1)  
                |    |        ~ EOL (13) -> END
first:>  1: ANYOF[ -~+utf8::IsPrint] (13)
first at 1
Peep>  1: ANYOF[ -~+utf8::IsPrint] (13)
Peep> 13: EOL (14)
minlen: 1 r->minlen:0
Final program:
   1: ANYOF[ -~+utf8::IsPrint] (13)
  13: EOL (14)
  14: END (0)
anchored ""$ at 1 stclass ANYOF[ -~+utf8::IsPrint] minlen 1 
r->extflags: UTF8 

Matching REx "[[:print:]]$" against "%t"
UTF-8 pattern and string...
regmatch start
   0 <> <%t>                 |  1:ANYOF[ -~+utf8::IsPrint](13)
   1 <%t> <>                 | 13:EOL(14)
   1 <%t> <>                 | 14:END(0)
Match successful!


4

^[[:print:]]$



Compiling REx "^[[:print:]]$"
Starting first pass (sizing)
 >^[[:print:... |   1|  reg    
                |    |    brnc   
                |    |      piec   
                |    |        atom   
 >[[:print:]... |   2|      piec   
                |    |        atom   
 >[:print:]]... |    |          clas   
 >$<            |  14|      piec   
                |    |        atom   
Required size 15 nodes
Starting second pass (creation)
 >^[[:print:... |   1|  reg    
                |    |    brnc   
                |    |      piec   
                |    |        atom   
 >[[:print:]... |   2|      piec   
                |    |        atom   
 >[:print:]]... |    |          clas   
 >$<            |  14|      tail~ BOL (1) -> ANYOF
                |    |      piec   
                |    |        atom   
 ><             |  15|      tail~ ANYOF[ -~+utf8::IsPrint] (2) -> EOL
                |  16|    tail~ BOL (1)  
                |    |        ~ ANYOF[ -~+utf8::IsPrint] (2)  
                |    |        ~ EOL (14) -> END
first:>  1: BOL (2)
first:>  2: ANYOF[ -~+utf8::IsPrint] (14)
first at 2
Peep>  1: BOL (2)
Peep>  2: ANYOF[ -~+utf8::IsPrint] (14)
Peep> 14: EOL (15)
minlen: 1 r->minlen:0
Final program:
   1: BOL (2)
   2: ANYOF[ -~+utf8::IsPrint] (14)
  14: EOL (15)
  15: END (0)
anchored ""$ at 1 stclass ANYOF[ -~+utf8::IsPrint] anchored(BOL) minlen 1 
r->extflags: ANCH_BOL UTF8 

Matching REx "^[[:print:]]$" against "%t"
UTF-8 pattern and string...
regmatch start
   0 <> <%t>                 |  1:BOL(2)
   0 <> <%t>                 |  2:ANYOF[ -~+utf8::IsPrint](14)
   1 <%t> <>                 | 14:EOL(15)
   1 <%t> <>                 | 15:END(0)
Match successful!


5

[[:print:]]+


×

Compiling REx "[[:print:]]+"
Starting first pass (sizing)
 >[[:print:]... |   1|  reg    
                |    |    brnc   
                |    |      piec   
                |    |        atom   
 >[:print:]]... |    |          clas   
 ><             |  13|        inst - PLUS
Required size 14 nodes
Starting second pass (creation)
 >[[:print:]... |   1|  reg    
                |    |    brnc   
                |    |      piec   
                |    |        atom   
 >[:print:]]... |    |          clas   
 ><             |  13|        inst - PLUS
                |  15|    tail~ PLUS (1) -> END
first:>  2: ANYOF[ -~+utf8::IsPrint] (0)
first at 2
Peep>  1: PLUS (14)
minlen: 1 r->minlen:0
Final program:
   1: PLUS (14)
   2:   ANYOF[ -~+utf8::IsPrint] (0)
  14: END (0)
stclass ANYOF[ -~+utf8::IsPrint] plus minlen 1 
r->extflags: UTF8

Matching REx "[[:print:]]+" against "%t"
UTF-8 pattern and string...
Matching stclass ANYOF[ -~+utf8::IsPrint] against "%t" (1 chars)
Contradicts stclass... [regexec_flags]
Match failed


6

^[[:print:]]+



Compiling REx "^[[:print:]]+"
Starting first pass (sizing)
 >^[[:print:... |   1|  reg    
                |    |    brnc   
                |    |      piec   
                |    |        atom   
 >[[:print:]... |   2|      piec   
                |    |        atom   
 >[:print:]]... |    |          clas   
 ><             |  14|        inst - PLUS
Required size 15 nodes
Starting second pass (creation)
 >^[[:print:... |   1|  reg    
                |    |    brnc   
                |    |      piec   
                |    |        atom   
 >[[:print:]... |   2|      piec   
                |    |        atom   
 >[:print:]]... |    |          clas   
 ><             |  14|        inst - PLUS
                |  15|      tail~ BOL (1) -> PLUS
                |  16|    tail~ BOL (1)  
                |    |        ~ PLUS (2) -> END
first:>  1: BOL (2)
first:>  2: PLUS (15)
first at 2
Peep>  1: BOL (2)
Peep>  2: PLUS (15)
  Peep>  3: ANYOF[ -~+utf8::IsPrint] (0)
synthetic stclass "ANYOF[ -~+utf8::IsPrint]".
minlen: 1 r->minlen:0
Final program:
   1: BOL (2)
   2: PLUS (15)
   3:   ANYOF[ -~+utf8::IsPrint] (0)
  15: END (0)
stclass ANYOF[ -~+utf8::IsPrint] anchored(BOL) minlen 1 
r->extflags: ANCH_BOL UTF8

Matching REx "^[[:print:]]+" against "%t"
UTF-8 pattern and string...
regmatch start
   0 <> <%t>                 |  1:BOL(2)
   0 <> <%t>                 |  2:PLUS(15)
                                  ANYOF[ -~\11-\15 -~\302\205\302\240-\303\277...+utf8::IsPrint] can match 1 times out of 2147483647...
   1 <%t> <>                 | 15:  END(0)
Match successful!


7

[[:print:]]+$


×

Compiling REx "[[:print:]]+$"
Starting first pass (sizing)
 >[[:print:]... |   1|  reg    
                |    |    brnc   
                |    |      piec   
                |    |        atom   
 >[:print:]]... |    |          clas   
 >$<            |  13|        inst - PLUS
                |  14|      piec   
                |    |        atom   
Required size 15 nodes
Starting second pass (creation)
 >[[:print:]... |   1|  reg    
                |    |    brnc   
                |    |      piec   
                |    |        atom   
 >[:print:]]... |    |          clas   
 >$<            |  13|        inst - PLUS
                |  14|      piec   
                |    |        atom   
 ><             |  15|      tail~ PLUS (1) -> EOL
                |  16|    tail~ PLUS (1)  
                |    |        ~ EOL (14) -> END
first:>  2: ANYOF[ -~+utf8::IsPrint] (0)
first at 2
Peep>  1: PLUS (14)
Peep> 14: EOL (15)
minlen: 1 r->minlen:0
Final program:
   1: PLUS (14)
   2:   ANYOF[ -~+utf8::IsPrint] (0)
  14: EOL (15)
  15: END (0)
floating ""$ at 1..2147483647 (checking floating) stclass ANYOF[ -~+utf8::IsPrint] plus minlen 1 
r->extflags: UTF8 USE_INTUIT_NOML USE_INTUIT_ML INTUIT_TAIL 

Guessing start of match in sv for REx "[[:print:]]+$" against "%t"
UTF-8 pattern and string...
Found floating substr ""$ at offset 1...
start_shift: 1 check_at: 1 s: 0 endpos: 1
Could not match STCLASS...
Match rejected by optimizer


これから次のようなことが分かる。

  1. コンパイル結果に anchored と出ている場合にマッチしている
  2. パタン 7だけはパタン 1, 5 とは違う理由でマッチしていない

anchored となっている場合のマッチはどうなってるの?

「Final program:」以下を表示している関数、regdump() を見てみると、

Perl_regdump(pTHX_ const regexp *r)
{
    ... (略) ...
    (void)dumpuntil(r, ri->program, ri->program + 1, NULL, NULL, sv, 0, 0);

    /* Header fields of interest. */
    if (r->anchored_substr) {
        RE_PV_QUOTED_DECL(s, 0, dsv, SvPVX_const(r->anchored_substr), 
            RE_SV_DUMPLEN(r->anchored_substr), 30);
        PerlIO_printf(Perl_debug_log,
                      "anchored %s%s at %"IVdf" ",
                      s, RE_SV_TAIL(r->anchored_substr),
                      (IV)r->anchored_offset);
    } else if (r->anchored_utf8) {
        RE_PV_QUOTED_DECL(s, 1, dsv, SvPVX_const(r->anchored_utf8), 
            RE_SV_DUMPLEN(r->anchored_utf8), 30);
        PerlIO_printf(Perl_debug_log,
                      "anchored utf8 %s%s at %"IVdf" ",
                      s, RE_SV_TAIL(r->anchored_utf8),
                      (IV)r->anchored_offset);
    }
    ... (略) ...          
    if (ri->regstclass) {
        regprop(r, sv, ri->regstclass);
        PerlIO_printf(Perl_debug_log, "stclass %s ", SvPVX_const(sv));
    }
    if (r->extflags & RXf_ANCH) {
        PerlIO_printf(Perl_debug_log, "anchored");
        if (r->extflags & RXf_ANCH_BOL)
            PerlIO_printf(Perl_debug_log, "(BOL)");
        if (r->extflags & RXf_ANCH_MBOL)
            PerlIO_printf(Perl_debug_log, "(MBOL)");
        if (r->extflags & RXf_ANCH_SBOL)
            PerlIO_printf(Perl_debug_log, "(SBOL)");
        if (r->extflags & RXf_ANCH_GPOS)
            PerlIO_printf(Perl_debug_log, "(GPOS)");
        PerlIO_putc(Perl_debug_log, ' ');
    }
    if (r->extflags & RXf_GPOS_SEEN)
	PerlIO_printf(Perl_debug_log, "GPOS:%"UVuf" ", (UV)r->gofs);
    if (r->intflags & PREGf_SKIP)
	PerlIO_printf(Perl_debug_log, "plus ");
    if (r->intflags & PREGf_IMPLICIT)
	PerlIO_printf(Perl_debug_log, "implicit ");
    PerlIO_printf(Perl_debug_log, "minlen %"IVdf" ", (IV)r->minlen);
    if (r->extflags & RXf_EVAL_SEEN)
	PerlIO_printf(Perl_debug_log, "with eval ");
    PerlIO_printf(Perl_debug_log, "\n");
    DEBUG_FLAGS_r(regdump_extflags("r->extflags: ",r->extflags));            
    ... (略) ...
}

のようになっていて、

  • ((regexp *)r)->anchored_substr が NULL でない
  • ((regexp *)r)->extflags & RXf_ANCH が 0 でない

場合に "anchored" と表示されることが分かる。

で、次に、パタンがこのような場合にマッチャーでどういうことが起こるのかを見てみる。前回は読み飛ばしていた条件分岐をrevisitする必要があるみたい。

    /* Simplest case:  anchored match need be tried only once. */
    /*  [unless only anchor is BOL and multiline is set] */
    if (prog->extflags & (RXf_ANCH & ~RXf_ANCH_GPOS)) {

        ... (ケース1) ...

    } else if (RXf_GPOS_CHECK == (prog->extflags & RXf_GPOS_CHECK)) {

        ... (ケース2) ...

    }

    /* Messy cases:  unanchored match. */
    if ((prog->anchored_substr || prog->anchored_utf8) && prog->intflags & PREGf_SKIP) {

        ... (ケース3) ...

    } else if (prog->anchored_substr != NULL
	      || prog->anchored_utf8 != NULL
	      || ((prog->float_substr != NULL || prog->float_utf8 != NULL)
		  && prog->float_max_offset < strend - s)) {

        ... (ケース4) ...

    } else if ( (c = progi->regstclass) ) {

        ... (ケース5) ...

    } else {

        ... (ケース6) ...

    }

これ。ちなみに、前回の場合は (ケース6) に該当していた。フラグの値から察すると、

パタン ケース
^[[:print:]] 1
[[:print:]]$ 4
^[[:print:]]$ 1
^[[:print:]]+ 1

となるはずだ。

まずは、ケース1を見てみよう。ケース1の場合の処理は以下のようになっている。

        if (s == startpos && regtry(&reginfo, &startpos))
            goto got_it;
        else if (multiline || (prog->intflags & PREGf_IMPLICIT)
                 || (prog->extflags & RXf_ANCH_MBOL)) /* XXXX SBOL? */
        {
            char *end;

            if (minlen)
                dontbother = minlen - 1;
            end = HOP3c(strend, -dontbother, strbeg) - 1;
            /* for multiline we only have to try after newlines */
            if (prog->check_substr || prog->check_utf8) {
                if (s == startpos)
                    goto after_try;
                while (1) {
                    if (regtry(&reginfo, &s))
                        goto got_it;
                  after_try:
                    if (s > end)
                        goto phooey;
                    if (prog->extflags & RXf_USE_INTUIT) {
                        s = re_intuit_start(prog, sv, s + 1, strend, flags, NULL);
                        if (!s)
                            goto phooey;
                    }
                    else
                        s++;
                }                
            } else {
                if (s > startpos)
                    s--;
                while (s < end) {
                    if (*s++ == '\n') {        /* don't need PL_utf8skip here */
                        if (regtry(&reginfo, &s))
                            goto got_it;
                    }
                }                
            }
        }
        goto phooey;

もうこうなるとどこで分岐するか分からないので、デバッガを使う。

とりあえず最初のregtry() が呼び出されるか知りたいので

(gdb) break S_regtry

と素直にやればいいように思える。ところが、

use encoding 'utf-8';
$_ = "\t";
/^[[:print:]]/;

以上のようなスクリプトを「gdb --args perl /tmp/test.pl」のようにしてデバッガ上で走らせたときにどうなるかというと、

GNU gdb 6.8-debian
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu"...
Warning: /usr/src/glibc-2.7/glibc-2.7/wcsmbs: No such file or directory.
(gdb) break S_regtry
Function "S_regtry" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y

Breakpoint 1 (S_regtry) pending.
(gdb) run
Starting program: /home/moriyoshi/opt/perl-5.10.0/bin/perl /tmp/test.pl

Breakpoint 1, S_regtry (reginfo=0x7fff933b6d00, startpos=0x7fff933b6d60)
    at regexec.c:2205
2205    {
(gdb) bt
#0  S_regtry (reginfo=0x7fff933b6d00, startpos=0x7fff933b6d60)
    at regexec.c:2205
#1  0x00007fbb8af3a044 in Perl_regexec_flags (prog=0xa9d9d0,
    stringarg=0xa98eca "/strict.pm", strend=0xa98ed4 "",
    strbeg=0xa98ea0 "/home/moriyoshi/opt/perl-5.10.0/lib/5.10.0/strict.pm",
    minend=<value optimized out>, sv=0xa95590, data=0x0, flags=3)
    at regexec.c:2034
#2  0x00007fbb8aec9f4b in Perl_pp_match () at pp_hot.c:1330
#3  0x00007fbb8ae9db34 in Perl_runops_debug () at dump.c:1931
#4  0x00007fbb8aebcc5f in Perl_call_sv (sv=0xa95350, flags=6) at perl.c:2646
#5  0x00007fbb8aebd1c4 in Perl_call_list (oldscope=14, paramList=0xa952a8)
    at perl.c:5133
#6  0x00007fbb8ae6d2d3 in S_process_special_blocks (
    fullname=<value optimized out>, gv=0xa95410, cv=0xa95350) at op.c:5631
#7  0x00007fbb8ae7aa1f in Perl_newATTRSUB (floor=305, o=<value optimized out>,
    proto=<value optimized out>, attrs=0x0, block=0xa964d0) at op.c:5604
#8  0x00007fbb8ae798fa in Perl_utilize (aver=1, floor=305, version=0xa96240,
    idop=0xa95f90, arg=0x0) at op.c:3757
#9  0x00007fbb8ae6c198 in Perl_yyparse () at perly.y:654
#10 0x00007fbb8aefdf1b in S_doeval (gimme=0, startop=0x0, outside=0x0,
    seq=<value optimized out>) at pp_ctl.c:2916
#11 0x00007fbb8af0042a in Perl_pp_require () at pp_ctl.c:3520
#12 0x00007fbb8ae9db34 in Perl_runops_debug () at dump.c:1931
#13 0x00007fbb8aebcc5f in Perl_call_sv (sv=0xa950b0, flags=6) at perl.c:2646
#14 0x00007fbb8aebd1c4 in Perl_call_list (oldscope=8, paramList=0xa86680)
    at perl.c:5133
#15 0x00007fbb8ae6d2d3 in S_process_special_blocks (
    fullname=<value optimized out>, gv=0xa95170, cv=0xa950b0) at op.c:5631
#16 0x00007fbb8ae7aa1f in Perl_newATTRSUB (floor=166, o=<value optimized out>,
    proto=<value optimized out>, attrs=0x0, block=0xa96d90) at op.c:5604
#17 0x00007fbb8ae798fa in Perl_utilize (aver=1, floor=166, version=0xa96b00,
    idop=0xa96a60, arg=0x0) at op.c:3757
#18 0x00007fbb8ae6c198 in Perl_yyparse () at perly.y:654
#19 0x00007fbb8aefdf1b in S_doeval (gimme=0, startop=0x0, outside=0x0,
    seq=<value optimized out>) at pp_ctl.c:2916
#20 0x00007fbb8af0042a in Perl_pp_require () at pp_ctl.c:3520
#21 0x00007fbb8ae9db34 in Perl_runops_debug () at dump.c:1931
#22 0x00007fbb8aebcc5f in Perl_call_sv (sv=0xa86458, flags=6) at perl.c:2646
#23 0x00007fbb8aebd1c4 in Perl_call_list (oldscope=2, paramList=0xa86560)
    at perl.c:5133
#24 0x00007fbb8ae6d2d3 in S_process_special_blocks (
    fullname=<value optimized out>, gv=0xa86530, cv=0xa86458) at op.c:5631
#25 0x00007fbb8ae7aa1f in Perl_newATTRSUB (floor=27, o=<value optimized out>,
    proto=<value optimized out>, attrs=0x0, block=0xa906f0) at op.c:5604
#26 0x00007fbb8ae798fa in Perl_utilize (aver=1, floor=27, version=0xa90430,
    idop=0xa902f0, arg=0xa903d0) at op.c:3757
#27 0x00007fbb8ae6c198 in Perl_yyparse () at perly.y:654
#28 0x00007fbb8aebef3b in S_parse_body (env=0x0, xsinit=0x400ce0 <xs_init>)
    at perl.c:2230
#29 0x00007fbb8aec0095 in perl_parse (my_perl=<value optimized out>,
    xsinit=0x400ce0 <xs_init>, argc=2, argv=0x7fff933b8398, env=0x0)
    at perl.c:1650
#30 0x0000000000400c8c in main (argc=2, argv=0x7fff933b8398,
    env=0x7fff933b83b0) at perlmain.c:111
(gdb)

正規表現はいろいろなところで使われているから、問題としているコードの処理に到達するまでにも頻繁にブレークポイントを通過することになってしまうのだ。もちろんひたすら continue コマンドを打ちつづけるのでもいいけど、非効率的なので、ちょっとスクリプトを直してやることで、ちょっとしたテクニックを使えるようにする。

use encoding 'utf-8';
$_ = "\t";
my $r = qr(^[[:print:]]);
sin(0);
$_ =~ $r;

これで、最初に sin にブレークポイントを設定してから実行して、止まったところで S_regtry にブレークポイントを設定するというように、段階的にブレークポイントを設定することで、早く目的の箇所に到達できるようになる。ちなみに sin() 関数そのものは便宜的なもので意味はなく、単に以下のような条件を満たす関数だからという理由で選んだものだ。

  • スクリプトの側の関数と C 側の関数の呼び出しが 1 対 1 で対応している
  • 副作用がない (関数呼び出しの結果がスクリプトの他の箇所の実行内容に影響しない)

(この辺も「Debug Hacks」に書いてあるのかな?)

GNU gdb 6.8-debian
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu"...
Warning: /usr/src/glibc-2.7/glibc-2.7/wcsmbs: No such file or directory.
(gdb) break sin
Function "sin" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y

Breakpoint 1 (sin) pending.
(gdb) run
Starting program: /home/moriyoshi/opt/perl-5.10.0/bin/perl /tmp/test.pl

Breakpoint 1, __sin (x=0) at ../sysdeps/ieee754/dbl-64/s_sin.c:89
89      ../sysdeps/ieee754/dbl-64/s_sin.c: No such file or directory.
        in ../sysdeps/ieee754/dbl-64/s_sin.c
(gdb) break S_regtry
Breakpoint 2 at 0x7f67e7593de0: file regexec.c, line 2205.
(gdb) cont
Continuing.

Breakpoint 2, S_regtry (reginfo=0x7fffefa23630, startpos=0x7fffefa23688)
    at regexec.c:2205
2205    {
(gdb) bt
#0  S_regtry (reginfo=0x7fffefa23630, startpos=0x7fffefa23688)
    at regexec.c:2205
#1  0x00007f67e75a4bc8 in Perl_regexec_flags (prog=0x1e23640,
    stringarg=0x1e24100 "\t", strend=0x1e24101 "", strbeg=0x1e24100 "\t",
    minend=<value optimized out>, sv=0x1e6e338, data=0x0,
    flags=<value optimized out>) at regexec.c:1867
#2  0x00007f67e7534f4b in Perl_pp_match () at pp_hot.c:1330
#3  0x00007f67e7508b34 in Perl_runops_debug () at dump.c:1931
#4  0x00007f67e7528bdb in perl_run (my_perl=<value optimized out>)
    at perl.c:2384
#5  0x0000000000400ccc in main (argc=2, argv=0x7fffefa23a08,
    env=0x7fffefa23a20) at perlmain.c:113
(gdb)

というわけで、1867行目

        if (s == startpos && regtry(&reginfo, &startpos)) // →ここ
            goto got_it;

ここで s == startpos という条件が成立し、regtry() が呼ばれていることが分かった。では、regtry() のここでの結果はどうなっているのかを見てみる。

(gdb) disable
(gdb) fin
Run till exit from #0  S_regtry (reginfo=0x7fff2d968570, 
    startpos=0x7fff2d9685c8) at regexec.c:2205
0x00007f0d254e9bc8 in Perl_regexec_flags (prog=0x18e5640, 
    stringarg=0x18e6100 "\t", strend=0x18e6101 "", strbeg=0x18e6100 "\t", 
    minend=<value optimized out>, sv=0x1930338, data=0x0, 
    flags=<value optimized out>) at regexec.c:1867
1867            if (s == startpos && regtry(&reginfo, &startpos))
Value returned is $1 = 1

というわけで、true が返ってきたので、最終的に got_it (マッチしたよ) に飛ぶことが分かった。LOST って TV ドラマで Sawyer という登場人物がよく口にするのは「gotcha.」だったな。

さて、ケース4も忘れずに見ておく必要がある。

        ... (略)...
        /* XXXX check_substr already used to find "s", can optimize if
           check_substr==must. */
        scream_pos = -1;
        dontbother = end_shift;
        strend = HOPc(strend, -dontbother);
        while ( (s <= last) &&
                ((flags & REXEC_SCREAM)
                 ? (s = screaminstr(sv, must, HOP3c(s, back_min, (back_min<0 ? strbeg : strend)) - strbeg,
                                    end_shift, &scream_pos, 0))
                 : (s = fbm_instr((unsigned char*)HOP3(s, back_min, (back_min<0 ? strbeg : strend)),
                                  (unsigned char*)strend, must,
                                  multiline ? FBMrf_MULTILINE : 0))) ) {
            /* we may be pointing at the wrong string */
            if ((flags & REXEC_SCREAM) && RX_MATCH_COPIED(prog))
                s = strbeg + (s - SvPVX_const(sv));
            DEBUG_EXECUTE_r( did_match = 1 );
            if (HOPc(s, -back_max) > last1) {
                last1 = HOPc(s, -back_min);
                s = HOPc(s, -back_max);
            }
            else {
                char * const t = (last1 >= PL_bostr) ? HOPc(last1, 1) : last1 + 1;

                last1 = HOPc(s, -back_min);
                s = t;
            }
            if (do_utf8) {
                while (s <= last1) {
                    if (regtry(&reginfo, &s))
                        goto got_it;
                    s += UTF8SKIP(s);
                }
            }
            else {
                while (s <= last1) {
                    if (regtry(&reginfo, &s))
                        goto got_it;
                    s++;
                }
            }
        }
        ... (略) ...
        goto phooey;

screaminstr() も fbm_instr() もやっていることは同じで、結局は部分文字列を探して、その後に続く文字列がパタンにマッチするかどうかを見ているだけだ。結局 regtry() が呼ばれていることには違いはない。というわけで、regtry() を見てみる。

では、regtry() の中身は?

regtry() を見ていく前に、ケース6で regtry() が呼ばれていないことを確認しておく。前回見たのだけど、

STATIC char *
S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s, 
    const char *strend, regmatch_info *reginfo)
{
    ... (略) ....
    register const bool do_utf8 = PL_reg_match_utf8;
    ... (略) ....

    /* We know what class it must start with. */
    switch (OP(c)) {
    case ANYOF:
        if (do_utf8) {
             REXEC_FBC_UTF8_CLASS_SCAN((ANYOF_FLAGS(c) & ANYOF_UNICODE) ||
                      !UTF8_IS_INVARIANT((U8)s[0]) ?
                      reginclass(prog, c, (U8*)s, 0, do_utf8) :
                      REGINCLASS(prog, c, (U8*)s));
        }
        else {
            while (s < strend) {
                STRLEN skip = 1;

                if (REGINCLASS(prog, c, (U8*)s) ||
                    (ANYOF_FOLD_SHARP_S(c, s, strend) &&
                    /* The assignment of 2 is intentional:
                     * for the folded sharp s, the skip is 2. */
                    (skip = SHARP_S_SKIP))) {
                    if (tmp && (!reginfo || regtry(reginfo, &s)))
                        goto got_it;
                    else
                        tmp = doevery;
                }
                else 
                    tmp = 1;
                s += skip;
            }
        }
        break;

    ... (略) ....

    }

    ... (略) ....
}

ここで、regtry() が呼ばれるケースは do_utf8 が false の場合のみなので、大丈夫そうだ。

さて、

STATIC I32			/* 0 failure, 1 success */
S_regtry(pTHX_ regmatch_info *reginfo, char **startpos)
{
    dVAR;
    CHECKPOINT lastcp;
    regexp *prog = reginfo->prog;
    RXi_GET_DECL(prog,progi);
    GET_RE_DEBUG_FLAGS_DECL;
    reginfo->cutpoint=NULL;

    if ((prog->extflags & RXf_EVAL_SEEN) && !PL_reg_eval_set) {
        ... (略) ...
    }
    DEBUG_EXECUTE_r(PL_reg_starttry = *startpos);
    prog->offs[0].start = *startpos - PL_bostr;
    PL_reginput = *startpos;
    PL_reglastparen = &prog->lastparen;
    PL_reglastcloseparen = &prog->lastcloseparen;
    prog->lastparen = 0;
    prog->lastcloseparen = 0;
    PL_regsize = 0;
    PL_regoffs = prog->offs;
    if (PL_reg_start_tmpl <= prog->nparens) {
        PL_reg_start_tmpl = prog->nparens*3/2 + 3;
        if(PL_reg_start_tmp)
            Renew(PL_reg_start_tmp, PL_reg_start_tmpl, char*);
        else
            Newx(PL_reg_start_tmp, PL_reg_start_tmpl, char*);
    }

    /* XXXX What this code is doing here?!!!  There should be no need
       to do this again and again, PL_reglastparen should take care of
       this!  --ilya*/

    /* Tests pat.t#187 and split.t#{13,14} seem to depend on this code.
     * Actually, the code in regcppop() (which Ilya may be meaning by
     * PL_reglastparen), is not needed at all by the test suite
     * (op/regexp, op/pat, op/split), but that code is needed otherwise
     * this erroneously leaves $1 defined: "1" =~ /^(?:(\d)x)?\d$/
     * Meanwhile, this code *is* needed for the
     * above-mentioned test suite tests to succeed.  The common theme
     * on those tests seems to be returning null fields from matches.
     * --jhi updated by dapm */
#if 1
    if (prog->nparens) {
        regexp_paren_pair *pp = PL_regoffs;
        register I32 i;
        for (i = prog->nparens; i > (I32)*PL_reglastparen; i--) {
            ++pp;
            pp->start = -1;
            pp->end = -1;
        }
    }
#endif
    REGCP_SET(lastcp);
    if (regmatch(reginfo, progi->program + 1)) {
        PL_regoffs[0].end = PL_reginput - PL_bostr;
        return 1;
    }
    if (reginfo->cutpoint)
        *startpos= reginfo->cutpoint;
    REGCP_UNWIND(lastcp);
    return 0;
}

RXf_EVAL_SEEN は /(?{...})/ が使われたときのみに true となるフラグなので無視するとして、その先。

prog->nparens はキャプチャ対象となるグループの数を表していて、PL_reg_start_tmp はキャプチャ用のバッファ*1、PL_reg_start_tmplはバッファの大きさ。バッファの大きさが要求されるものより小さければ、Renew() でバッファのサイズを大きくし、そもそもバッファが確保されていなければ Newx() で新しくバッファを確保する。って全然マッチの挙動と関係ないじゃん!

そんなこんなで、結局は regmatch() を見る必要が出てきた。

ラスボス - regmatch()

regtry() がバラモスなら、regmatch() はさしずめゾーマというところか。

regmatch() はド定番な感じの VM ですね。

STATIC I32                        /* 0 failure, 1 success */
S_regmatch(pTHX_ regmatch_info *reginfo, regnode *prog)
{
    register const bool do_utf8 = PL_reg_match_utf8;
    const U32 uniflags = UTF8_ALLOW_DEFAULT;
    ... (略) ...
    scan = prog;
    while (scan != NULL) {
        ... (略) ...
        switch (state_num) {
        case BOL:
            if (locinput == PL_bostr)
            {
                /* reginfo->till = reginfo->bol; */
                break;
            }
            sayNO;
        ... (略) ...
        case ANYOF:
            if (do_utf8) {
                STRLEN inclasslen = PL_regeol - locinput;

                if (!reginclass(rex, scan, (U8*)locinput, &inclasslen, do_utf8))
                    goto anyof_fail;
                if (locinput >= PL_regeol)
                    sayNO;
                locinput += inclasslen ? inclasslen : UTF8SKIP(locinput);
                nextchr = UCHARAT(locinput);
                break;
            }
            else {
                if (nextchr < 0)
                    nextchr = UCHARAT(locinput);
                if (!REGINCLASS(rex, scan, (U8*)locinput))
                    goto anyof_fail;
                if (!nextchr && locinput >= PL_regeol)
                    sayNO;
                nextchr = UCHARAT(++locinput);
                break;
            }
        anyof_fail:
            /* If we might have the case of the German sharp s
             * in a casefolding Unicode character class. */

            if (ANYOF_FOLD_SHARP_S(scan, locinput, PL_regeol)) {
                 locinput += SHARP_S_SKIP;
                 nextchr = UCHARAT(locinput);
            }
            else
                 sayNO;
            break;
        ... (略) ...
        default:
            PerlIO_printf(Perl_error_log, "%"UVxf" %d\n",
                          PTR2UV(scan), OP(scan));
            Perl_croak(aTHX_ "regexp memory corruption");
            
        } /* end switch */ 

        /* switch break jumps here */
        scan = next; /* prepare to execute the next op and ... */
        continue;    /* ... jump back to the top, reusing st */
        /* NOTREACHED */
    }

    /*
    * We get here only if there's trouble -- normally "case END" is
    * the terminating point.
    */
    Perl_croak(aTHX_ "corrupted regexp pointers");
    /*NOTREACHED*/
    sayNO;

yes:
    ... (略) ...
    result = 1;
    goto final_exit;

no:
    ... (略) ...
no_silent:
    ... (略) ...
    result = 0;

  final_exit:
    ... (略) ...
    return result;
}

ANYOF の処理を見ると、do_utf8 が true なら無条件に reginclass を、false なら REGINCLASS() を呼んでいる。つまり find_byclass() と扱いが異なっているってことだ!!

挙動の違いを表にまとめると以下のようになる。ちなみに UTF8_IS_INVARIANT() は utf8.h で次のように定義されている。

... (略) ...
#ifdef EBCDIC
/* The equivalent of these macros but implementing UTF-EBCDIC
   are in the following header file:
 */

#include "utfebcdic.h"

#else
... (略) ...
/* Transform after encoding */
#define NATIVE_TO_UTF(ch)        (ch)
... (略) ...
#define UNI_IS_INVARIANT(c)		(((UV)c) <  0x80)
#define UTF8_IS_INVARIANT(c)		UNI_IS_INVARIANT(NATIVE_TO_UTF(c))
... (略) ...
#endif /* EBCDIC vs ASCII */
regnode*のANYOF_UNICODEフラグ UTF8_IS_INVARIANT(s[0]) find_byclass() regmatch()
false false REGINCLASS() reginclass()
false true reginclass() reginclass()
true false reginclass() reginclass()
true true reginclass() reginclass()

動作が一貫していないってわけですね。

で、これはバグなの?

少なくとも一貫性がない時点でバグ確定。

find_byclass() の方の REXEC_FBC_UTF8_CLASS_SCAN() の中身である条件式を見ると、なんか理由があってそうしているという感じなので、(regmatch() と並べてみたときに) どっちが意図する挙動なのかは外野からは判断できない。perlre に書いてあることが正しいんだとすれば、find_byclass() の方が誤りということになるんじゃないのかなあ。

あ、そういえば、パタン 7について調べてなかった。番外編に続くかも。