Issues

ZF-5063: Segmentaion fault on preg_replace in Zend_Db_Statement

Description

This seems not to exactly be a Zend Framework bug, but more likely PHP or PRCE lib bug, but I'll post it here for your opinion.

On line 190 of Zend_Db_Statement is the following preg_replace:

$sql = preg_replace("/$q($qe|\\{2}|[^$q])*$q/", '', $sql);

When the sql passed is longer than about 4100 characters the process dies with the following error in the log:

[notice] child pid 7426 exit signal Segmentation fault (11)

I can drill it down to be this part of the regex that fails: /([^'])*/

The content in the sql i UTF8 encoded text - all characters are ascii though. I don't know if that might be of any help.

It seems to be somewhat related to this filed php bug:

http://bugs.php.net/bug.php?id=46551

I'm using PHP 5.2.6-0.dotdeb.1 with Suhosin-Patch 0.9.6.2 (cli) (built: May 2 2008 10:24:03) on a debian-server. I will upgrade to latest snapshot tomorrow to see if that helps.

Comments

I guess it has something to do with this: http://bugs.debian.org/cgi-bin/bugreport.cgi/…

This appears to be an issue within php itself and not an issue in ZF. Also appears to have been fixed in php.

It appears that it is actually an issue with pcre, which has been flagged as "won't fix".

See: http://bugs.exim.org/show_bug.cgi?id=797

Raising the process stack limit helps handle larger queries, but doesn't "fix" the issue.

// simple "query" that triggers segfault. $sql = str_pad("'", 6000, 'a');

// should throw invalid sql, actually segfaults $dbAdapter->query($sql);

Is there another way to write code to prevent large queries from segfaulting php since there doesn't appear to be a fix coming for the pcre library?

This issue can be fixed or better, worked around by changing the regexp slightly. Basically, the (..|..|[^.])* will 'eat' every character between quotes one-by-one, incrementing the stack every character. Adding a + sign after the negated character class will eat as much characters as it can, significantly reducing the stack usage.

I've attached a patch that will fix this issue, and also will correct http://framework.zend.com/issues/browse/ZF-3025

submitted patch also fix a wrong fix for issue ZF-3025

I'd really like to see this fix in 1.9.4 when ZF-7911 is not ready yet.

The patch works, why hasn't this fix made it to production in the last six months?

this cost me a couple hours. please bump up the priority on this one (the patch works), the seg fault might not be ZF's... fault.. but it's extremely painful.

This Issue still exists in ZF.

the patch doesn't work for me. also it's not correct though adding '+' after $qe means 'last character of $qe repeated one or more times'. this would be better: $sql = preg_replace("/$d(($de)+|(\\{2})+|([^$d])+)*$d/", '', $sql); and the best solution for me is just making the pattern ungreedy. i'm not an expert in PCRE, but as i read on wikipedia, the PCRE engine would work faster with ungreedy patterns. and with this pattern greediness actually doesn't matter

Guys, does anybody know how to fight this problem? I have RHE6, ZF 1.11.3, and issue is still there. Published patch fixes nothing as well as others published in ZF-7911 Today is 10/3/2011, meaning it has been in the air for almost 3 years now!!!!!!

@Misha Krassovski besides patches published here, there are two workarounds that worked perfectly for me: 1) make the pattern ungreedy by adding /U (more at http://php.net/manual/en/…) here you need to change ZF code, but it will help with all your queries 2) use the adapter directly with those queries that fail, i.e: $db->getAdapter()->getConnection()->exec($query);

Are there any known side-effects to switching the preg_replace call to ungreedy? That seems like the simplest solution, assuming we can come up with a test to prove it actually solves the issue.

The problem with this issue, is that it is not something that can reliably be unit tested. The segfault will only appear on certain systems when enough data is passed, where 'enough' depends on the system and how PHP and PCRE are build.

The patch that I suggested for this issue has been (and probably still is) in production. The patch would probably not fix the problem entirely, if enough data is passed, the query might still fail. I feel that if ZF-7911 is solved properly, this issue will go away or become even more unlikely.

In the mean time, I've got a new job where ZF is not used (yet).

Could someone here who is experiencing this issue please try the patch i've attached (ZF-7911_ungreedy.patch) and report back the result? It's a modified version of the fix suggested in ZF-7911, with the addition of the ungreedy modifier as suggested here by [~kipelovets]

[~adamlundrigan], it worked for me

Just casting my opinion on this - I've also wasted a couple of hours banging my head with this issue a few good months ago when it appeared on the zf-based product I was working on, but it only happened on my local env (which was Windows XP) and not on our linux production env. Nevertheless, important and really annoying issue that takes a while to drill down to.

I just ran into this issue with a very long query (inserting a PHP serialized array) with ZF 1.11.11 on PHP 5.3.3 and PHP 5.3.6. I tried out the ungreedy patch. I can't tell if the code in _stripQuoted() is now still doing what it is supposed to do, but the segmentation fault on preg_replace did go away.

This is a 3 year old bug, and knowing from experience now, it is pretty nasty once you run into it. I don't even remember running into segfault issues with PHP before. It took me a while to find the culprit and then this bug report.

If the maintainers are not willing to fix the problem, at least put a disclaimer in the manual of Zend_Db, that one should refrain from using Zend_Db_Statement, if they need to run certain long queries.

I agree that the bug is in libpcre, but with a different regex, the probability of segfault can be reduced. The below fix is mysql specific and also addresses ZF-7911. libpcre still goes into a recursion, but the depth seems to be only a fraction of the original. A proper fix would be in the libpcre. Or ZF has to write its own FSM. But doing it in pure php is probably slow.


class Company_Db_Statement_Pdo_Mysql extends Zend_Db_Statement_Pdo
{
    /* remove "foo""b\ar\"b\\az" and 'foo''b\ar\'b\\az' style quoted strings from sql queries. `backticks` are not touched. Specific for mysql syntax.
     * for `foo``bar` style quotes, that regex part could be added (not tested):
     *    `([^`]+|``)*?`(?!`)
     */
    protected function _stripQuoted($sql)
    {
        $regex = "/\"([^\\\\\"]+|\\\\.|\"\")*?\"(?!\")|'([^\\\\']+|\\\\.|'')*?'(?!')/";
        return preg_replace($regex, '', $sql);
    }
}

$db = Zend_Db::factory('PDO_MYSQL', $config->params);
$db->setStatementClass('Company_Db_Statement_Pdo_Mysql');

Added a patch that tries to solve the following: - adds unit tests as proposed by Vincent de Lau with some supposed improvements where they seemed wrong - completes the _stripQuoted function, as it was quite incomplete if tested against the new unit test - used the U modifier in the regexes that need it to avoid the segfault that I could reproduce over and over again

Please review and give your opinion. If I get no complaints I will commit in about two weeks from now.

If I get no complaints

SELECT ''?`;

expected result: SELECT ; I get: SELECT '?`;

Also, \ has no special meaning within `` style quotes in mysql. i.e. \ or \\ are both valid and different.

\ or \ \ (without the blank) are both valid and different.

Joern, can you be more specific. It's good that you complain / comment. I'm just the other developer trying to fix something that has been broken for many years. I don't expect it to be fixed overnight.

From what I have understood from the research of Vincent de Lau, is that no attempt is made to validate or improve the query. The query is assumed to be correct and identifiers are stripped, so that the query can be parsed. Quoted identifiers have to be stripped as well.

From the research of others I have understood that a backtick can't be escaped with a backslash, only with an extra backtick. Please use [] to quote strings in comments here, as Vincent did, to avoid confusion.

When complaining about expected results, please do so in a format that I can readily use in the unit test.

In your case, this would be: Expected:


in: [SELECT '`'`?`;]
out: [SELECT ;]

Actual:


in: [SELECT '`'`?`;]
out: [SELECT '?`;]

I can explain why you get what you do not expect: the two mathing backticks are stripped first. What remains is the actual output. Why is this wrong? Can you explain that? If so, I'll by happy to fix it.

Either way, it doesn't segfault anymore, while it covers the most obvious use cases. I will welcome edge cases to be solved in separate issues if needed. Maybe we should even add an adapter specific way to override the order and format of the regexes if needed, but it seems to me that solving the segfault issue and the common use cases would already help a lot of people.

I find it hard to find a realistic use case for the format you supplied, do you know of any?

I'll try to make a more realistic example. Not sure if anyone would actually do it, but it looks like someone might. I hope that now it is obvious why the actual output is wrong.

Expected:


in: [SELECT '?`' `x`, col `y` FROM t WHERE u = ?;]
out: [SELECT  , col  FROM t WHERE u = ?;]

Actual:


in: [SELECT '?`' `x`, col `y` FROM t WHERE u = ?;]
out: [SELECT '?xy` FROM t WHERE u = ?;]

Here is another unrealistic use case which segfaults:


<?php

$regex = "/'(\\\\'|[^'])*'/U";
$sql = "'";
for ($i = 0; $i < 1000000; ++$i) $sql .= "\\x";
$sql .= "'";
$sql = preg_replace($regex, '', $sql);

I wish I could come up with a perfect solution myself, but everything I try is either deadly slow, segfaults, would require libpcre changes or would require making a C module for stripping quotes :-)

Thanks for your explanation and the excellent formatting of your code examples.

I still have trouble understanding, because I think the query is wrong and we do not attempt to validate the query. In this input:

[SELECT '?`' `x`, col `y` FROM t WHERE u = ?;]

the first backtick is not allowed in my opinion and that's why is goes 'wrong' with the output. Or should a backtick not be considered a quote if inside single quotes and that is what I'm missing? Why would one want a single backtick inside the single quotes? I still find it very, very hard to imagine a use case for that.

I could imagine something like this:


[INSERT INTO `pcre` (`test`) VALUES ('In MySQL, the backtick (`) is used to quoted identifiers')]

Will I be in trouble with this?:


[INSERT INTO `pcre` (`test`) VALUES ('In MySQL, the backtick (`) is used to quoted identifiers, and here is another backtick: ` ...ooops')]

I admit that I will have to look deeper into examples like these and make sure I am confident about how these should be formatted to be valid queries and then what the expected output should be. Any help is still appreciated. Even if it doesn't get perfect, It would help if it would improve, or?

Your segfaulting examples worries me, although it is unrealistic. Maybe I should not be worried, in php you can't do everything either. Like allocating memory inside an infinite loop will certainly crash php.

{quote} Thanks for your explanation and the excellent formatting of your code examples. {quote} I'm trying my best to learn the formatting language of this bug tracker :-)

{quote} Or should a backtick not be considered a quote if inside single quotes and that is what I'm missing? Why would one want a single backtick inside the single quotes? I still find it very, very hard to imagine a use case for that. {quote}

Inside quotes (whatever the type) all other quotes lose their special meaning and are just ordinary characters. So '?' is a string consisting of a question mark and a backtick. My query would return two columns, named x and y, where the value of the former is always '?' and the latter's value comes from the database:

mysql> create table t (u int unsigned not null, col varchar(20) not null);
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t values (10,'ten'), (20,'twenty'), (42,'fortytwo');
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> SELECT '?`' `x`, col `y` FROM t WHERE u = 20;
+----+--------+
| x  | y      |
+----+--------+
| ?` | twenty |
+----+--------+
1 row in set (0.00 sec)

And your two queries are perfectly valid in mysql:

mysql> create table pcre (test varchar(255));
Query OK, 0 rows affected (0.02 sec)

mysql> INSERT INTO `pcre` (`test`) VALUES ('In MySQL, the backtick (`) is used to quoted identifiers');
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO `pcre` (`test`) VALUES ('In MySQL, the backtick (`) is used to quoted identifiers, and here is another backtick: ` ...ooops');
Query OK, 1 row affected (0.00 sec)

mysql> select * from pcre;
+----------------------------------------------------------------------------------------------------+
| test                                                                                               |
+----------------------------------------------------------------------------------------------------+
| In MySQL, the backtick (`) is used to quoted identifiers                                           |
| In MySQL, the backtick (`) is used to quoted identifiers, and here is another backtick: ` ...ooops |
+----------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

Of course my examples appear to be unrealistic. But they can get very real when user-provided data is sent to the database and people can cause your application to segfault (resulting in DoS) or worse. I believe that with unrealistic code this bug can allow a user to at least inject random data into the database or tamper with SELECT results.

Thank you very much. You just clarified a key point I missed: the fact that inside any type of quotes all other quotes lose their meaning entirely. It is obvious, but I overlooked it's importance for this function completely. I'll have to redo my homework. It will have to wait for a while, but in any case, I won't commit my patches until I have give it additional attention.

By changing the order by which the quoted identifiers and values are stripped, I now have your use cases - and mine - convered by the unit test. So, maybe I solved the related ZF-7911 issue.

As far as the segfault is concerned: I tested that with 9000 chars. It segfaulted. I added the Ungreedy modifier, it stopped segfaulting. But if I try with only a few more characters, like 9500, it segfaults just the same, even with the /U modifier. No sophisticated string needs to be inserted to reproduce it: I just insert 9500 times the [a] character:


$db = Zend_Db_Table::getDefaultAdapter();
$value = str_repeat('a', 9500);
$query = "INSERT INTO `pcre` (`test`) VALUES ('$value')";
$db->query($query);

I commited some changes to trunk using issue ZF-7911. This does not mean everything is fixed and fine now. However, my fix adds a test that should cover what has been asked for here and in ZF-7911.

Today I did another commit with a slightly changed regex. While the old regex would segfault near 9,500 characters, the new regex segfaults near 65,000 characters. It's still not good, but we are improving.

I also want to point everyone at this new bugreport in the PCRE library, there is some usefull information in there, that we might be able to use in order to get regular errors from PCRE instead of segfaults and to allow us to increase the memory limit of PCRE based on the memory limit we set for php:

http://bugs.exim.org/show_bug.cgi?id=1233

I would appreciate help from anyone who has experience compiling php. What I would like to do is change it so that: - PCRE uses heap instead of stack - PCRE is configured with a memory limit based on the memory limit of php, so that if we increase the memory limit for php, we also increase the memory limit for PCRE and so that we get regular errors instead of segfaults.

I committed a new regex a while ago, that passes all unit tests added in ZF-7911. With this regex, you can use a query of around 65,000 characters if you use the default settings of php.

In addition, by asking the PCRE developers, I found there is a setting you can put in php.ini. pcre.recursion_limit and pcre.backtrack_limit

I experimented with those, and I found that if I set the pcre.recursion_limit to 1000, which is a very low value, compared to the default of 100k, it 'just works' (all the unit tests pass). You get a regular memory_limit exceeded php error if you use too long a query, but you can increase the memory_limit until you get a MySQL has gone away error (because the maximum packet size was exceeded). In either case, it is no longer Zend_Db_Statement that is the problem. I could insert 100k utf8 characters before MySQL gave up.

See also: http://bugs.exim.org/show_bug.cgi?id=1233

If you still experience problems even with the new regex and tweaked settings, feel free to reopen this issue.

Note that if you set the recursion limit to low, to 10 for example, the resulting replacement string will be empty and you will have no query anymore. In that case, the unit tests will not pass.