Zend Framework

Incorrect encoding of 'subject' in Chinese e-mail, sent with Zend_Mail

Details

  • Type: Bug Bug
  • Status: Resolved Resolved
  • Priority: Critical Critical
  • Resolution: Duplicate
  • Affects Version/s: 1.5.2
  • Fix Version/s: None
  • Component/s: Zend_Mail
  • Labels:
    None

Description

With the following code, I wish to send an e-mail with Chinese in the subject:

// ZendFramework-1.5.2

$mail = new Zend_Mail('UTF-8');

$mail->setFrom('info@example.org');
$mail->addTo('sam@example.org');
$mail->setSubject('机器视觉组件生产商');
$mail->setBodyText('机器视觉组件生产商 机器视觉组件生产商');
$mail->send();

However, the subject of the sent e-mail is garbled and displayed incorrectly in e-mail client (Thunderbird).

"Subject: =?UTF-8?Q?=E6=9C=BA=E5=99=A8=E8=A7=86=E8=A7=89=E7=BB=84=E4=BB=B6=E7=94=9F=E4=BA=A7= =E5=95=86?="

See full e-mail, which is sent, is below:

From - Sat Aug  9 14:17:45 2008
X-Account-Key: account2
X-UIDL: BNb"!%58"!UQ]!!J>'"!
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
Return-Path: <xxx@xxx.org>
X-Original-To: yyy@xxx.org
Delivered-To: yyy@xxx.org
Received: from xxx.xxx.org (localhost [127.0.0.1])
    by localhost (Postfix) with ESMTP id B6B67501847
    for <yyy@xxx.org>; Sat,  9 Aug 2008 14:14:08 +0200 (CEST)
To: <yyy@xxx.org>
Subject: =?UTF-8?Q?=E6=9C=BA=E5=99=A8=E8=A7=86=E8=A7=89=E7=BB=84=E4=BB=B6=E7=94=9F=E4=BA=A7= =E5=95=86?=
From: "" <info@example.org>
Date: Sat, 09 Aug 2008 14:27:55 +0200
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
Message-Id: <20080809122755.C0CAB8C056@xxx.xxx.org>


=E6=9C=BA=E5=99=A8=E8=A7=86=E8=A7=89=E7=BB=84=E4=BB=B6=E7=94=9F=E4=BA=A7=E5=
=95=86 =E6=9C=BA=E5=99=A8=E8=A7=86=E8=A7=89=E7=BB=84=E4=BB=B6=E7=94=9F=E4=
=BA=A7=E5=95=86

It seems that the method Zend_Mail::_encodeHeader() is causing the problem.

E-Mail with a Chinese subject is sent correctly, when the following code is used for the method:

protected function _encodeHeader($value)
{
    if (Zend_Mime::isPrintable($value)) {
        return $value;
    } else {
        return '=?' . $this->_charset . '?B?' . base64_encode($value) . '?=';
    }
}

Jacky Chen posted this solution on Fri, 28 Dec 2007 17:21:26 -0800 at:

http://www.mail-archive.com/fw-general@lists.zend.com/msg09005.html

Can his version of Zend_Mail::_encodeHeader() be used to update the official version?

In the meantime, I am subclassing Zend_Mail, overloading Zend_Mail::_encodeHeader() with his version, as I need to send e-mail with Chinese subjects and also From and To addresses - Zend_Mail::_encodeHeader() is used in multiple methods.

Thank you for your kind consideration of this change.

Jonathan Maron

Issue Links

Activity

Hide
Tomas Markauskas added a comment -

This solution would work only for short headers, because the RFC does not allow the encoded words in the headers to be longer than 75 characters. The string has to be split to avoid problems with multibyte characters and therefore has to be split between those characters. I had the same problem and now I have a solution which fixes it for utf-8 string (other charsets later).

I will post a patch in the next few days. It should also close a few other tickets related to Zend_Mail/Zend_Mime.

Show
Tomas Markauskas added a comment - This solution would work only for short headers, because the RFC does not allow the encoded words in the headers to be longer than 75 characters. The string has to be split to avoid problems with multibyte characters and therefore has to be split between those characters. I had the same problem and now I have a solution which fixes it for utf-8 string (other charsets later). I will post a patch in the next few days. It should also close a few other tickets related to Zend_Mail/Zend_Mime.
Hide
Jonathan Maron added a comment -

Hello Thomas

Thank you very much for your prompt answer.

I very much look forward to seeing your patch.

Would it be possible for you to send me a copy already?

TIA

Jonathan Maron

Show
Jonathan Maron added a comment - Hello Thomas Thank you very much for your prompt answer. I very much look forward to seeing your patch. Would it be possible for you to send me a copy already? TIA Jonathan Maron
Hide
Tomas Markauskas added a comment -

Patch for Zend_Mail/Zend_Mime (@10901)

This fixes the described problem (and a few more).

Show
Tomas Markauskas added a comment - Patch for Zend_Mail/Zend_Mime (@10901) This fixes the described problem (and a few more).
Hide
Tomas Markauskas added a comment -

What was the problem:

From http://tools.ietf.org/html/rfc2047#section-2

An encoded-word may not be more than 75 characters long, including
charset, encoding, encoded-text, and delimiters. If it is desirable
to encode more text than will fit in an encoded-word of 75
characters, multiple encoded-words (separated by CRLF SPACE) may be
used.

While there is no limit to the length of a multiple-line header
field, each line of a header field that contains one or more
encoded-words is limited to 76 characters.

From http://tools.ietf.org/html/rfc2047#section-5

The 'encoded-text' in an 'encoded-word' must be self-contained;
'encoded-text' MUST NOT be continued from one 'encoded-word' to
another. This implies that the 'encoded-text' portion of a "B"
'encoded-word' will be a multiple of 4 characters long; for a "Q"
'encoded-word', any "=" character that appears in the 'encoded-text'
portion will be followed by two hexadecimal characters.

Each 'encoded-word' MUST encode an integral number of octets. The
'encoded-text' in each 'encoded-word' must be well-formed according
to the encoding specified; the 'encoded-text' may not be continued in
the next 'encoded-word'. (For example, "=?charset?Q?=?=
=?charset?Q?AB?=" would be illegal, because the two hex digits "AB"
must follow the "=" in the same 'encoded-word'.)

Each 'encoded-word' MUST represent an integral number of characters.
A multi-octet character may not be split across adjacent 'encoded-
word's.

There is still one problem with my patch: the first line of the header will probably exceed the length of 76 characters because we don't count the length of the header name. Zend_Mime::encodeQuotedPrintableHeader should get the length of the header name, so it could reduce the length of the first "encoded-word". I haven't seen any cases where this would cause big problems, so I haven't resolved it.

Show
Tomas Markauskas added a comment - What was the problem: From http://tools.ietf.org/html/rfc2047#section-2 An encoded-word may not be more than 75 characters long, including charset, encoding, encoded-text, and delimiters. If it is desirable to encode more text than will fit in an encoded-word of 75 characters, multiple encoded-words (separated by CRLF SPACE) may be used. While there is no limit to the length of a multiple-line header field, each line of a header field that contains one or more encoded-words is limited to 76 characters. From http://tools.ietf.org/html/rfc2047#section-5 The 'encoded-text' in an 'encoded-word' must be self-contained; 'encoded-text' MUST NOT be continued from one 'encoded-word' to another. This implies that the 'encoded-text' portion of a "B" 'encoded-word' will be a multiple of 4 characters long; for a "Q" 'encoded-word', any "=" character that appears in the 'encoded-text' portion will be followed by two hexadecimal characters. Each 'encoded-word' MUST encode an integral number of octets. The 'encoded-text' in each 'encoded-word' must be well-formed according to the encoding specified; the 'encoded-text' may not be continued in the next 'encoded-word'. (For example, "=?charset?Q?=?= =?charset?Q?AB?=" would be illegal, because the two hex digits "AB" must follow the "=" in the same 'encoded-word'.) Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded- word's. There is still one problem with my patch: the first line of the header will probably exceed the length of 76 characters because we don't count the length of the header name. Zend_Mime::encodeQuotedPrintableHeader should get the length of the header name, so it could reduce the length of the first "encoded-word". I haven't seen any cases where this would cause big problems, so I haven't resolved it.
Hide
Tomas Markauskas added a comment -

It would probably fix these two issues too.

Show
Tomas Markauskas added a comment - It would probably fix these two issues too.
Hide
Niko Sams added a comment -

Patch worked out fine for me. Thanks.

Show
Niko Sams added a comment - Patch worked out fine for me. Thanks.
Hide
Tomas Markauskas added a comment -

All these issues seem to be related.

Show
Tomas Markauskas added a comment - All these issues seem to be related.
Hide
Tomáš Fejfar added a comment -

Someone should fix this, it's critical bug for everyone usin non-ascii charset...

Show
Tomáš Fejfar added a comment - Someone should fix this, it's critical bug for everyone usin non-ascii charset...
Hide
Tomas Markauskas added a comment -

I would like to know if there's any need for other character sets than UTF-8.
Another option to deal with multibyte characters would be to convert all multibyte character sets to UTF-8 in encodeQuotedPrintableHeader (see attached patch).

Show
Tomas Markauskas added a comment - I would like to know if there's any need for other character sets than UTF-8. Another option to deal with multibyte characters would be to convert all multibyte character sets to UTF-8 in encodeQuotedPrintableHeader (see attached patch).

People

Vote (5)
Watch (5)

Dates

  • Created:
    Updated:
    Resolved: