Skip to end of metadata
Go to start of metadata

<ac:macro ac:name="unmigrated-inline-wiki-markup"><ac:plain-text-body><![CDATA[

<ac:macro ac:name="unmigrated-inline-wiki-markup"><ac:plain-text-body><![CDATA[

Zend Framework: Zend_Filter_Sanitize (formerly Zend_Filter_SeoUrl) Component Proposal

Proposed Component Name Zend_Filter_Sanitize (formerly Zend_Filter_SeoUrl)
Developer Notes http://framework.zend.com/wiki/display/ZFDEV/Zend_Filter_Sanitize (formerly Zend_Filter_SeoUrl)
Proposers Martin Hujer
Alexander Veremyev (Zend Liaison)
Revision 1.0 - 24th January 2008: Initial proposal
1.1 - 28th January 2008: Renamed to Zend_Filter_Sanitize (wiki revision: 17)

Table of Contents

1. Overview

Zend_Filter_Sanitize is a component that converts string into SEO friendly url.

2. References

3. Component Requirements, Constraints, and Acceptance Criteria

  • This filter must convert any string to seo url without any setting
  • This component will correctly converts string into SEO url.
  • This component will put all characters in lowercase.
  • This component will translate special chars such as '?' or '?' into 'c' and 'u' (included in Zend_Filter_Transliteration)
  • This component will allow to change the word delimiter character (default are (' ', '.', '
    ', '/', '-', '_'))
  • This component will allow to change the delimiter replacement character (default is dash)
  • This component will strip characters, which are not allowed in url.

4. Dependencies on Other Framework Components

5. Theory of Operation

...

6. Milestones / Tasks

7. Class Index

  • Zend_Filter_Sanitize

8. Use Cases

9. Class Skeletons

]]></ac:plain-text-body></ac:macro>

]]></ac:plain-text-body></ac:macro>

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Jan 25, 2008

    <p>Will this work with non-Latin characters, like Cyrillic etc?</p>

    <p>I like this, but I would suggest to create a more general transliteration filter. Then anyone could just replace all spaces with dashes ir wanted...</p>

    1. Jan 25, 2008

      <p>Hi,</p>

      <p>it works with these character, because I'm from Czech Republic and I need to transform Czech characters, such as ? š ? ? ž ý á í é into their equivalent. It was one of the reasons, why I decided to write this class.</p>

      <p>I'm not sure, if I understand your second note well. You mean to create class, which can just replace spaces with dashes? Or just to add some options to this class?</p>

      1. Jan 25, 2008

        <p>No, I meant, that this could be not a SeoURL filter, but just a transliteration filter, that outputs any text as transliterated ascii text.</p>

        <p>And after that you could just use the output for the URL's, you would nly need to lowercase the text and replace all the spaces with dashes...</p>

        1. Jan 27, 2008

          <p>Yes, good idea, but when I want to transliterate '?š??žýáíé?ú' (typical Czech symbols) it works thi way:</p>

          <ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
          $s = iconv("utf-8", "us-ascii//TRANSLIT", "?š??žýáíé?ú");
          // $s contains "escrz'y'a'i'eu'u"
          // not "escrzyaieuu"
          ]]></ac:plain-text-body></ac:macro>

          1. Jan 30, 2008

            <p>I've solved it.</p>

            <ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
            $s = Zend_Filter_Transliteration::filter("?š??žýáíé?ú");
            // $s now contains "escrzyaieuu" as expected
            ]]></ac:plain-text-body></ac:macro>

            <p> You can checkout from SVN of <a href="http://framework.zend.com/wiki/display/ZFPROP/Zend_Filter_Transliteration+-+Martin+Hujer">Zend_Filter_Transliteration</a></p>

  2. Jan 27, 2008

    <p>I'd suggest to rename this class to Zend_Filter_Sanitize. The same name is used in other languages and it seems wrong to bind a class name to only one task (seo optimzation) when it could be used for several other tasks as well.</p>

    1. Jan 27, 2008

      <p>I should agree with Daniel, this class can be useful for several other task as well, so renaming it to Zend_Filter_Sanitize makes sense. Of course, it would be useful only if it can handle every latin-derivative (or close-to-it) alphabets (european alphabets with diacriticized latin letters, cyrillic and so on).</p>

      1. Jan 27, 2008

        <p>It works correctly with Czech, so I suppose I will work well with other European alphabets. If you have some problematic characters in your language, I'll be happy to add them to unit test.</p>

  3. Jan 28, 2008

    <p><strong>Renamed to Zend_Filter_Sanitize</strong></p>

    1. Jan 28, 2008

      <p>Perhaps it'd also be appropriate to update the proposal to reflect the new name (e.g. "This component will correctly converts <ac:link><ri:page ri:content-title="sic" /></ac:link> string into SEO url."). On the other hand, if you were going to rename it SanitizeUrl (which I also think is more appropriate) you can keep on renaming...</p>

  4. Jan 28, 2008

    <p>I think you will have problems with the encode (UTF-8 and ISO-8859-1, for example) of the characters.<br />
    How will you handle that?</p>

    <p>I always use Sanitize in my projects. I think I can help with my experiences.</p>

    1. Jan 28, 2008

      <p>It converts utf-8 strings well. If the user has another input, it needs to be converted before.</p>

      <p>I'd really appreciate your tips.</p>

  5. Jan 28, 2008

    <p>Nice idea about this class, Martin...</p>

    <p>Btw, i din't saw you on #zftalk at all last months...</p>

  6. Jan 28, 2008

    <p>Hey Martin,</p>

    <p>I might propose SanitizeUrl over simply 'Sanitize'.</p>

    <p>The current naming doesn't give any context to what you are attempting to filter for and against.</p>

    <p>Another thought might be to simply create a "Tansliteration" filter first, as that might have a much broader audience than that of Url's.</p>

    <p>My 2cents <ac:emoticon ac:name="wink" /></p>

    <p>-ralph</p>

    1. Jan 28, 2008

      <p>Hey Ralph,</p>

      <p>I think this filter could also be used to filter directory names on the filesystem (for example). I don't think it's a good idea to rename it to *anything*Url.</p>

      <ul class="alternate">
      <li>Daniel</li>
      </ul>

      1. Jan 28, 2008

        <p>I should agree with Ralph just because I found this class has some replacements that has been specifically designed for URLs, especially SEO URLs.<br />
        In particular, space replacement with dashes (instead of underscores) is a common practice when you design routes search engine friendly.</p>

        <p>I have a question for you, Martin.<br />
        How will the filter handle special characters such as slashes <ac:emoticon ac:name="tick" /> or dots?</p>

        <p>Additionally, if you really want to make an URL search engine friendly you should ensure that URLs with and without trailing slash are normalized to an unique version (usually without for such this type of framework powered URLs).<br />
        Do you think this filter should normalize trailing slash too?</p>

        1. Feb 09, 2008

          <p>I'd prefer to leave dots untouched - would be handy when normalizing filenames like MoZzIlla FiREfOx 1.0.0.12.EXE (mozilla-firefox-1.0.0.12.exe).</p>

          1. Feb 10, 2008

            <p>Maybe the option to handle this would do it <ac:emoticon ac:name="smile" /></p>

          2. Feb 18, 2008

            <p>Added <ac:emoticon ac:name="smile" /></p>

            <p>See use cases and update from SVN.</p>

            <p>Martin H.</p>

      2. Jan 28, 2008

        <p>That said, doesnt that also make a good case for a Zend_Filter_Transliteration specific filter?</p>

        <p>After just a cursory review, I think having a trasliteration filter would be a "Good Thing" (r).</p>

        <p>-ralph</p>

  7. Jan 28, 2008

    <p>Hello,<br />
    I have thought about it (discussion on irc helped me) and I will create proposal for Zend_Filter_Transliteration (or just Translitere). It will handle just transliteration and maybe apostrophes removal.</p>

    <p>Simone: slashes and dots are currently stripped away, but they should be converted to dash (or to set replacement character). I'll do it tomorrow. </p>

    <p>New name of this component could be one of these:</p>
    <ol>
    <li>Zend_Filter_Sanitize</li>
    <li>Zend_Filter_SanitizeUrl</li>
    <li>Zend_Filter_StringToSlug</li>
    </ol>

    1. Jan 28, 2008

      <p>I would vote for Zend_Filter_Sanitize. It could be used then to filter strings for URLs, maybe for filenames (ie. for uploaded files, not to contain invalid/unwanted characters) and probably for lots of other things.</p>

      1. Jan 29, 2008

        <p>The idea behind Zend_Filter_StringToSlug is nice.<br />
        Being inspired by Tomas's feedback, what about StringToPath?</p>

  8. Jan 29, 2008

    <p>You proposal is basically an aggregation of three filter:</p>
    <ul>
    <li>Transliteration</li>
    <li>Lowercase</li>
    <li>Space replacement<br />
    I suggest to implement that as-is and then aggregate that three helpers into a single one which could be called NiceUrl or something (btw: URLs are overrated in SEO, domains are not, but this doesn't matter here).</li>
    </ul>

  9. Jan 30, 2008

    <p><strong>I have split off part of this component into <ac:link><ri:page ri:content-title="//framework.zend.com/wiki/display/ZFPROP/Zend_Filter_Transliteration+-+Martin+Hujer" ri:space-key="Zend_Filter_Transliteration http" /></ac:link></strong></p>

    1. Jan 30, 2008

      <p>Do you think it is necessary? :|</p>

      <p>I don't know if it can help you, but no more than 3 days ago I did the same for a Rails project.<br />
      Here's my piece of code.</p>

      <ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
      module PathFilters

      def to_path()
      Iconv.iconv("ASCII//IGNORE//TRANSLIT", "UTF-8", self).join.sanitize_path
      rescue
      self.sanitize_path
      end

      def sanitize_path()
      self.gsub(/[^a-z._0-9 -]/i, "").gsub(/\s+|\./, "_").dasherize.downcase
      end

      end
      ]]></ac:plain-text-body></ac:macro>

      <p>I used iconv as well as you posted at
      <a class="external-link" href="http://framework.zend.com/wiki/display/ZFPROP/Zend_Filter_Sanitize+-+Martin+Hujer?focusedCommentId=42104#comment-42104">http://framework.zend.com/wiki/display/ZFPROP/Zend_Filter_Sanitize+-+Martin+Hujer?focusedCommentId=42104#comment-42104</a><br />
      You can get inspired, if you need. <ac:emoticon ac:name="smile" /></p>

      1. Jan 30, 2008

        <p>Thanks, Rails looks like Python <ac:emoticon ac:name="smile" /></p>

        <p>I wanted to write more general transliteration filter and when i use this icovn command, it converts some diacritics mark into ' or " or ^<br />
        Zend_Filter_Transliteration filter strips this chars out.</p>

        <p>And the Sanitize filter has some improvements to be more general (e.g. file system paths creation)</p>

        <p>Martin.</p>

  10. Feb 08, 2008

    <p>I started using sanitize instead of my own helpers.</p>

    <p>Works great for now, will check deeply next days.</p>

  11. May 28, 2008

    <p>Some correction from my side to Zend_Filter_Transliteration, which I just found out:</p>

    <p>The transliteration table should more look like this for German (according to german grammar):</p>

    <ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
    $table = array (
    'ä' => 'ae',
    'ë' => 'e',
    'ï' => 'i',
    'ö' => 'oe',
    'ü' => 'ue',
    'Ä' => 'Ae',
    'Ë' => 'E',
    'Ï' => 'I',
    'Ö' => 'Oe',
    'Ü' => 'Ue',
    'ß' => 'ss',
    );
    ]]></ac:plain-text-body></ac:macro>

    1. Jun 07, 2008

      <p>Code updated to reflect this.</p>

      1. Jun 09, 2008

        <p>Have to add another thing to that topic:</p>

        <p>When the word is written entirely uppercase (ÄPFEL), it should result in "AEPFEL", while "drüben" still results in "drueben".</p>

        <p>I guess you could check, if the next character in the word is uppercase. If there is no next character, check the previous character. And don't mind words with just two letters, afaik there aren't any with umlauts <ac:emoticon ac:name="smile" /></p>

  12. May 29, 2008

    <p>Would be nice to have an option to supply it with a dictionary in which you can specify alternatives for signs in common use: there's nothing more annoying than URLs like /blog/dr-jekyll-mr-hide generated from the title Dr Jekyll & Mr Hide - this case the ampersand could be replaced with the text 'and', 'und', 'et', 'y', 'és' an so on...<br />
    This dictionary could be specified in a PHP array, INI or XML config file.</p>

    1. May 29, 2008

      <p>Yeah, I was alsoing going to tell that. I suggest that you <strong>could</strong> supply a Zend_Translate instance.</p>

      <p>But I also see a problem with that. Sometimes it may be problematic, when you have a title, for example, like "That silly  ", which would then be converted to "that-silly-andnbsp".</p>

      <p>Sure, in that case you could say, that there must be a space before and after the &, but how about: "The & character in HTML & other things", which would be converted to "the-and-in-html-and-other-things". You would want the first one not to be converted, but the last one.</p>

      <p>As you can see, there is no logical way to determine, wether to convert a special character or not.</p>

      1. May 30, 2008

        <p>Yes, Zend_Translate support would be nice.<br />
        However, I don't really see the problem here. In the first case it is more or less unambiguous, I, personally pronounce it that way: 'and-n-b-s-p', suppose I'm not the only one. Also, we could optionally use regular expressions, to say that character references should be converted in the form of 'ampersand-nbsp', 'nbsp' or if you want, just hardcode it, and use 'non-breaking-space'.<br />
        In the second case: I suppose you admit that this seems to be no too realistic, and converted to 'the-and-character-in-html-and-other-things' is a totally acceptable conversion.</p>

        1. May 30, 2008

          <p>Ok, agreed. Just wanted to make sure that everything is clear <ac:emoticon ac:name="smile" /></p>

  13. Jul 01, 2008

    <p>This filter is ready for Zend review (also with <a href="http://framework.zend.com/wiki/display/ZFPROP/Zend_Filter_Transliteration+-+Martin+Hujer">Zend_Filter_Transliteration</a>)</p>

    <p>I'm not sure, whether I should decouple this from Zend_Filter_Transliteration.</p>

    <p>These two filters were just one, 15 lines long filter to create seo-friendly URL from string.</p>

  14. Aug 05, 2008

    <ac:macro ac:name="note"><ac:parameter ac:name="title">Zend Comments</ac:parameter><ac:rich-text-body>
    <p>The proposal is archived since it has hard dependency on <a href="http://framework.zend.com/wiki/x/7qQ">Zend_Filter_Transliteration Component Proposal</a> which is already archived.</p></ac:rich-text-body></ac:macro>