<ac:macro ac:name="unmigrated-inline-wiki-markup"><ac:plain-text-body><![CDATA[
<ac:macro ac:name="unmigrated-inline-wiki-markup"><ac:plain-text-body><![CDATA[
Zend_Filter_Sanitize is a component that converts string into SEO friendly url.Zend Framework: Zend_Filter_Sanitize (formerly Zend_Filter_SeoUrl) Component Proposal
Proposed Component Name
Zend_Filter_Sanitize (formerly Zend_Filter_SeoUrl)
Developer Notes
http://framework.zend.com/wiki/display/ZFDEV/Zend_Filter_Sanitize (formerly Zend_Filter_SeoUrl)
Proposers
Martin Hujer
Alexander Veremyev (Zend Liaison)
Revision
1.0 - 24th January 2008: Initial proposal
1.1 - 28th January 2008: Renamed to Zend_Filter_Sanitize (wiki revision: 17)Table of Contents
1. Overview
2. References
3. Component Requirements, Constraints, and Acceptance Criteria
', '/', '-', '_'))
4. Dependencies on Other Framework Components
- Zend_Filter_Interface
- Zend_Filter_Exception
- Zend_Filter_StringToLower
- Zend_Filter_StringTrim
- Zend_Filter_Transliteration
5. Theory of Operation
...
6. Milestones / Tasks
- Milestone 1: [DONE] Finalize this proposal
- Milestone 2: [DONE] Working prototype checked into the http://zfdev.googlecode.com/svn/trunk/ZendFilterSanitize/
- Milestone 3: [DONE] Unit tests exist, work, and are checked into SVN.
- Milestone 4: Community and Zend review
7. Class Index
- Zend_Filter_Sanitize
8. Use Cases
37 Comments
comments.show.hideJan 25, 2008
Tomas Markauskas
<p>Will this work with non-Latin characters, like Cyrillic etc?</p>
<p>I like this, but I would suggest to create a more general transliteration filter. Then anyone could just replace all spaces with dashes ir wanted...</p>
Jan 25, 2008
Martin Hujer
<p>Hi,</p>
<p>it works with these character, because I'm from Czech Republic and I need to transform Czech characters, such as ? š ? ? ž ý á í é into their equivalent. It was one of the reasons, why I decided to write this class.</p>
<p>I'm not sure, if I understand your second note well. You mean to create class, which can just replace spaces with dashes? Or just to add some options to this class?</p>
Jan 25, 2008
Tomas Markauskas
<p>No, I meant, that this could be not a SeoURL filter, but just a transliteration filter, that outputs any text as transliterated ascii text.</p>
<p>And after that you could just use the output for the URL's, you would nly need to lowercase the text and replace all the spaces with dashes...</p>
Jan 27, 2008
Martin Hujer
<p>Yes, good idea, but when I want to transliterate '?š??žýáíé?ú' (typical Czech symbols) it works thi way:</p>
<ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
$s = iconv("utf-8", "us-ascii//TRANSLIT", "?š??žýáíé?ú");
// $s contains "escrz'y'a'i'eu'u"
// not "escrzyaieuu"
]]></ac:plain-text-body></ac:macro>
Jan 30, 2008
Martin Hujer
<p>I've solved it.</p>
<ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
$s = Zend_Filter_Transliteration::filter("?š??žýáíé?ú");
// $s now contains "escrzyaieuu" as expected
]]></ac:plain-text-body></ac:macro>
<p> You can checkout from SVN of <a href="http://framework.zend.com/wiki/display/ZFPROP/Zend_Filter_Transliteration+-+Martin+Hujer">Zend_Filter_Transliteration</a></p>
Jan 27, 2008
Daniel Freudenberger
<p>I'd suggest to rename this class to Zend_Filter_Sanitize. The same name is used in other languages and it seems wrong to bind a class name to only one task (seo optimzation) when it could be used for several other tasks as well.</p>
Jan 27, 2008
Joó Ádám
<p>I should agree with Daniel, this class can be useful for several other task as well, so renaming it to Zend_Filter_Sanitize makes sense. Of course, it would be useful only if it can handle every latin-derivative (or close-to-it) alphabets (european alphabets with diacriticized latin letters, cyrillic and so on).</p>
Jan 27, 2008
Martin Hujer
<p>It works correctly with Czech, so I suppose I will work well with other European alphabets. If you have some problematic characters in your language, I'll be happy to add them to unit test.</p>
Jan 28, 2008
Martin Hujer
<p><strong>Renamed to Zend_Filter_Sanitize</strong></p>
Jan 28, 2008
Vincent
<p>Perhaps it'd also be appropriate to update the proposal to reflect the new name (e.g. "This component will correctly converts <ac:link><ri:page ri:content-title="sic" /></ac:link> string into SEO url."). On the other hand, if you were going to rename it SanitizeUrl (which I also think is more appropriate) you can keep on renaming...</p>
Jan 28, 2008
Renan Gonçalves
<p>I think you will have problems with the encode (UTF-8 and ISO-8859-1, for example) of the characters.<br />
How will you handle that?</p>
<p>I always use Sanitize in my projects. I think I can help with my experiences.</p>
Jan 28, 2008
Martin Hujer
<p>It converts utf-8 strings well. If the user has another input, it needs to be converted before.</p>
<p>I'd really appreciate your tips.</p>
Jan 28, 2008
Cristian Bichis
<p>Nice idea about this class, Martin...</p>
<p>Btw, i din't saw you on #zftalk at all last months...</p>
Jan 28, 2008
Ralph Schindler
<p>Hey Martin,</p>
<p>I might propose SanitizeUrl over simply 'Sanitize'.</p>
<p>The current naming doesn't give any context to what you are attempting to filter for and against.</p>
<p>Another thought might be to simply create a "Tansliteration" filter first, as that might have a much broader audience than that of Url's.</p>
<p>My 2cents <ac:emoticon ac:name="wink" /></p>
<p>-ralph</p>
Jan 28, 2008
Daniel Freudenberger
<p>Hey Ralph,</p>
<p>I think this filter could also be used to filter directory names on the filesystem (for example). I don't think it's a good idea to rename it to *anything*Url.</p>
<ul class="alternate">
<li>Daniel</li>
</ul>
Jan 28, 2008
Simone Carletti
<p>I should agree with Ralph just because I found this class has some replacements that has been specifically designed for URLs, especially SEO URLs.<br />
In particular, space replacement with dashes (instead of underscores) is a common practice when you design routes search engine friendly.</p>
<p>I have a question for you, Martin.<br />
How will the filter handle special characters such as slashes <ac:emoticon ac:name="tick" /> or dots?</p>
<p>Additionally, if you really want to make an URL search engine friendly you should ensure that URLs with and without trailing slash are normalized to an unique version (usually without for such this type of framework powered URLs).<br />
Do you think this filter should normalize trailing slash too?</p>
Feb 09, 2008
Joó Ádám
<p>I'd prefer to leave dots untouched - would be handy when normalizing filenames like MoZzIlla FiREfOx 1.0.0.12.EXE (mozilla-firefox-1.0.0.12.exe).</p>
Feb 10, 2008
Martin Hujer
<p>Maybe the option to handle this would do it <ac:emoticon ac:name="smile" /></p>
Feb 18, 2008
Martin Hujer
<p>Added <ac:emoticon ac:name="smile" /></p>
<p>See use cases and update from SVN.</p>
<p>Martin H.</p>
Jan 28, 2008
Ralph Schindler
<p>That said, doesnt that also make a good case for a Zend_Filter_Transliteration specific filter?</p>
<p>After just a cursory review, I think having a trasliteration filter would be a "Good Thing" (r).</p>
<p>-ralph</p>
Jan 28, 2008
Martin Hujer
<p>Hello,<br />
I have thought about it (discussion on irc helped me) and I will create proposal for Zend_Filter_Transliteration (or just Translitere). It will handle just transliteration and maybe apostrophes removal.</p>
<p>Simone: slashes and dots are currently stripped away, but they should be converted to dash (or to set replacement character). I'll do it tomorrow. </p>
<p>New name of this component could be one of these:</p>
<ol>
<li>Zend_Filter_Sanitize</li>
<li>Zend_Filter_SanitizeUrl</li>
<li>Zend_Filter_StringToSlug</li>
</ol>
Jan 28, 2008
Tomas Markauskas
<p>I would vote for Zend_Filter_Sanitize. It could be used then to filter strings for URLs, maybe for filenames (ie. for uploaded files, not to contain invalid/unwanted characters) and probably for lots of other things.</p>
Jan 29, 2008
Simone Carletti
<p>The idea behind Zend_Filter_StringToSlug is nice.<br />
Being inspired by Tomas's feedback, what about StringToPath?</p>
Jan 29, 2008
Lars Strojny
<p>You proposal is basically an aggregation of three filter:</p>
<ul>
<li>Transliteration</li>
<li>Lowercase</li>
<li>Space replacement<br />
I suggest to implement that as-is and then aggregate that three helpers into a single one which could be called NiceUrl or something (btw: URLs are overrated in SEO, domains are not, but this doesn't matter here).</li>
</ul>
Jan 30, 2008
Martin Hujer
<p><strong>I have split off part of this component into <ac:link><ri:page ri:content-title="//framework.zend.com/wiki/display/ZFPROP/Zend_Filter_Transliteration+-+Martin+Hujer" ri:space-key="Zend_Filter_Transliteration http" /></ac:link></strong></p>
Jan 30, 2008
Simone Carletti
<p>Do you think it is necessary? :|</p>
<p>I don't know if it can help you, but no more than 3 days ago I did the same for a Rails project.<br />
Here's my piece of code.</p>
<ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
module PathFilters
def to_path()
Iconv.iconv("ASCII//IGNORE//TRANSLIT", "UTF-8", self).join.sanitize_path
rescue
self.sanitize_path
end
def sanitize_path()
self.gsub(/[^a-z._0-9 -]/i, "").gsub(/\s+|\./, "_").dasherize.downcase
end
end
]]></ac:plain-text-body></ac:macro>
<p>I used iconv as well as you posted at
<a class="external-link" href="http://framework.zend.com/wiki/display/ZFPROP/Zend_Filter_Sanitize+-+Martin+Hujer?focusedCommentId=42104#comment-42104">http://framework.zend.com/wiki/display/ZFPROP/Zend_Filter_Sanitize+-+Martin+Hujer?focusedCommentId=42104#comment-42104</a><br />
You can get inspired, if you need. <ac:emoticon ac:name="smile" /></p>
Jan 30, 2008
Martin Hujer
<p>Thanks, Rails looks like Python <ac:emoticon ac:name="smile" /></p>
<p>I wanted to write more general transliteration filter and when i use this icovn command, it converts some diacritics mark into ' or " or ^<br />
Zend_Filter_Transliteration filter strips this chars out.</p>
<p>And the Sanitize filter has some improvements to be more general (e.g. file system paths creation)</p>
<p>Martin.</p>
Feb 08, 2008
Cristian Bichis
<p>I started using sanitize instead of my own helpers.</p>
<p>Works great for now, will check deeply next days.</p>
May 28, 2008
Ben Scholzen
<p>Some correction from my side to Zend_Filter_Transliteration, which I just found out:</p>
<p>The transliteration table should more look like this for German (according to german grammar):</p>
<ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
$table = array (
'ä' => 'ae',
'ë' => 'e',
'ï' => 'i',
'ö' => 'oe',
'ü' => 'ue',
'Ä' => 'Ae',
'Ë' => 'E',
'Ï' => 'I',
'Ö' => 'Oe',
'Ü' => 'Ue',
'ß' => 'ss',
);
]]></ac:plain-text-body></ac:macro>
Jun 07, 2008
Martin Hujer
<p>Code updated to reflect this.</p>
Jun 09, 2008
Ben Scholzen
<p>Have to add another thing to that topic:</p>
<p>When the word is written entirely uppercase (ÄPFEL), it should result in "AEPFEL", while "drüben" still results in "drueben".</p>
<p>I guess you could check, if the next character in the word is uppercase. If there is no next character, check the previous character. And don't mind words with just two letters, afaik there aren't any with umlauts <ac:emoticon ac:name="smile" /></p>
May 29, 2008
Joó Ádám
<p>Would be nice to have an option to supply it with a dictionary in which you can specify alternatives for signs in common use: there's nothing more annoying than URLs like /blog/dr-jekyll-mr-hide generated from the title Dr Jekyll & Mr Hide - this case the ampersand could be replaced with the text 'and', 'und', 'et', 'y', 'és' an so on...<br />
This dictionary could be specified in a PHP array, INI or XML config file.</p>
May 29, 2008
Ben Scholzen
<p>Yeah, I was alsoing going to tell that. I suggest that you <strong>could</strong> supply a Zend_Translate instance.</p>
<p>But I also see a problem with that. Sometimes it may be problematic, when you have a title, for example, like "That silly ", which would then be converted to "that-silly-andnbsp".</p>
<p>Sure, in that case you could say, that there must be a space before and after the &, but how about: "The & character in HTML & other things", which would be converted to "the-and-in-html-and-other-things". You would want the first one not to be converted, but the last one.</p>
<p>As you can see, there is no logical way to determine, wether to convert a special character or not.</p>
May 30, 2008
Joó Ádám
<p>Yes, Zend_Translate support would be nice.<br />
However, I don't really see the problem here. In the first case it is more or less unambiguous, I, personally pronounce it that way: 'and-n-b-s-p', suppose I'm not the only one. Also, we could optionally use regular expressions, to say that character references should be converted in the form of 'ampersand-nbsp', 'nbsp' or if you want, just hardcode it, and use 'non-breaking-space'.<br />
In the second case: I suppose you admit that this seems to be no too realistic, and converted to 'the-and-character-in-html-and-other-things' is a totally acceptable conversion.</p>
May 30, 2008
Ben Scholzen
<p>Ok, agreed. Just wanted to make sure that everything is clear <ac:emoticon ac:name="smile" /></p>
Jul 01, 2008
Martin Hujer
<p>This filter is ready for Zend review (also with <a href="http://framework.zend.com/wiki/display/ZFPROP/Zend_Filter_Transliteration+-+Martin+Hujer">Zend_Filter_Transliteration</a>)</p>
<p>I'm not sure, whether I should decouple this from Zend_Filter_Transliteration.</p>
<p>These two filters were just one, 15 lines long filter to create seo-friendly URL from string.</p>
Aug 05, 2008
Alexander Veremyev
<ac:macro ac:name="note"><ac:parameter ac:name="title">Zend Comments</ac:parameter><ac:rich-text-body>
<p>The proposal is archived since it has hard dependency on <a href="http://framework.zend.com/wiki/x/7qQ">Zend_Filter_Transliteration Component Proposal</a> which is already archived.</p></ac:rich-text-body></ac:macro>