Added by Martin Hujer, last edited by Alexander Veremyev on Aug 05, 2008  (view change) show comment

Labels

 
(None)

Zend Framework: Zend_Filter_Sanitize (formerly Zend_Filter_SeoUrl) Component Proposal

Proposed Component Name Zend_Filter_Sanitize (formerly Zend_Filter_SeoUrl)
Developer Notes http://framework.zend.com/wiki/display/ZFDEV/Zend_Filter_Sanitize (formerly Zend_Filter_SeoUrl)
Proposers Martin Hujer
Alexander Veremyev (Zend Liaison)
Revision 1.0 - 24th January 2008: Initial proposal
1.1 - 28th January 2008: Renamed to Zend_Filter_Sanitize (wiki revision: 16)

Table of Contents

1. Overview

Zend_Filter_Sanitize is a component that converts string into SEO friendly url.

2. References

3. Component Requirements, Constraints, and Acceptance Criteria

  • This filter must convert any string to seo url without any setting
  • This component will correctly converts string into SEO url.
  • This component will put all characters in lowercase.
  • This component will translate special chars such as '?' or '?' into 'c' and 'u' (included in Zend_Filter_Transliteration)
  • This component will allow to change the word delimiter character (default are (' ', '.', '
    ', '/', '-', '_'))
  • This component will allow to change the delimiter replacement character (default is dash)
  • This component will strip characters, which are not allowed in url.

4. Dependencies on Other Framework Components

5. Theory of Operation

...

6. Milestones / Tasks

7. Class Index

  • Zend_Filter_Sanitize

8. Use Cases

9. Class Skeletons

Will this work with non-Latin characters, like Cyrillic etc?

I like this, but I would suggest to create a more general transliteration filter. Then anyone could just replace all spaces with dashes ir wanted...

Hi,

it works with these character, because I'm from Czech Republic and I need to transform Czech characters, such as ? š ? ? ž ý á í é into their equivalent. It was one of the reasons, why I decided to write this class.

I'm not sure, if I understand your second note well. You mean to create class, which can just replace spaces with dashes? Or just to add some options to this class?

No, I meant, that this could be not a SeoURL filter, but just a transliteration filter, that outputs any text as transliterated ascii text.

And after that you could just use the output for the URL's, you would nly need to lowercase the text and replace all the spaces with dashes...

Yes, good idea, but when I want to transliterate '?š??žýáíé?ú' (typical Czech symbols) it works thi way:

I've solved it.

You can checkout from SVN of Zend_Filter_Transliteration

I'd suggest to rename this class to Zend_Filter_Sanitize. The same name is used in other languages and it seems wrong to bind a class name to only one task (seo optimzation) when it could be used for several other tasks as well.

I should agree with Daniel, this class can be useful for several other task as well, so renaming it to Zend_Filter_Sanitize makes sense. Of course, it would be useful only if it can handle every latin-derivative (or close-to-it) alphabets (european alphabets with diacriticized latin letters, cyrillic and so on).

It works correctly with Czech, so I suppose I will work well with other European alphabets. If you have some problematic characters in your language, I'll be happy to add them to unit test.

Renamed to Zend_Filter_Sanitize

Perhaps it'd also be appropriate to update the proposal to reflect the new name (e.g. "This component will correctly converts [sic] string into SEO url."). On the other hand, if you were going to rename it SanitizeUrl (which I also think is more appropriate) you can keep on renaming...

I think you will have problems with the encode (UTF-8 and ISO-8859-1, for example) of the characters.
How will you handle that?

I always use Sanitize in my projects. I think I can help with my experiences.

It converts utf-8 strings well. If the user has another input, it needs to be converted before.

I'd really appreciate your tips.

Nice idea about this class, Martin...

Btw, i din't saw you on #zftalk at all last months...

Hey Martin,

I might propose SanitizeUrl over simply 'Sanitize'.

The current naming doesn't give any context to what you are attempting to filter for and against.

Another thought might be to simply create a "Tansliteration" filter first, as that might have a much broader audience than that of Url's.

My 2cents

-ralph

Hey Ralph,

I think this filter could also be used to filter directory names on the filesystem (for example). I don't think it's a good idea to rename it to *anything*Url.

  • Daniel

I should agree with Ralph just because I found this class has some replacements that has been specifically designed for URLs, especially SEO URLs.
In particular, space replacement with dashes (instead of underscores) is a common practice when you design routes search engine friendly.

I have a question for you, Martin.
How will the filter handle special characters such as slashes or dots?

Additionally, if you really want to make an URL search engine friendly you should ensure that URLs with and without trailing slash are normalized to an unique version (usually without for such this type of framework powered URLs).
Do you think this filter should normalize trailing slash too?

I'd prefer to leave dots untouched - would be handy when normalizing filenames like MoZzIlla FiREfOx 1.0.0.12.EXE (mozilla-firefox-1.0.0.12.exe).

View the rest of this thread. Most recent comment: Feb 18, 2008
2 more comments by: Martin Hujer

That said, doesnt that also make a good case for a Zend_Filter_Transliteration specific filter?

After just a cursory review, I think having a trasliteration filter would be a "Good Thing" (r).

-ralph

Hello,
I have thought about it (discussion on irc helped me) and I will create proposal for Zend_Filter_Transliteration (or just Translitere). It will handle just transliteration and maybe apostrophes removal.

Simone: slashes and dots are currently stripped away, but they should be converted to dash (or to set replacement character). I'll do it tomorrow.

New name of this component could be one of these:

  1. Zend_Filter_Sanitize
  2. Zend_Filter_SanitizeUrl
  3. Zend_Filter_StringToSlug

I would vote for Zend_Filter_Sanitize. It could be used then to filter strings for URLs, maybe for filenames (ie. for uploaded files, not to contain invalid/unwanted characters) and probably for lots of other things.

The idea behind Zend_Filter_StringToSlug is nice.
Being inspired by Tomas's feedback, what about StringToPath?

You proposal is basically an aggregation of three filter:

  • Transliteration
  • Lowercase
  • Space replacement
    I suggest to implement that as-is and then aggregate that three helpers into a single one which could be called NiceUrl or something (btw: URLs are overrated in SEO, domains are not, but this doesn't matter here).

I have split off part of this component into [Zend_Filter_Transliteration http://framework.zend.com/wiki/display/ZFPROP/Zend_Filter_Transliteration+-+Martin+Hujer]

Do you think it is necessary? :|

I don't know if it can help you, but no more than 3 days ago I did the same for a Rails project.
Here's my piece of code.

I used iconv as well as you posted at
http://framework.zend.com/wiki/display/ZFPROP/Zend_Filter_Sanitize+-+Martin+Hujer?focusedCommentId=42104#comment-42104
You can get inspired, if you need.

Thanks, Rails looks like Python

I wanted to write more general transliteration filter and when i use this icovn command, it converts some diacritics mark into ' or " or ^
Zend_Filter_Transliteration filter strips this chars out.

And the Sanitize filter has some improvements to be more general (e.g. file system paths creation)

Martin.

I started using sanitize instead of my own helpers.

Works great for now, will check deeply next days.

Some correction from my side to Zend_Filter_Transliteration, which I just found out:

The transliteration table should more look like this for German (according to german grammar):

Code updated to reflect this.

Have to add another thing to that topic:

When the word is written entirely uppercase (ÄPFEL), it should result in "AEPFEL", while "drüben" still results in "drueben".

I guess you could check, if the next character in the word is uppercase. If there is no next character, check the previous character. And don't mind words with just two letters, afaik there aren't any with umlauts

Would be nice to have an option to supply it with a dictionary in which you can specify alternatives for signs in common use: there's nothing more annoying than URLs like /blog/dr-jekyll-mr-hide generated from the title Dr Jekyll & Mr Hide - this case the ampersand could be replaced with the text 'and', 'und', 'et', 'y', 'és' an so on...
This dictionary could be specified in a PHP array, INI or XML config file.

Yeah, I was alsoing going to tell that. I suggest that you could supply a Zend_Translate instance.

But I also see a problem with that. Sometimes it may be problematic, when you have a title, for example, like "That silly  ", which would then be converted to "that-silly-andnbsp".

Sure, in that case you could say, that there must be a space before and after the &, but how about: "The & character in HTML & other things", which would be converted to "the-and-in-html-and-other-things". You would want the first one not to be converted, but the last one.

As you can see, there is no logical way to determine, wether to convert a special character or not.

Yes, Zend_Translate support would be nice.
However, I don't really see the problem here. In the first case it is more or less unambiguous, I, personally pronounce it that way: 'and-n-b-s-p', suppose I'm not the only one. Also, we could optionally use regular expressions, to say that character references should be converted in the form of 'ampersand-nbsp', 'nbsp' or if you want, just hardcode it, and use 'non-breaking-space'.
In the second case: I suppose you admit that this seems to be no too realistic, and converted to 'the-and-character-in-html-and-other-things' is a totally acceptable conversion.

Ok, agreed. Just wanted to make sure that everything is clear

This filter is ready for Zend review (also with Zend_Filter_Transliteration)

I'm not sure, whether I should decouple this from Zend_Filter_Transliteration.

These two filters were just one, 15 lines long filter to create seo-friendly URL from string.

Zend Comments

The proposal is archived since it has hard dependency on Zend_Filter_Transliteration Component Proposal which is already archived.