Issues

ZF-2956: Consider the rel-attribute in getLinks

Description

It would be nice if the Zend_Search_Lucene_Document_Html would use the rel-attribute of links. The getLinks method no fetches all links of a document.

Patch:

Index: Search/Lucene/Document/Html.php

--- Search/Lucene/Document/Html.php (revision 9039) +++ Search/Lucene/Document/Html.php (working copy) @@ -105,7 +105,7 @@

     $linkNodes = $this->_doc->getElementsByTagName('a');
     foreach ($linkNodes as $linkNode) {

- if (($href = $linkNode->getAttribute('href')) != '') { + if (($href = $linkNode->getAttribute('href')) != '' && $linkNode->getAttribute('rel') != 'nofollow' ) { $this->_links[] = $href; } }

Comments

Please categorize/fix as needed.

Done.

I don't think it's good idea to have this behavior as default since 'nofollow' initiative is not a W3C standard. But it's really useful to have such option.

links with 'nofollow' rel attribute can be excluded now using the following code:


Zend_Search_Lucene_Document_Html::setExcludeNoFollowLinks(true);
$doc = Zend_Search_Lucene_Document_Html::loadHTML($html);

This functionality is merged into 1.5 and 1.6 release branches. So it will be included into ZF 1.5.3 and ZF 1.6 (documentation mentions it's only available starting from 1.6, so it's "undocumented" feature for the ZF 1.5.3)

Updating for the 1.6.0 release.