Skip to end of metadata
Go to start of metadata

<ac:macro ac:name="unmigrated-inline-wiki-markup"><ac:plain-text-body><![CDATA[

<ac:macro ac:name="unmigrated-inline-wiki-markup"><ac:plain-text-body><![CDATA[

Zend Framework: Zend_Markup Component Proposal

Proposed Component Name Zend_Markup
Developer Notes http://framework.zend.com/wiki/display/ZFDEV/Zend_Markup
Proposers Pieter Kokx
Zend Liaison Ralph Schindler
Revision 1.1 - 31 January 2008: Created the proposal.
1.2 - 4 February 2008: Finished the proposal and submitted for community review.
1.3 - 25 April 2008: Changed a use case.
1.4 - 26 April 2008: Added the source code.
1.5 - 5 June 2008: Added link to subversion repository.
2.0 - 9 August 2008: Refactored Zend_TextParser to Zend_Markup
2.1 - 25 October 2008: Submitted the proposal for community review (wiki revision: 38)

Table of Contents

1. Overview

Zend_Markup should provide an extensible way to tokenize and render lightweight markup languages, like BBcode and Textile.

2. References

3. Component Requirements, Constraints, and Acceptance Criteria

  • This component will provide the extensibility to tokenize different lightweight markup languages.
  • This component will provide the extensibility to render into different (lightweight) markup languages.
  • This component will provide an easy way to create your own tags.
  • It will not be possible to retrieve the original source from the rendered text.

4. Dependencies on Other Framework Components

  • Zend_Exception
  • Zend_Filter
  • Zend_Loader_PluginLoader

5. Theory of Operation

Zend_Markup does tokenize and render lightweight markup languages into another format. Because there are a lot of lightweight markup languages, it should be compatible with the most important languages. It should also be possible to create your own Parsers.

Zend_Markup consists of parsers and renderers. A parser is splitting the input text into an array with all the information it could extract out of the input text.

For example, the Zend_Markup_Parser_Bbcode parser should produce this array from the string '[tag="a" attr=val]value[/tag]':

A renderer does loop trough the generated array, and uses the provided information to generate a value.

6. Milestones / Tasks

  • Milestone 1: [DONE] Create the proposal
  • Milestone 2: [DONE] Initial class design
  • Milestone 3: [DONE] Submit the proposal for community review
  • Milestone 4: [DONE] Create working prototype (see the bottom of this proposal)
  • Milestone 5: [DONE] Create code-covering unit tests.

7. Class Index

  • Zend_Markup
  • Zend_Markup_Parser_Interface
  • Zend_Markup_Parser_BbCode
  • Zend_Markup_Parser_Textile
  • Zend_Markup_Parser_Parsed
  • Zend_Markup_Renderer_Abstract
  • Zend_Markup_Renderer_Html
  • Zend_View_Helper_Markup

8. Use Cases

UC-01

Simple usage (bbcode)

UC-02

Creating your own tags (bbcode)

UC-03

Create an object with a textile parser (uses the same renderer as with the BBcode parser).

UC-04

Filtering (bbcode)

9. Class Skeletons

Zend_Markup_Parser_Interface
Zend_Markup_Renderer_Abstract
Zend_Markup
Zend_View_Helper_Markup

You can retrieve the current code from the incubator.

]]></ac:plain-text-body></ac:macro>

]]></ac:plain-text-body></ac:macro>

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Feb 05, 2008

    <p>According to the coding guidelines, Bbcode should be BbCode. Also would be better to use attribute(s) instead of attr(s).</p>

    1. Feb 08, 2008

      <p>That's a pretty good idea, when i'm back from work i'm going to change that.</p>

  2. Feb 08, 2008

    <p>I'd love to see support for Markdown <ac:link><ri:page ri:content-title="1" /></ac:link> added. </p>

    <p><ac:link><ri:page ri:content-title="1" /></ac:link> <a class="external-link" href="http://daringfireball.net/projects/markdown/">http://daringfireball.net/projects/markdown/</a></p>

    1. Feb 08, 2008

      <p>In the beginning, only tokenizers for BBcode and MediaWiki should be available. But you can still write your own tokenizer for any language. And if you would like, you can also help me with a tokenizer for Markdown.</p>

      <p>But I think I will write a tokenizer for Markdown later, but first I'm going to build my current ideas of Zend_TextParser.</p>

  3. Feb 13, 2008

    <p>Hi, <br />
    This looks really interesting, I've actually been thinking about a similar idea. One thing I thought of is parsing the given mark-up into a DOMDocument first, and then writing that back out as HTML. This would allow you to convert any mark-up, to any other mark-up, and give you a common, standard, in-between format. Something like:</p>

    <ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
    "BBCode" -> BBCode Tokenizer & Parser -> (DOMDocument) -> saveHTML() // get HTML output
    -> MediaWiki Renderer -> "MediaWiki"
    -> Textile Renderer -> "Textile"
    ]]></ac:plain-text-body></ac:macro>

    <p>And so on. Of course I have no idea how to write tokenizers or parsers (something I'm trying to learn), so I can't really do this, but I thought I'd make the suggestion.</p>

    1. Feb 16, 2008

      <p>Hi Jack,</p>

      <p>Can you explain your idea for using DOMDocument further with an exaple how you should use it? Now it looks like it does have a lot of unnecessary overhead.</p>

      <p>And writing tokenizers isn't very hard, you can look at the interface for the return format.</p>

      1. Feb 19, 2008

        <p>Hi Pieter,<br />
        Well my general idea was that rather than reading in BBCode, and then directly writing that back out as HTML, you could instead parse the BBCode, and then store it in a standard intermediary format, something that could for instance be serialized and then saved into a database. Making this format an extension of DOMDocument, would allow for modifications to the DOM, and allow people to do all sorts of other processing on the markup. This is roughly how it could work:</p>

        <ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
        $parser = new Zend_TextParser_BBCode_Parser();
        $dom = $parser->parse('[b][i]some text[/b][/i] some text');

        // to get HTML you'd then do
        echo $dom->saveHTML();

        // or you could render it back out as another format
        $renderer = new Zend_TextParser_Textile_Renderer();
        echo $renderer->render($dom);
        ]]></ac:plain-text-body></ac:macro>

        <p>And say for instance it were a news article with various header tags, if the mark-up was available as a DOMDocument object, someone could very easily scan through it (manually or via XPath), and build a table of contents based on the header tags, in an OO fashion. And there are all sorts of other possibilities.</p>

        <p>You also have the advantage of not having to parse and render the mark-up every time you need it. If for example you stored the serialized version of the DOM object in your database, rather than the raw BBCode, you only need to render it each time.</p>

        <p>Of course I've no idea what the overheads would be like by doing this, but it's just an idea. In my mind it seems quite tidy to have the data in a standard, almost mark-up language agnostic format.</p>

        1. Feb 19, 2008

          <p>Also, using DOMDocument allows you to write the mark-up back out as XML (if you wanted it as XHTML), and the loadHTML() method, allows you to parse even badly formed HTML. You could for example load in an external document (without worrying if it's valid), and then write that back out as BBCode. Like:</p>

          <ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
          $dom = new DOMDocument();
          $dom->loadHTML(file_get_contents('http://example.com'));

          $renderer = new Zend_TextParser_BBCode_Renderer();
          echo $renderer->render($dom);
          ]]></ac:plain-text-body></ac:macro>

          <p>I'm probably getting carried away with the examples here, but I'm sure you can see the flexibility that using DOMDocument introduces.</p>

          1. Oct 05, 2009

            <p>Hi Jack,</p>

            <p>There are still some problems when you are using DOMDocument. Because many developers like to have [code] tags on their sites with syntax highlighting. And as you know, it is very hard to remove syntax highlighting, and that isn't the task of a renderer. Oke, it would be possible, but then it does have a really huge overhead.</p>

            <p>Also, your example, that should be possible later, in this form:</p>

            <ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
            $html = Zend_TextParser::factory('html', 'bbcode');

            echo $html->render(file_get_contents('http://example.com'));
            ]]></ac:plain-text-body></ac:macro>

            <p>Also, using DOMDdocument means that you have to implement a renderer into a tokenizer, which makes tokenizers a lot more complicated. And it doesn't really conform to the framework's DRY and KISS standards.</p>

  4. Mar 30, 2008

    <p>A few comments:</p>

    <ul>
    <li>A tokenizer implies that there is a set of Tokens that are generated. Does this mean that there is a standard set of tokens that will be shared by all parsers and generators?</li>
    <li>How are token tree's handled? For example <ac:link><ri:page ri:content-title="b" /></ac:link><ac:link><ri:page ri:content-title="i" /></ac:link>Foo<ac:link><ri:page ri:content-title="/i" /></ac:link><ac:link><ri:page ri:content-title="/b" /></ac:link> implies that there is a token for bold with contents "<ac:link><ri:page ri:content-title="i" /></ac:link>Foo<ac:link><ri:page ri:content-title="/i" /></ac:link>".. but that further implies there is a token within it for italic with contents "Foo"</li>
    <li>Will tokenized input be able to be serialized for output in any format that a renderer is available for?</li>
    </ul>

    <p>Just a few question to get started <ac:emoticon ac:name="wink" /></p>

    <p>-ralph</p>

    1. Mar 30, 2008

      <p>I don't have much time, so I will only answer you last question, I will answer the other two later.</p>

      <p>Yes, you can pass the same tokenizer output to any renderer.</p>

      1. May 17, 2008

        <p>Well, a tokenizer generates an array like described in "Theory of Operation". This is totally independent of the renderer, so there is a little problem with tokenizers like the MediaWiki tokenizer. For example, if you pass "''some text''" to the MediaWiki tokenizer, it will output an array with three tokens. The first token has the tagname "i" and contains "''", it also defines "''" as the stopper. The second token does only contain "some text". And the third token contains exactly the same information as the first token.</p>

        <p>There are also some tokens defined in the MediaWiki tokenizer, which aren't supported by a renderer. This is because a renderer isn't able to gather information like the signature of the current user. But you can still use '~~~' with a MediaWiki tokenizer, and if you define a new tag (look to UC-02), single-replace callback (not yet implemented!) with the name 'signature', then it will work fine <ac:emoticon ac:name="wink" />.</p>

        <p>Well, token-trees are handled very simple. For example, if we take an example like "[b]...[i]...[/b]...', it will output something like: "<strong>...<em>...</em></strong><em>...</em>". This is, because the renderer first sees the '[b]' token, then it will add all the text to a string until the '[i]' token, there it will create another string for the italic tag. All the text will be added to the italic tag until the '[/b]' tag is found. There it will first end the italic tag, and add it to the bold tag, which is also closed. After that, it will restart the italic tag with exactly the same parameters and add it to the return string.</p>

  5. Jun 23, 2008

    <ac:macro ac:name="note"><ac:parameter ac:name="title">Zend Official Response</ac:parameter><ac:rich-text-body>

    <p>This proposal is currently accepted into the Zend Laboratory as we'd like to see the following points explored and developed:</p>

    <ul>
    <li>Move Zend_TextParser into the Zend_Text_Parser namespaces (there will be other Zend_Text proposals in this space as well)</li>
    <li>We need to understand the use cases better with respect to what this component <strong>WILL NOT</strong> provide (See section 3 above), currently, this proposal is quite open ended.</li>
    <li>We need to understand better the method that will be implemented to "parse".</li>
    <li>Other areas we'd like to explore:
    <ul>
    <li>Currently, it seems as though the approach is token replacement</li>
    <li>How will the Parser handle non-well-formed data</li>
    <li>What relationship will this have with DomDocument</li>
    </ul>
    </li>
    </ul>

    </ac:rich-text-body></ac:macro>

  6. Oct 25, 2008

    <p>It'd be great if you could define languages in a declarative way. Seems like it would make it easier for users to add or extend their own.</p>

    <p>Taking this to an extreme, one would define something like BBCode in BNF, then have a parser generator (adapted from any of the C ones) spit out a parser on initial load. Parsing would be pretty darn fast in that case.</p>

    <p>Now, you can meet somewhere in the middle on this. Defining each token in its own object, putting them in a stack, and then transforming the text by popping off the tokens would potentially allow a declarative syntax.</p>

    <p>What I'm thinking is something like this:</p>

    <ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
    <?php
    class Zend_Markup_Language_BBCode
    {
    public function defineTokens(Zend_Markup_Lexer $lexer)

    Unknown macro: { $lexer->addToken(...); }

    }
    ]]></ac:plain-text-body></ac:macro>

    1. Oct 25, 2008

      <p>That is a pretty good idea. But I think there is a problem. My Bbcode parser looks to Bbcode as a language like SGML or XML. The tags don't have any meaning at all (because that part is done by the renderer). But languages like Textile are a bit different, they have tokens like '*' which defines bold text or '_' which defines italics text.</p>

      1. Oct 25, 2008

        <p>I'm not sure about the benefit of decoupling the "renderer" (that is, the transformer) if you don't put it to good use: allow transformations between markup languages. That is, allow transformations between Textile and Markdown, for instance.</p>

        <p>After all, most of these languages are tied directly to HTML. I'm not sure if there is any meaning in alternate transformations of BBCode, for instance. Obviously a data interchange format like JSON is meaningless, so we're talking about presentation. BBCode => PDF, for instance. But most of these formats permit HTML embedded in the text... translating some but not all of the code to PDF format is not helpful. These lightweight markup languages were designed for HTML, so why not couple them to HTML?</p>

        <p>One difficulty with decoupling is that you don't have a one-to-one correspondence. Take Markdown's block quote for instance:</p>

        <ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
        > Zend Framework (ZF) is an open source framework for developing web
        > applications and services with PHP 5. ZF is implemented using 100%
        > object-oriented code. The component structure of ZF is somewhat
        > unique; each component is designed with few dependencies on other
        > components. This loosely coupled architecture allows developers to
        > use components individually. We often call this a "use-at-will"
        > design.
        ]]></ac:plain-text-body></ac:macro>

        <p>This translates to:</p>

        <ac:macro ac:name="code"><ac:plain-text-body><![CDATA[
        <blockquote>
        Zend Framework (ZF) is an open source framework for developing web
        applications and services with PHP 5. ZF is implemented using 100%
        object-oriented code. The component structure of ZF is somewhat
        unique; each component is designed with few dependencies on other
        components. This loosely coupled architecture allows developers to
        use components individually. We often call this a "use-at-will"
        design.
        </blockquote>
        ]]></ac:plain-text-body></ac:macro>

        1. Oct 26, 2008

          <p>That is the whole point of my idea, making it possible to not only go from Textile/Bbcode/Markdown to HTML, but also transformations like Textile to Bbcode. Well, initially not all combinations will be possible, because the tranformation

          Unknown macro: {insert-favorite-language}

          to HTML is the most important part.</p>

          <p>And yes, it is sometimes pretty hard to write a parser so it will be compatible with the renderers. But if you want to allow transformations between the languages, without a lot of copy-paste, I think this is the best idea.</p>

  7. Nov 12, 2008

    <p>Is it normal that on the repository nothing is inside the Rendered directory ?<br />
    Where could I find the basic html rendered class ?</p>

    1. Nov 13, 2008

      <p>Yes that is normal, since I decided to rewrite everything, and I still haven't found any time to create a renderer.</p>

      1. Nov 13, 2008

        <p>ok I understand.<br />
        When do you think you can release a first draft ? <br />
        As I m very interesting to use it.</p>

        1. Nov 13, 2008

          <p>Currently, I think that it will be this weekend. But if you are really interested in that, you should look to revisions before 12121. The old Zend_TextParser code was removed in that revision.</p>

          1. Nov 13, 2008

            <p>thanks I think I can wait until next week <ac:emoticon ac:name="smile" /></p>

  8. Nov 25, 2008

    <p>This parser should also allow for special markup handling. As you are yet parsing it into a tree structure, one should be able to mark elements as not beeing allowed in other elements. For example. <ac:link><ri:page ri:content-title="code" /></ac:link> or <ac:link><ri:page ri:content-title="blockquote" /></ac:link> tag should not be allowed within a <ac:link><ri:page ri:content-title="url" /></ac:link> tag.</p>

    <p>Also you should allow paragraphhandling instead of simple nl2br conversion. If you didn't yet, take a look at Chrisitan Seiler's BBCode Parser:</p>

    <p><a class="external-link" href="http://www.christian-seiler.de/projekte/php/bbcode/index_en.html">http://www.christian-seiler.de/projekte/php/bbcode/index_en.html</a></p>

    1. Nov 26, 2008

      <p>About the 'special markup handling', that is already in place. It is what you should call context-awareness. Currently, you can specify which tag is allowed inside an other tag, but I am not sure if that is already checked into SVN.</p>

      <p>The paragraph handling is quite interesting. It is not that hard to realize.</p>

      <p>Also, I have looked to Chrisitan Seiler's BBCode parser fore several times <ac:emoticon ac:name="wink" />.</p>

  9. Nov 26, 2008

    <p>I m trying to get Zend_markup working on my website but I have some problems to implement it.</p>

    <p>below are the tags that I would like to use:</p>

    <p>[h1]test[/h1]<br />
    [list]<br />
    [*]toto est en bateau<br />
    [/list]<br />
    [url]http://example.org[/url]<br />
    [url=http://example.com]Example[/url\]</p>

    <p>no one of these tags except "list" element is working</p>

    <p>Here is what I did:<br />
    $bbCode = Zend_Markup::factory('BbCode', 'Html');<br />
        <br />
    $allowedIn    = array('i', 'u', 's', 'b');<br />
    $allowsInside = array('i', 'u', 's', 'b');<br />
                            <br />
    $bbCode->addTag('h1',     Zend_Markup::REPLACE, array('start' => '<h1>', 'end' => '</h1>'), $allowedIn, $allowsInside);<br />
    $bbCode->addTag('*', Zend_Markup::REPLACE, array('start' => '<li>', 'end' => '</li>'), array("list"), $allowsInside);<br />
    $bbCode->addTag('list', Zend_Markup::REPLACE, array('start' => '<ul>', 'end' => '</ul>'), $allowedIn, array('*') );</p>

    <p>$bbCode->render($texte ); //the output is exactly the same as the input</p>

    <p>Can someone help me?</p>

    1. Nov 26, 2008

      <p>Currently, the factory is not tested, so I think there will be some issues with that. And the name of the file (and the last part of the classname) is Bbcode, not BbCode.</p>

      <p>Also, I have a known issue in my bbcode tokenizer, the '[*]' tag will not parse correctly. And the '$allowedIn' and '$allowsInside' parameters are not there currently.</p>

      <p>You should also know that this is pretty much a development snapshot, not near to production-ready. And there could be a lot of API changes in the future <strong>without</strong> any notification.</p>

  10. Jan 09, 2009

    <ac:macro ac:name="note"><ac:parameter ac:name="title">Zend Official Response</ac:parameter><ac:rich-text-body>

    <p>We like the direction in which this is taking shape. This component is approved for development in the <strong>Standard Incubator</strong>.</p>

    <p>Issues as mentioned on the IRC channel we'd like to see explored while in the incubator:</p>
    <ul>
    <li>Parsers generate a token tree</li>
    <li>There is a standard set of Tokens defined that must be implemented by all parsers, in addition to their own Custom tokens, this would provide maximum compatibility between parsers and generators.</li>
    <li>Parsers should have some element of modes/error settings to determine if the character stream supplied is parsable and/or fixable.</li>
    </ul>

    </ac:rich-text-body></ac:macro>

  11. Mar 18, 2009

    <p>What is the status of this proposal? We'd love to include it in the 1.8 release.</p>

  12. Jul 26, 2009

    <p>Since it doesn't look like the author is pursuing this any longer, I started writing tests for Zend_Markup tonight. If nothing has happened, I'd be willing to take on the responsibility of getting it done. </p>

  13. Jul 29, 2009

    <p>I emailed the author a while back and he said he was still working on this. I haven't seen much activity in svn though. I'd love to see this component.</p>

    1. Aug 05, 2009

      <p>Yep, i am still working on this.</p>

      <p>The recent lack of activity is probably a result of my vacation in france. I am just back for a few hours, so give me some time to pull everything back together here, and then I will continue to work on a basic HTML renderer.</p>

  14. Jul 30, 2009

    <p>Meanwhile for those who are looking for a php5 solution you can checkout wikirenderer <a class="external-link" href="http://wikirenderer.berlios.de/en/">http://wikirenderer.berlios.de/en/</a></p>

  15. Oct 27, 2009

    <p>The ezComponents are offering also a very interesting "Document" component which is able to do transformations between RST (ReStructured text), XHTML, Docbook, ez Publish XML markup, and Wiki markup languages, like: Creole, Dokuwiki and Confluence. <a class="external-link" href="http://ezcomponents.org/docs/tutorials/Document">http://ezcomponents.org/docs/tutorials/Document</a></p>

  16. Dec 22, 2009

    <p>I would like to find out if I can merge Zend_Phaml (Zend_Haml) into this project, it is an implementation of HAML for PHP that seems to be a good fit. Can someone guide me on this?</p>

    1. Dec 22, 2009

      <p>I just looked fast into HAML, but it seems that there is a big difference between what HAML does, and what Zend_Markup does. Zend_Markup is for content, while HAML is for creating entire pages. </p>