Added by Pieter Kokx, last edited by Ralph Schindler on Nov 03, 2008  (view change)

Labels

 
(None)

Zend Framework: Zend_Markup Component Proposal

Proposed Component Name Zend_Markup
Developer Notes http://framework.zend.com/wiki/display/ZFDEV/Zend_Markup
Proposers Pieter Kokx
Zend Liaison Ralph Schindler
Revision 1.1 - 31 January 2008: Created the proposal.
1.2 - 4 February 2008: Finished the proposal and submitted for community review.
1.3 - 25 April 2008: Changed a use case.
1.4 - 26 April 2008: Added the source code.
1.5 - 5 June 2008: Added link to subversion repository.
2.0 - 9 August 2008: Refactored Zend_TextParser to Zend_Markup
2.1 - 25 October 2008: Submitted the proposal for community review (wiki revision: 34)

Table of Contents

1. Overview

Zend_Markup should provide an extensible way to tokenize and render lightweight markup languages, like BBcode and Textile.

2. References

3. Component Requirements, Constraints, and Acceptance Criteria

  • This component will provide the extensibility to tokenize different lightweight markup languages.
  • This component will provide the extensibility to render into different (lightweight) markup languages.
  • This component will provide an easy way to create your own tags.
  • It will not be possible to retrieve the original source from the rendered text.

4. Dependencies on Other Framework Components

  • Zend_Exception
  • Zend_Filter
  • Zend_Loader_PluginLoader

5. Theory of Operation

Zend_Markup does tokenize and render lightweight markup languages into another format. Because there are a lot of lightweight markup languages, it should be compatible with the most important languages. It should also be possible to create your own Parsers.

Zend_Markup consists of parsers and renderers. A parser is splitting the input text into an array with all the information it could extract out of the input text.

For example, the Zend_Markup_Parser_Bbcode parser should produce this array from the string '[tag="a" attr=val]value[/tag]':

A renderer does loop trough the generated array, and uses the provided information to generate a value.

6. Milestones / Tasks

  • Milestone 1: [DONE] Create the proposal
  • Milestone 2: [DONE] Initial class design
  • Milestone 3: Submit the proposal for community review
  • Milestone 4: Create working prototype (see the bottom of this proposal)
  • Milestone 5: Create code-covering unit tests.

7. Class Index

  • Zend_Markup
  • Zend_Markup_Parser_Interface
  • Zend_Markup_Parser_BbCode
  • Zend_Markup_Parser_Textile
  • Zend_Markup_Parser_Parsed
  • Zend_Markup_Renderer_Abstract
  • Zend_Markup_Renderer_Html
  • Zend_View_Helper_Markup

8. Use Cases

UC-01

Simple usage (bbcode)

UC-02

Creating your own tags (bbcode)

UC-03

Create an object with a textile parser (uses the same renderer as with the BBcode parser).

UC-04

Filtering (bbcode)

9. Class Skeletons

Zend_Markup_Parser_Interface

Zend_Markup_Renderer_Abstract
Zend_Markup
Zend_View_Helper_Markup

You can retrieve the current code from subversion:

According to the coding guidelines, Bbcode should be BbCode. Also would be better to use attribute(s) instead of attr(s).

That's a pretty good idea, when i'm back from work i'm going to change that.

I'd love to see support for Markdown [1] added.

[1] http://daringfireball.net/projects/markdown/

In the beginning, only tokenizers for BBcode and MediaWiki should be available. But you can still write your own tokenizer for any language. And if you would like, you can also help me with a tokenizer for Markdown.

But I think I will write a tokenizer for Markdown later, but first I'm going to build my current ideas of Zend_TextParser.

Hi,
This looks really interesting, I've actually been thinking about a similar idea. One thing I thought of is parsing the given mark-up into a DOMDocument first, and then writing that back out as HTML. This would allow you to convert any mark-up, to any other mark-up, and give you a common, standard, in-between format. Something like:

And so on. Of course I have no idea how to write tokenizers or parsers (something I'm trying to learn), so I can't really do this, but I thought I'd make the suggestion.

Hi Jack,

Can you explain your idea for using DOMDocument further with an exaple how you should use it? Now it looks like it does have a lot of unnecessary overhead.

And writing tokenizers isn't very hard, you can look at the interface for the return format.

Hi Pieter,
Well my general idea was that rather than reading in BBCode, and then directly writing that back out as HTML, you could instead parse the BBCode, and then store it in a standard intermediary format, something that could for instance be serialized and then saved into a database. Making this format an extension of DOMDocument, would allow for modifications to the DOM, and allow people to do all sorts of other processing on the markup. This is roughly how it could work:

And say for instance it were a news article with various header tags, if the mark-up was available as a DOMDocument object, someone could very easily scan through it (manually or via XPath), and build a table of contents based on the header tags, in an OO fashion. And there are all sorts of other possibilities.

You also have the advantage of not having to parse and render the mark-up every time you need it. If for example you stored the serialized version of the DOM object in your database, rather than the raw BBCode, you only need to render it each time.

Of course I've no idea what the overheads would be like by doing this, but it's just an idea. In my mind it seems quite tidy to have the data in a standard, almost mark-up language agnostic format.

Also, using DOMDocument allows you to write the mark-up back out as XML (if you wanted it as XHTML), and the loadHTML() method, allows you to parse even badly formed HTML. You could for example load in an external document (without worrying if it's valid), and then write that back out as BBCode. Like:

I'm probably getting carried away with the examples here, but I'm sure you can see the flexibility that using DOMDocument introduces.

Hi Jack,

There are still some problems when you are using DOMDocument. Because many developers like to have [code] tags on their sites with syntax highlighting. And as you know, it is very hard to remove syntax highlighting, and that isn't the task of a renderer. Oke, it would be possible, but then it does have a really huge overhead.

Also, your example, that should be possible later, in this form:

Also, using DOMDdocument means that you have to implement a renderer into a tokenizer, which makes tokenizers a lot more complicated.

A few comments:

  • A tokenizer implies that there is a set of Tokens that are generated. Does this mean that there is a standard set of tokens that will be shared by all parsers and generators?
  • How are token tree's handled? For example [b][i]Foo[/i][/b] implies that there is a token for bold with contents "[i]Foo[/i]".. but that further implies there is a token within it for italic with contents "Foo"
  • Will tokenized input be able to be serialized for output in any format that a renderer is available for?

Just a few question to get started

-ralph

I don't have much time, so I will only answer you last question, I will answer the other two later.

Yes, you can pass the same tokenizer output to any renderer.

Well, a tokenizer generates an array like described in "Theory of Operation". This is totally independent of the renderer, so there is a little problem with tokenizers like the MediaWiki tokenizer. For example, if you pass "''some text''" to the MediaWiki tokenizer, it will output an array with three tokens. The first token has the tagname "i" and contains "''", it also defines "''" as the stopper. The second token does only contain "some text". And the third token contains exactly the same information as the first token.

There are also some tokens defined in the MediaWiki tokenizer, which aren't supported by a renderer. This is because a renderer isn't able to gather information like the signature of the current user. But you can still use '~~~' with a MediaWiki tokenizer, and if you define a new tag (look to UC-02), single-replace callback (not yet implemented!) with the name 'signature', then it will work fine .

Well, token-trees are handled very simple. For example, if we take an example like "[b]...[i]...[/b]...', it will output something like: "<strong>...<em>...</em></strong><em>...</em>". This is, because the renderer first sees the '[b]' token, then it will add all the text to a string until the '[i]' token, there it will create another string for the italic tag. All the text will be added to the italic tag until the '[/b]' tag is found. There it will first end the italic tag, and add it to the bold tag, which is also closed. After that, it will restart the italic tag with exactly the same parameters and add it to the return string.

Zend Official Response

This proposal is currently accepted into the Zend Laboratory as we'd like to see the following points explored and developed:

  • Move Zend_TextParser into the Zend_Text_Parser namespaces (there will be other Zend_Text proposals in this space as well)
  • We need to understand the use cases better with respect to what this component WILL NOT provide (See section 3 above), currently, this proposal is quite open ended.
  • We need to understand better the method that will be implemented to "parse".
  • Other areas we'd like to explore:
    • Currently, it seems as though the approach is token replacement
    • How will the Parser handle non-well-formed data
    • What relationship will this have with DomDocument

It'd be great if you could define languages in a declarative way. Seems like it would make it easier for users to add or extend their own.

Taking this to an extreme, one would define something like BBCode in BNF, then have a parser generator (adapted from any of the C ones) spit out a parser on initial load. Parsing would be pretty darn fast in that case.

Now, you can meet somewhere in the middle on this. Defining each token in its own object, putting them in a stack, and then transforming the text by popping off the tokens would potentially allow a declarative syntax.

What I'm thinking is something like this:

That is a pretty good idea. But I think there is a problem. My Bbcode parser looks to Bbcode as a language like SGML or XML. The tags don't have any meaning at all (because that part is done by the renderer). But languages like Textile are a bit different, they have tokens like '*' which defines bold text or '_' which defines italics text.

I'm not sure about the benefit of decoupling the "renderer" (that is, the transformer) if you don't put it to good use: allow transformations between markup languages. That is, allow transformations between Textile and Markdown, for instance.

After all, most of these languages are tied directly to HTML. I'm not sure if there is any meaning in alternate transformations of BBCode, for instance. Obviously a data interchange format like JSON is meaningless, so we're talking about presentation. BBCode => PDF, for instance. But most of these formats permit HTML embedded in the text... translating some but not all of the code to PDF format is not helpful. These lightweight markup languages were designed for HTML, so why not couple them to HTML?

One difficulty with decoupling is that you don't have a one-to-one correspondence. Take Markdown's block quote for instance:

This translates to:

That is the whole point of my idea, making it possible to not only go from Textile/Bbcode/Markdown to HTML, but also transformations like Textile to Bbcode. Well, initially not all combinations will be possible, because the tranformation {insert-favorite-language} to HTML is the most important part.

And yes, it is sometimes pretty hard to write a parser so it will be compatible with the renderers. But if you want to allow transformations between the languages, without a lot of copy-paste, I think this is the best idea.

Is it normal that on the repository nothing is inside the Rendered directory ?
Where could I find the basic html rendered class ?

Yes that is normal, since I decided to rewrite everything, and I still haven't found any time to create a renderer.

ok I understand.
When do you think you can release a first draft ?
As I m very interesting to use it.

Currently, I think that it will be this weekend. But if you are really interested in that, you should look to revisions before 12121. The old Zend_TextParser code was removed in that revision.

thanks I think I can wait until next week