Page 1 of 1

problem with description in metadesc plugin

Posted: Wed Sep 24, 2008 11:32 pm
by konus
Hello,
I use the serendipity_event_metadesc plugin and found a small bug by generating the content of <meta name="description" content=xxx">

Usually, i start every entry with an image and was wondering, why the description was filled up with html-entities and code like >a class='serendipity_image_link' ...

I found out, that in serendipity_event_metadesc.php at line 83 and following, the code for extraction the description out of the entry had a problem. He was looking for <p>-tags which normally don't exist in blog-entries. Using substr() seems to convert all html-tags to entities, so calling strip_tags() did not help. (But it was never called BTW, because the function exited not founding the first <p>-tag.

So I changed this code

Code: Select all

    function extract_description($text) {
        $x = strpos($text, '<p>');
        if ($x === false) {
            return substr($text, 0, 120);
        }

        $y = strpos($text, '</p>');
        if ($y === false) {
            return substr($text, 0, 120);
        }

        $title = substr($text, $x+3, $y-($x+3));
        $title = strip_tags($title);
        return $title;
    }
to

Code: Select all

    function extract_description($text) {
        $text = strip_html($text);
        $text = substr($text, 0, 120);
        return $text;
    }
because I wanted to strip also all text between <script></script> I found a good example and changed the code finally to

Code: Select all

    /**
    * Remove HTML tags, including invisible text such as style and
    * script code, and embedded objects.  Add line breaks around
    * block-level tags to prevent word joining after tag removal.
    * http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_web_page
    */
    function strip_html_tags($text) {
    	$text = preg_replace(
    	array(
          // Remove invisible content
            '@<iframe[^>]*?>.*?</iframe>@siu',
            '@<style[^>]*?>.*?</style>@siu',
            '@<script[^>]*?.*?</script>@siu',
            '@<object[^>]*?.*?</object>@siu',
            '@<embed[^>]*?.*?</embed>@siu',
            '@<applet[^>]*?.*?</applet>@siu',
            '@<noframes[^>]*?.*?</noframes>@siu',
            '@<noscript[^>]*?.*?</noscript>@siu',
            '@<noembed[^>]*?.*?</noembed>@siu',
          // Add line breaks before and after blocks
            '@</?((address)|(blockquote)|(center)|(del))@iu',
            '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
            '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
            '@</?((table)|(th)|(td)|(caption))@iu',
            '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
            '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
            '@</?((frameset)|(frame)|(iframe))@iu',
        ),
        array(
            ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
            "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
            "\n\$0", "\n\$0",
        ),
        $text );
    return strip_tags($text);
}
    
    function extract_description($text) {
        $text = $this->strip_html_tags($text);
        $text = substr($text, 0, 120);
        $text = str_replace('"','',$text);
        return $text;
    }
Note, that I also stript quotes.
In my case the plugin is now working much better. It would be nice, if someone could review the code and then submit it to SVN.

Edit: replaced HEAD with IFRAME

Re: problem with description in metadesc plugin

Posted: Thu Sep 25, 2008 10:04 am
by garvinhicking
Hi!

Yes, the plugin was built for people that use <p> tags. In your case, your patch would destroy the currently working text extraction for them.

It would be nice to make your patch optional so that one can switch a config option "Use <p> as source" or "Use fulltext minus HTML tags as source".

Do you want to try this and then submit your patch so that I can commit it? :)

Regards,
Garvin

Posted: Thu Sep 25, 2008 10:58 am
by konus
I wasn't aware, that there could be people outside, using p-Tags, because my installation does not generate them. Everything is in div's and line-breaks are made with br :)
Is this really a fact to be considered? In this case you are right, the new behavior of the plugin might be a setback for them.

I am pretty sure, that i am not able to code a option to switch. This would be way beyond my abilities. But I could try the following:

look for p-tags, if found, use behavior A (extract the text between them)
if not found use the new behavior B (Use the first 120 chars of full-text minus HTML tags as source)

Some questions:
Do you think, that in case of A (Use the text between <p> as source) it would desired NOT to strip html-tags?
Should there be a char limit for case A (<p></p> could be long) too?

Posted: Thu Sep 25, 2008 3:39 pm
by Don Chambers
Entries are written using <p> tag with the wysiwyg editor - but of course, anyone using the html editor could insert them as well.

For the purposes of this plugin, all it is trying to do is get the first 120 characters of an entry to use as a <meta description> if one is not provided specifically for the entry.

Is there really any harm in stripping all html to arrive at this default meta description? I don't personally see the harm, but perhaps I am missing something here.

What about strip_tags() first followed by substr()?

Posted: Thu Sep 25, 2008 7:43 pm
by garvinhicking
Hi!

I agree with Don and Konus -- stripping all HTML would be fine. But we cannot strip it first, because once that is done, there is no more <p> Tag. :) :)

Regards
Garvin

Posted: Thu Sep 25, 2008 7:59 pm
by konus
Don Chambers wrote:Entries are written using <p> tag with the wysiwyg editor - but of course, anyone using the html editor could insert them as well.
Ah, I see!
Is there really any harm in stripping all html to arrive at this default meta description? I don't personally see the harm, but perhaps I am missing something here.

What about strip_tags() first followed by substr()?
Thats actually what I did, except the fact, that strip_html_tags() also takes care of the stuff between SCRIPT, EMBED or similar (BTW I think including IFRAME would be also a good idea. Google Maps come with IFRAME. HEAD could be dropped, because it shouldn't be in the entry. (I changed it in the script above)

The drawback of not searching after P-Tags would be, that the description is always extracted from the beginning to pos 120. With looking for P-Tags, this could be a little more flexible, description could be from pos 35 (after <h5>blabla</h5> to pos 536.
Edit: is this an important function?

Posted: Thu Sep 25, 2008 9:20 pm
by Don Chambers
garvinhicking wrote:...stripping all HTML would be fine. But we cannot strip it first, because once that is done, there is no more <p> Tag. :) :)

Regards
Garvin
So there is no <p> tag... what harm is there in that? Now we are starting our character count beginning at the first character AFTER all html, including paragraph tags, have been removed.... this is for generating a meta description - I still do not see the problem.

Posted: Fri Sep 26, 2008 11:14 am
by garvinhicking
Hi!
So there is no <p> tag... what harm is there in that?
When the <p> Tag is stripped, the code can no longer search for the content within that <p> Tag!

Imagine this HTML:

Code: Select all

<div><img src=".."> here is an unimportant image caption, it's a long, long text to go.
<p>But this is the important text</p>
after stripping this is:

Code: Select all

Here is an unimportant image caption, it's a long text to go.
But this is the important text!
and now the code would take the first 120 characters, because it cannot match on any <p> Tag it will use "Here is an unimportant image caption" as your metadescription. Not quite what we wanted...

Regards,
Garvin

Posted: Fri Sep 26, 2008 3:45 pm
by Don Chambers
The current behavior of the plugin, which I never altered, was to seek the first 120 characters of the entry body contained between <p> tags, if they exist. I believe it otherwise just takes the first 120 characters if there are no <p> tags. This becomes the <meta description>.

The <meta keywords> builds a list of keywords contained by user-supplied tags. The default is <b> and <strong>, but the backend allows a user to change this.

Again, both of these are defaults - the piece that I added (with Judebert's help) was to allow these, and the <title> element of an entry, to be specifically provided for any entry, and not automagically generated.

If an entry is written using the wysiwyg editor, it will begin with a <p>, and given the 120 character count, changes are pretty good that the first paragraph will contain all the text that will become the default <meta description>. There is nothing really significant about this text just because it is contained between <p> tags. At least one exception would be an entry constructed like this:

<p><b>My view of current events.</b></p>
<p>blah blah blah blah....</p>

If an entry is written with the html editor, odds are that few authors bother using <p> tags, especially if they are also using the nl2br plugin, so now the plugin is just using the first 120 characters without any regard for "significance".

To presume that text contained within a paragraph tag has some significance over text that is not contained within a <p> is, IHMO, making a presumption that probably does not apply in most cases. Before the plugin had the ability to allow an author to specifically write the <meta description>, it was probably as good a guess as any.

I continue to stand by my feeling that it is best to strip all html from the entry body prior to truncating it to just the first 120 characters for the purposes of generating the <meta description>.... I would be willing to bet that, with or without <p> tags, 99.9% of what is happening is still this first 120 characters.... the problem at the moment is that it includes escaped html, and that definitely needs to be removed.

Another choice MIGHT be to provide functionality similar to the <meta keywords>, where a list of tags can be provided from which to generate a default <meta description> with a fallback to just the first 120 characters in the event those tags are not found.

I personally use this plugin, and am not a good enough writer that my first sentence or two is such a clear explanation of the entry's "who, what, when, where, how". So, in my case, the first 120 characters, enclosed in a <p> tag or not, is not usually a great <meta description>.... instead, I find the ability to provide my own <meta description> far more valuable than the automagically generated one.