problem with description in metadesc plugin

Found a bug? Tell us!!
Post Reply
konus
Regular
Posts: 334
Joined: Mon Jun 16, 2008 1:57 pm
Location: Dresden, Germany
Contact:

problem with description in metadesc plugin

Post by konus »

Hello,
I use the serendipity_event_metadesc plugin and found a small bug by generating the content of <meta name="description" content=xxx">

Usually, i start every entry with an image and was wondering, why the description was filled up with html-entities and code like >a class='serendipity_image_link' ...

I found out, that in serendipity_event_metadesc.php at line 83 and following, the code for extraction the description out of the entry had a problem. He was looking for <p>-tags which normally don't exist in blog-entries. Using substr() seems to convert all html-tags to entities, so calling strip_tags() did not help. (But it was never called BTW, because the function exited not founding the first <p>-tag.

So I changed this code

Code: Select all

    function extract_description($text) {
        $x = strpos($text, '<p>');
        if ($x === false) {
            return substr($text, 0, 120);
        }

        $y = strpos($text, '</p>');
        if ($y === false) {
            return substr($text, 0, 120);
        }

        $title = substr($text, $x+3, $y-($x+3));
        $title = strip_tags($title);
        return $title;
    }
to

Code: Select all

    function extract_description($text) {
        $text = strip_html($text);
        $text = substr($text, 0, 120);
        return $text;
    }
because I wanted to strip also all text between <script></script> I found a good example and changed the code finally to

Code: Select all

    /**
    * Remove HTML tags, including invisible text such as style and
    * script code, and embedded objects.  Add line breaks around
    * block-level tags to prevent word joining after tag removal.
    * http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_web_page
    */
    function strip_html_tags($text) {
    	$text = preg_replace(
    	array(
          // Remove invisible content
            '@<iframe[^>]*?>.*?</iframe>@siu',
            '@<style[^>]*?>.*?</style>@siu',
            '@<script[^>]*?.*?</script>@siu',
            '@<object[^>]*?.*?</object>@siu',
            '@<embed[^>]*?.*?</embed>@siu',
            '@<applet[^>]*?.*?</applet>@siu',
            '@<noframes[^>]*?.*?</noframes>@siu',
            '@<noscript[^>]*?.*?</noscript>@siu',
            '@<noembed[^>]*?.*?</noembed>@siu',
          // Add line breaks before and after blocks
            '@</?((address)|(blockquote)|(center)|(del))@iu',
            '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
            '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
            '@</?((table)|(th)|(td)|(caption))@iu',
            '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
            '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
            '@</?((frameset)|(frame)|(iframe))@iu',
        ),
        array(
            ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
            "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
            "\n\$0", "\n\$0",
        ),
        $text );
    return strip_tags($text);
}
    
    function extract_description($text) {
        $text = $this->strip_html_tags($text);
        $text = substr($text, 0, 120);
        $text = str_replace('"','',$text);
        return $text;
    }
Note, that I also stript quotes.
In my case the plugin is now working much better. It would be nice, if someone could review the code and then submit it to SVN.

Edit: replaced HEAD with IFRAME
Last edited by konus on Thu Sep 25, 2008 7:53 pm, edited 1 time in total.
garvinhicking
Core Developer
Posts: 30022
Joined: Tue Sep 16, 2003 9:45 pm
Location: Cologne, Germany
Contact:

Re: problem with description in metadesc plugin

Post by garvinhicking »

Hi!

Yes, the plugin was built for people that use <p> tags. In your case, your patch would destroy the currently working text extraction for them.

It would be nice to make your patch optional so that one can switch a config option "Use <p> as source" or "Use fulltext minus HTML tags as source".

Do you want to try this and then submit your patch so that I can commit it? :)

Regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
konus
Regular
Posts: 334
Joined: Mon Jun 16, 2008 1:57 pm
Location: Dresden, Germany
Contact:

Post by konus »

I wasn't aware, that there could be people outside, using p-Tags, because my installation does not generate them. Everything is in div's and line-breaks are made with br :)
Is this really a fact to be considered? In this case you are right, the new behavior of the plugin might be a setback for them.

I am pretty sure, that i am not able to code a option to switch. This would be way beyond my abilities. But I could try the following:

look for p-tags, if found, use behavior A (extract the text between them)
if not found use the new behavior B (Use the first 120 chars of full-text minus HTML tags as source)

Some questions:
Do you think, that in case of A (Use the text between <p> as source) it would desired NOT to strip html-tags?
Should there be a char limit for case A (<p></p> could be long) too?
Don Chambers
Regular
Posts: 3657
Joined: Mon Feb 13, 2006 2:40 am
Location: Chicago, IL, USA
Contact:

Post by Don Chambers »

Entries are written using <p> tag with the wysiwyg editor - but of course, anyone using the html editor could insert them as well.

For the purposes of this plugin, all it is trying to do is get the first 120 characters of an entry to use as a <meta description> if one is not provided specifically for the entry.

Is there really any harm in stripping all html to arrive at this default meta description? I don't personally see the harm, but perhaps I am missing something here.

What about strip_tags() first followed by substr()?
=Don=
garvinhicking
Core Developer
Posts: 30022
Joined: Tue Sep 16, 2003 9:45 pm
Location: Cologne, Germany
Contact:

Post by garvinhicking »

Hi!

I agree with Don and Konus -- stripping all HTML would be fine. But we cannot strip it first, because once that is done, there is no more <p> Tag. :) :)

Regards
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
konus
Regular
Posts: 334
Joined: Mon Jun 16, 2008 1:57 pm
Location: Dresden, Germany
Contact:

Post by konus »

Don Chambers wrote:Entries are written using <p> tag with the wysiwyg editor - but of course, anyone using the html editor could insert them as well.
Ah, I see!
Is there really any harm in stripping all html to arrive at this default meta description? I don't personally see the harm, but perhaps I am missing something here.

What about strip_tags() first followed by substr()?
Thats actually what I did, except the fact, that strip_html_tags() also takes care of the stuff between SCRIPT, EMBED or similar (BTW I think including IFRAME would be also a good idea. Google Maps come with IFRAME. HEAD could be dropped, because it shouldn't be in the entry. (I changed it in the script above)

The drawback of not searching after P-Tags would be, that the description is always extracted from the beginning to pos 120. With looking for P-Tags, this could be a little more flexible, description could be from pos 35 (after <h5>blabla</h5> to pos 536.
Edit: is this an important function?
Don Chambers
Regular
Posts: 3657
Joined: Mon Feb 13, 2006 2:40 am
Location: Chicago, IL, USA
Contact:

Post by Don Chambers »

garvinhicking wrote:...stripping all HTML would be fine. But we cannot strip it first, because once that is done, there is no more <p> Tag. :) :)

Regards
Garvin
So there is no <p> tag... what harm is there in that? Now we are starting our character count beginning at the first character AFTER all html, including paragraph tags, have been removed.... this is for generating a meta description - I still do not see the problem.
=Don=
garvinhicking
Core Developer
Posts: 30022
Joined: Tue Sep 16, 2003 9:45 pm
Location: Cologne, Germany
Contact:

Post by garvinhicking »

Hi!
So there is no <p> tag... what harm is there in that?
When the <p> Tag is stripped, the code can no longer search for the content within that <p> Tag!

Imagine this HTML:

Code: Select all

<div><img src=".."> here is an unimportant image caption, it's a long, long text to go.
<p>But this is the important text</p>
after stripping this is:

Code: Select all

Here is an unimportant image caption, it's a long text to go.
But this is the important text!
and now the code would take the first 120 characters, because it cannot match on any <p> Tag it will use "Here is an unimportant image caption" as your metadescription. Not quite what we wanted...

Regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
Don Chambers
Regular
Posts: 3657
Joined: Mon Feb 13, 2006 2:40 am
Location: Chicago, IL, USA
Contact:

Post by Don Chambers »

The current behavior of the plugin, which I never altered, was to seek the first 120 characters of the entry body contained between <p> tags, if they exist. I believe it otherwise just takes the first 120 characters if there are no <p> tags. This becomes the <meta description>.

The <meta keywords> builds a list of keywords contained by user-supplied tags. The default is <b> and <strong>, but the backend allows a user to change this.

Again, both of these are defaults - the piece that I added (with Judebert's help) was to allow these, and the <title> element of an entry, to be specifically provided for any entry, and not automagically generated.

If an entry is written using the wysiwyg editor, it will begin with a <p>, and given the 120 character count, changes are pretty good that the first paragraph will contain all the text that will become the default <meta description>. There is nothing really significant about this text just because it is contained between <p> tags. At least one exception would be an entry constructed like this:

<p><b>My view of current events.</b></p>
<p>blah blah blah blah....</p>

If an entry is written with the html editor, odds are that few authors bother using <p> tags, especially if they are also using the nl2br plugin, so now the plugin is just using the first 120 characters without any regard for "significance".

To presume that text contained within a paragraph tag has some significance over text that is not contained within a <p> is, IHMO, making a presumption that probably does not apply in most cases. Before the plugin had the ability to allow an author to specifically write the <meta description>, it was probably as good a guess as any.

I continue to stand by my feeling that it is best to strip all html from the entry body prior to truncating it to just the first 120 characters for the purposes of generating the <meta description>.... I would be willing to bet that, with or without <p> tags, 99.9% of what is happening is still this first 120 characters.... the problem at the moment is that it includes escaped html, and that definitely needs to be removed.

Another choice MIGHT be to provide functionality similar to the <meta keywords>, where a list of tags can be provided from which to generate a default <meta description> with a fallback to just the first 120 characters in the event those tags are not found.

I personally use this plugin, and am not a good enough writer that my first sentence or two is such a clear explanation of the entry's "who, what, when, where, how". So, in my case, the first 120 characters, enclosed in a <p> tag or not, is not usually a great <meta description>.... instead, I find the ability to provide my own <meta description> far more valuable than the automagically generated one.
=Don=
Post Reply