problem with regex markup

Creating and modifying plugins.
Post Reply
stm999999999
Regular
Posts: 1531
Joined: Tue Mar 07, 2006 11:25 pm
Location: Berlin, Germany
Contact:

problem with regex markup

Post by stm999999999 »

Hello,

garvin tells me, better to post in the english forums:

the markup makes from

Code: Select all

http://mueller.example

<a href="http://www.mueller.example" target="_blank"><a href="http://www.mueller.example" target="_blank">http://www.mueller.example</a></a>

I think, the problem is here, the double regex-code:

Code: Select all

    'SearchArray'=>array( 
      "/([^]_a-z0-9-=\"'\/])((https?|ftp|gopher|news|telnet):\/\/|www\.)([^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/si", 
                         "/^((https?|ftp|gopher|news|telnet):\/\/|www\.)([^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/si"
What exactly does these do:

Code: Select all

([^]_a-z0-9-=\"'\/])
?


and something else:

The code replaces www.example.com, too. But it does not add http:// to the href="", so
href="http://mein-blog.example/www.example.com" is the result :-(


And: something else would be nice, what I found in other blog-systems:

http://www.example.com should become

<a href="http://www.example.com">www.example.com</a>, without the http:// in the link-text (not in the href of course)!
Ciao, Stephan
judebert
Regular
Posts: 2478
Joined: Sat Oct 15, 2005 6:57 am
Location: Orlando, FL
Contact:

Post by judebert »

Fiercesome.

First, the easy question:
([^]_a-z0-9-=\"'\/]) indicates a single character, taken from the group between the brackets ([]). The caret (^) indicates that it should be a character NOT in the group. The first end-bracket (]) is right after the caret, so we understand that it's part of the group, not the ending bracket. So this comes down to "A character that isn't an end-bracket, underscore, lowercase character, number, dash, equals, double or single quote, or slash." That leaves us with dots, colons, angle brackets (<>), parentheses, some other punctuation, and uppercase characters -- however, the "i" at the very end indicates that this is case-insensitive, so we can ignore all the characters.

All together, the first entry in the array says:
First a special character as we've just discussed, followed by one of (http://,https://,ftp://,gopher://,news://,or telnet://) OR "www.", followed by as many characters as you can find that are NOT (carriage return, newline, parens, caret, dollar, exclamation point, any kind of quotes, bar, or any kind of brackets), without regard to case. It also stores the special character as \1, the "prefix://" OR "www." as \2, and the rest of the URL as \3.

The second entry is exactly the same, but without the special character; so the prefix/www goes in \1, and the URL goes in \2.

I know that's long-winded, but so is regexp matching. Sorry.

When it does the replacement (I'll bet it has a replacement array too, doesn't it?), you'll see something like:

Code: Select all

<a href="\1\2" target="_blank">\1\2</a>
At least, that's what it SHOULD look like. To remove the "http://", change it to:

Code: Select all

<a href="\1\2" target="_blank">\2</a>
Whew. Any questions?
Judebert
---
Website | Wishlist | PayPal
stm999999999
Regular
Posts: 1531
Joined: Tue Mar 07, 2006 11:25 pm
Location: Berlin, Germany
Contact:

Post by stm999999999 »

First, the easy question:
([^]_a-z0-9-=\"'\/]) indicates a single character, taken from the group between the brackets ([]). The caret (^) indicates that it should be a character NOT in the group. The first end-bracket (]) is right after the caret, so we understand that it's part of the group, not the ending bracket.
Should not the first ] be encoded with "\" ?
Whew. Any questions?
Yes, whats about the problems? ;-)

1) www.mueller.example will be replaced, but to:

Code: Select all

<a href="www.example.com">www.example.com</a>
this is no valid extern URL! It must be

Code: Select all

<a href="http://www.example.com">www.example.com</a>
Perhaps the case "no http or something, but www. must be handled by a third regex, which adds the http?

2) malformed URI like

Code: Select all

http://example.com” onlick=”alert(document.cookie) 
are replaced by

Code: Select all

<a href="http://example.com”" target="_blank">http://example.com”</a> onlick=”alert(document.cookie)
Is it possible to throw the ” out of the href?



3) my problem with the doubled replacement is very easy: I had the plugin twice! I do not know, how it cukd happen, but it happens. :-(



BTW, a source of ideas could be http://minglewithingle.com/code/linktrunc.phps
Ciao, Stephan
judebert
Regular
Posts: 2478
Joined: Sat Oct 15, 2005 6:57 am
Location: Orlando, FL
Contact:

Post by judebert »

Should not the first ] be encoded with "\" ?
You might think so, but no. If you put the ] right after the [ or negator, it's recognized as a part of the character set.
1) www.mueller.example will be replaced, but to:

Code:
<a href="www.example.com">www.example.com</a>

this is no valid extern URL! It must be

Code:
<a href="http://www.example.com">www.example.com</a>

Perhaps the case "no http or something, but www. must be handled by a third regex, which adds the http?
Sorry, I thought you were having the opposite problem. But you've got the solution almost exactly correct. Just remove the "|www\." part (which matches no-protocol links starting with www.) from the first two comparisons, and add this third one:

Code: Select all

'/(^|\s)(www\.[^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/si'
And of course, you'll need a replacement string, too:

Code: Select all

'\\1<a href="http://\\2/">\\2</a>'
Is it possible to throw the ” out of the href?
The \" in the replacement should make sure that double-quotes don't get included as part of a match. Are those actually end-double-quotes or something? You could add them to the list of disallowed characters, but I'm not sure how it's going to work, since they're non-ASCII characters.
Judebert
---
Website | Wishlist | PayPal
stm999999999
Regular
Posts: 1531
Joined: Tue Mar 07, 2006 11:25 pm
Location: Berlin, Germany
Contact:

Post by stm999999999 »

the new code did not work, when the url is not at the beginning of the line.

I changed it a little, now it works. And I add the regex for ftps:

Code: Select all

    'SearchArray'=>array(
		"/([^]_a-z0-9-=\"'\/])((https?|ftps?|gopher|news|telnet):\/\/)([^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/si",
		"/^((https?|ftps?|gopher|news|telnet):\/\/)([^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/si",
		"/([^]_a-z0-9-=\"'\/])(www\.[^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/si",
		"/(^|\s)(www\.[^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/si"
    ),
Ciao, Stephan
stm999999999
Regular
Posts: 1531
Joined: Tue Mar 07, 2006 11:25 pm
Location: Berlin, Germany
Contact:

Post by stm999999999 »

judebert wrote:
Is it possible to throw the ” out of the href?
The " in the replacement should make sure that double-quotes don't get included as part of a match. Are those actually end-double-quotes or something?

Code: Select all

http://example.com” onlick=”alert(document.cookie)
It is a example for an cross-site-script, which should not be converted in the full length for security reasons. But I see other clickable-makers, which can make the cut before the ”.

But it seems not to be so important, because this should only be happend, when someone wants to inject evil code and it is important to make a cut in the url, and not important to make a good and clickable url out of this.
Ciao, Stephan
judebert
Regular
Posts: 2478
Joined: Sat Oct 15, 2005 6:57 am
Location: Orlando, FL
Contact:

Post by judebert »

I'm all too glad to let the weird quotes go for now.

The (^|\s) is supposed to mean "start of the line or whitespace". I'm glad everything is working, of course, but I'll have to meditate on this for a while and see if I can grok what happened.
Judebert
---
Website | Wishlist | PayPal
stm999999999
Regular
Posts: 1531
Joined: Tue Mar 07, 2006 11:25 pm
Location: Berlin, Germany
Contact:

Post by stm999999999 »

whats about things like "bla bla (www.bla.example)"?

But, as I said, the code above will work for this - only the ”-problem is still alive
Ciao, Stephan
judebert
Regular
Posts: 2478
Joined: Sat Oct 15, 2005 6:57 am
Location: Orlando, FL
Contact:

Post by judebert »

That should be handled by the third one, "something not in this set of characters, then www.whatever".

Are those quotes anything other than the standard double-quotes? In the forum entry, they appear to be ending-quotations.
Judebert
---
Website | Wishlist | PayPal
stm999999999
Regular
Posts: 1531
Joined: Tue Mar 07, 2006 11:25 pm
Location: Berlin, Germany
Contact:

Post by stm999999999 »

That should be handled by the third one, "something not in this set of characters, then www.whatever".
Do you mean this?

Code: Select all

"/([^]_a-z0-9-=\"'\/])(www\.[^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/si"
This will work, I know. But this is added by me, bacause

Code: Select all

  "/(^|\s)(www\.[^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/si" 
only did not work to me.


Are those quotes anything other than the standard double-quotes? In the forum entry, they appear to be ending-quotations.
In the XSS-example? I do not know what they are. It is just copy&paste from garvin:
http://www.s9y.org/forums/viewtopic.php?p=36950#36818
Ciao, Stephan
judebert
Regular
Posts: 2478
Joined: Sat Oct 15, 2005 6:57 am
Location: Orlando, FL
Contact:

Post by judebert »

Okay. I went to the German discussion, and tried the URLs in the WordPress demo at OpenSourceCMS. Two of them are invalid, because they contain characters not allowed in URLs: quotes and umlauts. These were converted up to the invalid character, as expected.

To match more of the URL, including invalid characters, would require something additional, like converting them from UTF-8 to ASCII. As an ignorant American, I'm afraid I don't have much experience along those lines. (I speak a little Spanish, but not enough to ask for a quarter-pound of garlic salami in Ecuador. Different story.)

To match and repair invalid URLs involving styles, targets, and onclicks, well... that'd take one heck of a regexp string. You could add something to just remove them, like this:

Code: Select all

/(style|onclick|target)\s*=\s*"[^"]*"/
And make it's replacement string the empty string. That should remove all such garbage. (It says, "style=, onclick=, or target=", where spaces are unimportant, followed by a quoted string whose contents don't include quotes." But you probably knew that already.)

As to the (^|\s), well, I'm still meditating on that. It should match start of line or whitespace, as far as I can tell.
Judebert
---
Website | Wishlist | PayPal
Post Reply