problem with regex markup

stm999999999 · Post by **stm999999999** » Wed Aug 30, 2006 1:10 pm

Hello,

garvin tells me, better to post in the english forums:

the markup makes from

http://mueller.example

<a href="http://www.mueller.example" target="_blank"><a href="http://www.mueller.example" target="_blank">http://www.mueller.example</a></a>

I think, the problem is here, the double regex-code:

Code: Select all

    'SearchArray'=>array( 
      "/([^]_a-z0-9-=\"'\/])((https?|ftp|gopher|news|telnet):\/\/|www\.)([^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/si", 
                         "/^((https?|ftp|gopher|news|telnet):\/\/|www\.)([^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/si"

What exactly does these do:

Code: Select all

([^]_a-z0-9-=\"'\/])

?

and something else:

The code replaces www.example.com, too. But it does not add http:// to the href="", so
href="http://mein-blog.example/www.example.com" is the result

And: something else would be nice, what I found in other blog-systems:

http://www.example.com should become

<a href="http://www.example.com">www.example.com</a>, without the http:// in the link-text (not in the href of course)!

judebert · Post by **judebert** » Thu Aug 31, 2006 9:24 pm

Fiercesome.

First, the easy question:
([^]_a-z0-9-=\"'\/]) indicates a single character, taken from the group between the brackets ([]). The caret (^) indicates that it should be a character NOT in the group. The first end-bracket (]) is right after the caret, so we understand that it's part of the group, not the ending bracket. So this comes down to "A character that isn't an end-bracket, underscore, lowercase character, number, dash, equals, double or single quote, or slash." That leaves us with dots, colons, angle brackets (<>), parentheses, some other punctuation, and uppercase characters -- however, the "i" at the very end indicates that this is case-insensitive, so we can ignore all the characters.

All together, the first entry in the array says:
First a special character as we've just discussed, followed by one of (http://,https://,ftp://,gopher://,news://,or telnet://) OR "www.", followed by as many characters as you can find that are NOT (carriage return, newline, parens, caret, dollar, exclamation point, any kind of quotes, bar, or any kind of brackets), without regard to case. It also stores the special character as \1, the "prefix://" OR "www." as \2, and the rest of the URL as \3.

The second entry is exactly the same, but without the special character; so the prefix/www goes in \1, and the URL goes in \2.

I know that's long-winded, but so is regexp matching. Sorry.

When it does the replacement (I'll bet it has a replacement array too, doesn't it?), you'll see something like:

Code: Select all

<a href="\1\2" target="_blank">\1\2</a>

At least, that's what it SHOULD look like. To remove the "http://", change it to:

Code: Select all

<a href="\1\2" target="_blank">\2</a>

Whew. Any questions?

stm999999999 · Post by **stm999999999** » Tue Sep 05, 2006 8:03 pm

First, the easy question:
([^]_a-z0-9-=\"'\/]) indicates a single character, taken from the group between the brackets ([]). The caret (^) indicates that it should be a character NOT in the group. The first end-bracket (]) is right after the caret, so we understand that it's part of the group, not the ending bracket.

Should not the first ] be encoded with "\" ?

Whew. Any questions?

Yes, whats about the problems?

1) www.mueller.example will be replaced, but to:

Code: Select all

<a href="www.example.com">www.example.com</a>

this is no valid extern URL! It must be

Code: Select all

<a href="http://www.example.com">www.example.com</a>

Perhaps the case "no http or something, but www. must be handled by a third regex, which adds the http?

2) malformed URI like

Code: Select all

http://example.com” onlick=”alert(document.cookie)

are replaced by

Code: Select all

<a href="http://example.com”" target="_blank">http://example.com”</a> onlick=”alert(document.cookie)

Is it possible to throw the ” out of the href?

3) my problem with the doubled replacement is very easy: I had the plugin twice! I do not know, how it cukd happen, but it happens.

BTW, a source of ideas could be http://minglewithingle.com/code/linktrunc.phps

judebert · Post by **judebert** » Wed Sep 06, 2006 9:43 pm

Should not the first ] be encoded with "\" ?

You might think so, but no. If you put the ] right after the [ or negator, it's recognized as a part of the character set.

1) www.mueller.example will be replaced, but to:

Code:
<a href="www.example.com">www.example.com</a>

this is no valid extern URL! It must be

Code:
<a href="http://www.example.com">www.example.com</a>

Perhaps the case "no http or something, but www. must be handled by a third regex, which adds the http?

Sorry, I thought you were having the opposite problem. But you've got the solution almost exactly correct. Just remove the "|www\." part (which matches no-protocol links starting with www.) from the first two comparisons, and add this third one:

Code: Select all

'/(^|\s)(www\.[^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/si'

And of course, you'll need a replacement string, too:

Code: Select all

'\\1<a href="http://\\2/">\\2</a>'

Is it possible to throw the ” out of the href?

The \" in the replacement should make sure that double-quotes don't get included as part of a match. Are those actually end-double-quotes or something? You could add them to the list of disallowed characters, but I'm not sure how it's going to work, since they're non-ASCII characters.

stm999999999 · Post by **stm999999999** » Wed Sep 06, 2006 11:46 pm

the new code did not work, when the url is not at the beginning of the line.

I changed it a little, now it works. And I add the regex for ftps:

Code: Select all

    'SearchArray'=>array(
		"/([^]_a-z0-9-=\"'\/])((https?|ftps?|gopher|news|telnet):\/\/)([^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/si",
		"/^((https?|ftps?|gopher|news|telnet):\/\/)([^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/si",
		"/([^]_a-z0-9-=\"'\/])(www\.[^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/si",
		"/(^|\s)(www\.[^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/si"
    ),

stm999999999 · Post by **stm999999999** » Wed Sep 06, 2006 11:52 pm

judebert wrote:
Is it possible to throw the ” out of the href?
The " in the replacement should make sure that double-quotes don't get included as part of a match. Are those actually end-double-quotes or something?

Code: Select all

http://example.com” onlick=”alert(document.cookie)

It is a example for an cross-site-script, which should not be converted in the full length for security reasons. But I see other clickable-makers, which can make the cut before the ”.

But it seems not to be so important, because this should only be happend, when someone wants to inject evil code and it is important to make a cut in the url, and not important to make a good and clickable url out of this.

judebert · Post by **judebert** » Thu Sep 07, 2006 8:07 pm

I'm all too glad to let the weird quotes go for now.

The (^|\s) is supposed to mean "start of the line or whitespace". I'm glad everything is working, of course, but I'll have to meditate on this for a while and see if I can grok what happened.

stm999999999 · Post by **stm999999999** » Thu Sep 07, 2006 9:06 pm

whats about things like "bla bla (www.bla.example)"?

But, as I said, the code above will work for this - only the ”-problem is still alive

judebert · Post by **judebert** » Fri Sep 08, 2006 3:55 pm

That should be handled by the third one, "something not in this set of characters, then www.whatever".

Are those quotes anything other than the standard double-quotes? In the forum entry, they appear to be ending-quotations.

stm999999999 · Post by **stm999999999** » Fri Sep 08, 2006 6:26 pm

That should be handled by the third one, "something not in this set of characters, then www.whatever".

Do you mean this?

Code: Select all

"/([^]_a-z0-9-=\"'\/])(www\.[^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/si"

This will work, I know. But this is added by me, bacause

Code: Select all

  "/(^|\s)(www\.[^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/si"

only did not work to me.

Are those quotes anything other than the standard double-quotes? In the forum entry, they appear to be ending-quotations.

In the XSS-example? I do not know what they are. It is just copy&paste from garvin:
http://www.s9y.org/forums/viewtopic.php?p=36950#36818

judebert · Post by **judebert** » Fri Sep 08, 2006 8:39 pm

Okay. I went to the German discussion, and tried the URLs in the WordPress demo at OpenSourceCMS. Two of them are invalid, because they contain characters not allowed in URLs: quotes and umlauts. These were converted up to the invalid character, as expected.

To match more of the URL, including invalid characters, would require something additional, like converting them from UTF-8 to ASCII. As an ignorant American, I'm afraid I don't have much experience along those lines. (I speak a little Spanish, but not enough to ask for a quarter-pound of garlic salami in Ecuador. Different story.)

To match and repair invalid URLs involving styles, targets, and onclicks, well... that'd take one heck of a regexp string. You could add something to just remove them, like this:

Code: Select all

/(style|onclick|target)\s*=\s*"[^"]*"/

And make it's replacement string the empty string. That should remove all such garbage. (It says, "style=, onclick=, or target=", where spaces are unimportant, followed by a quoted string whose contents don't include quotes." But you probably knew that already.)

As to the (^|\s), well, I'm still meditating on that. It should match start of line or whitespace, as far as I can tell.