DB wrote:In which regard is this behavior a problem?
In the regard that the 404 Page isn't doing it's job anymore.
Totally read to let this go, just curious is all.
Rant ensuing...
It has been brought up on more than one occasion in my experience when using rewrites.
One of my s9y setups did have a bout with it here (
http://board.s9y.org/viewtopic.php?f=2&t=14063), and changing all my links from relative to absolute, as you had pointed out, seemed to take care of the problem I was having (spiders were hitting links such as /1-Category/login/css/print/terms/login/signup/login/support.html, and these URLs would go on for days, gigabytes worth). Maybe this would still occur though, seems that I do still have relative links, many of those being created via the admin panel wysiwyg. *Sorry if this might be double-post. The previous issue seems resolved, and this was more of a general question.
I assist in administering several non-s9y website's that produce all sections/pages of the sites dynamically using rewrites. I have been asked more than once why a 404 is not produced when non-existing directory is hit. It might be that someone just re-worked their entire website, so there are plenty of URLs out there that may now correlate to a non-existent place. I find that most of the people I work with, who pay lots of $ to SEO companies, request that any 'now non-existent' page must take you to a 404. The Page Not Found can have some "we just redesigned our site" type of language, etc. There are usually a few regex patterns that can be added to the rewrites to attempt to deal with any foreseen circumstances.
In terms of the s9y setup, I just find it odd that this open-ended URL stuff seems somehow okay. There is no check done on these types of URLs. Not even to match it up to something under a 100 chars, or even under 1000000 chars. Anything goes. The slashes in the URLs refer to a directory structure... until you get to a category page, then all of that doesn't matter anymore. The the open-ended, maybe endless, URL is a feature.
The entries rewrite, although not checked to see if the entry actually exists, at least does not allow for non-existent directories:
Code: Select all
#no endless directory here
RewriteRule ^((archives/([0-9]+)-[0-9a-z\.\_!;,\+\-\%]+\.html)/?) unite.html?/$1 [NC,L,QSA]
Categories definetly doesn't check for any slashes:
Code: Select all
#endless directory here though
RewriteRule ^(categories/([0-9]+)-[0-9a-z\.\_!;,\+\-\%]+) unite.html?/$1 [NC,L,QSA]
Something like this might work:
Code: Select all
# could stop with something like this, may not be perfect for all instances
RewriteRule ^(categories/([0-9]+)-[0-9a-z\.\_!;,\+\-\%]+)/?$ unite.html?/$1 [NC,L,QSA]
RewriteRule ^(categories/([0-9]+)-[0-9a-z\.\_!;,\+\-\%]+)/([P0-9]+).html$ unite.html?/$1 [NC,L,QSA]
By the way, while not being cleansed for length, are they at least cleansed for javascript, etc? I'm assuming they are, but haven't actually looked at the code.