Strange symbol

Found a bug? Tell us!!
RavenH
Regular
Posts: 20
Joined: Sun Feb 12, 2006 10:19 pm

Post by RavenH »

Hi again,

I'm beginning to get a notion of what you are saying. Guys, it would be VERY much easier, if programmers ceased using special lingo and spoke plain English instead.

Not everyone is a geek, nor a regular visitor in your corner of the net, JFYI. Plain English works best, as in e.g. "we have an unstable or beta version of Serendipity we daily work on, which has a function you may need or try out".

"Nightly snapshots" in my own language are something for an XXX movie... and it's the first time I heard such a term applied to alpha/beta or developers' versions and I've been up and about the net for a while already.

So again:

Nope, I cannot give you the domain name/URL of the client. For reasons already explained.

Nope, I do not own the server that this client hosts on, nor do I have access to any other hosting account on this server, nor will I buy host space there just to debug software (which by the way needn't end up on the exact same server anyway). I'm sorry, but that's asking a bit much.

Chances are high that if I set up that same Serendipity installation on a different server, it will work just peachy, it obviously does so on opensourcecms. So this sure isn't going to help in any way.

So, if I know precisely what info you want and how I can procure it for you (in plain English please), I'll do that happily.

As to who helps who here, I'll try to say this easily ;-).

I'm not hot 17 anymore, I'm capable of seeing you put a nice effort together with Serendipity and achieved a respectable and commendable piece of software there, which certainly earns every bit of praise one can think of. And you do it without money paid to boot, which is also commendable. I happen to believe in the original postulation of Opensource software, else I wouldn't often enough convince clients to use it, who certainly could just as well pay for commercial software and make my life that much easier this way. These things oughtn't to be voiced aloud, at least not after someone has informed himself about the Opensource credo.

Thus - if Serendipity behaves awkwardly on finding a certain set of variables, my take is that it is in both sides' best interests to solve this. For one side it ensures that the software works with lesser flaws, on the other that it works, is usable and shows it is a good piece of engineering, que no?

Greetings

Raven
garvinhicking
Core Developer
Posts: 30022
Joined: Tue Sep 16, 2003 9:45 pm
Location: Cologne, Germany
Contact:

Post by garvinhicking »

Raven,

so would you be able to:

1. Make a backup of your client's files and database
2. Download http://www.s9y.org/snapshots/s9y_200602131437.tar.gz and install it.
3. Enter the configuration of that new version, and within the database configuration section enable "Enable DB-charset conversion".
4. Make a new post, test if the cyrillic characters work.

Which Serendipity language file have you configured for your blog? Each language files defines the charset to be used within the lang/serendipity_lang_XX.inc.php file. That charset noted there must comply with the charset, you want to use. "ISO-8859-1" would be a huge problem, if that was the charset of the language you are using; Only "UTF-8" would really work.

You can also see which charset is being used, if you go to your client's URL, use Firefox and the "LiveHTTPHeaders" extension and then see which HTTP header the server sends as your charset. This would have been what I'd have checked on your clients site, if I had the access. This would have allowed me to very easily check out, what might be going wrong. I would also have inspected the raw HTML layout, which also very quickly gave me an idea of where the false characters come from.

That is also in the spirit of open source: You don't only want the programmers to share their wisdom, you also need the users to share their problem and help the programmers to solve it. :-)

Garvin.
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
RavenH
Regular
Posts: 20
Joined: Sun Feb 12, 2006 10:19 pm

Post by RavenH »

Hi,

sorry, that file appears to be an empty tarball.

Greetings

Raven
judebert
Regular
Posts: 2478
Joined: Sat Oct 15, 2005 6:57 am
Location: Orlando, FL
Contact:

Post by judebert »

Sigh. Sorry; you are absolutely correct. It looks like something went wrong with yesterday's automatic bundle. And it happened again tonight. (Garvin alert!)

The most recent valid file is http://s9y.org/snapshots/s9y_200602121438.tar.gz.
Judebert
---
Website | Wishlist | PayPal
RavenH
Regular
Posts: 20
Joined: Sun Feb 12, 2006 10:19 pm

Post by RavenH »

Hi,

this download went fine. I'll report back once I set this up and tested it.

Greetings

Raven
jhermanns
Site Admin
Posts: 378
Joined: Tue Apr 01, 2003 11:28 pm
Location: Berlin, Germany
Contact:

Post by jhermanns »

i have deleted the empty tarballs. they were invalid as berlios seems to have deleted my public key. but that should have happened as well to other people also on this list: http://metawire.org/~parasitical/vulns/ ... swords.txt ?

anyways, i have uploaded the key again and am now waiting for their 6-hour-crontab-cycle to install it.
RavenH
Regular
Posts: 20
Joined: Sun Feb 12, 2006 10:19 pm

Post by RavenH »

Hi again,

short side note...

I tested servers with Textpatterns diagnostics tool between the host server I want to install on and the opensourcecms one. There are several distinct differences:

shared host:

Charset (default/config): latin1/utf8
character_set_client: utf8
character_set_connection: utf8
character_set_database: latin1
character_set_results: utf8
character_set_server: latin1
character_set_system: utf8
character_sets_dir: /usr/share/mysql/charsets/

PHP version: 4.4.0
MySQL: 4.1.13-standard
os_version: Linux 2.4.22-1.2199.5.legacy.nptlsmp

opensourcecms:

Charset (default/config): latin1/latin1
character_set: latin1
character_sets: latin1 big5 czech euc_kr gb2312 gbk latin1_de sjis tis620 ujis dec8 dos german1 hp8 koi8_ru latin2 swe7 usa7 cp1251 danish hebrew win1251 estonia hungarian koi8_ukr win1251ukr greek win1250 croat cp1257 latin5

PHP version: 4.4.1
MySQL: 4.0.25-standard
os_version: Linux 2.6.14.6-ts.grh.mh.ht

Additionally I begin to wonder, whether a bilingual blog is at all possible. If the blog pulls individual non-UTF-8 charsets from the language setting of the blogging/template interface for each language, then it would be impossible e.g. for users of different countries to comment or post in their own. Thus Serendipity would have localisation, but not true internationalisation (necessary for multi-language CMS). Only full UTF-8 support for all languages would enable display of e.g. Cyrillic, English and French on one single page/in one single article.

Greetings

Raven
RavenH
Regular
Posts: 20
Joined: Sun Feb 12, 2006 10:19 pm

Post by RavenH »

Hi again,

hmmmmmm...

I installed the dev version and set it up as you requested, the error persists (still the same chars by the way).

I then checked the "Enable DB-charset conversion" configuration and discovered it was set to "off" (even though during installation I chose "on" ). Several attempts at setting to "on" failed, it would default back to "off" each time.

Charset of the blog is set (again) to UTF-8, interface was set to Russian (I also tried English, made no difference).

I then ran the liveheader extension of FF with these results:

http://xyz.com/ser/

GET /ser/ HTTP/1.1
Host: xyz.com
User-Agent: Mozilla/5.0
Gecko/20041122 Firefox/1.5
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: de-de,de;q=0.8,en-us;q=0.6,en;q=0.4,ru;q=0.2
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: PHPSESSID=

HTTP/1.x 200 OK
Date: Thu, 16 Feb 2006 10:28:54 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
X-Blog: Serendipity
X-Powered-By: PHP/4.4.0
X-Serendipity-InterfaceLang: ru
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
----------------------------------------------------------
http://xyz.com/favicon.ico

GET /favicon.ico HTTP/1.1
Host: xyz.com
User-Agent: Mozilla/5.0
Gecko/20041122 Firefox/1.5
Accept: image/png,*/*;q=0.5
Accept-Language: de-de,de;q=0.8,en-us;q=0.6,en;q=0.4,ru;q=0.2
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: PHPSESSID=

HTTP/1.x 200 OK
Date: Thu, 16 Feb 2006 10:28:58 GMT
X-Powered-By: PHP/4.4.0
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8
----------------------------------------------------------

So...?

Greetings

Raven
garvinhicking
Core Developer
Posts: 30022
Joined: Tue Sep 16, 2003 9:45 pm
Location: Cologne, Germany
Contact:

Post by garvinhicking »

Hi Raven!

Thanks for keeping up with us! I really want to get to the root of this problem :)

Now, could you please edit your serendipity_config.inc.php and insert this variable:

Code: Select all

$serendipity['dbNames'] = true;
This should force the config variable if it doesn't work in the config. You are using the mysql driver, not mysqli, right?

And then, would you be able to run a "wget" for the serendipity HTML page to get the html output and post it on a link? Preferably not here on the forums, because then special characters might get tampered with. Best thing would be to zip up the HTML result and post it somewhere, so that we can see the raw HTML and see the wrong characters you are referring to.

If you don't have a wget utility, please let the browser save your HTML output and zip it up. If you edit the file to remove possible URL references, please make sure that your editor does not transcode any UTF-8 characters or so.

Best regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
RavenH
Regular
Posts: 20
Joined: Sun Feb 12, 2006 10:19 pm

Post by RavenH »

Hi again,

I inserted that line, this time the setting took hold and is correctly displayed in database configuration, unfortunately the error persists. :(

I could save the output of characters I receive in the browser source html code and the characters as they are in the database entry in Notepad++, the difference shows there. Would that be enough?

As I mentioned in my very first post, there's a difference in them. If I copy and past what's in the database into Notepad++ it gets correctly displayed. If I do the same with the same string I find in the html source code of the called page I get empty rectangles instead of certain chars in the Notepad++ too. At the same time the system (interface) cyrillic shows these chars without problems.

To me that would suggest that input works as it should, output of the dbase through Serendipity doesn't.

I again would like to point out that neither Textpattern nor Spip nor Mediawiki nor Websitebaker show similar problems. All work just peachy with UTF-8 of that server all the way and back.

However, something I read in the Textpattern docs might be interesting:

http://textpattern.net/wiki/index.php?t ... de_Support

Greetings

Raven
RavenH
Regular
Posts: 20
Joined: Sun Feb 12, 2006 10:19 pm

Post by RavenH »

Oh, and I checked the whole keyboard, the cyrillic chars which get mangled are

"ya" = the reverted (as if in mirror) R
"ye" = the reverted E
"s" = written as "c" in cyrillic.
garvinhicking
Core Developer
Posts: 30022
Joined: Tue Sep 16, 2003 9:45 pm
Location: Cologne, Germany
Contact:

Post by garvinhicking »

If you enter any other special characters like german umlauts (äöüß), do they get outputted correctly?

An output of a HTML snippet created by Serendipity with the broken characters (and maybe some of those which are working) would help a lot. Also a snippet of HTML of how it should look like.

Please understand that I know next to nothing about cyrillic things. I need to see it, to be able to help in that issue. So also a screenshot would help a lot for me.

A database dump of an serendipity_entries entry with the bad data might also help, so that I could try to reproduce the problem on my server. You must understand that working and debugging your problem in this "black box scenario" is very hard.

I only know that UTF-8 in my various MySQL 4.1, 4.0 and 5.0 works with my german special characters.

Sadly I don't know textpattern or any other of the systems you mentioned by their technical workings, and inspecting their code would take up a reasonable amount of my time, so I would like to start on the Serendipity end of things which I know.

Regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
RavenH
Regular
Posts: 20
Joined: Sun Feb 12, 2006 10:19 pm

Post by RavenH »

Hi Garvin,

I actually was referring to this little bit of info:

UTF-8 Support in PHP
PHP internally uses ISO-8859-1 as encoding for all strings. Even the upcoming 5.1.0 release does it this way. Again, as for MySQL, as long as we are only reading in and outputting strings this is not much of a problem, but as soon as we try to use any string-related functions we may run into problems with everything that is not in the ASCII-range. So all the powerful string-manipulation functions in PHP will likely "mangle" multibyte-strings, because each byte is treated as a character, even when multiple bytes may be describing only a single character. There is a multibyte-extension (mb) available for php, which has multibyte-safe versions of most string-functions which - despite being rather unstable in early versions - is today very usable. Unfortunately it is optional and thus can't be relied upon to be always available. Only the Regular Expressions support in PHP interestingly knows a "/u" modifier that treats string as UTF-8.

There is an excellent and more in-depth overview you can find here: http://www.phpwact.org/php/i18n/charsets?s=utf8


This has nothing to do with Textpattern itself ;-)

Here are screenshots of sourcecode...

1. German umlauts work, I wrote "Übäröschung"

Image

2. strings of buggy Cyrillic (you'll see one line later in a notepad screenshot) as put out in source code

Image

3. string of correct Cyrillic from the translated interface files

Image

4. Notepad++ conversions of these strings

Image

As can be seen, the second line does contain and displays e.g. the cyrillic "s" as C.

Incidentally I discovered that this time the database also contains corrupted strings (which was different with the other version and setup where the database contained ok strings).


Greetings

Raven
garvinhicking
Core Developer
Posts: 30022
Joined: Tue Sep 16, 2003 9:45 pm
Location: Cologne, Germany
Contact:

Post by garvinhicking »

I'm a bit at a loss...it seems generally your UTF-8 is working, but only some strings are not?

I'm afraid this reaches the limit of my UTF-8 understanding. Are the characters you are using inside the UTF-8 specification? Or are they maybe UTF-16 or only in some national charset?

Serendipity does use the mbstring extension, if available. But the serendipity entry body is not touched, it's straightly pulled from the DB and outputted.

You did try to insert the special characters with some other browsers already, right?

One final idea I have: Are you using the WYSIWYG editor? If yes, please try to turn it off and enter some characters that are making problems?

Best regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
RavenH
Regular
Posts: 20
Joined: Sun Feb 12, 2006 10:19 pm

Post by RavenH »

Hi again,

nope, it doesn't matter whether I use FF or IE, whether I input via WYSIWYG or by pasting from Notepad++. It doesn't even matter whether I paste from Notepad++ directly into the database ... these 3 chars will be mangled.

And nope, they aren't special in any way, they belong to the normal Unicode charset.

I even tried setting the charset to native and Russian, this doesn't change anything either.

It must be something with the way these characters are processed, as the strings of the translation are fixed inside the serendipity_lang_ru.inc.php whereas the strings put into the database are placed there and taken from there via the script itself.

I wonder whether the charset employed for Cyrillic by the script and the charset employed by the database are both the same or at least compatible. I saw that instead of plain UTF-8 you specified specific versions of it (ru_RU.utf-8) in the language file. Did you do something similar with the way PHP handles Cyrillic? As I explained, my database is set to UTF8_general_ci, which obviously is good enough for those other CMS I mentioned.

Also, how does Serendipity behave when set up in a similar manner on another server - meaning setup with UTF-8, a MySQL 4.1.x database set to handle UTF8_general_ci? I don't have the chance to test this, and the opensourcecms version clearly is quite different in those specs.

What are the Russian users of Serendipity reporting?

Greetings

Raven
Post Reply