Page 1 of 1

Correct alphabet sorting. A piece of spaghetti code

Posted: Thu May 05, 2011 12:49 am
by LazyBadger
Gute Nacht, liebe Genossen! Ich bitte um etwas Hilfe von niemandem in dieser Idee interessiert.

Mission of code
Provide a way of sorting strings in natural alphabet order for any non-English alphabets. Because I was crying bitterly on the results of the standard PHP sort - they are alogical (from user's POV), unnatural and uncomfortable: f.e such order of tags just drive me nuts

Code: Select all

    [50] => АИ
    [49] => Анонс
    [92] => История
    [47] => Крестный Батька
    [48] => ПА
    [51] => Россия
    [53] => СССР
    [52] => Сэй Алек
    [54] => Тарковский
    [55] => Яндекс
    [57] => агрегатор
    [58] => алармизм
    [56] => аудио
    [61] => байки
    [62] => биология
    [63] => блоги
    [64] => боевик
    [60] => бред
    [59] => бусидо
    [65] => важно
    [66] => гарнитуры
    [67] => генетика
    [68] => детектив
    [69] => журналамеры
    [70] => злое
    [72] => идиотизмы
    [71] => история России
    [74] => котовое
    [73] => кулинарное
    [77] => масскульт
    [81] => меломанское
    [79] => миниатюра
    [78] => мифы
    [80] => мониторы
    [75] => музыка
    [76] => мысли
    [82] => на злобу дня
    [83] => наблюдения
    [84] => новость
    [89] => парадоксы
    [90] => погодное
    [91] => попаданцы
    [88] => программизмы
    [87] => профессиональное
    [85] => псевдолитература
    [86] => публицистика
    [35] => рейтинги
    [36] => религия
    [37] => реплики
    [34] => русопятство
    [33] => русский язык
    [20] => сериалы
    [16] => стандарты
    [17] => стиль разработки
    [18] => столик
    [19] => столик-трансформер
    [27] => телефоны
    [26] => типографика
    [25] => триллер
    [29] => фантастика
    [30] => филология
    [31] => фото
    [28] => фразы
    [32] => цитаты
    [24] => шутки
    [22] => экономика
    [21] => эссе
    [23] => я.ру
but this is ideal

Code: Select all

    [50] => АИ
    [49] => Анонс
    [57] => агрегатор
    [58] => алармизм
    [56] => аудио
    [61] => байки
    [62] => биология
    [63] => блоги
    [64] => боевик
    [60] => бред
    [59] => бусидо
    [65] => важно
    [66] => гарнитуры
    [67] => генетика
    [68] => детектив
    [69] => журналамеры
    [70] => злое
    [92] => История
    [72] => идиотизмы
    [71] => история России
    [47] => Крестный Батька
    [74] => котовое
    [73] => кулинарное
    [77] => масскульт
    [81] => меломанское
    [79] => миниатюра
    [78] => мифы
    [80] => мониторы
    [75] => музыка
    [76] => мысли
    [83] => наблюдения
    [82] => на злобу дня
    [84] => новость
    [48] => ПА
    [89] => парадоксы
    [90] => погодное
    [91] => попаданцы
    [88] => программизмы
    [87] => профессиональное
    [85] => псевдолитература
    [86] => публицистика
    [51] => Россия
    [35] => рейтинги
    [36] => религия
    [37] => реплики
    [34] => русопятство
    [33] => русский язык
    [53] => СССР
    [52] => Сэй Алек
    [20] => сериалы
    [16] => стандарты
    [17] => стиль разработки
    [18] => столик
    [19] => столик-трансформер
    [54] => Тарковский
    [27] => телефоны
    [26] => типографика
    [25] => триллер
    [29] => фантастика
    [30] => филология
    [31] => фото
    [28] => фразы
    [32] => цитаты
    [24] => шутки
    [22] => экономика
    [21] => эссе
    [55] => Яндекс
    [23] => я.ру
Fields of application
Always, where sort in order or mothertongue is preferred:
tags-list, category-list in block for sorting "Category", etc

Implementation
UDF for u(a|k)sort()

Implementation details
lang/UTF-8/serendipity_lang_*.inc.php (ru in example)

Code: Select all

...
/* full alphabet - digits, english, local in both cases */
$alphabet = '0123456789AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZzАаБб...';
...
include/alphasort.inc.php (or add-on to lang.inc.php?) significant portion

Code: Select all

 class utf_8_alphabet 
{ 
   if (isset($GLOBALS['alphabet'])) {
       static $order = $GLOBALS['alphabet'];
   } else {
       static $order = '0123456789AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz';
   } 
   // everything else is sorted at the end 
   static function cmp($a, $b) { 
after "include alphasort.inc.php" alphabetical sorting f array can be obtained with one-liner fix: add uasort() or uksort() (sort by value|key)
before array output

Re: Correct alphabet sorting. A piece of spaghetti code

Posted: Thu May 05, 2011 2:46 am
by LazyBadger
And here my troubles began
In attachment I added two files
- standalone test of utf_8_alphabet::cmp, there initial array was filled with data from test-blog
- dirty hack of serendipity_event_freetag.php from test-blog,there I added class and debug print into code

In both cases source array is (sorry for russian keys)

Code: Select all

array( 
    'Assembla' => 1,
    'сервисы' => 1,
    'русский язык' => 3,
    's9y' => 6,
    'Serendipity' => 3,
    'Twitter' => 2,
    'анонсы' => 1,
    'блог' => 5,
    'обновления' => 5,
    'перевод' => 2 
 );
(dumped from blog data)

But for alphasort.php I got after uksort()+print_r()

Code: Select all

Array
(
    [Assembla] => 1
    [Serendipity] => 3
    [s9y] => 6
    [Twitter] => 2
    [анонсы] => 1
    [блог] => 5
    [обновления] => 5
    [перевод] => 2
    [русский язык] => 3
    [сервисы] => 1
)
(good)
and for serendipity_event_freetag.php with

Code: Select all

@@ -934,6 +983,9 @@
                         <div id="backend_freetag_list" style="margin: 5px; border: 1px dotted #000; padding: 5px; font-size: 9px;">
 <?php
                             $lastletter = '';
+                            preprint($taglist);
+                            uksort($taglist, 'utf_8_alphabet::cmp');
+                            preprint($taglist);
                             foreach ($taglist as $tag => $count) {
                                 if (function_exists('mb_strtoupper')) {
                                     $upc = mb_strtoupper(mb_substr($tag, 0, 1, LANG_CHARSET), LANG_CHARSET); 
on second preprint()

Code: Select all

Array
(
    [Assembla] => 1
    [Serendipity] => 3
    [s9y] => 6
    [Twitter] => 2
    [сервисы] => 1
    [обновления] => 5
    [русский язык] => 3
    [блог] => 5
    [анонсы] => 1
    [перевод] => 2
)
(wrong and bad sort)

I'm lost totally

Re: Correct alphabet sorting. A piece of spaghetti code

Posted: Thu May 05, 2011 1:03 pm
by garvinhicking
Hi!

Doesn't natsort() help?

Regards,
Garvin

Re: Correct alphabet sorting. A piece of spaghetti code

Posted: Thu May 05, 2011 1:49 pm
by LazyBadger
Natsort(), even in form natcasesort() is JUST SUXX. It know nothing about alphabet order beside numbers and US-ASCII.
BTW, I fixed my code (one string moved one string lower), we can just integrate it now, if you want and can review it

Re: Correct alphabet sorting. A piece of spaghetti code

Posted: Thu May 05, 2011 4:45 pm
by garvinhicking
Hi!

Sure, is the attachment in your posting above the fixed version already?

Regards,
Garvin

Re: Correct alphabet sorting. A piece of spaghetti code

Posted: Thu May 05, 2011 11:12 pm
by LazyBadger
Full file from above - no, patch here - yes, it has.
I didn't remove also preprint() around uksort() (which have to be done in release version) and init $order inside class without GLOBALS by hardcoded russian alphabet

Re: Correct alphabet sorting. A piece of spaghetti code

Posted: Fri May 06, 2011 11:05 am
by garvinhicking
Hi!

Great, thanks for your effort! I believe this is a great starting step.

Hm, I think before committing this it would be an idea to create its own s9y plugin for that, so that we can sort using that algorithm on other instances as well.

A new event hook "sort" could be created, so that in the freetag plugin this is used:

Code: Select all

serendipity_plugin_api::hook_event('sort', $taglist);
and then a serendpity_event_sorter plugin like yours could use that sort.

This has not only the advantage of a reusable plugin in more instances, but also it will not affect current users who do not need a different sorting algorithm (that takes more processing power).

And the new plugin could be maintained to sort by other strings as well, like japanese?

Also, the utf_8_alphabet class would need to be adaptet to only use mb_str if those function exists, because there are PHP installations without mbstring, and this would break the whole freetag plugin for those people...

Best regards,
Garvin

Re: Correct alphabet sorting. A piece of spaghetti code

Posted: Sat May 07, 2011 4:48 am
by LazyBadger
garvinhicking wrote: Hm, I think before committing this it would be an idea to create its own s9y plugin for that, so that we can sort using that algorithm on other instances as well.

A new event hook "sort" could be created
It's far away from the area of my competence and even uncompetence, in such situation I completely trust your vision of right and wrong solutions
garvinhicking wrote:This has not only the advantage of a reusable plugin in more instances, but also it will not affect current users who do not need a different sorting algorithm (that takes more processing power).
Agree, it will be "The Right Way" (tm)
garvinhicking wrote:And the new plugin could be maintained to sort by other strings as well, like japanese?
AFAIS, any alphabet, which can be presented as UTF8-string, can use this "natural sorter"
garvinhicking wrote:Also, the utf_8_alphabet class would need to be adaptet to only use mb_str if those function exists, because there are PHP installations without mbstring, and this would break the whole freetag plugin for those people...
For PHP without mbstring we can:
- do nothing (best way, I think)
- use replacements from Andreas Gohr UTF8 helper functions (attached) (LGPL 2.1, usable in Serendipity?)

Re: Correct alphabet sorting. A piece of spaghetti code

Posted: Mon May 09, 2011 10:27 am
by garvinhicking
Hi!

Ive just committed the new "sort" event hook to the freetag plugin, and committed serendipity_event_sort which contains your code. I wasn't able to test this, so please go ahead and see if the plugin does the trick for you?

I cannot really read your special characters, so I don't know if the $order array now contains the proper characters?!

Regards,
Garvin

Re: Correct alphabet sorting. A piece of spaghetti code

Posted: Thu May 12, 2011 12:02 pm
by LazyBadger
garvinhicking wrote: Ive just committed the new "sort" event hook to the freetag plugin, and committed serendipity_event_sort which contains your code.
Still can't see these commits in SPARTACUS, tried to read online event_sort. This snippet confuses me somewhat.

Code: Select all

   function utf8cmp($a, $b) { 
	        static $order = null;
	        static $char2order = null;
	        
	        if ($order === null) {
	            // Kyrillic. More languages to come?
	            $order =
Because if wiil be always true and everybody will get only one alphabet...
Or have I misunderstood something?
I planned to have the alphabet in the language file as additional variable and if this variable is missing, use fallback-alphabet in pure US-ASCII only (non-mentioned letters will be sorted after English). Retransmission of draft

Code: Select all

   if (isset($GLOBALS['alphabet'])) {
       static $order = $GLOBALS['alphabet'];
   } else {
       static $order = '0123456789AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz';
   } 
This way extending to new languages can be easy task - just adding alphabet to already existing translations

Re: Correct alphabet sorting. A piece of spaghetti code

Posted: Thu May 12, 2011 1:04 pm
by LazyBadger
Updated freetag and installed event_sort by hand. All correct (in sorting) in backend
* new entry - tags div
* manage tags - all tags
and in frontend tag cloud
* sort order - tag name

New sorting not used in "related tags" output (must these tags be sorted?)

Will be happy to see category-list with sort-order Category" sorted by new algo also

Re: Correct alphabet sorting. A piece of spaghetti code

Posted: Thu May 12, 2011 1:27 pm
by garvinhicking
Hi!

Thanks for the headsup. The spartacus sync wasn't properly working due to a PHP parse error.

About the $order === null -- currently only one alphabet is used, but I suppose in the future when more alphabets get put there, we need to decide which one to put into $order.

Regards,
Garvin