Hi,
I have a backup of an old s9y blog, which only exists in the form of an old wget html dump (no database). Now I would like to reinstall the blog with all former entries.
Important for me is, that I retain:
- Properties of the entries (including dates)
- Comments on the site (and their dates)
- Links and Pictures and downloads
- Correct split of "extended" entries.
I would be able to extract all this information form the html files using python, but I am wondering what would be the best / easiest output format for my program?
I also thought about filling the database directly. What about that?
Daniel
Best import options
-
garvinhicking
- Core Developer
- Posts: 30022
- Joined: Tue Sep 16, 2003 9:45 pm
- Location: Cologne, Germany
- Contact:
Re: Best import options
Hi!
I don't think there's a way to do this.
It would be faster to create the entries and comments new manually. Otherwise, you'd have to invest many hours in custom regular expression magic, get used to all the database tables etc.
Regards,
Garvin
I don't think there's a way to do this.
It would be faster to create the entries and comments new manually. Otherwise, you'd have to invest many hours in custom regular expression magic, get used to all the database tables etc.
Regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
Re: Best import options
I think parsing my html sources should not be so much of a problem, all required information is nicely stated in the html markups (regexp would be great as well, but probably not as flexible when following links, ...). From from that I would just fill a list with all entries and their properties including comments.
Isn't there an import format which at least contains all the required information? Can I import comments using RSS? Maybe there is a python plugin creating this file format (eg. xml).
Because especially those comments are a pain to import manually since there is no gui way to edit the date of the comments.
If that is not possible I think I have to brush up my SQL again. Beside the bigger effort, I am just afraid, that there are some counters or something which I do not update and which break my import.
Daniel
Isn't there an import format which at least contains all the required information? Can I import comments using RSS? Maybe there is a python plugin creating this file format (eg. xml).
Because especially those comments are a pain to import manually since there is no gui way to edit the date of the comments.
If that is not possible I think I have to brush up my SQL again. Beside the bigger effort, I am just afraid, that there are some counters or something which I do not update and which break my import.
Daniel
-
garvinhicking
- Core Developer
- Posts: 30022
- Joined: Tue Sep 16, 2003 9:45 pm
- Location: Cologne, Germany
- Contact:
Re: Best import options
Hi!
You can import as RSS, when you format all of your HTML pages into a single RSS-Feed, yes. Since comments are not part of RSS XML files, those are not imported there. However, all the other importer plugins for phpBB for example could be used as a draft template for you.
Of course you can do that with regexp parsing. But I figure I myself, with fairly good regexp knowledge, would surely need 1-2 full days of working to import all the data properly...so depending on the amount of content you have, a manual copy&paste would be much faster.
Regards,
Garvin
You can import as RSS, when you format all of your HTML pages into a single RSS-Feed, yes. Since comments are not part of RSS XML files, those are not imported there. However, all the other importer plugins for phpBB for example could be used as a draft template for you.
Of course you can do that with regexp parsing. But I figure I myself, with fairly good regexp knowledge, would surely need 1-2 full days of working to import all the data properly...so depending on the amount of content you have, a manual copy&paste would be much faster.
Regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
Re: Best import options
Summary:
It wasn't that hard after all: have a look at: http://hk.danst0.com
All entries seem to be working fine. Comments, Hierarchies, Dates, all imported.
The RegExp idea helped a lot.
I wrote a python script which analyzes the entries and outputs plain sql commands, which are then imported into the database. The code is really crude (and I mean it), but may be if someone is as stupid as I am to only backup to plain HTML...
Here it is:
After importing the only thing I did was to copy the pictures to the uploads folder manually, serendipity automatically added the pictures to the library.
I did not import trackbacks since I only had 2.
Daniel
ps. Overall the programming / import took about 4 hours, and of course much more rewarding than playing the copy'n'paste monkey
It wasn't that hard after all: have a look at: http://hk.danst0.com
All entries seem to be working fine. Comments, Hierarchies, Dates, all imported.
The RegExp idea helped a lot.
I wrote a python script which analyzes the entries and outputs plain sql commands, which are then imported into the database. The code is really crude (and I mean it), but may be if someone is as stupid as I am to only backup to plain HTML...
Code: Select all
#/usr/bin/env python
# -*- coding: UTF-8 -*-
import os
import glob
import re
import time
path = "blog.danst0.com/"
def replace_month(string):
string = string.replace(u'Januar', "1")
string = string.replace(u'Februar', "2")
string = string.replace(u'März', "3")
string = string.replace(u'April', "4")
string = string.replace(u'Mai', "5")
string = string.replace(u'Juni', "6")
string = string.replace(u'Juli', "7")
string = string.replace(u'August', "8")
string = string.replace(u'September', "9")
string = string.replace(u'Oktober', "10")
string = string.replace(u'November', "11")
string = string.replace(u'Dezember', "12")
return string
def r4body(string):
string = string.replace('\n', '\\n')
string = string.replace('\r', '\\r')
#string = string.replace('//', '\/')
#string = string.replace(':', '\:')
#string = string.replace("\"", "\\""")
#string = string.replace("'", "\'")
# string = string.replace('<br />', '<br />')
string = string.replace("'", "''")
return string
def replace_category_by_id(string):
string = string.replace(u'Allgemeines', "1")
string = string.replace(u'Arbeit & Co.', "2")
string = string.replace(u'Arbeit & Co.', "2")
string = string.replace(u'Asien 2005', "3")
string = string.replace(u'Hong Kong 2004', "4")
string = string.replace(u'Mikro', "5")
string = string.replace(u'Persönliches', "6")
string = string.replace(u'iPod', "6")
return string
if __name__ == "__main__":
print "Starting Parsing"
article_count = 0
a_id = []
a_title = []
a_timestamp = []
a_body = []
a_extended = []
a_category = []
c_id = []
c_entryid = []
c_parentid = []
c_timestamp = []
c_author = []
c_body = []
c_url = []
c_email = []
for infile in glob.glob( os.path.join(path, '*.html') ):
f = open(infile, 'r')
html_code = f.read().decode('ISO-8859-1')
if (html_code.find("""<div class="serendipity_commentsTitle">""") != -1):
m = re.search('<h4 class=\"serendipity_title\"><a.+>(.+)</a>', html_code)
tmp_title = m.group(1)
if tmp_title not in a_title:
article_count += 1
#print article_count, " Reading file: " + infile
a_title.append(tmp_title)
a_id.append(article_count)
m = re.search('<h3 class=\"serendipity_date\">.+, (.+)</h3>', html_code)
tmp_date = replace_month(m.group(1))
m = re.search('<a.+>([0-9]{1,2}:[0-9]{1,2})</a>', html_code)
tmp_time = m.group(1)
#print tmp_title, tmp_date, tmp_time
a_timestamp.append(int(time.mktime(time.strptime(tmp_date + " " + tmp_time, "%d. %m %Y %H:%M"))))
# Search for Extended
m = re.search('<div class=\"serendipity_entry_extended\"><a id=\"extended\"></a>(.+)</div>.+<div class=.serendipity_entryFooter.>', html_code, re.DOTALL)
extended = True
text = ''
try:
text = r4body(m.group(1).strip())
except:
extended = False
a_extended.append(text)
## Search Category
m = re.search('in <a href=\".+?\">(.+?)</a> ', html_code)
tmp = replace_month(m.group(1))
tmp = replace_category_by_id(tmp)
a_category.append(tmp)
## Search for Entry Body
if not extended:
sstring = '<div class=\"serendipity_entry_body\">(.+)</div>.+<div class=.serendipity_entryFooter.>'
else:
sstring = '<div class=\"serendipity_entry_body\">(.+)</div>.+<div class=\"serendipity_entry_extended\">'
m = re.search(sstring , html_code, re.DOTALL)
text = r4body(m.group(1).strip())
a_body.append(text)
if (html_code.find("Keine Trackbacks") == -1):
# Trackbacks bearbeiten
print "Trackback gefunden:", a_id[-1], a_title[-1], infile
if (html_code.find("Noch keine Kommentare") == -1):
# Kommentare bearbeiten
#print "Kommentare gefunden"
lm = re.findall('<a id=\"c([0-9]{1,3})\"></a>', html_code)
c_localid = {}
for m in lm:
# Entry ID
c_entryid.append(a_id[-1])
# Comment ID
tmp = int(m)
c_id.append(tmp)
# Hierarchy
j = re.search('<a href=\"#c' + m + '\" title="Link to comment #[0-9\.]+">#([0-9\.]+)</a>', html_code)
hi = j.group(1)
# Search Parent ID
c_localid[hi] = tmp
tmp_parent = 0
if "." in hi:
tmp_parent = c_localid[hi[0:-2]]
c_parentid.append(tmp_parent)
# Comment Author, URL and Date
sstring = '#' + hi + '</a>(?P<name>.+?)(\(<a href=\"(?P<url>.+?)\".+?\).+?)?am.+?(?P<datum>[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{4} [0-9]{1,2}:[0-9]{1,2}).+\(<a href=\"#serendipity_CommentForm\"'
j = re.search(sstring, html_code, re.DOTALL)
c_author.append(r4body(j.group('name').strip().strip('\r').strip('\n')))
tmp_url = ''
try:
tmp_url = j.group('url').strip()
except:
pass
c_url.append(tmp_url)
c_timestamp.append(int(time.mktime(time.strptime(j.group('datum').strip(), "%d.%m.%Y %H:%M"))))
# Find comment body:
sstring = '<a id=\"c' + m + '\"></a>.+?<div class=\"serendipity_commentBody\">(.+?)</div>'
j = re.search(sstring, html_code, re.DOTALL)
c_body.append(r4body(j.group(1)))
#break
#break
#print a_title
f = open('/Users/danst/Desktop/articles.sql', 'w')
table_name = '`serendipity_entries`'
f.write("TRUNCATE TABLE "+ table_name + ";\r\n")
for i, id in enumerate(a_id):
# Columns id title timestamp body comments trackbacks extended exflag author authorid isdraft allow_comments last_modified moderate_comments
if a_extended[i] == "":
exflag = '0'
else:
exflag = '1'
## Count Comments
comment_count = 0
for z in c_entryid:
if z == id:
comment_count += 1
sql = "INSERT INTO " + table_name + " VALUES (" + str(id) + ", " + \
"'" + a_title[i] + "', " + \
str(a_timestamp[i]) + ", " + \
"'" + a_body[i] + "', " + \
str(comment_count) + ", " + \
str(0) + ", " + \
"'" + a_extended[i] + "', " + \
exflag + ", " + \
"'admin'" + ", " + \
"'1'" + ", " + \
"'false'" + ", " + \
"'false'" + ", " + \
"1304420225" + ", " + \
"'false'" + ");"
#print sql
f.write(sql.encode('utf-8') + '\r\n\r\n')
table_name = '`serendipity_entrycat`'
f.write("TRUNCATE TABLE "+ table_name + ";\r\n")
for i, id in enumerate(a_id):
sql = "INSERT INTO " + table_name + " VALUES (" + str(id) + ", " + \
str(a_category[i]) + ");"
f.write(sql.encode('utf-8') + '\r\n\r\n')
table_name = '`serendipity_comments`'
f.write("TRUNCATE TABLE "+ table_name + ";\r\n")
for i, id in enumerate(c_id):
#print i, c_entryid[i], c_parentid[i],
#print c_timestamp[i], c_author[i], c_url[i], c_body[i]
# id entry_id parent_id timestamp title author email url ip body type subscribed status referer
#
#
sql = "INSERT INTO " + table_name + " VALUES ('" + str(id) + "', " + \
"'" + str(c_entryid[i]) + "', " + \
"'" + str(c_parentid[i]) + "', " + \
"'" + str(c_timestamp[i]) + "', " + \
"'', " + \
"'" + c_author[i] + "', " + \
"'', " + \
"'" + c_url[i] + "', " + \
"'', " + \
"'" + c_body[i] + "', " + \
"'NORMAL', " + \
"'false'" + ", " + \
"'approved'" + ", " + \
"'');"
#print sql
f.write(sql.encode('utf-8') + '\r\n\r\n')
f.close()I did not import trackbacks since I only had 2.
Daniel
ps. Overall the programming / import took about 4 hours, and of course much more rewarding than playing the copy'n'paste monkey
-
garvinhicking
- Core Developer
- Posts: 30022
- Joined: Tue Sep 16, 2003 9:45 pm
- Location: Cologne, Germany
- Contact:
Re: Best import options
Hi!
Wow, not too bad. 4 hours isn't too bad either. Good to know, and thanks for sharing!
Regards,
Garvin
Wow, not too bad. 4 hours isn't too bad either. Good to know, and thanks for sharing!
Regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/