1
votes

I am trying to finish exporting a 1000 article website (ASP SQL Server) with categories and tags into a WordPress blog. The articles were originally written in Microsoft Word and included many non-UTF-8 characters. They were then copy and pasted into Microsoft Access. The articles are currently stored in a SQL Server 2008 database and displayed on a website using the iso-8859-1 charset

I am using the default WordPress import/export xml file (WordPress eXtended RSS (WXR) file) which I copied from the file used when exporting a blog from WordPress. This file requires UTF-8 encoding.

My problem is that iso-8859-1 characters break the importer and many articles are not fully imported. Characters such as these

naïve , 
and funny characters such as “ ’

My question is how do I clean up all the text, I can create a replace function to clean up the funny quotes but there will always be a random word like naïve which will cause a problem?

What is the simplest way to convert the encoding of all the text from iso-8859-1 to UTF-8?

1

1 Answers

1
votes

See http://en.wikipedia.org/wiki/Iconv:

iconv is a computer program and a standardized API used to convert between different character encodings.

If you are trapped on pure Windows (i.e. not even Cygwin), and you don't agree that it's probably the easiest to copy the files to a Unix system and perform the conversion there, http://www.unicodetools.com/ has a bunch of conversion tools.