Replacing Invalid XML characters from an excel file and writing it back to disk causes file is corrupted error on opening in MS Excel

Question

A little background on problem:

We have an ASP.NET MVC5 Application where we use FlexMonster to show the data in grid. The data source is a stored procedure that brings all the data into the UI grid, and once user clicks on export button, it exports the report to Excel. However, in some cases export to excel is failing. Some of the data has some invalid characters, and it is not possible/feasible to fix the source as suggested here

My approach so far:

EPPlus library fails on initializing the workbook as the input excel file contains some invalid XML characters. I could find that the file is dumped with some invalid character in it. I looked into the possible approaches .

Firstly, I identified the problematic character in the excel file. I first tried to replace the invalid character with blank space manually using Notepad++ and the EPPlus could successfully read the file.

Now using the approaches given in other SO thread here and here, I replaced all possible occurrences of invalid chars. I am using at the moment

XmlConvert.IsXmlChar

method to find out the problematic XML character and replacing with blank space.

I created a sample program where I am trying to work on the problematic excel sheet.

//in main method 
String readFile = File.ReadAllText(filePath);
string content = RemoveInvalidXmlChars(readFile);
File.WriteAllText(filePath, content);

//removal of invalid characters
        static string RemoveInvalidXmlChars(string inputText)  
        {
            StringBuilder withoutInvalidXmlCharsBuilder = new StringBuilder();
            int firstOccurenceOfRealData = inputText.IndexOf("<t>");
            int lastOccurenceOfRealData = inputText.LastIndexOf("</t>");

            if (firstOccurenceOfRealData < 0 ||
                lastOccurenceOfRealData < 0 ||
                firstOccurenceOfRealData > lastOccurenceOfRealData)
                return inputText;

            withoutInvalidXmlCharsBuilder.Append(inputText.Substring(0, firstOccurenceOfRealData)); 
            int remaining = lastOccurenceOfRealData - firstOccurenceOfRealData;
            string textToCheckFor = inputText.Substring(firstOccurenceOfRealData, remaining); 

            foreach (char c in textToCheckFor)
            {
                withoutInvalidXmlCharsBuilder.Append((XmlConvert.IsXmlChar(c)) ? c : ' ');
            }
      withoutInvalidXmlCharsBuilder.Append(inputText.Substring(lastOccurenceOfRealData));

            return withoutInvalidXmlCharsBuilder.ToString();

        }

If I replaces the problematic character manually using notepad++, then the file opens fine in MSExcel. The above mentioned code successfully replaces the same invalid character and writes the content back to the file. However, when I try to open the excel file using MS Excel, it throws an error saying that file may have been corrupted and no content is displayed (snapshots below). Moreover, Following code

var excelPackage = new ExcelPackage(new FileInfo(filePath));

on the file that I updated via Notepad++, throws following exception

"CRC error: the file being extracted appears to be corrupted. Expected 0x7478AABE, Actual 0xE9191E00"}

My Questions:

Is my approach to modify content this way correct?
If yes, How can I write updated string to an Excel file?
If my approach is wrong then, How can I proceed to get rid of invalid XML chars?

Errors shown on opening file (without invalid XML char):

First Pop up

When I click on yes

Thanks in advance !

it exports the report to Excel. are you sure? Or is it a CSV/HTML file with a fake extension? Did you check the actual file contents? An xlsx file is a zip package that contains XML files. You can't just edit it with a text editor — Panagiotis Kanavos
Which means, if that really is a zip package with XML files, you can open it with ZipPackage and modify the files directly. If you can't open them using XmlSerialider or XDocument you can treat them as text files — Panagiotis Kanavos
This problematic file is a file posted back to the server in HttpPostedFileBase object input stream. The server saves this file and then using EPPlus adds another workbook sheet to the final Excel file. As per FlexMonster documentation, it says that exportTo method exports it to excel File. We use flex monster to export the grid data to an excel file — kuldeep
Assuming the file is a real xlsx file, it means you can just open it with the ZipArchive class and edit its contents. You can access individual files through the ZipArchive.Entries property — Panagiotis Kanavos
After that you can open a stream on the file with ZipArchiveEntry.Open — Panagiotis Kanavos

Ernie S Ernie S · Accepted Answer · 2019-06-01T00:49:17

It does sounds like a binary (presumable XLSX) file based on your last comment. To confirm, open the file created by the FlexMonster with 7zip. If it opens properly and you see a bunch of XML files in folders, its a XLSX.

In that case, a search/replace on a binary file sounds like a very bad idea. It might work on the XML parts but might also replace legit chars in other parts. I think the better approach would be to do as @PanagiotisKanavos suggests and use ZipArchive. But you have to do rebuild it in the right order otherwise Excel complains. Similar to how it was done here https://stackoverflow.com/a/33312038/1324284, you could do something like this:

public static void ReplaceXmlString(this ZipArchive xlsxZip, FileInfo outFile, string oldString, string newstring)
{
    using (var outStream = outFile.Open(FileMode.Create, FileAccess.ReadWrite))
    using (var copiedzip = new ZipArchive(outStream, ZipArchiveMode.Update))
    {
        //Go though each file in the zip one by one and copy over to the new file - entries need to be in order
        foreach (var entry in xlsxZip.Entries)
        {
            var newentry = copiedzip.CreateEntry(entry.FullName);
            var newstream = newentry.Open();
            var orgstream = entry.Open();

            //Copy non-xml files over
            if (!entry.Name.EndsWith(".xml"))
            {
                orgstream.CopyTo(newstream);
            }
            else
            {
                //Load the xml document to manipulate
                var xdoc = new XmlDocument();
                xdoc.Load(orgstream);

                var xml = xdoc.OuterXml.Replace(oldString, newstring);
                xdoc = new XmlDocument();
                xdoc.LoadXml(xml);

                xdoc.Save(newstream);
            }

            orgstream.Close();
            newstream.Flush();
            newstream.Close();
        }
    }
}

When it is used like this:

[TestMethod]
public void ReplaceXmlTest()
{
    var datatable = new DataTable("tblData");
    datatable.Columns.AddRange(new[]
    {
        new DataColumn("Col1", typeof (int)),
        new DataColumn("Col2", typeof (int)),
        new DataColumn("Col3", typeof (string))
    });

    for (var i = 0; i < 10; i++)
    {
        var row = datatable.NewRow();
        row[0] = i;
        row[1] = i * 10;
        row[2] = i % 2 == 0 ? "ABCD" : "AXCD";
        datatable.Rows.Add(row);
    }

    using (var pck = new ExcelPackage())
    {
        var workbook = pck.Workbook;
        var worksheet = workbook.Worksheets.Add("source");

        worksheet.Cells.LoadFromDataTable(datatable, true);
        worksheet.Tables.Add(worksheet.Cells["A1:C11"], "Table1");

        //Now similulate the copy/open of the excel file into a zip archive
        using (var orginalzip = new ZipArchive(new MemoryStream(pck.GetAsByteArray()), ZipArchiveMode.Read))
        {
            var fi = new FileInfo(@"c:\temp\ReplaceXmlTest.xlsx");
            if (fi.Exists)
                fi.Delete();

            orginalzip.ReplaceXmlString(fi, "AXCD", "REPLACED!!");
        }
    }
}

Gives this:

Just keep in mind that this is completely brute force. Anything you can do to make the file filter smarter rather then simply doing ALL xml files would be a very good thing. Maybe limit it to the SharedString.xml file if that is where the problem lies or in the xml files in the worksheet folders. Hard to say without knowing more about the data.

Replacing Invalid XML characters from an excel file and writing it back to disk causes file is corrupted error on opening in MS Excel

1 Answers