0
votes

I have uploaded a file into server. How can I use c# to read the contents and display it. I used string builder to extract the content, and display it in a multiline textbox.

The code I used is :-

string[] readText = File.ReadAllLines(path);

StringBuilder strbuild = new StringBuilder();
foreach (string s in readText)
{
    strbuild.Append(s);
    strbuild.AppendLine();
}
txtPreview.Text = strbuild.ToString();

The problem with this is that, some kind of additional un-readable character are displayed at the top and bottom, maybe some kind of encrypted text. How to remove those charactes, and show only the contents ?

Microsoft.Office.Interop.Word.Document doc = Application.Documents.Open(ref file, ref nullobj, ref nullobj,
                                                  ref nullobj, ref nullobj, ref nullobj,
                                                  ref nullobj, ref nullobj, ref nullobj,
                                                  ref nullobj, ref nullobj, ref nullobj,
                                                  ref nullobj, ref nullobj, ref nullobj, ref nullobj);
doc.Activate();
string Doc_Content = doc.Content.Text;
string str = Doc_Content;
var words = str.Split(new char[] { ' ', ':', '\r', '\t' });

for (int i = 0; i < words.Length; i++)
{
    string val1 = words[i].ToString();
}

UPDATE: I am using Microsoft Interop library, and I am able to show the contents of the word document into the multiline text box.

I created a string variable str to hold all content of the word file. And an array word[] to store the words. The problem I am facing now is :- Read the words. If the first word is "hello", I need to read the second and third word. If the first word is "hello" and second word is "world", I need to read the third and fourth words. Other wise, I need to read the first and second words. How can this be done?

1
You can't, what if the word document had pictures in it, you need to use something such as office interop - Sayse
Is the file a .DOCX file? If so it is a Package (zipped xml) so you cannot read it as text like that. You will have to use the OpenXml libraries within .NET. Here is a 'basics' tutorial codeproject.com/Articles/20998/…. There are some useful snippets in here too. stackoverflow.com/questions/1142830/…. NPOI is also an option npoi.codeplex.com (which also handles .doc) - CodeBeard
Best solution here depends on your use case -- what do you actually have to do? - Hogan
You need to specify the file type you are uploading..In order to get a help - None
Word has text but it is not just a test file. Can use a commercial product like Aspose to extract the text. Or you can use ifilter but the text will just be the words with zero formatting. - paparazzo

1 Answers

4
votes

Word documents are not basic text. Depending on the version they are either 'Packages' (zipped xml) or a custom binary format. As such you either need to crack open the package and read the xml (not advised) or use a library.

OpenXml as part of .NET framework will enable you to open Word.docx files and work with them. There are some useful snippets in this example. You can also find basics tutorials like this if you don't want to follow the Msft documentation.

There are non msft libraries like NPOI that will help with both .doc and .docx files.

To use interop you would need to have office installed on the servers that were handling the document. It is possible to run word headless for this purpose. However, I personally, would not recommend it.