Reading a text file word by word

5

votes

I have a text file containing just lowercase letters and no punctuation except for spaces. I would like to know the best way of reading the file char by char, in a way that if the next char is a space, it signifies the end of one word and the start of a new word. i.e. as each character is read it is added to a string, if the next char is space, then the word is passed to another method and reset until the reader reaches the end of the file.

I'm trying to do this with a StringReader, something like this:

public String GetNextWord(StringReader reader)
{
    String word = "";
    char c;
    do
    {
        c = Convert.ToChar(reader.Read());
        word += c;
    } while (c != ' ');
    return word;
}

and put the GetNextWord method in a while loop till the end of the file. Does this approach make sense or are there better ways of achieving this?

c#

Please don't prefix your titles with "C#: " and such. That's what the tags are for. - John Saunders

I think you should read bigger chunks from file (say 4096 bytes), otherwise seems fine... Also, I would like to know what would be the best size for it :) - neeKo

Building strings like that will generate a lot of objects (remember string is immutable). Use StringBUilder if you want to build the string while reading the file. - Brian Rasmussen

@ Niko how do I do that since I still need the 1 character at a time? @ Brian ok I'll check that out thanks for the help - Matt

If you want the fastest way and have enough memory, use the new MemoryMappedFile class. - Tim Schmelter

17

votes

There is a much better way of doing this: string.Split(): if you read the entire string in, C# can automatically split it on every space:

string[] words = reader.ReadToEnd().Split(' ');

The words array now contains all of the words in the file and you can do whatever you want with them.

Additionally, you may want to investigate the File.ReadAllText method in the System.IO namespace - it may make your life much easier for file imports to text.

Edit: I guess this assumes that your file is not abhorrently large; as long as the entire thing can be reasonably read into memory, this will work most easily. If you have gigabytes of data to read in, you'll probably want to shy away from this. I'd suggest using this approach though, if possible: it makes better use of the framework that you have at your disposal.

6

votes

If you're interested in good performance even on very large files, you should have a look at the new(4.0) MemoryMappedFile-Class.

For example:

using (var mappedFile1 = MemoryMappedFile.CreateFromFile(filePath))
{
    using (Stream mmStream = mappedFile1.CreateViewStream())
    {
        using (StreamReader sr = new StreamReader(mmStream, ASCIIEncoding.ASCII))
        {
            while (!sr.EndOfStream)
            {
                var line = sr.ReadLine();
                var lineWords = line.Split(' ');
            }
        }  
    }
}

From MSDN:

A memory-mapped file maps the contents of a file to an application’s logical address space. Memory-mapped files enable programmers to work with extremely large files because memory can be managed concurrently, and they allow complete, random access to a file without the need for seeking. Memory-mapped files can also be shared across multiple processes.

The CreateFromFile methods create a memory-mapped file from a specified path or a FileStream of an existing file on disk. Changes are automatically propagated to disk when the file is unmapped.

The CreateNew methods create a memory-mapped file that is not mapped to an existing file on disk; and are suitable for creating shared memory for interprocess communication (IPC).

A memory-mapped file is associated with a name.

You can create multiple views of the memory-mapped file, including views of parts of the file. You can map the same part of a file to more than one address to create concurrent memory. For two views to remain concurrent, they have to be created from the same memory-mapped file. Creating two file mappings of the same file with two views does not provide concurrency.

4

votes

First of all: StringReader reads from a string which is already in memory. This means that you will have to load up the input file in its entirety before being able to read from it, which kind of defeats the purpose of reading a few characters at a time; it can also be undesirable or even impossible if the input is very large.

The class to read from a text stream (which is an abstraction over a source of data) is StreamReader, and you would might want to use that one instead. Now StreamReader and StringReader share an abstract base class TextReader, which means that if you code against TextReader then you can have the best of both worlds.

TextReader's public interface will indeed support your example code, so I 'd say it's a reasonable starting point. You just need to fix the one glaring bug: there is no check for Read returning -1 (which signifies the end of available data).

1

votes

All in one line, here you go (assuming ASCII and perhaps not a 2gb file):

var file = File.ReadAllText(@"C:\myfile.txt", Encoding.ASCII).Split(new[] { ' ' });

This returns a string array, which you can iterate over and do whatever you need with.

1

votes

I would do something like this:

IEnumerable<string> ReadWords(StreamReader reader)
{
    string line;
    while((line = reader.ReadLine())!=null)
    {
        foreach(string word in line.Split(new [1] {' '}, StringSplitOptions.RemoveEmptyEntries))
        {
            yield return word;
        }
    }
}

If to use reader.ReadAllText it loads the entire file into your memory so you can get OutOfMemoryException and a lot of other problems.

1

votes

If you want to read it whitout spliting the string - for example lines are too long, so you might encounter OutOfMemoryException, you should do it like this (using streamreader):

while (sr.Peek() >= 0)
{
    c = (char)sr.Read();
    if (c.Equals(' ') || c.Equals('\t') || c.Equals('\n') || c.Equals('\r'))
    {
        break;
    }
    else
        word += c;
}
return word;

0

votes

This is method that will split your words, while they are separated by space or more than 1 space (two spaces for example)/

StreamReader streamReader = new StreamReader(filePath); //get the file
string stringWithMultipleSpaces= streamReader.ReadToEnd(); //load file to string
streamReader.Close();

Regex r = new Regex(" +"); //specify delimiter (spaces)
string [] words = r.Split(stringWithMultipleSpaces); //(convert string to array of words)

foreach (String W in words)
{
   MessageBox.Show(W);
}

0

votes

I created a simple console program on your exact requirement with the files you mentioned, It should be easy to run and check. Please find attached the code. Hope this helps

static void Main(string[] args)
    {

        string[] input = File.ReadAllLines(@"C:\Users\achikhale\Desktop\file.txt");
        string[] array1File = File.ReadAllLines(@"C:\Users\achikhale\Desktop\array1.txt");
        string[] array2File = File.ReadAllLines(@"C:\Users\achikhale\Desktop\array2.txt");

        List<string> finalResultarray1File = new List<string>();
        List<string> finalResultarray2File = new List<string>();

        foreach (string inputstring in input)
        {
            string[] wordTemps = inputstring.Split(' ');//  .Split(' ');

            foreach (string array1Filestring in array1File)
            {
                string[] word1Temps = array1Filestring.Split(' ');

                var result = word1Temps.Where(y => !string.IsNullOrEmpty(y) && wordTemps.Contains(y)).ToList();

                if (result.Count > 0)
                {
                    finalResultarray1File.AddRange(result);
                }

            }

        }

        foreach (string inputstring in input)
        {
            string[] wordTemps = inputstring.Split(' ');//  .Split(' ');

            foreach (string array2Filestring in array2File)
            {
                string[] word1Temps = array2Filestring.Split(' ');

                var result = word1Temps.Where(y => !string.IsNullOrEmpty(y) && wordTemps.Contains(y)).ToList();

                if (result.Count > 0)
                {
                    finalResultarray2File.AddRange(result);
                }

            }

        }

        if (finalResultarray1File.Count > 0)
        {
            Console.WriteLine("file array1.txt contians words: {0}", string.Join(";", finalResultarray1File));
        }

        if (finalResultarray2File.Count > 0)
        {
            Console.WriteLine("file array2.txt contians words: {0}", string.Join(";", finalResultarray2File));
        }

        Console.ReadLine();

    }
}

0

votes

This code will extract words from a text file based on the Regex pattern. You can try playing with other patterns to see what works best for you.

    StreamReader reader =  new StreamReader(fileName);

    var pattern = new Regex(
              @"( [^\W_\d]              # starting with a letter
                                        # followed by a run of either...
                  ( [^\W_\d] |          #   more letters or
                    [-'\d](?=[^\W_\d])  #   ', -, or digit followed by a letter
                  )*
                  [^\W_\d]              # and finishing with a letter
                )",
              RegexOptions.IgnorePatternWhitespace);

    string input = reader.ReadToEnd();

    foreach (Match m in pattern.Matches(input))
        Console.WriteLine("{0}", m.Groups[1].Value);

    reader.Close();

Reading a text file word by word

9 Answers