31
votes

In C# what's the best way to remove blank lines i.e., lines that contain only whitespace from a string? I'm happy to use a Regex if that's the best solution.

EDIT: I should add I'm using .NET 2.0.


Bounty update: I'll roll this back after the bounty is awarded, but I wanted to clarify a few things.

First, any Perl 5 compat regex will work. This is not limited to .NET developers. The title and tags have been edited to reflect this.

Second, while I gave a quick example in the bounty details, it isn't the only test you must satisfy. Your solution must remove all lines which consist of nothing but whitespace, as well as the last newline. If there is a string which, after running through your regex, ends with "/r/n" or any whitespace characters, it fails.

19
A regex is quick and simple. What aspect are you trying to optimize when you say "the best way"? Readability? Time? Memory Use?Michael Petito
I'd say readability would be the most important in this case.FunLovinCoder
Readability rarely equates to regular expressionsNick Gotch
Agreed they can get pretty hairy, but I think the one by Chris Schmich, for example, is fine.FunLovinCoder

19 Answers

21
votes

If you want to remove lines containing any whitespace (tabs, spaces), try:

string fix = Regex.Replace(original, @"^\s*$\n", string.Empty, RegexOptions.Multiline);

Edit (for @Will): The simplest solution to trim trailing newlines would be to use TrimEnd on the resulting string, e.g.:

string fix =
    Regex.Replace(original, @"^\s*$\n", string.Empty, RegexOptions.Multiline)
         .TrimEnd();
18
votes
string outputString;
using (StringReader reader = new StringReader(originalString)
using (StringWriter writer = new StringWriter())
{
    string line;
    while((line = reader.ReadLine()) != null)
    {
        if (line.Trim().Length > 0)
            writer.WriteLine(line);
    }
    outputString = writer.ToString();
}
14
votes

off the top of my head...

string fixed = Regex.Replace(input, "\s*(\n)","$1");

turns this:

fdasdf
asdf
[tabs]

[spaces]  

asdf


into this:

fdasdf
asdf
asdf
8
votes

Using LINQ:

var result = string.Join("\r\n",
                 multilineString.Split(new string[] { "\r\n" }, ...None)
                                .Where(s => !string.IsNullOrWhitespace(s)));

If you're dealing with large inputs and/or inconsistent line endings you should use a StringReader and do the above old-school with a foreach loop instead.

4
votes

Alright this answer is in accordance to the clarified requirements specified in the bounty:

I also need to remove any trailing newlines, and my Regex-fu is failing. My bounty goes to anyone who can give me a regex which passes this test: StripWhitespace("test\r\n \r\nthis\r\n\r\n") == "test\r\nthis"

So Here's the answer:

(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|(\r?\n)+\z

Or in the C# code provided by @Chris Schmich:

string fix = Regex.Replace("test\r\n \r\nthis\r\n\r\n", @"(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|(\r?\n)+\z", string.Empty, RegexOptions.Multiline);

Now let's try to understand it. There are three optional patterns in here which I am willing to replace with string.empty.

  1. (?<=\r?\n)(\s*$\r?\n)+ - matches one to unlimited lines containing only white space and preceeded by a line break (but does not match the first preceeding line breaks).
  2. (?<=\r?\n)(\r?\n)+ - matches one to unlimited empty lines with no content that are preceeded by a line break (but does not match the first preceeding line breaks).
  3. (\r?\n)+\z - matches one to unlimited line breaks at the end of the tested string (trailing line breaks as you called them)

That satisfies your test perfectly! But also satisfies both \r\n and \n line break styles! Test it out! I believe this will be the most correct answer, although simpler expression would pass your specified bounty test, this regex passes more complex conditions.

EDIT: @Will pointed out a potential flaw in the last pattern match of the above regex in that it won't match multiple line breaks containing white space at the end of the test string. So let's change that last pattern to this:

\b\s+\z The \b is a word boundry (beginning or END of a word), the \s+ is one or more white space characters, the \z is the end of the test string (end of "file"). So now it will match any assortment of whitespace at the end of the file including tabs and spaces in addition to carriage returns and line breaks. I tested both of @Will's provided test cases.

So all together now, it should be:

(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|\b\s+\z

EDIT #2: Alright there is one more possible case @Wil found that the last regex doesn't cover. That case is inputs that have line breaks at the beginning of the file before any content. So lets add one more pattern to match the beginning of the file.

\A\s+ - The \A match the beginning of the file, the \s+ match one or more white space characters.

So now we've got:

\A\s+|(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|\b\s+\z

So now we have four patterns for matching:

  1. whitespace at the beginning of the file,
  2. redundant line breaks containing white space, (ex: \r\n \r\n\t\r\n)
  3. redundant line breaks with no content, (ex: \r\n\r\n)
  4. whitespace at the end of the file
3
votes

not good. I would use this one using JSON.net:

var o = JsonConvert.DeserializeObject(prettyJson);
new minifiedJson = JsonConvert.SerializeObject(o, Formatting.None);
2
votes

In response to Will's bounty, which expects a solution that takes "test\r\n \r\nthis\r\n\r\n" and outputs "test\r\nthis", I've come up with a solution that makes use of atomic grouping (aka Nonbacktracking Subexpressions on MSDN). I recommend reading those articles for a better understanding of what's happening. Ultimately the atomic group helped match the trailing newline characters that were otherwise left behind.

Use RegexOptions.Multiline with this pattern:

^\s+(?!\B)|\s*(?>[\r\n]+)$

Here is an example with some test cases, including some I gathered from Will's comments on other posts, as well as my own.

string[] inputs = 
{
    "one\r\n \r\ntwo\r\n\t\r\n \r\n",
    "test\r\n \r\nthis\r\n\r\n",
    "\r\n\r\ntest!",
    "\r\ntest\r\n ! test",
    "\r\ntest \r\n ! "
};
string[] outputs = 
{
    "one\r\ntwo",
    "test\r\nthis",
    "test!",
    "test\r\n ! test",
    "test \r\n ! "
};

string pattern = @"^\s+(?!\B)|\s*(?>[\r\n]+)$";

for (int i = 0; i < inputs.Length; i++)
{
    string result = Regex.Replace(inputs[i], pattern, "",
                                  RegexOptions.Multiline);
    Console.WriteLine(result == outputs[i]);
}

EDIT: To address the issue of the pattern failing to clean up text with a mix of whitespace and newlines, I added \s* to the last alternation portion of the regex. My previous pattern was redundant and I realized \s* would handle both cases.

1
votes
string corrected = 
    System.Text.RegularExpressions.Regex.Replace(input, @"\n+", "\n");
1
votes

I'll go with:

  public static string RemoveEmptyLines(string value) {
    using (StringReader reader = new StringReader(yourstring)) {
      StringBuilder builder = new StringBuilder();
      string line;
      while ((line = reader.ReadLine()) != null) {
        if (line.Trim().Length > 0)
          builder.AppendLine(line);
      }
      return builder.ToString();
    }
  }
1
votes

Here's another option: use the StringReader class. Advantages: one pass over the string, creates no intermediate arrays.

public static string RemoveEmptyLines(this string text) {
    var builder = new StringBuilder();

    using (var reader = new StringReader(text)) {
        while (reader.Peek() != -1) {
            string line = reader.ReadLine();
            if (!string.IsNullOrWhiteSpace(line))
                builder.AppendLine(line);
        }
    }

    return builder.ToString();
}

Note: the IsNullOrWhiteSpace method is new in .NET 4.0. If you don't have that, it's trivial to write on your own:

public static bool IsNullOrWhiteSpace(string text) {
    return string.IsNullOrEmpty(text) || text.Trim().Length < 1;
}
1
votes

In response to Will's bounty here is a Perl sub that gives correct response to the test case:

sub StripWhitespace {
    my $str = shift;
    print "'",$str,"'\n";
    $str =~ s/(?:\R+\s+(\R)+)|(?:()\R+)$/$1/g;
    print "'",$str,"'\n";
    return $str;
}
StripWhitespace("test\r\n \r\nthis\r\n\r\n");

output:

'test

this

'
'test
this'

In order to not use \R, replace it with [\r\n] and inverse the alternative. This one produces the same result:

$str =~ s/(?:(\S)[\r\n]+)|(?:[\r\n]+\s+([\r\n])+)/$1/g;

There're no needs for special configuration neither multi line support. Nevertheless you can add s flag if it's mandatory.

$str =~ s/(?:(\S)[\r\n]+)|(?:[\r\n]+\s+([\r\n])+)/$1/sg;
1
votes

if its only White spaces why don't you use the C# string method

    string yourstring = "A O P V 1.5";
    yourstring.Replace("  ", string.empty);

result will be "AOPV1.5"

0
votes
char[] delimiters = new char[] { '\r', '\n' };
string[] lines = value.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
string result = string.Join(Environment.NewLine, lines)
0
votes

Here is something simple if working against each individual line...

(^\s+|\s+|^)$
0
votes

Eh. Well, after all that, I couldn't find one that would hit all the corner cases I could figure out. The following is my latest incantation of a regex that strips

  1. All empty lines from the start of a string
    • Not including any spaces at the beginning of the first non-whitespace line
  2. All empty lines after the first non-whitespace line and before the last non-whitespace line
    • Again, preserving all whitespace at the beginning of any non-whitespace line
  3. All empty lines after the last non-whitespace line, including the last newline

(?<=(\r\n)|^)\s*\r\n|\r\n\s*$

which essentially says:

  • Immediately after
    • The beginning of the string OR
    • The end of the last line
  • Match as much contiguous whitespace as possible that ends in a newline*
  • OR
  • Match a newline and as much contiguous whitespace as possible that ends at the end of the string

The first half catches all whitespace at the start of the string until the first non-whitespace line, or all whitespace between non-whitespace lines. The second half snags the remaining whitespace in the string, including the last non-whitespace line's newline.

Thanks to all who tried to help out; your answers helped me think through everything I needed to consider when matching.

*(This regex considers a newline to be \r\n, and so will have to be adjusted depending on the source of the string. No options need to be set in order to run the match.)

0
votes

String Extension

public static string UnPrettyJson(this string s)
{
    try
    {
        // var jsonObj = Json.Decode(s);
        // var sObject = Json.Encode(value);   dont work well with array of strings c:['a','b','c']

        object jsonObj = JsonConvert.DeserializeObject(s);
        return JsonConvert.SerializeObject(jsonObj, Formatting.None);
    }
    catch (Exception e)
    {
        throw new Exception(
            s + " Is Not a valid JSON ! (please validate it in http://www.jsoneditoronline.org )", e);
    }
}
0
votes

Im not sure is it efficient but =)

  List<string> strList = myString.Split(new string[] { "\n" }, StringSplitOptions.None).ToList<string>();
  myString = string.Join("\n", strList.Where(s => !string.IsNullOrWhiteSpace(s)).Distinct().ToList());
-1
votes

Try this.

string s = "Test1" + Environment.NewLine + Environment.NewLine + "Test 2";
Console.WriteLine(s);

string result = s.Replace(Environment.NewLine, String.Empty);
Console.WriteLine(result);
-2
votes
s = Regex.Replace(s, @"^[^\n\S]*\n", "");

[^\n\S] matches any character that's not a linefeed or a non-whitespace character--so, any whitespace character except \n. But most likely the only characters you have to worry about are space, tab and carriage return, so this should work too:

s = Regex.Replace(s, @"^[ \t\r]*\n", "");

And if you want it to catch the last line, without a final linefeed:

s = Regex.Replace(s, @"^[ \t\r]*\n?", "");