8
votes

I have been using happily gawk with FPAT. Here's the script I use for my examples:

#!/usr/bin/gawk -f

BEGIN {
    FPAT="([^,]*)|(\"[^\"]+\")"
}

{
    for (i=1; i<=NF; i++) {
        printf "Record #%s, field #%s: %s\n", NR, i, $i
    }
}

Simple, no quotes

Works well.

$ echo 'a,b,c,d' | ./test.awk 
Record #1, field #1: a
Record #1, field #2: b
Record #1, field #3: c
Record #1, field #4: d

With quotes

Works well.

$ echo '"a","b",c,d' | ./test.awk 
Record #1, field #1: "a"
Record #1, field #2: "b"
Record #1, field #3: c
Record #1, field #4: d

With empty columns and quotes

Works well.

$ echo '"a","b",,d' | ./test.awk 
Record #1, field #1: "a"
Record #1, field #2: "b"
Record #1, field #3: 
Record #1, field #4: d

With escaped quotes, empty columns and quotes

Works well.

$ echo '"""a"": aaa","b",,d' | ./test.awk 
Record #1, field #1: """a"": aaa"
Record #1, field #2: "b"
Record #1, field #3: 
Record #1, field #4: d

With a column containing escaped quotes and ending with a comma

Fails.

$ echo '"""a"": aaa,","b",,d' | ./test.awk 
Record #1, field #1: """a"": aaa
Record #1, field #2: ","
Record #1, field #3: b"
Record #1, field #4: 
Record #1, field #5: d

Expected output:

$ echo '"""a"": aaa,","b",,d' | ./test_that_would_be_working.awk 
Record #1, field #1: """a"": aaa,"
Record #1, field #2: "b"
Record #1, field #4: 
Record #1, field #5: d

Is there a regex for FPAT that would make this work, or is this just not supported by awk?

The pattern would be " followed by anything but a single ". The regex class search works one character at a time so it can't not match a "".

I think there may be an option with lookaround, but I'm not good enough with it to make it work.

1
@RomanPerekhrest, it's four fields. Using | as a field separator: a||"b" b,|cjas
@BenoitDuffez, can you accept another working alternative solution? simple one-linerRomanPerekhrest
@BenoitDuffez, unfortunately, the question is marked as duplicate and I can't add an answer. If you still want to get it - post the question with python tag without mentioning awk. And let me knowRomanPerekhrest
@BenoitDuffez, what input is prefered for you: a single string or a file?RomanPerekhrest
please fix the second example, it's obviously wrong, and explain what "somewhat parses" means. Add the exact output you want for each case. Now it is fuzzy, i.e. second example says that field """b"" b" needs further parsing (why?) and third one says it would be ok.thanasisp

1 Answers

4
votes

Because awk's FPAT doesn't know lookarounds, you need to be explicit in your patterns. This one will do:

FPAT="[^,\"]*|\"([^\"]|\"\")*\""

Explanation:

[^,\"]*             # match 0 or more times any character except , and "
|                   # OR
\"                  # match '"'
  ([^\"]            #   followed by 0 or more anything but '"'
   |                #   OR
   \"\"             #   '""'
  )*        
\"                  # ending with '"'

Now testing it:

$ cat tst.awk
BEGIN {
    FPAT="[^,\"]*|\"([^\"]|\"\")*\""
}
{ 
   for (i=1; i<=NF; i++){ printf "Record #%s, field #%s: %s\n", NR, i, $i }
}


$ echo '"""a"": aaa,","b",,d' | awk -f tst.awk
Record #1, field #1: """a"": aaa,"
Record #1, field #2: "b"
Record #1, field #3:
Record #1, field #4: d