0
votes

I have insider transactions of a company in a space delimited file. Sample data looks like the following:

1 Gilliland Michael S January 2,2013 20,000 19
2 Still George J Jr January 2,2013 20,000 19
3 Bishkin S. James February 1,2013 150,000 21
4 Mellin Mark P May 28,2013 238,000 25.26

Col1 is Serial# that I dont need to print
Col2 is the name of the person who did trades. This column is not consistent. It has first name and second name and middle initial and for some insiders salutations as well (Mr, Dr. Jr etc)
col3 is the date format Month Day,Year
col4 is the number of shares traded
col5 is the price at which shares were purchased or sold.

I need you guys help to print each column value separately. Thanks for your help.

3
If you have spaces inside of fields and your columns are space-delimited, you won't be able to get a clean solution. Ideally you want to be able to extract your data in tsv format. Also, it is customary to show what you have tried before soliciting assistance on StackOverflow. - merlin2011
You mean you want a command that will print the names, another command that will print the dates, and so on? - Beta
Do you have any control over how the data looks? - Zack
No, unfortunately not. This is how we are receiving the data from the source. - user3781132
Are you using AWK to try to get the values? I have no experience with it, so I was wondering if you know if it has regex capabilities? That might be your only hope. - Zack

3 Answers

1
votes

Count the total number of fields read; the difference between that and the number of non-name fields gives you the width of the name.

#!/bin/bash
# uses bash features, so needs a /bin/bash shebang, not /bin/sh

# read all fields into an array
while read -r -a fields; do

  # calculate name width assuming 5 non-name fields
  name_width=$(( ${#fields[@]} - 5 ))
  cur_field=0

  # read initial serial number
  ser_id=${fields[cur_field]}; (( ++cur_field ))

  # read name
  name=''
  for ((i=0; i<name_width; i++)); do
    name+=" ${fields[cur_field]}"; (( ++cur_field ))
  done
  name=${name# } # trim leading space

  # date spans two fields due to containing a space
  date=${fields[cur_field]}; (( ++cur_field ))
  date+=" ${fields[cur_field]}"; (( ++cur_field ))

  # final fields are one span each
  num_shares=${fields[cur_field]}; (( ++cur_field ))
  price=${fields[cur_field]}; (( ++cur_field ))

  # print in newline-delimited form
  printf '%s\n' "$ser_id" "$name" "$date" "$num_shares" "$price" ""
done

Run as follows (if you saved the script as process):

./process <input.txt >output.txt
1
votes

It might be a little easier in perl.

perl -lane '
    @date = splice @F, -4, 2;
    @left = splice @F, -2, 2;
    splice @F, 0, 1;
    print join "|", "@F", "@date", @left
' file
Gilliland Michael S|January 2,2013|20,000|19
Still George J Jr|January 2,2013|20,000|19
Bishkin S. James|February 1,2013|150,000|21
Mellin Mark P|May 28,2013|238,000|25.26

You can change the delimiter in the join as per your requirement.

0
votes

Here is the data separated using awk

awk '{c1=$1;c5=$NF;c4=$(NF-1);c3=$(NF-3)FS$(NF-2);$1=$NF=$(NF-1)=$(NF-2)=$(NF-3)="";gsub(/^ | *$/,"");c2=$0;print c1"|"c2"|"c3"|"c4"|"c5}' file
1|Gilliland Michael S|January 2,2013|20,000|19
2|Still George J Jr|January 2,2013|20,000|19
3|Bishkin S. James|February 1,2013|150,000|21
4|Mellin Mark P|May 28,2013|238,000|25.26

You know have your data in variable c1 to c5

Or better displayed here:

awk '{c1=$1;c5=$NF;c4=$(NF-1);c3=$(NF-3)FS$(NF-2);$1=$NF=$(NF-1)=$(NF-2)=$(NF-3)="";gsub(/^ | *$/,"");c2=$0;print c1"|"c2"|"c3"|"c4"|"c5}' file | column -t -s "|"
1  Gilliland Michael S  January 2,2013   20,000   19
2  Still George J Jr    January 2,2013   20,000   19
3  Bishkin S. James     February 1,2013  150,000  21
4  Mellin Mark P        May 28,2013      238,000  25.26