1
votes

I have a huge flat file 100K records each spanning 3000 columns. I need to removed a segment of the data fay starting position 300 to position 500 before archiving. This is sensitive part of data that needs to be wiped before I can archive. I am looking for a awk or sed or any similar command that can do the trick for me.

Sample file

003133780 MORNING GLORY DR                                        SOUTHAMPTON         PA18966780 MORNING GLORY DR    
0054381303 MADISON ST                                             RADFORD             VA241411303 MADISON ST         
00586728 CONESTOGA COURT                                          CHADDS FORD         PA1931728 CONESTOGA COURT      
1852921800 SAMER RD                                               MILAN               MI481601800 SAMER RD           
192717175 EVERGREEN CIRCLE                                        HENDERSONVILLE      TN37075175 EVERGREEN CIRCLE    
213673217 EAST BRANCH                                             LONGVIEW            TX75604217 EAST BRANCH         
2490423205 NOTTAGE LANE                                           FALLS CHURCH        VA220423205 NOTTAGE LANE       
249357344 BALOGH PLACE                                            LONGWOOD            FL32750344 BALOGH PLACE        
2502811224 WILFORD HOLLOW ROAD                                    VINTON              VA241791224 WILFORD HOLLOW ROAD
277634210 AMANDA CT                                               WHITEHOUSE          TX7579119726 COPPER OAKS DRIVE 
282482507 B ST.                                                   CHESAPEAKE          VA23324507 B ST.               

Expected output

003133780 MORNING GLORY DR                                        SOUTHAMPTON         PA780 MORNING GLORY DR    
0054381303 MADISON ST                                             RADFORD             VA1303 MADISON ST         
00586728 CONESTOGA COURT                                          CHADDS FORD         PA28 CONESTOGA COURT      
1852921800 SAMER RD                                               MILAN               MI1800 SAMER RD           
192717175 EVERGREEN CIRCLE                                        HENDERSONVILLE      TN175 EVERGREEN CIRCLE    
213673217 EAST BRANCH                                             LONGVIEW            TX217 EAST BRANCH         
2490423205 NOTTAGE LANE                                           FALLS CHURCH        VA3205 NOTTAGE LANE       
249357344 BALOGH PLACE                                            LONGWOOD            FL344 BALOGH PLACE        
2502811224 WILFORD HOLLOW ROAD                                    VINTON              VA1224 WILFORD HOLLOW ROAD
277634210 AMANDA CT                                               WHITEHOUSE          TX19726 COPPER OAKS DRIVE 
282482507 B ST.                                                   CHESAPEAKE          VA507 B ST.               

Here I removed the char between position 89 and 95. One small change, I also need to write the changed content to the same file.

Below is the script I have so far. I am looping through all files, dividing them into files of max rows 20000 and then removing the characters from position X and Y before archiving.

for currentfilename in ls -1 *.[tT][xX][tT] do echo $currentfilename tempfilename=${currentfilename%%.*} awk -v A="$tempfilename" '{filename = A "Part" int((NR-1)/20000) ".txt"; print >> filename}' $currentfilename awk '{print substr($0,1,522) substr($0,953) >> filename}' $currentfilename mv $currentfilename $APP_ROOT/Archive done

3
Columns 300 to 500, or characters 300 to 500?phs
And you sure it doesn't start at 301? Also how is each column separated if they are columns? Care to provide a sample input even just 1 line? You can also upload it to pastebin.com.konsolebox
What is your delimiter? Commas or Tabs?merlin2011

3 Answers

5
votes

Assuming that position means column, you can use cut to select the columns you want.

cut -f 1-299,501-3000 CutMe.txt

If your data is delimited by commas instead of tabs, then use -d.

cut -d, -f 1-299,501-3000 CutMe.txt

If position means character, you can do the same with cut -c.

cut -c 1-299,501-3000 CutMe.txt
3
votes

Assuming "position" means "character":

awk '{print substr($0,1,299) substr($0,501)}' file

If it doesn't then edit your question to add some REPRESENTATIVE sample input and expected output (e.g. 5 lines of 6 columns each, not thousands of lines of thousands of columns).

2
votes

Using sed:

sed -r -i.bak 's/(.{299}).{200}/\1/' file

The -r option enables extended regex. If you need to make it portable you can remove that option by escaping braces and curlies. The -i option makes changes in-places. I have put an extension .bak to safeguard from any mess up. You can remove it if you don't need to maintain the backup of original.

For solution, we just capture the first 299 characters in a capture group and seek the next 200 we need to remove. We substitute this entire line with our captured group.