2
votes

I have a several hundred million row CSV, with no headers, tab delimited, and about 40 fields. I need to split this csv into 6 smaller csv's.

The smaller csv's fields will be specified via cherry picking. As in this example:

Output csv#1 uses fields 1, 4, 23, 16

Output csv#2 uses fields 2, 3, 5, 24, 34

Output csv#3 uses fields 6, 25, 20, 16

Output csv#4 uses fields 10, 29, 31, 32, 34, 35

Etc.

Several of the original fields won't get used.

I have had the pleasure of reading many other users posts on splitting a single csv into multiple csv's, but haven't stumbled across one that hints how to specify which fields to put into which csv's.

I am assuming I should use streamReader/StreamWriter, and '`t' to detect the delimiter. I don't care if I have to run once over the source file for each iteration rather than run it all on a single pass. I expect this to run overnight (or over a weekend!)

I just haven't figured out how to pick each filed for the correct output csv. Thank you for any and all idea's!

-= Bruce

Here is my sample csv. This sample is comma delimited, NOT tab delimited. I can deal with the tab delimited change to any code.

12/22/2018 16:00,Something,,4503333883,137.11.21.251,113,,Tackov,SO,1968974515,192.168.4312.179,37468,QWE,,172166316,99.0.0.0-99.255.255.255 (STACK),3011,overflow_internet,stack-overflow002-stack.over.flow,stack_over_flow,1,,CISCO,ASA,,US/Eastern,STACK2 Type,,STACK1 Flow,,Forty4Fortnights,,QWE FourceScore,RST,ABC,,AndForty,,Fries,,With That,,Shake,,14012,Lunch ,With Stackoverflow all things are possible,0,7,/dot,/Access,/Dot/Com/Service,/Win,/Fail/Win
12/22/2018 16:00,Something2,stackover,2018138499,120.43.212.240,34749,,Tackov,SO,1968982930,192.168.4345.146,4545,QWE,,172166316,99.0.0.0-99.255.255.255 (STACK),3011,overflow_internet,stack-overflow002-stack.over.flow,stack_over_flow,1,,CISCO,ASA,inside,US/Eastern,STACK2 Type,,STACK1 Flow,,Forty4Fortnights,,QWE FourceScore,,ABC,,AndForty,,Fries,,With That,,Shake,,14012,Lunch ,with Google all things are possible ABC,0,5,/dot,/Access,/Dot/Com/Service,/Win,/Fail/Win
12/22/2018 16:00,Something3,Overflows,3203409083,12.45.36.187,60170,,Tackov,SO,1968976181,192.168.4319.53,4545,QWE,,172166316,99.0.0.0-99.255.255.255 (STACK),3011,overflow_internet,stack-overflow002-stack.over.flow,stack_over_flow,1,,CISCO,ASA,inside,US/Eastern,STACK2 Type,,STACK1 Flow,,Forty4Fortnights,,QWE FourceScore,,ABC,,AndForty,,Fries,,With That,,Shake,,14012,Lunch ,with Google all things are possible ABC,0,5,/dot,/Access,/Dot/Com/Service,/Win,/Fail/Win
12/22/2018 16:00,Something4,,1390242054,82.221.105.6,59922,,Tackov,SO,1968986785,192.168.4360.161,88,QWE,,172166316,99.0.0.0-99.255.255.255 (STACK),3011,overflow_internet,stack-overflow002-stack.over.flow,stack_over_flow,1,,CISCO,ASA,inside,US/Eastern,STACK2 Type,,STACK1 Flow,,Forty4Fortnights,,QWE FourceScore,,ABC,,AndForty,,Fries,,With That,,Shake,,14012,Lunch ,with Google all things are possible ABC,0,5,/dot,/Access,/Dot/Com/Service,/Win,/Fail/Win
12/22/2018 16:00,Something7,Ackover,1245693210,11.63.197.26,60300,,Tackov,SO,1968982959,192.168.4345.175,99,QWE,,172166316,99.0.0.0-99.255.255.255 (STACK),3011,overflow_internet,stack-overflow002-stack.over.flow,stack_over_flow,1,,CISCO,ASA,inside,US/Eastern,STACK2 Type,,STACK1 Flow,,Forty4Fortnights,,QWE FourceScore,,ABC,,AndForty,,Fries,,With That,,Shake,,14012,Lunch ,with Google all things are possible ABC,0,5,/dot,/Access,/Dot/Com/Service,/Win,/Fail/Win
12/22/2018 16:00,Something2,Koverflo,3026283045,199.97.106.37,47861,,Tackov,SO,1968982790,192.168.4345.6,443,QWE,,172166316,99.0.0.0-99.255.255.255 (STACK),3011,overflow_internet,stack-overflow002-stack.over.flow,stack_over_flow,1,,CISCO,ASA,inside,US/Eastern,STACK2 Type,,STACK1 Flow,,Forty4Fortnights,,QWE FourceScore,,ABC,,AndForty,,Fries,,With That,,Shake,,14012,Lunch ,with Google all things are possible ABC,0,5,/dot,/Access,/Dot/Com/Service,/Win,/Fail/Win

In case the data doesn't copy paste well, each row ends with the chars "/Fail/Win"

=-=-=-=-=-=-=-=-==- 7 hours later..... I think I have everything going except for one tiny detail. I haven't been able to get my columns split out when reading and writing in this method. My test file now reads and writes in 29.1 seconds. YAY! But all it does is read it, and write it back out. Not terribly effective. The variable $Header, per the 'Tessalating Solution' contains the headers I want to split off on my test code. I have not found how to specify a delimiter, and subsequently place the $Header variable. Thank You for any help.

    clear
    $sw = new-object System.Diagnostics.Stopwatch
    $sw.Start()
    $Header = "Select-Object -Property 'SourceGeoCountryName','SourceGeoLocationInfo','SourceAddress','SourceHostName','SourcePort','SourceDnsDomain'" 
    $MyPath = "C:\Users\myusername\SplitOuts\"
    $InputFile ="mysourcefile.csv"
    $OutputFileName = "table1.csv"
    $FileName = $MyPath + $InputFile 
    $OutputFile = $MyPath + $OutputFileName 
    Set-Location $MyPath
    $reader = $null
    If(Test-path $OutputFile) {Clear-Content $OutputFile}
    $stream=[System.IO.StreamWriter] $OutputFile
    ForEach ($line in [System.IO.File]::ReadLines($FileName)) {
    $stream.WriteLine($line) 
    }       
    $stream.close()
    $sw.Stop()
    Write-Host "Completed in " $sw.Elapsed.TotalSeconds.ToString("00.0") "seconds" -ForegroundColor Yellow`

-= Bruce

1
For PowerShell the obvious answer would be to use Import-Csv, data munging and Export-Csv. However, with files that size loading everything into memory becomes prohibitive. But dropping down to manual broken parsing feels weird, too ... :-| - Joey
How long does Import-Csv take to import the file once? Measure-Command { Import-Csv "file/path" -Delimiter "yourdelimiter" } - sodawillow
@sodawillow It's not just about the import operation. If the data volume exceeds the address space of the process you're gonna end up waiting on swaps instead. One strategy could be to chunk the import (Get-Content |Select -Skip $x -First $n |ConvertFrom-Csv |%{<# split data and Export-Csv -Append #>}) - Mathias R. Jessen
Could you provide a sample (5 lines) of the data? If it's simple (eg. barewords and numbers, no text qualifiers or escape sequences) you may win a bit of memory compaction by doing pure string operations rather than trying to parse it as an actual CSV - Mathias R. Jessen
SodaWillow. I will manufacture 5 rows of allowable sample Data. - Bruce

1 Answers

2
votes

Since you're willing to run it over a weekend, i.e. speed doesn't matter too much, I don't see the need to avoid Import-CSV immediately. Specify some headers that make sense, and do something like:

$Header = 'Date', 'IP', 'CustName', 
          'Things', 'Code', 'AccountNum', 
          'Whatever' # etc. 

Import-Csv -Path c:\temp\input.tsv -Delimiter "`t" -Header $Header | ForEach-Object {

        # small CSV 1
        $_ | Select-Object -Property 'Date', 'Code', 'Whatever' | 
                 Export-Csv -Path c:\temp\output1.csv -Append -Delimiter "`t" -NoTypeInformation

        # small CSV 2
        $_ | Select-Object -Property 'AccountNum', 'IP', 'CustName' | 
                 Export-Csv -Path c:\temp\output2.csv -Append -Delimiter "`t" -NoTypeInformation

         #... etc.

}

Not the most efficient, lots of file opening and closing for the output, chunk of overhead and object churn for each line, but pretty straight forward. Name the fields, then choose the ones you want in each file.


Edit for a StreamReader / StreamWriter approach which will hopefully run faster:

$out1 = [System.IO.File]::CreateText('d:\test\out1.csv')
$out2 = [System.IO.File]::CreateText('d:\test\out2.csv')

$in = [System.IO.File]::OpenText('d:\test\input.csv')
while ($in.Peek() -ge 0) {

    # read and split next line by tab
    $fields = $in.ReadLine() -split "`t"

    # select some fields by number, join them into a new line 
    # write to small csv1
    $f1 = $fields[0,3,5] -join "`t"
    $out1.WriteLine($f1)

    # same for csv2
    $f2 = $fields[1,2,7] -join "`t"
    $out2.WriteLine($f2)

    # ..
    # etc.

}
$out1.Close()
$out2.Close()
$in.Close()