I have some information in two large files.
One of them(file1.txt
, has ~ 4 million lines) contains all object names(which are unique) and types.
And the other(file2.txt
, has ~ 2 million lines) some object names(they can be duplicated) and some values assigned to them.
So, I have something like below in file1.txt
:
objName1 objType1
objName2 objType2
objName3 objType3
...
And in file2.txt
I have:
objName3 val3_1
objName3 val3_2
objName4 val4
...
For the all objects in file2.txt
I need to output object names, their types and values assigned to them in a single file like below:
objType3 val3_1 "objName3"
objType3 val3_2 "objName3"
objType4 val4 "objName4"
...
Previously object names in file2.txt
supposed to be unique, so I've implemented some solution, where I'm reading all the data from both files, saving them to a Tcl arrays, and then iterating over larger array and checking whether object with the same name exists in a smaller array, and if so, writing my needed information to a separate file. But this runs too long (> 10 hours and hasn't completed yet).
How can I improve my solution, or is there another way to do this?
EDIT:
Actually I don't have file1.txt
, I'm finding that data by some procedure and writing it into Tcl array. I'm running some procedure to get object types and save them to a Tcl array, then, I'm reading file2.txt
and saving data to a Tcl array, then I'm iterating over items in the first array, and if object name match some object in second(object values) array, I'm writing info to output file and erasing that element from the second array. Here is a piece of code that I'm running:
set outFileName "output.txt"
if [catch {open $outFileName "w"} fid ] {
puts "ERROR: Failed to open file '$outFileName', no write permission"
exit 1
}
# get object types
set TIME_start [clock clicks -milliseconds]
array set objTypeMap [list]
# here is some proc that fills up objTypeMap
set TIME_taken [expr [clock clicks -milliseconds] - $TIME_start]
puts "Info: Object types are found. Elapsed time $TIME_taken"
# read file2.txt
set TIME_start [clock clicks -milliseconds]
set file2 [lindex $argv 5]
if [catch { set fp [open $file2 r] } errMsg] {
puts "ERROR: Failed to open file '$file2' for reading"
exit 1
}
set objValData [read $fp]
close $fp
# tcl list containing lines of file2.txt
set objValData [split $objValData "\n"]
# remove last empty line
set objValData [lreplace $objValData end end]
array set objValMap [list]
foreach item $objValData {
set objName [string range $item 0 [expr {[string first " " $item] - 1}] ]
set objValue [string range $item [expr {[string first " " $item] + 1}] end ]
set objValMap($instName) $objValue
}
# clear objValData
unset objValData
set TIME_taken [expr [clock clicks -milliseconds] - $TIME_start]
puts "Info: Object value data is read and processed. Elapsed time $TIME_taken"
# write to file
set TIME_start [clock clicks -milliseconds]
foreach { objName objType } [array get objTypeMap] {
if { [array size objValMap] eq 0 } {
break
}
if { [info exists objValMap($objName)] } {
set objValue $objValMap($objName)
puts $fid "$objType $objValue \"$objName\""
unset objValMap($objName)
}
}
if { [array size objValMap] neq 0 } {
foreach { objName objVal } [array get objValMap] {
puts "WARNING: Can not find object $objName type, skipped..."
}
}
close $fid
set TIME_taken [expr [clock clicks -milliseconds] - $TIME_start]
puts "Info: Output is cretaed. Elapsed time $TIME_taken"
Seems for the last step (writing to a file) there are ~8 * 10^12 iterations to do, and it's not realistic to complete in a reasonable time, because I've tried to do 8 * 10^12 iterations in a for loop and just print the iteration index, and ~850*10^6 iterations took ~30 minutes (so, the whole loop will finish in ~11hours).
So, there should be another solution.
EDIT:
Seems the reason was some unsuccessful hashing for file2.txt
map, as I've tried to shuffle lines in file2.txt
and got results in about 3 minutes.
read
and instead usewhile
andgets
to get the lines one at a time (so no need to split and it doesn't consume huge memory in the few variables). – Jerry