0
votes

I am trying to update a master file with a change file with same layout. I would like to replace/append the line records in master file using key in change file. Both the input files will have duplicates. The master file need to be updated with the matching records and new records should get appended in the master file.

Input file will be of same layouts and "|" delimited and huge in size(25-40GB). Could you please help me here.

Example -

  • Master file:

Key1|AAA|BBB|CCC

Key1|AAA|BBB|DDD

Key1|XXX|YYY|ZZZ

Key2|ZZZ|YYY|123

Key2|EEE|FFF|RRR

Key3|RRR|EEE|GGG

Key3|SSS|TTT|GGG

  • Change file:

Key1|111|222|333

Key1|222|333|444

Key4|888|333|222

Key4|888|777|222

  • output file:

Key1|111|222|333

Key1|222|333|444

Key2|ZZZ|YYY|123

Key2|EEE|FFF|RRR

Key3|RRR|EEE|GGG

Key3|SSS|TTT|GGG

Key4|888|333|222

Key4|888|777|222

sample data in image format

1
there are only spacers? no line delimiters in your file? If the part "key1" is your primary key, why are there more then one "key1" how do you determine which key1 to exchange with which key1? because the change file only has 2 times key1 but master file has 3x - veritaS
Thanks for your response. It looks like there were alignment issues and i have corrected the same in my question. The matching key values will have duplicates in both the files.We need to delete all the records from master file and retain only records from the change file for matching keys. For example, Key1 is common for both the files.There are 3 records for key1 in master file and 2 records in change file.The output file should have only 2 records from change files and all the 3 records from master file should be ignored in the output. - pythonlearner1985
I suggest that you turn off your computer and get a piece of paper and pencil. Describe in words what steps you need to take to solve this problem. Don't worry about python syntax. Just get a clear idea of what the solution should be. After you do that, turn your computer back on and attempt to translate your steps into Python code. - Code-Apprentice

1 Answers

0
votes

So I have tried around a bit since this sounded interesting and I am currently learning python.
Please find the following code.
This works with your samples. However if you have a bigger hole of subsequent keys in the master file, it mixes up the order. I was not able to fix it.

I really had a lot of issues with the data structure having duplicate primary keys distributed over several rows.

I dont know what you are doing exactly, but I am working a lot with databases and I can tell you, that this kind of data structure is highly unusual. You would probably benefit a lot if you would restructe your data set.

with this amount of data you would probably benefit from storing it in a database. If you are not running deep learning algos over it.

Example: this is the example where it mixes up the order, but works nontheless

Master file

Key1|AAA|BBB|CCC
Key1|AAA|BBB|DDD
Key1|XXX|YYY|ZZZ
Key2|ZZZ|YYY|123
Key2|EEE|FFF|RRR
Key3|RRR|EEE|GGG
Key3|SSS|TTT|GGG
Key7|RRR|EEE|GGG
Key7|SSS|TTT|GGG

changefile

Key1|111|222|333
Key1|222|333|444
Key1|222|333|555
Key4|888|333|222
Key4|888|777|222
Key5|888|333|222
Key5|888|777|222
Key6|888|333|222
Key6|888|777|222
Key8|888|333|222
Key8|888|777|222
Key9|888|333|222
Key9|888|777|222

Code:

import fileinput

with open('changefile.txt') as infile:
    keyindex = []
    for line in infile:
        linelist = line.strip().split("|") ## split line by |
        key = linelist[0] ## assign the key
        keyid = linelist[0][3:] ## assign keyid
        keylist = [] ## assign keylist for loop

        ## finding duplicate keys in changefile and assign them to list
        if key not in keyindex: ## we need this because multiple keys in multiple lines
            with open('changefile.txt') as infile2:
                #spawning extra loop for each new key to open and search all duplicate keys and assign them to list
                for line2 in infile2:
                    if line2.startswith(key):
                        print(line2)
                        keylist.append(line2)
            ## Delete line with current key of loop from master file
            keyindex.append(key)
            print(keylist)
            for linem in fileinput.input('test.txt', inplace=True):
                if key in linem:
                    continue
                print(linem, end='')
            ## insert keys from keyindex

            for linei in fileinput.input('test.txt', inplace=1):

                    if 'Key'+str(int(keyid)+1) in linei: ## This statement is case sensitive
                        for item in keylist:
                            print(item, end='')
                        keylist = []
                    print(linei, end='')


# I had problems with not beeing able to go to next line at the beginning of this code if you fix this, this would be better then opening the file anew
##                    if last in linei and keylist:
##                        ##print('\n')
##                        for item in keylist:
##                            print(item, end='')
##                        keylist = []
##                        print('\n')

## this block may cause problem with memory you may can fix this with the comment block before this.
## this block is for adding left over keys from the end of change file to the end of master e.g. id 9 is in changefile, but masterfile is only going to key8
            with open("test.txt", "a") as myfile:
                if keylist:
                    for item in keylist:
                            myfile.write(item)
                    keylist = []
                else:
                    continue




        ## because we spawned a seperate loop each time we find a new key, we can skip the duplicate lines                    
        else:
            ## print('>>>key '+line+'already worked at! go to next line') # if you want to skip, uncomment continue and comment this 
            continue
    ##print all keyindexes that have been changed
    print('Following keys have been changed:'keyindex)