0
votes

Asked in a recent interview

There is a log file containing one million integers. Each integer is 32 bits in length. Specific integer values in the log file might be repeated. You can read the log file sequentially. You can also read from and write to temp files sequentially; there is no limit to the number of files that may be open at any time. However, you may keep no more than 2000 integers in memory at any given time.

I was asked to produce a histogram showing absolute counts for each integer value that occurs in the log file and state an upper bound on order complexity for the number of times that each integer must be loaded into memory

4
I hope the job was a good one I'd have told them to stick itcusimar9

4 Answers

4
votes

That's a confusing question. Can't you just read 2,000 numbers, sort them, and then write to a temp file? Do that 500 times and then do an N-way merge. Each number would be loaded into memory twice.

It's the same kind of thing you'd do if you had to sort a terabyte-sized file on a computer that only has 2 gigabytes of RAM.

3
votes

Open 232 temporary files, one for every integer. Read sequentially through the log file once. Whenever reading integer n, write '1' to the temp file number n. Then produce the histogram by going through all the temp files. Every integer is read into memory only once, so it's O(n) algorithm.

0
votes
  1. Keep in memory 2000 int = size of the buffer
  2. No limit for files for read / write = each number count will be stored in one file.

  3. Numbers of 32 bits length = Each file is one number and the file name is the 32bit that represents the integer ( integer value can be used as well )

  4. Show histogram (means no order is needed )

pseudo code:

count = 2000
HashMap<number, number> = new 

code:readbuffer
while count != 0 
read NextNumber
if HashMap.HasKey NextNumber then HashMap[NextNumber]++
else HashMap[NextNumber]=1
count --;
end while 

code:flushbuffer
foreach Key in HashMap 
 if exists FileName Key.ToBinnary 
  FileValue += HashMap[HashKey]
 else WriteNewFile FileName=Key.toBinnay; SetValue = 1
end foreach


code: histogram
each file name is the number;
each file value is the count. 

Costs: readbuffer

Sequencial number read = 2M (M = Million )

Map.HasKey = (search for the key in 2000 record, worst case the number is not there, SUMMARY( ∑2000)x2M )

SetValue on Map same cost as above

Total: (2M)+(2Mx∑2000)x2

Costs: flushbuffer File.exists 2M

Costs: histogram 2M

Total = 6M + 4Mx∑2000

0
votes

Implement a file based B-tree.

File names are GUIDs.

File content is: number-seen count left-node-file right-node-file

After one pass, the complexity can be derived from that of a B-tree.

And your historgram is implicit in the structure.