Popular Posts

Sunday, March 20, 2011

Big File containing billions of numbers

Q: There is big file containing billions of numbers. How do we find maximum 10 numbers from those files?

Priority queue is the best data structure to answer such question. Of course loading all the data and building the heap is not such a good idea. Main challenge is how do we manage chunks of data in memory and arrive at the solution.

Divide the data into some sets of 1000 elements, make a heap of them and take 10 element from each heap one by one. Finally sort all the sets of 10 elements and take top 10 among those. But where do we store 10 element from each heap which may itself require large amount of memory as we have billions of numbers.

Reuse top 10 elements from earlier heap in subsequent elements can solve this problem. Take first block of 1000 elements and subsequent blocks of 990 elements. The last set of 990 elements or less than that and taking top 10 elements from the final heap will fetch you the answer.

Time Complexity would be O( (n/1000)*(complexity of heapsort of 1000 elements) = O( n )

http://flexaired.blogspot.com/2011/03/big-file-containing-billions-of-numbers.html


http://tech-queries.blogspot.com/2009/05/find-largest-20-elements-from-billions.html



1 comment:

  1. We can do it using quick sort.As you said Divide the data into some sets of 1000 elements and in each set get max 10 elements using quick sort and mix them in next set.Repeat the previous step till the end.The only difference is using quick sort instead of heap

    ReplyDelete