Efficiently selecting a set of random elements from a linked list,
Say I have a linked list of numbers of length
N is very large and I donâ€™t know in advance the exact value of
How can I most efficiently write a function that will return
k completely random numbers from the list?
There’s a very nice and efficient algorithm for this using a method called reservoir sampling.
Let me start by giving you its history:
Knuth calls this Algorithm R on p. 144 of his 1997 edition of Seminumerical Algorithms (volume 2 of The Art of Computer Programming), and provides some code for it there. Knuth attributes the algorithm to Alan G. Waterman. Despite a lengthy search, I haven’t been able to find Waterman’s original document, if it exists, which may be why you’ll most often see Knuth quoted as the source of this algorithm.
McLeod and Bellhouse, 1983 (1) provide a more thorough discussion than Knuth as well as the first published proof (that I’m aware of) that the algorithm works.
Vitter 1985 (2) reviews Algorithm R and then presents an additional three algorithms which provide the same output, but with a twist. Rather than making a choice to include or skip each incoming element, his algorithm predetermines the number of incoming elements to be skipped. In his tests (which, admittedly, are out of date now) this decreased execution time dramatically by avoiding random number generation and comparisons on each in-coming number.
In pseudocode the algorithm is:
Let R be the result array of size s Let I be an input queue > Fill the reservoir array for j in the range [1,s]: R[j]=I.pop() elements_seen=s while I is not empty: elements_seen+=1 j=random(1,elements_seen) > This is inclusive if j<=s: R[j]=I.pop() else: I.pop()
Note that I’ve specifically written the code to avoid specifying the size of the input. That’s one of the cool properties of this algorithm: you can run it without needing to know the size of the input beforehand and it still assures you that each element you encounter has an equal probability of ending up in
R (that is, there is no bias). Furthermore,
R contains a fair and representative sample of the elements the algorithm has considered at all times. This means you can use this as an
Why does this work?
McLeod and Bellhouse (1983) provide a proof using the mathematics of combinations. It’s pretty, but it would be a bit difficult to reconstruct it here. Therefore, I’ve generated an alternative proof which is easier to explain.
We proceed via proof by induction.
Say we want to generate a set of
s elements and that we have already seen
Let’s assume that our current
s elements have already each been chosen with probability
By the definition of the algorithm, we choose element
n+1 with probability
Each element already part of our result set has a probability
1/s of being replaced.
The probability that an element from the
n-seen result set is replaced in the
n+1-seen result set is therefore
(1/s)*s/(n+1)=1/(n+1). Conversely, the probability that an element is not replaced is
n+1-seen result set contains an element either if it was part of the
n-seen result set and was not replaced—this probability is
(s/n)*n/(n+1)=s/(n+1)—or if the element was chosen—with probability
The definition of the algorithm tells us that the first
s elements are automatically included as the first
n=s members of the result set. Therefore, the
n-seen result set includes each element with
s/n (=1) probability giving us the necessary base case for the induction.
- McLeod, A. Ian, and David R. Bellhouse. “A convenient algorithm for drawing a simple random sample.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 32.2 (1983): 182-184.
- Vitter, Jeffrey S. “Random sampling with a reservoir.” ACM Transactions on Mathematical Software (TOMS) 11.1 (1985): 37-57.
That's the answer Efficiently selecting a set of random elements from a linked list, Hope this helps those looking for an answer. Then we suggest to do a search for the next question and find the answer only on our site.
The answers provided above are only to be used to guide the learning process. The questions above are open-ended questions, meaning that many answers are not fixed as above. I hope this article can be useful, Thank you