0
votes

I use 538's Riddler to practice code. Wrote a simple simulation in Python but had to add another nested loop to get the mean of means in order to reduce output variance. Tried to run it, but after 45 minutes I stopped it, thinking there must be a way to improve the efficiency of the code.

For context the problem is: You have a radio and play 100 songs per day. How big must the playlist be for the probability of playing the same song to be equal to 50%.

My approach is to increment the size of the playlist (starting at 7000) with 1 until the grand mean of the mean probability of a re-play is equal to 50%, using 1000 for both sample sizes and number of samples.

The code is:

import random

playlist = 7000
chance_of_replay = []
sample = 1000
mean_chance_of_replay = 0
replays = 0
temp_sum = 0

while mean_chance_of_replay > 0.5 or mean_chance_of_replay == 0.0:

    playlist += 1

    for j in range(0, sample):

        for i in range(1, sample + 1):

            songs_to_play = 100
            songs_played = []

            while songs_to_play > 0:

                song_pick = random.randint(1, playlist + 1)

                if song_pick not in songs_played:
                    songs_played.append(song_pick)
                    songs_to_play -= 1
                else:
                    replays += 1
                    break

        chance_of_replay.insert(j, (replays / sample))
        replays = 0

    for element in chance_of_replay:
        temp_sum = temp_sum + element

    mean_chance_of_replay = temp_sum/sample

print(playlist)
2

2 Answers

0
votes

Before looking at performance issues in your code, there is a bigger problem to be solved first: the code is in an infinite loop.

The list chance_of_replay is never cleared and the variable temp_sum is never set to 0. Because of that, the variable mean_chance_of_replay is always increasing and your code will run forever.

After fixing those two logical errors, you should start worrying about performance optimization.

0
votes

The chance of playing two of the same song decreases as the playlist size goes up. Have you considered that the chances of the same song being played at a playlist size 7000 is actually below 50%? If that is the case, then checking any higher values will just lead to smaller percentages, therefore you will never find your answer.

If you want to do simulation (as opposed to the pure-mathematical approach), the main optimisation I can find is that the array inserts and appends are quite a performance killer when done many times. What I did was to create an array of Boolean values which store the state of if any of the given songs have been played. It is much easier to check if a given song has been played, and doesn't require any insertions which make new arrays behind the scenes.

Here's the code:

from random import randint

playlist_size = 1
samples = 1000

songs_per_sample = 100

simulation_running = True
while simulation_running:
    replays = 0

    for _ in range(samples):
        songs_played = [False] * playlist_size

        for song_sample in range(songs_per_sample):
            song_to_play_index = randint(0, playlist_size - 1)

            if songs_played[song_to_play_index]:
                replays += 1

            songs_played[song_to_play_index] = True

    replay_chance = replays / (samples * songs_per_sample)

    if replay_chance <= 0.5:
        break

    playlist_size += 1

print(playlist_size)

Running this actually gives a surprising answer, which is way below 7000!