How Does Multi-armed bandit algorithms - Epsilon greedy algorithm Work?

You choose between different options by trying them out and learning what works best, just like a kid picking candies from a jar without knowing which one tastes the best.

Imagine you're at a candy store with 5 jars of different candies. You don't know which jar has the best candy, so you start picking randomly, sometimes getting a sweet treat, sometimes getting something yucky. But as you go on, you begin to remember which jars gave you good candies before.

That’s how the epsilon greedy algorithm works in multi-armed bandit problems: it's like being that kid who tries new candies but also remembers their favorites.

How It Chooses

Most of the time, the kid picks from the jar they think has the best candy, just like the algorithm choosing the option with the highest reward so far. But sometimes, the kid decides to try a new jar, this is called exploration.

That "sometimes" is where epsilon comes in: it's like saying “1 out of 10 times, I'll try something new.” If epsilon is 0.1 (or 10%), then 10% of the time, the kid will pick a random jar instead of their favorite one.

This balance between exploration and exploitation helps find the best candy, or in real life, the best choice, without getting stuck on something that might not be the best forever.

Take the quiz →

Examples

  1. A kid trying different ice cream flavors to find their favorite
  2. A student switching between study methods to get the best results
  3. A shopper trying out different stores for the best deals

Ask a question

See also

Discussion

Recent activity