An Analysis of Chess Openings

Tutorial by Ben Moskowitz

Introduction

Chess is an ancient game which up until recently was mostly a niche community and not part of mainstream culture. However, beginning with the pandemic, with people stuck at home and chess content growing rapidly, along with the release of the popular Netflix show Queen's Gambit , the chess community has seen an unprecented upsurge in interest and playerbase. While in person tournament are on a pause, we can see these incredible statistics through online chess services. Chess.com, one of the most popular chess platforms, saw a 160% increase in players over the past year, with the daily active users rising from 1.3M in March 2020 all the way up to over 3M by 2021.

With the community of chess growing rapidly it is important to educate this new generation of players. Chess has been around for many years, with vast knowledge and analysis of the game done by champions of the game. An areas where this wealth of knowledge from these top players can be transferred to even the newest members of the game is opening analysis.

The opening is a crucial part of the chess game, and one that can be greatly improved upon at any level through studying opening theory. The opening you play sets up the rest of your game, and without proper understanding of the opening it is very hard to progress in skill level. Additionally, studying openings can give one a better understanding of their style of play, which can help improve other aspects of their game. With all the new players beginning their chess journey, now is as good a time as any to do an analysis of chess openings!

This tutorial will be studying many of what are commonly considered the top openings. We will look at statistics of these openings, see how they change at different skill levels and time formats, and help give a better understanding of what openings might work best for you.

Setup

These are the packages we will be using in this tutorial:

Data Collection

Luckily for us, Kaggle has a dataset containing approximately 5 million games played on lichess, another popular online chess platform. These were games collected during the month of November in 2019. Since the data can be downloaded straight from Kaggle we can download it and begin by taking a look at the data it contains. You can find the dataset here.

As the dataframe is 3GB and that is too large for pandas to simply read in, I will be using dask. Since some columns have mixed data types we will be setting low_memory to false and specifying the data types for the ones that it infers incorrectly.

Woah, we see this dataframe has 217 columns. Lets see what those columns are:

Okay, now it's easy to see there are many unnecessary columns for our analysis, mainly the columns of the specific moves. Since from the moves we're only interested in what opening was played we can drop all the columns 'Move' columns. Additionally, we can delete the Event, Date, Round, White, Black, BlackRatingDifference, UtcDateTime, and ECO since these are going to be irrelevant for our analysis.

Now that we got the relevant columns, lets take a look at what each one represents:

  1. Site - A link to view the chess game
  2. Result - The result of the game: white won (1-0), black won (0-1), or a draw (1/2-1/2)
  3. WhiteElo/BlackElo - Elo rating of the player's in the game
  4. WhiteRatingDifference - Difference in elo rating between white and black (from white's elo)
  5. OpeningName - The name of the opening played
  6. TimeControl - The time control for the game (in seconds)
  7. TimeIncrement - The time increment after each move (in seconds)
  8. Termination - Description of how the game terminated

Data Cleaning

Data cleaning is a critical step in the data science pipeline. Proper preprocessing of your data makes downsteam analysis much simpler since the data is already formatted in the desired way. For the cleaning I will be handling missing data and altering some columns to format the data in a way that will make sense for analysis.

Missing Data

It is crucial to take care of missing data or any data that can cause the results to be skewed due to some unwanted feature in any data science project. Lucky for us, there are actually very few rows with missing data (actually there's only 1), so we'll drop those completely since we have plenty of data left to work with. Additionally, in the 'Termination' column we have a few games where a player cheated. We definately don't want these games included in our analysis so we will remove these rows too.

Large Rating Difference

If 2 players have a large rating difference, which here I'll define as more than a 300 point gap in ELO rating, then the higher rated player will almost always win. It doesn't really matter what opening the stronger player chooses to play, since he'll most likely beat his opponent later in the game. Many times these players will play more "fun" openings, usually opting for gambits, which would artificially skew the data to indicate that those gambits are a better opening than they actually are. Therefore, we will be dropping any games with a rating difference of over 300.

Data Type Conversion

There are a few columns in the dataframe that would be easier to work with if they had different data types, namely the 'Result' column. It would be useful to have the results as a float type rather than a string, so below we will change this.

Formatting Changes

As we're doing opening analysis, we want to have sufficient data on the openings that we're analyzing. In the 'OpeningName' column we have quite specific openings, counting different lines of the same general opening as different openings. While we want the data on the specific lines played in the opening and their success rates, it is too much data for a single column. Therefore, we will split this into 2 columns: 1 with the main opening and 1 with the specific line played. If it doesn't include a specific line, a 'None' term will be the default value for the specific opening line column.

Additionally, for the time control we want to be able to reduce that data into just 1 column rather than 2. Also, we will want to convert to categorical data, since chess speeds are seperated into 4 categories: bullet, blitz, rapid, and classical. To see how to calculate which time controls map to which speed you can view this page. Making this column into categorical data will make analysis of how the opening stregnth's vary across different time controls much easier to understand.

Adding a Column Indicating Color

In chess it is known that white is supposed to have a slight advantage, therefore leading white openings to overall have a higher average win percentage than black openings. However, in chess you should have an opening repotoire for both white and black, it is important to do an analysis on each color's openings seperately, as to not skew all the best openings to only be for white. In order to accomplish this it would be useful to have an additional column with which color the opening belongs to. However, as we see above, there are 350 unique openings and there is no way besides through manually laboring through the openings and indicating which color it is for. Because I don't want to go through the tedious task of labeling them, in the analysis stage I will carefully indicate how I deal with this problem and what I do to minimize the amount of bias I introduce. Keep this issue in the back of your mind until then.

Exploratory Analysis

Distribution of Data

It is important to check how the data in each column is distributed to better understand where your data lies and what further analysis you can perform. Below we will check the distribution on a few of our columns, starting with the ratings distribution.

The first 2 plots show us that distribution of rating is nearly identical for white and black ratings, which is definately good for analysis. Additionally, it is quite a smooth normal distribution with a center at about a rating of 1650. There does seem to be a longer tail on the right side, and this makes sense since the low rated players usually can only do so bad, while near the top level play some of the best players can achieve much higher ratings than other players.

The 3rd plot shows us the rating difference between white and black. It seems from this plot that the vast majority of our games have a very small rating difference, with the players tending to be within +/- 20 points. Knowing this we can then go back to our data cleaning and reduce the range of games we consider to only allow for the ratings to be within 25 point of one another since we know there will still be plenty of games to analyze, and those are the games which should be a better indicator for an openings success (since more equal skill level between players).

Next we'll look at the distribution of the categorical data, namely the game speed and win percentages as each color.

We see from the first pie chart that almost half the games we are analyzing are blitz games. This may seem like a problem at first, but we need to remember that we have nearly 5 million games, so even the 13.7% of rapid games, which might seem like a low percentage, equates to approximately 700,000 games, which is plenty of data. We will keep this in mind though during later analysis.

The second pie chart tells us the win percentages of each color. Here we see white tends to win a bit more than black, which is what we'd expect, and that there are very few draws. The reason for this low percentage of draws relates to the game speeds. With faster games such as bullet or blitz, there tends to be more decisive results, while in other game speeds such as rapid and classical (we have no classical data since that is usually from over the board tournament chess and we got our data from online chess) draws are much more common.

"Best" Openings

The time has come. Now that we have an idea of what our data looks like, lets try to figure out what the best overall openings in the dataset are. A logical way to do this would be to look at which openings have the highest win percentage, and those should be the best opening, right? Lets go ahead a make a new dataframe with all of their openings and win percentages.

With such a large dataset we're bound to have some novelty openings that are only played a few times and are not really considered sound openings. For example, you can see that we have the "Amar Gambit" above, which has an astounding 100% win rate. However, this opening was only played twice in the entire dataset of almost 5 million games, indicating that it's not a "real" opening.

Due to these openings with low game counts having very skewed win percentages that can mess with our analysis, we will be dropping any opening that accounts for less than .1% of our data (5,000 games), as if these openings were high quality surely they would be played more often. This idea of dropping openings with low game counts is going to be coming back quite shortly so stay tuned.

Now it's time to graph this and see what are the "best" openings are according to win percentage. We will only be plotting the top 25 openings as otherwise the graph would be too cluttered.

The results of the graph are probably not going to be what one may expect to be the best openings. This is mainly due to 2 reasons, both of which were hinted to earlier. Can you spot why?

This is a critical point in our analysis where we need to reflect on what this plot is tellign us. Like I said before, this plot, at least according to chess players, is not representative of what are usually considered the "best" openings. Like I was saying earlier, this is due to 2 major reasons. Let's go through them below:

  1. White vs. Black openings: As we saw in the pie chart on results by color, we saw that black had a slightly lower overall win percentage than white. What this means is that openings that are classified as "black openings" will tend to have lower win percentages than many of the white openings simply because white has a slight advantage. However, we need an opening repotoire for both white and black and therefore our analysis should also be looking at the top black openings. If you know some openings you can see that the vast majority of the openings in this graph above are for white.

    In order to account for this issue, we get back the problem that I mentioned at the start, namely labeling openings as black or white. My solution here will be to do a similar analysis to the one above and manually select the top 5 black openings, in which we will do further analysis on those. While not perfect, that approach should yield significant results without the trouble of manually labeling all of the openings.

  2. Frequency: Probably the larger issue with this method of finding the best opening is the approach of purely focusing on win percentage. The major issue with that is that some openings have smaller sample sizes (which is why we removed ones with less than 5,000 games), and if these openings are played well in the small sample it will have an artificially large win percentage. These would gravitate to be openings that have a large variance, of which gambits (where you sacrifice material for positional advantage) tend to be the prime candidates. As our results show, many of the top 25 openings by this measure are in fact gambits, defending this hypothesis.

    Now to check this hypothesis we will grpah the frequency of these top 25 openings below:

Aha! So it seems our hypothesis was correct. Many of what we initially considered as our top 25 openings are played in very low frequency relative to other openings in our dataset. Thinking about it a little more, the frequency of an opening may be as or even more important than it's win percentage. This is because if an opening was bad, people would not play it as frequently (to be clear, this is an assumption, but I think I can faily assume that people do not want to play bad chess), and thus we can consider it a bad opening.

Taking this new idea into account, maybe we can make a new measure of how good an opening is be WinPerc*Freq, since this takes both ideas into account. One thing to note is that the dominant term in this relation is the opening's frequency. This is due to the fact that opening frequency has a much higher variance than win percentages. For example, an opening that is played 30,000 times with a win percentage of just 33% will be ranked the same as an opening player 10,000 times wit an 100% win percentage. To account for this I will take the log of the number of games played to reduce it's effect.

Best Openings By WinPerc*Freq

Hmm, again we are getting results that don't align so well with what would be expected according to chess experts. While this is an analysis of all player's and not just expert play, the results still don't feel right.

Seeing these results makes us wonder what other data we have that can answer our question. What if we just look at frequency? Well, that probably doesn't make much sense since people can be blindly playing an opening often without realizing that a different one can have a higher win percentage, and that is of course what we are looking for from the opening. The other data we have also probably wouldn't help answer this question. So what can we do?

Reavaluation of the Question

We have come to an important point in any data science project. That is, what do you do when you get stuck trying to answer the question you were asking? The first thing you should do is reevaluate the question itself. Is this something that can be answered with our data, or even with any amount of data?

Our question of "what is the best chess opening by rating," while it seems like a simple question to answer with data, is actually quite complex for a multitude of reasons. For example, in lower rated games often the opening is nearly irrelevant to the result of the game since the outcome is more heavily decided by midgame and endgame play. Additionally, while an opening may have a high general win percentage that doesn't necessarily mean that changing to play that opening will make you have that same win percentage (this is because openings tend to have different flavors, creating different piece structures that lend themselves better to some players more than others). Maybe we cannot answer this question with the data we possess. So what question should we be looking at?

Updated Question

A different question we can look at with our data is to determine general success rates of popular openings. In other words, are the most popular openings performing well and how do these change at different rating levels and time controls? This is a good question we can ask because we will be able to have clear results. With this new question in mind, lets tke a look at our data in this light.

Here we'll look at the top 20 most popular openings:

Here, since we're just looking at popularity, we actually have a good mix of white and black openings, alleviating the issue of only analyzing white openings since they tend to have higher win percentages.

The openings in this top 10 are definately highly popular openings. Many of these openings are openings that are "defined" at only a move or 2, which makes sense since these positions will be reached often. Additionally, these openings seem to have high win percentages overall, with the black openings nearing a win percentage of 50%, while white openings reach almost 54%. Let's now look at how these change (or stay similar) in different time formats.

Openings Popularity in Different Time Formats

This is a crucial point in our analysis, where we finally get to see the trends in popular openings at each time control with their success rate. The first thing to point out about this is that interestingly, the success rate of each opening stays pretty much the same across the time controls, with no opening differing by more than 0.3%. We also see that some of the overall most popular openings stay consistently in the top openings in all the time controls, such as the Sicilian, French, and Queen's Pawn.

One trend that makes a lot of sense is that some of the more foundationally solid openings that are played at the highest rank and can be quite complex are not so popular in bullet, where there is no time to think of clever tactics in the difficult positions, and they are much more populat with a longer time format. Openings that fit this trend are the King's Pawn, Ruy Lopez, and Italian. These, as you can see, have some of the highest success scores int he blitz and rapid time fomates, while they're not even in the top 10 most popular for bullet. This would suggest that you should aim to play these openings when you will have more time to think of clever ideas, and it may be less solid when playing a very quick game.

Some of the more obscure openings such as the Modern, Zukertort, and Hungarian are played with relatively high frequency (and success) in bullet games, but then disappear as we get to the longer time formats. This may be because in a quick game, since these openings are more obscure, your openent may not be as well prepared and will need to spend valuable time figuring out the position rather than just playing known patterns, which can be a major advantage in bullet games. However, since in blitz and rapid there is more time to think of how to counter these, and since they're not as foundationally solid openings, they begin to get played less.

Overall, it seems that the popular openings tend to have high win percentages, which makes sense, and that the "stronger" openings tend to get played more in longer time formats while more obscure openings are played more in the quick games.

Opening Popularity Across Different Ratings

There's lots to analyze about this plot, so lets take it step-by-step:

First of all, we can notice that the top 10 openings at the lowest ratings are quite different from the top openings at the higher ELOs. And even if they're both in the top 10 they actually have no overlap between their top 5 openings. This does make sense since some openings, while they may be "better," they require much more theory and can lead to more complex games that wouldn't be good for beginners to play, so this trend we see makes sense.

Next we can observe the rise and fall of particular openings, since these indicate which openings are popular at a certain rank and might be good to learn if you're within that level. One of the most obvious falls is the King's Pawn Game, in which it starts as the most popular opening and continues that way through much of the lower ranks. However, once the Sicilian takes over, the King's Pawn Game quickly falls, until 500 ELO later when it's not even in the top 10 anymore. A reasonable explanation for this is that the opening follows strong opening priciples, so it is strong and easy to learn. However, once you more thoroughly understand the game there are openings with much theory that expand upon this, so stronger players tend to do those openings more often. The opening that made the biggest rise was the Sicilian, which once it overtook the King's Pawn Game as the most popular it stayed that way until the end. A major reason I believe this happened is because the Sicilian is considered one of the best opening responses to e4, and therefore a great option when playing black. Additionally, the basic concepts of the Sicilian are not too hard to learn (definately more complex than others such as the King's Pawn Game though), but attainable for mid-level players. It also goes into really deep theory and lines which work very well even at top level play, so that's why it stays at the top (until final rating bracket, but that has very low sample size).

I'd say the major takeaway from this is that popular openings vary quite a bit depending on what rating you are at, and that when thinking about expanding your opening repotoire you should mainly consider learnings the popular openings at your rating level. That is because even if other openings may seem better because stronger players use these, since the game is not just memorizing openings and seeing who can play that better and you actually need to finish the game, the positions that arise from these openings may be too complex to play well, causing the opening to not be a good option for you even if it is considered good.

In Depth Exporation of an Opening

Here we will take the most frequently played openings overall, which we can see is the Sicilian, and look at some of the opening lines. Since each of these openings have many of these lines it is important to analyze these as well to see which lines may suit you the best and have the overall best statistics on them. Lucky enough, we can run the same analysis as before, just on the opening lines now since we're trying to understand the same concepts, just now about more in depth lines rather than just the broad overview.

These results are quite interesting. There are many things to say about it, but there is one specific aspect I would like to highlight.

The Sicilian, being quite a complex opening, when faced by white sometimes he may want to do a simple move and not fall into all of black's preperation. As a chess player myself, when I play people who are playing openings with lots of thoery behind it, I often want to try getting out of theory since I believe that they are probably more prepared. This is the case with the Bowdler attack here. It is an opening by white that does not fall into the mainline Sicilians, so in both blitz and rapid we can see that many people try to play this way. However, looking at the win percentage we see that it's under 50%, and it's a white opening! This seems foolish for people to just blindly go into this attack against the Sicilian while there are plenty of other lines that have a much higher success rate. this tells us that a blind following of frequency is for sure the wrong approach to do an analysis of openings and this gives evidence for why analysis such as this project can be useful.

Another important thing to note about these graphs is that while the top 3 most frequently played openings don't change much across the time controls, the bottom half change a lot. The Sicilian can be a very dynamic opening with many different ideas, and with each time control comes different tactics for the best play. This causes different lines of the Sicilian to be chosen to best fit whichever time control is being played.

Conclusion + Further Work

In this project we began by asking what at first glance seems like a simple question, "what are the best openings in chess?" By going through some exploratory analysis we realized that chess is quite a complex game and we can't simply determine what is the "best" opening with the data we have, and potentially with even more. However, by rephrasing our question we we're able to discover trends in opening popularity, and how well these popular openings perform at different time controls and rating levels. From this we were able to understand more about certain openings and can help guide decisions for which openings to try out for yourself!

To expand upon this tutorial, you can do even further analysis. For one, you can take an opening you are learning and run it through analysis similar to what we did for the Sicilian Defense or see how it performs at your elo. Additionally, you can maybe gather more data such as what color each opening is defines as, what type of opening it is (closed/open, aggressive/passive, tactical/positional), and from there you can make a machine learning tool to suggest good openings for you by inputting what type of opening fits your play-style and what rating you currently are. You can also expand this to include classical games and computer games, which can add an interesting additional layer. Plenty more further analysis exists so be creative!

Additional Resources