Some Useful Python Stuff
A few quick Python tips that can help save you a lot of time. If you’re an experienced coder, this probably won’t be of much use, but when I was starting out I definitely wish I had learned these things earlier.
Note: I’ll be doing all this in Jupyter Notebooks.
Filtering a dataframe
Say you want your dataframe to only include players on a certain team, outside of a certain team, or above/below a certain threshold. Here’s how you can do those things.
First, load in your data (I’m using some basic information on La Liga players from Wyscout) and use df.head() to see the top of the dataframe.
import pandas as pd
df = pd.read_csv("LaLigaData.csv")
This is what I get after that:
With that, I can see the different columns and the type of data that’s in each of them.
Let’s start by filtering for a specific team. Here, you can see there are currently 283 rows in the dataframe.
This first line creates a new dataframe called barcadf, that only includes the rows from the original dataframe — df — where the value in the “Team” column was “Barcelona.”
dfbarca = df[(df['Team']=="Barcelona")]
Upon viewing that new dataframe, we see we have only Barcelona players.
What if we wanted all players who do not have Barcelona as their team? This is possible with just a slight adjustment.
dfnonbarca = df[(df['Team']!="Barcelona")]
That exclamation point makes it a “not equal to”, so the resulting dataframe has all 269 non-Barça players from the original.
Of course, you can also filter with numbers. Let’s use the “Age” column for that. This code gets you a new dataframe with all players 23 and younger:
dfyoung = df[(df['Age']<=23)]
Here are some other options:
#23 or less
dfyoung = df[(df['Age']<=23)]#30 or greater
dfold = df[(df['Age']>=30)]#less than 23
dfyoung2 = df[(df['Age']<23)]#greater than 30
dfold2 = df[(df['Age']>30)]#exactly 27
df27yearolds = df[(df['Age']==27)
Creating a new column in a dataframe
Next up, what if you wanted to add a column to your dataframe? To give every row the same value in this new column, you can do something like this:
df["League"] = "La Liga"
That adds a new column named “League” at the end of the dataframe, and every single player has “La Liga” in that column.
But, you can also create a new column where the value for each player depends on what they have in other columns. We have goals and xG here, so why not make a quick over/under performance column.
df["xG +/-"] = df["Goals"] - df["xG"]
Sorting a dataframe by a column
That new column seems like something we might want to sort in order to see the biggest over/under performers. Here’s how to do something like that:
df.sort_values(by=['xG +/-'], inplace=True, ascending=True)
In this case, I have ascending = True, which sorts the players from lowest to highest. This means the greatest underperformers will come up at the top.
To flip that around and get those outperforming their xG on top, you just change it ascending = False.
df.sort_values(by=['xG +/-'], inplace=True, ascending=False)
Dropping a column in a dataframe
Say you want to drop a column for some reason. Maybe we don’t need the column with contract expiration dates. To get rid of it, you can just use this:
df = df.drop(["Contract expires"],axis=1)
That takes the dataframe from this:
Make calculations from a dataframe column
Now, what if you wanted to use that dataframe to find the total xG of all the players from each individual team? This is something I used to waste a lot of time on splitting apart manually and summing in Excel.
To get each team and their total xG (just note that not all players are included here, the cutoff was somewhere around 1000 minutes so I could download the whole sheet), you can group by the different names in the “Team” column and sum the xG.
teamxG = df.groupby(['Team']).xG.sum()
Granted, this example is even more imperfect because of how Wyscout display the teams — some players have transferred during the season (that’s why Torino is there) and it still has some players as playing for their B team. But, you get the general idea.
You can also replace sum with mean, median, min, max, count (which would just count the number of players on each team), and more.
Merging two dataframes
For this section, I’m going to bring in two new basic dataframes. The first one looks like this:
The second one looks like this:
Now, to put these two together and have each player, their age, and their goals scored all in one dataframe, you can do this:
merged = pd.merge(GoalsScored,Ages, on = ["Player"],how='inner')
That merges the data in the two dataframes based on the values of each of their “Player” columns.
Concatenating two dataframes
Also in the vein of combining sets of data, you can concatenate dataframes to essentially add one to the bottom of the other. Maybe you have separate data for different teams, or separate event data files for different matches. Here, I’ll be combining three dataframes that have some goal scoring stats for different teams.
To put them all together:
ThreeTeams = pd.concat([BarcaGoals, RealMadridGoals, AtleticoMadridGoals], axis=0)
Saving a dataframe as a csv
You probably already know how to do this, but I am particularly including this in regards to saving data with accents. It took me way too long to realize what was going wrong when I was saving csv’s and player names were coming up in Excel or Tableau with a bunch of weird characters.
By setting the csv to save with encoding = “utf-8-sig”, you avoid this issue.
df.to_csv("Data.csv", encoding = "utf-8-sig")
All the accents remain intact.
Turning a dataframe column into a list
To take a column of data and turn it into a list, you can use a simple line like this:
That takes all the names from the “Player” column of the dataframe, and gets you this:
Creating and using a widget
Piggybacking off of that, a very cool way to use a list (that I myself only started using recently thanks to @Soumyaj15209314) is to provide options for a widget.
There are many different types of widgets, all very useful for certain applications, but I’ll be using a combobox. It is like a dropdown menu, but you also have the ability to start typing in it, and it will show you the options that contain the text you’ve typed.
This is how to make a combobox widget where the options are the player names in the playerlist we just made:
import ipywidgets as widgets
playerlistpick = widgets.Combobox(
placeholder='Enter a name',
If you uncomment the value = part, that will be the default value for the widget. So if you had a combobox or dropdown with each of the big five leagues, maybe you would set it to start at La Liga (by doing value = “La Liga”) if that’s your favorite league, so then you wouldn’t even have to select it each time.
I can’t actually get the combobox working in a screenshot, but it’s cool stuff. You can then use a line like this:
playername = playerlistpick.value
to turn the value of the widget (the player selected in this case) into a variable. That variable can then be used for something like creating a dataframe that only has that selected player.
playerdf = df[(df['Player']==playername)]
Creating a colormap
Now onto some visualization stuff. In my opinion, one of the coolest parts of making a plot is applying a colormap. There are many good ones in the matplotlib library, but it can be very beneficial to make your own if you want your colors just right, if you want certain parts of your viz to blend into others, etc.
You can use LinearSegmentedColormap to do just that. In this bit of code, the custom = line creates the custom colormap. Within those brackets where I have black, gray, orange, and red, you can have as many colors as you like (they can be in hex code form too). That last line is just to quickly view the colormap so you can see what this created.
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.colors import LinearSegmentedColormap
custom = LinearSegmentedColormap.from_list('custom cmap', ["black", "gray", "orange", "red"])
plt.colorbar(cm.ScalarMappable(cmap=custom), cax=None, ax=None)
Saving a figure
The last thing I’ll include here is how to save a figure you’ve created with matplotlib. A blurry visual can be very unattractive, as well as making things hard to see, so it’s definitely beneficial to get your images in high quality.
I’ve created a simple scatter plot to demonstrate this, but the main focus here is that second line.
plt.scatter(df.Goals, df.xG, s=50, c=None, marker="o", color = "green")
plt.savefig("Graph.png", dpi = 300, bbox_inches='tight')
Here you have the output graph in the notebook:
And here is the png that was saved:
Obviously, there isn’t a lot going on here, but it’s crisp. The higher you set your dpi, the better the quality of the image will be. I’ve found anywhere from 300–500 tends to be good, especially when setting it too high can make the image too large for places like Twitter.
That’s all for me. Hopefully you found some of this useful, or maybe you found some inspiration to create something like this yourself.