Building a Player Identification Tool

15 min readJun 15, 2023

A walkthrough of the methodology used to create a tool for data scouting — including output examples and discussion.

*This project was made using Opta data from Football Reference and the Opta Analyst website. Data was collected from Football Reference using the worldfootballR package — created by jaseviz.

Intro/Background

What makes a player a good fit for a potential new club? Even when limiting ourselves to the on-pitch stuff, there are an infinite amount of details and factors that combine to make us sure of only one thing: that we will never be 100% sure.

With this in mind, the goal from a club perspective is to put yourself in the best possible position — to identify the targets who are most likely to succeed. As you have probably heard someone say by now, data can be an incredibly valuable tool for “casting the net” — cutting your pool of options down from thousands to a much more realistic number that can be subject to further forms of scouting and analysis.

For some, this process entails applying a set of filters using raw stats or combining stats to create performance indicators of some type. While there is no right or wrong answer here, one potential weakness of these methods is their one-size-fits-all nature. If you use them to create a shortlist of dribbling wingers, the output will be the same for Manchester City as it is for Union Berlin. My qualm here isn’t necessarily with price ranges and feasibility — you will see that this “flaw” can still certainly be present in what I created, though more importantly it is a minor thing that can be easily accounted for with post-search filters or the subjective removal of impossible targets — moreso the fact that the method is blind to the context it wants these players to perform in.

Of course, you could argue that this can be accounted for later with your other forms of analysis, but, can we do more with the data? Can we push it to identify a better, more tailor-made initial shortlist? Even without super-advanced metrics or complex modeling, I thought it was worth a shot.

Thus, I thought back to the opening question of this piece. Knowing that the output will never be 100% perfect, I tried to think of the key factors in broad, overarching terms.

I realized I wanted to at least attempt to create a tool that combines what I believe to be the most important, logical aspects:

Player profile: The role(s) a player is accustomed to performing in and the related responsibilities they take on for their side.
Team style: The traits and tendencies of the side in which the player is performing.
Team quality: The quality and pedigree of the side in which the player is performing.
League quality: The overall level of competition and pedigree at which the player is performing.

By this logic, if you identify a need, the most sensible option would be player who has proven capable of performing the desired key tasks and responsibilities in a similar system at a similar level of play. Seems to make sense, right?

Let’s imagine a top European side who dominate possession. They want to sign an active ball-winning midfielder to help increase their control over matches. First things first, the player’s data should reflect, to some extent, that when their team is defending, they have a proven ability to put in a shift to close down opponents and win the ball (player profile). From there, in terms of predicting their scaling to the new target team, it is not required, but optimal if the player is accustomed to performing these tasks in a team that also dominates the ball and plays in a similar manner (team style). This should help identify players who not only have the right data, but tend to get that data from the desired situations — maybe counterpressing and sniffing out transitions as opposed to sitting deep in a compact block. Then, it is, once again, optimal if the player is experienced at a top European level against high-quality opponents (team and league quality).

Compiling/Quantifying Different Factors

Hopefully that introduction has shown you all that the thinking behind this project is based in common sense and simple considerations on player-team fits. The data work that went into creating the tool reflects this, as the math is all relatively straightforward and easily explainable.

Player profile

For the player perspective, I set out to create different “role scores” to reflect their profile. The players were first assigned to one of five overarching outfield position groups: center backs, full backs/wing backs, center midfielders, attacking midfielders/wingers, and forwards. Each player went to the position group at which they made the most starts, with a cutoff of at least 8 starts at that “primary” position group to be included in the dataset.

Within each position group, multiple roles were created to reflect different archetypes and styles that teams tend to look for. I got inspiration for several roles and role names from Mike Imburgio’s DAVIES model. Here they are:

Center backs: Box Defender, Proactive Defender, Progressor
Full backs/wing backs: Defensive, Progressive, Attacking
Center midfielders: Deep-Lying, Ball-Winner, Dribbler, Final Third
Attacking midfielders/wingers: Wide Creator, Central Creator, Goal Threat, Dribbler, Defensive
Forwards: Poacher, Outlet, Deep-Lying, Dribbler, Creator, Hold-Up/Target

The role scores are made up of combinations of different metrics that attempt to account for some level of opportunity and tendency, not just raw stats. For instance, the Deep-Lying role score for center midfielders involves the player’s rate of deep touches (defensive + middle third) per team touches, progressive passing distance per team touches, and turnovers per player touch (inverse).

So, the players who score highly tend get on the ball in deeper areas frequently and pass ahead to advanced teammates when their team is in possession, and tend to be safer on the ball.

In the Box Defender score for center backs, blocked shots and clearances are adjusted per opposition attacking third touch. In the different Dribbler role scores, progressive carries are adjusted per team touches in the middle + attacking thirds (the criteria for progressive carries means they only take place in the attacking half). You get the gist.

The metrics that go into a given role score are given different weights that I altered through experimentation. Going back to the Deep-Lying center midfielder score, the deep touch rate is worth 45% of the final score, the progressive passing distance rate is worth 40%, and the turnover rate is worth 15%.

Below you can see the output of the top 10 center midfielders in the dataset just based on the role score alone (scaled from 0 to 100 here):

Team style

The initial process for obtaining team style similarity was largely the same as creating the player role scores. The first step was creating style ratings that reflected different areas of a team’s play and their tactical choices.

While there are a lot of potential micro-level things you can look into, I tried to limit the ratings to key areas that would relate to all players within a system. The six ratings I settled on, along with the metrics and weights behind them, are as follows:

High Press: attacking third tackles per opposition defensive third touches (40%), opposition pass completion rate (inverse, 30%), opposition defensive third touches per opposition middle third touch (30%)
Deep Block: opposition attacking third touches per opposition attacking penalty area touch (65%), opposition npxG per shot (inverse, 35%)
Narrow Defense: opposition switches per opposition pass attempt (70%), opposition crosses into penalty area per opposition attacking third touch (30%)
Deep Circulation: long pass attempts per total pass attempts (inverse, 40%), progressive passing distance per total passing distance (inverse, 30%), pass completion rate (30%)
High Retention: attacking third touches per attacking penalty area touch (100%)
Wide Attack: switches per pass attempt (70%), crosses into penalty area per attacking third touch (30%)

For some output, below are the teams in the dataset with the 10 highest Deep Block ratings (scaled 0 to 100):

For the team aspect of the project, though, I needed the additional step of creating actual similarity scores using the ratings. For this, I weighted the different style ratings (25% High Press, 25% Deep Circulation, 15% Deep Block, 15% High Retention, 10% Narrow Defense, 10% Wide Attack) and then utilized a technique called Euclidean distance.

Essentially, this allows you to get an output where the team with the lowest distance to the target team in the included variables is considered the most similar. Below are the 10 teams with the greatest similarity to Manchester City (scaled from 0 to 100):

Team and league quality

Finally, for quantifying the level of teams and leagues, I thankfully did not have to do much work. I simply took advantage of Opta’s publicly available Power Rankings, which, in their words, “utilise a hierarchical Elo-based rating system to measure the strength of each team.” Admittedly, this part would be quite complex to work out from scratch, but in this case, we don’t have to.

When I gathered this data the day after the Champions League final, these were the top 10 teams in the world:

And this is how the 11 leagues in the dataset rank in average rating:

For the ultimate player ID “equation”, what matters is the difference in team quality and league quality compared to the target team.

Putting It All Together

Now for putting everything into tangible, useful output. You enter a team, a position group, and one or two roles, and get back a shortlist.

To combine our four factors, as I’ve done throughout this process, I relied on my views of the game and objectives for the project along with trial and error. I didn’t set out to find a scientifically perfect combination or ratio — I don’t think there is one.

After testing and revising, the formula that I think has found the best results is:

Player role score(s): 50%
Team style similarity: 20%
Team quality similarity: 20%
League quality similarity: 10%

In my first attempts, in addition to the weights being different, I was using the similarity in team relative quality as opposed to raw quality. This meant that the ratings within each individual league would range from 0 to 100.

With this, I was mainly trying to address the scaling of a player’s production. A mid-table-or-lower Premier League side signing a player from a team who essentially run their domestic league (think a Red Bull Salzburg) can’t expect their output to be the same, and I wanted that risk to be reflected.

However, after observing some limitations of the initial setup, I decided to switch to raw team quality for a few reasons:

The team style aspect already accounts for things like territorial dominance and build-up style that I believe are responsible for a lot of potential scaling up/down of output.
Once you have scaling accounted for, a side like Celtic certainly have an overall level of quality that matches a mid-table-or-lower Premier League team (and vice versa), and we see moves that reflect this (Edouard, Ajer, etc.).
At the very top level, things were getting thrown off because of just how good Manchester City are, basically. The difference between City and third-rated team Liverpool was 6.2, which was the same as the difference between Liverpool and 28th-ranked team Ajax. This means you would get things like Manchester United, who are supposed to be a top 10 side in the world, not getting players from other top teams in their shortlist (cough, cough, Victor Osimhen) because City made them look almost mediocre in relative quality.

So, there you have it, the framework for my tool all laid out. Still mainly reliant on the individual’s peformance, but combined with tactical and level-of-play context to personalize the process for each unique club. Now it’s time to judge whether or not any of this was worth it.

Examples and Use Cases

Remember in the intro when I talked about creating winger shortlists for Manchester City and Union Berlin? You’d best believe I was going to come back to that. Let’s try this thing out.

*Quickly before we jump in, I did also add some useful filters to allow for an even more personalized search. Using my favorite Python widgets, I can filter the shortlist to only include a certain age range, players with a certain specific primary position (attacking midfielder/winger includes AM, LM, LW, RM, RW), players who primarily played in a certain formation, only specific leagues, and more. For each case, I’ll note when I’m using these.

Manchester City, Attacking Midfielder/Winger, Dribbler + Goal Threat (60–40), Primary position LM, LW, RM, or RW:

And then…

Union Berlin, Attacking Midfielder/Winger, Dribbler + Goal Threat (60–40), Primary position LM, LW, RM, or RW:

Two Champions League teams. Exact same search parameters, no differentiation in desired roles. No cutting down of the player pool. No filtering for team’s formation.

And… each team gets results that I believe are better-catered to their further scouting interests — with only 4 out of 20 players overlapping.

As you can see, unfeasible potential transfers are not removed — even in the Champions League, I don’t think Union Berlin can get Rafael Leão — but overall, the results make sense for the target team. This thing might be alright after all, so let’s see what else it can do.

I hear Manchester United are in the market for a center forward. They’ve got some good creative pieces and could use a real focal point up top. Let’s say they want someone who’s most importantly active in the box and generates shots, but it helps if they can also play off the back line a bit — maybe make some dangerous channel runs. Oh, they also want options 25 or younger, and who have played in a one-striker setup as Erik ten Hag has mainly utilized a 4–2–3–1.

Manchester United, Forward, Poacher + Outlet (70–30), 25 or younger, Primary formation with one striker

Osimhen, Openda, Højlund, Kolo Muani, Thuram, Balogun, Brobbey, Ferguson — quite a few interesting options that they’ve actually been linked with already or the fans have been calling for.

Speaking of links, I hear an awful lot of talk about Arsenal and center midfielders. Let’s find some young options who could potentially slot in at the base of their midfield.

Arsenal, Center Midfielder, Deep-Lying (full weight), 25 or younger, Primary position DM

It may be too late for Enzo Fernández, and Orkun Kökçü recently joined Benfica as his replacement, but there are still plenty of good options. Everyone wants Caicedo and Rice, of course, but the likes of Mats Wieffer (recently broke into Netherlands starting lineup), Bennacer, and Zubimendi may be worth taking a look at.

For one last Premier League example (I’ve leaned on them for ideas, as, believe it or not, most discussions of transfer rumors involve Premier League clubs), let’s head to Bournemouth. With Jefferson Lerma on his way to Crystal Palace, Bournemouth will need reinforcement in midfield.

Bournemouth, Center Midfielder, Deep-Lying + Ball Winner (50–50)

Interestingly, Azor Matusiwa of Reims is the one player here who was present in that previous Arsenal shortlist. I don’t know if Bournemouth have the juice for this, but with signings like Ilya Zabarnyi, Dango Ouattara, and Marco Senesi, it may be worth a shot. Guido Rodríguez, Walace, and Wilfred Ndidi are some options with experience, while Samu Costa, Lucien Agoume, and Nicolò Rovella (up for a loan from Juventus? I have no idea) are younger names to look into.

Enough of this Premier League nonsense. It’s time we travelled across the pond to the “worst” league in our dataset — Major League Soccer. The Vancouver Whitecaps have been an around-mid-table side for the past few seasons, while having some of the lowest take-on and carry numbers in the division. They already have their 10/central creator profile (which is so common in MLS) taken care of in Ryan Gauld, so we’ll give them some options for a dynamic winger who can beat defenders and get into scoring positions.

Vancouver Whitecaps, Attacking Midfielder/Winger, Dribbler + Goal Threat (60–40), 25 or younger, Primary position LM, LW, RM, or RW

Our dataset is lacking some of the big areas where MLS clubs like to search for young talent (Argentina, Uruguay especially), but we have a decent list. As a teenager playing major minutes for Santos, Ângelo Borges will certainly have eyes on him. We can tell Feyenoord did some good scouting, as they brought in Igor Paixão from relegation-battling Coritiba and had him play a big role in their Eredivisie title win. There are also some good in-league options that could be targeted through trade.

To wrap things up, let’s finally turn the spotlight to defenders. To show off the usefulness of our extra filters, I wanted to include an example where we identify wing back options for side who play with a back three or back five system. I went with Hoffenheim, as they had a bit of a rough season. Is this one of their big needs? I’m not entirely sure, but Angeliño’s loan will be up and Pavel Kadeřábek is 31, so maybe. Anyway…

Hoffenheim, Full Back/Wing Back, Attacking (full weight), 28 or younger, Primary position WB

Among other things, we can see that Tottenham’s style of play last season was similar to a lot of mid-table sides. For the players, though, we see a good return of energetic profiles who are used to getting up and down the flank and contributing in attack. Julian Gressel even overcomes the league quality disparity from the MLS!

Lastly, Diego Simeone’s Atlético Madrid seem to have a penchant for experienced center backs. Admittedly, they usually sign these players younger and then keep them in the side into their 30s, but if they were looking to bring in a battle-tested warrior with a bit of quality on the ball, who would be a good fit?

Atlético Madrid, Center Back, Box Defender + Progressor (70–30), 28 or older

Honestly, I can imagine quite a few of these defenders celebrating a big clearance at the Metropolitano. Also, with Atleti or elsewhere, the idea of a Harry Maguire revival within the compactness of La Liga sounds enticing.

Weaknesses and Shortcomings

I’d be lying if I said I wasn’t personally happy with the tool’s outputs. Nevertheless, my creation is still far from perfect and has its limitations.

One of the big ones is that the whole process is based on what has happened (data from the previous season). It can’t really account for a team’s project or vision of what they are trying to become. If I want to find options for Tottenham, it will present players who were likely to fit Antonio Conte’s vision, not Ange Postecoglou’s.

In a similar vein, while I have, of course, tried to build the project around roles rather than just positions, it still views players based on where they were playing. There is a tradeoff involved here, as when you think about minimizing the risk of a signing, you would say that ideally the player shouldn’t have to undergo a position change. However, this can also be limiting when it come to presenting options with good potential.

For instance, when teams now go to look for a full back/wing back, it can be valuable to look for players who can scale back from winger (e.g. Marc Cucurella from Getafe to Brighton, Ismail Jakobs from Köln to Monaco). With the tool I built, one could maybe try to work around this by making an additional search for wingers who score highly in the Defensive and maybe Wide Creator roles, but there is not a perfect answer.

I have thought about how the incorporation of event data (to assess the physical zones where a player actually performs their actions on the pitch) and tracking data (to get a sense of physical, movement, and positioning profile) could help here.

Overall, though, I am pretty happy with how this turned out. I believe the tool does a good job of creating useful output while having a better understanding of a team’s specific needs, and it is not difficult to interpret or explain.

For a dataset that includes double or triple the number of leagues and teams at varying levels, and maybe more or better-quality data points (pressures, off-ball runs, etc.), a tool like this could be even more helpful for guiding a search.