Blog: ML Petitioners Tend To Be US-based Students

10 Jun 2018

Nature recently announced the launch of Nature Machine Intelligence, a new journal focusing on Machine Learning. Usually journal launches are fairly uneventful but this one caught the attention of open access advocates within the machine learning community, leading to the release of an open statement decrying Nature for not adopting a full zero cost open access approach (such as that deployed by the Journal of Machine Learning Research).

Politics of open access aside, despite being claimed as containing both 3000 ‘researchers’ and ‘tech giants’, I heard claims that the majority of signatories were students and therefore unlikely to impact the journal through a boycott. Having had a quick glance at the list itself, I realised this could be a good project for some web scraping.

Scraping Websites With BeautifulSoup
Looking at the website where the statement is hosted, we can see that basic information for each signature is available but not neccessarily accessible. Therefore step one is to extract that information, and convert it into a more useable format. For this we’re going to use BeautifulSoup, a classic Python package for converting a messy HTML and XML documents (ie. most websites) into more easily digestible objects.

For ease of replication we’ll be using a fixed version of the web page, saved on the 4th June 2018 when 3,245 signatures were present. We then pass that file, in this case ‘signatures.html’, to BeautifulSoup. It’s also possible to combine a requests.get call for a specific URL to BeautifulSoup, should we want to create an on-demand analysis pipeline, ie. to track changes over time or raise alerts when certain threshold numbers are reached.

soup = BeautifulSoup(open('signatures.html'), 'html.parser')

Taking a look at the initially cleaned object, we can see that each signatory has an associated field for their professional position; with that information being displayed and stored inside a ‘field-content’ div within a ‘views-field views-field-field-professional-position’ div. Once found, we can collect these all up into a list, which we’ll suitably name ‘positions’.

positions = soup.find_all("div", class_="views-field views-field-field-professional-position")

A Quick Look At The Data
With all the job positions collected, it’s time to start mining for some insights. One good place to start is by converting our list of positions into a pandas Series and calling the value_counts method, and viewing the top ten values.

print(pd.Series([x.get_text() for x in positions]).value_counts().head(10))

This allows us to peek at the most common specific job titles in decending order of frequency, producing the following table:

Position Count
Blank 423
PhD Student 152
Student 124
PhD student 114
Research Scientist 101
Researcher 92
Data Scientist 85
Assistant Professor 73
Professor 71
Software Engineer 55

Clearly, the top hits from this rough approach appear to support students, and particularly PhD students, as being a significant cohort within the total list. However, we can also see that this approach overlooks some basic overlaps. For example, ‘PhD Student’ and ‘PhD student’ are considered two categories simply due to capitalisation, and ‘student’ and ‘PhD student’ should probably both be classed as ‘student’. Similar groupings should also be probably applied for ‘Assistant Professor’ with ‘Professor’, and ‘Research Scientist’ with ‘Researcher’.

Using Custom Categories
A better approach is to build custom groups where clear overlaps exist, though this approach is slower and requires sensible group definitions. In this case I’ve created eight primary groups: blank (no information is given), students (position contains ‘phd’, ‘student’, ‘candidate’, ‘msc’ or ‘masters’), professors (contains ‘professor’ or ‘lecturer’), postdoc (contains ‘postdoc’), data scientists (contains ‘data scientist’), engineers (contains ‘engineer’), software (contains ‘software’), developer (contains ‘developer’), and researcher (contains ‘researcher’).

Each is counted using a variant of the following list comprehension, which effectively loops through all positions and considers if they contain a keyword of interest (when both are lowercase). We then count up how many signatures contain a match.

professor = sum([('professor' in x.get_text().lower()) or ('lecturer' in x.get_text().lower()) for x in positions])

Running this for all custom categories produces the following table in which we can see that a notable chunk of the signatures fit under the ‘students’ category (840 of 3,245), but not a majority. Similarly, only 563 specifically self-declare as a researcher.

Category Count
Blank 423
Students 840
Professors 396
PostDocs 110
Data Scientists 144
Engineers 294
Software 114
Developers 28
Researchers 563

Identifying Keywords
One final approach is to chop up each job position, ie. ‘phd student’ would become ‘phd’ and ‘student’, then count up the instances of each keyword. In doing so we can gain a nice middleground between custom categories, and rigid job matches. For this we first need a quick clean-up step to remove any odd characters from the positions data. For this we can use the sub function from the re module to replace a regex statement matching special characters r'[^\w\s]' with '', ie. nothing.

To unpick that regex statement a little (because they’re pretty terrifying to look at), '' indicates a string, r prior to the initial ' ensures that it’s a raw string (backslashes aren’t converted to special character, such as \n with a new line), [] indicates a set of characters, \w refers to all non-special unicodes characters, \s refers to all whitespace characters (space, tab etc.), and ^ within [] reverses the character set. The statement therefore matches all special characters (%@£$!% etc.) allowing us to remove them for a more effective analysis.

import re
keywords = [i for j in [re.sub(r'[^\w\s]','', x.get_text().lower().strip(' ')).split(' ') for x in positions] for i in j if i != '']
print(pd.Series(keywords).value_counts().head(10))

Running that snippet gives us:

Keyword Count
student 681
phd 510
scientist 359
professor 359
research 309
researcher 247
engineer 246
data 218
learning 192
machine 181

Naturally this approach is still imperfect, particularly when considering that we’re not accounting for spelling mistakes, foreign languages, or synonyms. But it’s clear that we don’t need a full investigation to determine that a large proportion of the signatories are students, particularly PhD students, and that fewer are more senior researchers. Interestingly, a notable portion also appear to be engineers and software developers which is reflected in the associated institutions (more on this later). Ultimately it’s unclear how much of an impact boycotting by these signatories will have on the new journal, you don’t need to be a professor to decide where papers get published but it certainly helps and many of these individuals don’t appear to be in that position.

Alternative Lines of Investigation
Beyond positions and into more blue skies data mining, a similar approach can be applied to the ‘country’ and ‘institution’ fields to see if any of these are particularly elevated. For example, by country almost half of signatures are from the United States of America.

Country Count
United States of America (the) 1168
United Kingdom of Great Britain and Northern Ireland (the) 292
Canada 239
India 193
Germany 161
France 158
China 74
Netherlands (the) 68
Spain 61
Switzerland 56

And by institutions, the most frequent counts suggest that the signatories are fairly well distributed, though a significant subset (115) come from Google, Google Brain or DeepMind. On the other hand, it’s a little surprising to see that only 20 signatories are from Oregon State University, from which the statement originates.

Institution Count
Blank 552
Google 47
Google Brain 41
Carnegie Mellon University 30
DeepMind 27
UC Berkeley 27
University of Oxford 26
Stanford University 23
MIT 22
McGill University 19

The full code used for this mini-project, and the saved version of the web page, are available on github.