@article{
title={Becoming an Economist: A Database of French Economics PhDs},
author={Delcey, Thomas and Goutsmedt, Aurélien},
journal={Zenodo},
year={2024},
doi={https://doi.org/10.5281/zenodo.14541427}
}
Documentation: the French database
Introduction
This database compiles information on Ph.D. dissertations in economics defended in France since 1900.1
The French database is implemented as a relational database that integrates multiple interconnected data frames. It is organized around four main components:
- Thesis Metadata: This table contains the core information for each dissertation. Each entry corresponds to a single thesis and includes details such as the title, defense date, abstract, and other relevant metadata.
- Edges Data: This table captures the connections between the other three tables, linking individuals, institutions, and theses. It associates each thesis with the individuals and institutions involved in its production, thereby enabling a synthetized view of these relationships.2
- Institutions Data: This table includes information on universities, laboratories, doctoral schools, and other institutions associated with the dissertations. Each entry corresponds to a single institution.
- Individual Data: This table contains information on the individuals involved in the dissertations, including authors, supervisors, and jury members. Each entry corresponds to a single individual.
Sources
The data used in this project comes from three mains sources:
- Theses.fr: https://theses.fr/
- Sudoc: https://www.sudoc.fr/
- IdRef: https://www.idref.fr/
These sources are the result of the work of the ABES (l’Agence bibliographique de l’enseignement supérieur) who produced metadata and APIs regarding research and superior education. The data of the three sources mentionned above are under the Etabab “Open Licence”.3
Theses.fr is a comprehensive repository for PhD dissertations defended in French institutions since 1985.4 It includes metadata such as the title of the dissertation, author, date of defense, institution, supervisor, abstract, etc. The database covers a wide range of disciplines, providing access, in some cases, to digital theses.
Sudoc stands for “Système Universitaire de Documentation”. It is a union catalog that includes references to various documents held in French academic and research libraries. It covers books, journal articles, dissertations, and other academic works. The Sudoc database includes metadata like title, author, publication date, and library locations where the documents can be found. It is a key resource for academic research in France, providing a broad overview of available scholarly materials. Regarding PhD, it allows to find dissertations defended before 1985, and to recover relevant metadata.
IdRef stands for “Identifiants et Référentiels pour l’Enseignement supérieur et la Recherche”. It is a database focused on managing and standardizing the names and identifiers of authors and other contributors to academic and research works. It provides authority control for names used in academic cataloging, ensuring consistency and aiding in accurate attribution of works. IdRef is used in conjunction with Sudoc and other databases to support the management of bibliographic data in the French higher education and research sectors. In our project, it allows us to find additional data on individuals and institutions.
Our data-building approach focuses on ensuring consistency and quality while preserving the integrity of the original information. We relies on two principles to build this database:
No Data transformation: Our work primarily involves data collection, categorization, and cleaning. We intentionally minimized transformations, restricting them to only minor and non-impactful adjustments. Specifically, we avoided altering the cell values of the original data, instead encoding any modifications in new columns.5 The complete edges data frame keeps a track of our transformations.
Disambiguation: We aimed to disambiguate theses and associated entities (individuals and institutions) as thoroughly as possible. Disambiguation involves identifying and distinguishing between entries with similar descriptions. This process is essential to avoid duplicated data or to merge distinct entries. To address this, we assigned a unique identifier to each entity. The identifiers provided by the Agence Bibliographique de l’Enseignement Supérieur (ABES) through IdRef served as our primary source for unique identifiers. In cases where ABES identifiers were unavailable, we generated our own unique identifiers to maintain consistency and accuracy.
Usage and Access
Our database is released under the CC-BY-4.0 licence. allowing free access and use by anyone. The data is hosted in a Zenodo repository. While the database focuses on Ph.D. dissertations in economics and queries sources based on the field of the thesis, the scripts have been designed with flexibility. They can be adapted for other queries involving notably different disciplines.
If you use our data or scripts, please cite the following reference: “Delcey Thomas, and Aurélien Goutsmedt. (2024). Becoming an Economist: A Database of French Economics PhDs. Zenodo. https://doi.org/10.5281/zenodo.14541427”
Note that some of our data cleaning steps are tailored to the specific characteristics of the datasets we extracted. We systematically identified issues in our data and applied manual cleaning processes, such as removing problematic titles, identifying duplicates, and standardizing institutional names.
If you use our code to extract data, it is essential to carefully assess the quality of your extracted data and adapt the cleaning steps accordingly. Feel free to reach out to us for guidance if you plan to use our code for similar data extraction tasks.
Presentation of the tables
Thesis Metadata
The thesis_metadata
table contains 20 variables:
thesis_id
: the unique identifier of the thesis. If it exists, it is the official “national number of the thesis” created by the ABES and the theses.fr website. If not, it is a temporary identifier we have created.year_defence
: the year of the thesis defense. Our database covers the period 1899-2023.author
: the first name and last name of the author of the thesis.author_id
: the identifier of the author. If it exists, it is the official “idref” created by the ABES. If not, it is a temporary identifier we have created.title_fr
: the title of the thesis in French.title_en
: the title of the thesis in English.title_other
: the title of the thesis in another language.abstract_fr
: the abstract of the thesis in French.abstract_en
: the abstract of the thesis in English.abstract_other
: the abstract of the thesis in another language.language
andlanguage_2
are the languages of the thesis. The variable is harmonized to make the information on language found in Sudoc and These.fr compatible.institution_thesis_name
: the name of the institution where the thesis was defended. The name is standardized using IdRef preferred name.institution_thesis_id
: the identifier of the institution where the thesis was defended. If it exists, it is the official “idref” created by the ABES. If not, it is a temporary identifier we have created.country
: the country where the thesis was defended, i.e. France.field
: the field of the thesis (such as “Sciences économiques”). The field remains unaltered by our work and can take on a wide range of values, as indicated by the number of distinct entries (696).type
: the type of the thesis. Type can take 6 values: Thèse, Thèse d’État, Thèse complémentaire, Thèse de 3e cycle, Thèse de docteur-ingénieur, Thèse sur travaux. All categories are derived from categories found in Sudoc.accessible
: a binary variable indicating whether the fulltext is accessible or not (data coming only from theses.fr).url
: the url of the thesis on theses.fr or Sudoc websitesduplicates
: a list ofthesis_id
that indicate duplicated thesis. We identified duplicates but did not remove them to preserve the maximum information from the raw sources. This variable allows users to handle duplicates using their preferred strategy.
When we were unable to provide an ABES identifiers for a thesis or entity, we created our own unique “temporary identifiers” coded as follows temp_X_Y
, X representing the source of the original information (either “sudoc” or “thesesfr”) and Y being a randomly generated unique number. We refer to these identifiers as temporary because they may be replaced in future updates by ABES identifiers, either as ABES updates its sources or as we improve our cleaning process.
Language
The language
variables may be important for exploring issues related to the internationalization of economics. However, the raw source data exhibited a notable error rate in the classification of title and abstract languages: French titles were frequently mislabeled as English, and vice versa. To address this issue, we employed language prediction models to correct the misclassified information when possible. For a detailed explanation of the cleaning process, refer to Section 3.2.4.
Table 1 shows a sample of the thesis metadata table. The thesis metadata table contains 21025 theses. Figure 1 shows the distribution of theses over time.
type
The French education system lacked a standardized Ph.D. system between the early 1960s and 1984, the year of the Savary reform, which harmonized the Ph.D. system. During this period, various types of theses coexisted. For instance, in the mid-1970s, it was common for scholars to first complete a “Doctorat de 3e cycle” before pursuing a “Doctorat d’État.” As a result, a single author may have produced multiple types of theses. Figure 2 illustrates the distribution of theses over time, categorized by type. It should also be noted that the inclusion of thesis type in the metadata is not systematically ensured. This variability depends on the quality of metadata provided by individual institutions, which may affect the reliability of classification.
abstracts
The practice of providing abstracts started in the 1980s. Prior to this period, abstracts were missing (see Figure 3).
Edges
Each line in the thesis_edge
table represents a unique edge between a thesis and an entity. We define an entity as any individual or institution involved in the thesis. The edge table has five 5 columns:
thesis_id
: the identifier of a thesis (the same as in thesis_metadata). In the edge table, athesis_id
can have several edges. Athesis_id
has at least two edges: the author and the institution where the thesis was defended.
entity_id
: the identifier of an entity. If it exists, it is the official “idref”, an unique identifiers created by the ABES (see https://www.idref.fr/). If not, it is a temporary identifier we have created following the strategy we used forthesis_id
.entity_role
: the role of the entity. An individual, for example, may serve as an author, supervisor, referee, president, or jury member. In addition to identifying the primary institution where the Ph.D. was defended, theentity_role
variable may include supplementary information we collected, such as affiliations with other institutions, laboratories, or doctoral schools (the organizations responsible for overseeing doctoral programs in French universities). This detailed information primarily applies to theses recorded in theses.fr after 1985. For data sourced from Sudoc, the value “etablissements_soutenance_from_info” ofentity_role
may provide additional information regarding the institutions associated with the thesis.entity_name
: The name of the entity. It is derived from the preferred name in the official IdRef notice, or from raw information when the entity has no IdRef. See Section 3.2.5 and Section 3.2.6 for more information about names standardization.
entity_firstname
, the first name of the individual. Coded as missing value when the entity is an institution.
Through the https://www.idref.fr/ platform, the ABES assigns unique identifiers to institutions and individuals involved in research in France. This system provides valuable information about entities, such as their dates of existence and the various names used to refer to entities. For example, see the entry for the former University of Paris that split after 1968. We scrapped those information to enrich our institution and individual tables (see Section 3.1.3).
Table 2 shows a sample of the thesis edge table. We identify 100814 edges in total. Figure 4 shows the distribution of individuals by role. Figure 5 shows the distribution of individuals for the top institutions.
Note that the two figures represent the raw count of observations in the edge table and do not account for thesis duplicates. For example, to determine the exact number of theses published by the Université Paris I Panthéon-Sorbonne, it is necessary to first address duplicates in the metadata tables. This can be achieved by merging entries identified as duplicates using the duplicates
column.
Complete Edges Data
The thesis_edge_complete_data
allows the comparison between original data as collected on theses.fr and sudoc with the results of our cleaning process. In addition to the columns of thesis_edge
, we have 4 additional columns:
original_id
: the original identifier of the entity in the raw data. This allows to see how temporary identifiers for institutions have been cleaned to find the official idref.original_entity_name
: the name of the entity as in the original raw data.original_entity_firstname
: the first name of the individual as in the original data source.source
: the source of the data. It can be “thesesfr” or “sudoc”.
Table 3 shows a sample of the additional information contained in the thesis_edge_complete
table.
Institutions
In the thesis_institution
table, each line represents a unique institution. Institutions are the universities, laboratories, doctoral schools, and other institutions associated with the theses. The table contains 1435 institutions and 19 variables. It consists of two core variables:
entity_id
: the unique identifier of the entity (here the institution).entity_name
: the name of the entity. When an IdRef exists, theentity_name
comes from thepref_name
variable of the IdRef database.old_id
: the list of the temporary identifiers of the entity that has been merged with theentity_id
(see Section 3.2.5 for details on the cleaning process).
The other variables are additional information on the institutions scrapped on IdRef:
url
: the IdRef url of the entity.other_labels
: other labels of the entity.date_of_birth
: the date of creation of the entity.date_of_death
: the date of disappearance of the entity.information
: additional information on the entity.replaced_idref
: the identifier of the entity that replaced the entity.predecessor
: the predecessor of the entity.predecessor_idref
: the identifier of the predecessor of the entity.successor
: the successor of the entity.successor_idref
: the identifier of the successor of the entity.subordinated
: list of the entities subordinated to the entity.subordinated_idref
: list of the identifiers of the entities subordinated to the entity.unit_of
: the entities to which the entity in question is a unit of.unit_of_idref
: the identifier of the entities to which the entity in question is unit of.other_link
: other links of the entity.country_name
: the country of the entity.
An essential aspect of our work involved associating institutions without an IdRef identifier to an existing IdRef. This step was crucial for standardizing information, particularly regarding the names of entities, and for enabling users to accurately assess the involvement of a given entity in theses. The process was relatively straightforward for the institution table, as it contains only a few hundred unique institutions. Consequently, the main institutions—universities—are well identified by a unique IdRef in most cases.
Table 4 shows a sample of the thesis_institution
table.
Individuals
In the individuals table, each line represents a unique individual. Individual are the authors, supervisors and other jury members associated with the theses. The table contains 27553 individuals and 14 variables:
entity_id
: the unique identifier of the individual.entity_name
: the family name of the individual.entity_firstname
: the first name of the individual.gender
: the gender of the individual according to the IdRef database.gender_expanded
: the gender of the individual according to the IdRef database augmented for missing values with the French census data (see details in Section 3.2.6).
The other variables are additional information on the individual provided by the IdRef database:
birth
: the birth date of the individual.country_name
: the country name of the individual.information
: additional information on the individual.organization
: a list of organizations in which the individual worked.last_date_org
: the last dates recorded for which the individual was still a member of these organizationsstart_date_org
: the starting dates for each organization in which the individual worked.end_date_org
: the ending dates for each organization in which the individual worked.other_link
: a list of link to relevant online repository pages of the individual.homonym_of
: a list of theentity_id
of the individual’s homonyms (see Section 3.2.6 for details).
homonym_of
Disambiguating individual entities, when IdRef identifiers were missing, proved more challenging than disambiguating institutions. For example, it is relatively straightforward to determine that the strings “Université Paris I” and “Université Paris I Panthéon-Sorbonne” refer to the same institution. In contrast, identifying whether “Robert Martin,” who authored a Ph.D. in 1985, is the same individual as “Robert Martin,” who supervised a Ph.D. in 2022, is far less certain. To assist users in identifying potential matches between individuals, the variable homonym_of
highlights cases where two records may represent the same individual. For further details on the methodology, refer to Section 3.2.6.
Table 5 shows a sample of the thesis metadata table.
Data collection and cleaning process
This section outlines our strategy for constructing the database, which is divided into two main steps:
- Scraping: The first step consists of scraping data from the three main sources: Theses.fr, Sudoc, and IdRef.
- Cleaning: The second step entails processing and cleaning the raw data files to generate five relational tables.
The R
code is available in the following GitHub repository. The following diagram illustrates the relationships between each script. If you encounter any errors or have questions regarding the data or the codes, please submit an issue.
Scraping
theses.fr
Theses records are registered in theses.fr since 1985. Theses.fr data are also stored on data.gouv.fr website. They can be downloaded directly at this URL. The set of data we downloaded dated back from January 2024. The downloading_theses_fr.R script allows to download the .csv
on data.gouv and to compress and store it in .rds
format.
Sudoc
We systematically collect metadata on French theses archived in the Sudoc database, focusing on theses in economics. To identify theses in economics, we employ a dual-query approach:
In the main query, we search for theses with a term beginning with “econo” in the “Note de Thèse” field, which specifies the discipline of the thesis.6 The search is restricted to the period from 1900 to 1985, as theses from subsequent years are systematically cataloged in Theses.fr. Here is the query allowing to retrieve thesis records.
One issue specific to French history is that economic research was predominantly conducted within law faculties until the 1960s. Consequently, in a second query, we focus on theses where the term “droit” (law) appears in the field “Note de Thèse” and a word beginning with “econo” is present in the title. This search is restricted to the period 1900–1968, aiming to identify theses classified as law theses prior to 1968 that likely pertain to economics. Here is the query, allowing to retrieve thesis records.
The scraping_sudoc_id.R collects the thesis records URLs. Then, the scraping_sudoc_api.R allows to query the Sudoc API to retrieve structured metadata for each thesis, including information such as title, author, defence date, abstract, supervisor and other relevant details. These metadata are stored in an .xml
file, which we then parse to extract the relevant information.7
scraping_sudoc_api.R utilizes parallel processing to accelerate data collection. It is designed with robust error and exception handling, ensuring efficient and reliable data retrieval. Moreover, the script is highly adaptable and can be easily used for other query types.
IdRef
We utilize the IdRef identifiers collected from Sudoc and These.fr to retrieve additional information about entities, such as date of birth, gender, last known institutions, institutions’ preferred and alternate names, and years of existence. The scripts scraping_idref_person.R and scraping_idref_institution.R use the IdRef identifiers as input to query the IdRef API and organize the retrieved information into structured tables.
Cleaning
This section outlines the data cleaning process. Starting with the raw sources, we clean and harmonize the data to enable seamless merging of the two datasets, Sudoc and theses.fr. Following this, we construct our five data tables.
Sudoc
The cleaning_sudoc.R script cleans the Sudoc data. It has two main objectives: managing duplicated identifiers and transforming the raw Sudoc data into a structured dataset. The process involves evaluating the data quality and restructuring the raw sources to ensure consistency and facilitate future merging with the theses.fr dataset.
The script handles duplicate identifiers, which fall into two categories:
- True duplicates: these occur when the same dissertation appears multiple times with identical identifiers and authors but differing defense dates. In such cases, the script retains the most recent record, as it is more likely to contain accurate metadata.
- False duplicates: these arise when the same identifier is linked to different authors, typically due to data entry errors from ABES. To resolve this, the script generates unique identifiers by appending a counter to the “national number thesis” field.
Most of the column of the final data are created here from the raw data. Two variables deserves a particular attention:
year_defence
:- For some theses, multiple defense dates are retrieved for a single observation (line). In such cases, the earliest date is selected, as it is more likely to correspond to the original, unfinished thesis.
- When dates differ significantly, manual checks are performed.
- Anomalous dates outside the query range (1899–1985) are cleaned to maintain consistency.
type
:- The
type
of the thesis is determined from various Sudoc metadata fields, reflecting the diversity of thesis types in the French system before the 1984 reform. - Thesis types are recoded into consistent categories, such as “Thèse d’État” and “Thèse de 3e cycle.” Entries that are not doctoral theses (e.g., master’s dissertations) are excluded to focus solely on relevant records.
- If the thesis type cannot be determined, the variable is assigned the generic value “Thèse.”
- The
The value “Thèse” of the Type
variable is default value when we cannot identify a specific type of thesis.
Note that the value of Language
are also standardized to align with ISO conventions, ensuring compatibility with theses.fr data.
The final dataset is divided into four tables that constitute the relational database: metadata, edge, individual, and institution. For entities without official identifiers, temporary IDs are generated to enable future identification and disambiguation. Temporary identifiers are under the format temp_X_Y
, X representing the source of the original information (either “sudoc” or “thesesfr”) and Y being a randomly generated unique number.
Theses.fr
The cleaning_thesesfr.R script focuses on cleaning and structuring metadata for theses related to economics, extracted from the theses.fr database. The methodology closely parallels the approach used for Sudoc: assessing data quality, standardizing raw data, and preparing the dataset for integration with Sudoc data.
A specific challenge addressed in this script involves filtering out theses that were incorrectly classified as economics-related in the query results. After resolving this issue, the script applies the same steps as those used for Sudoc data, including the categorization and harmonization of variables, to ensure consistency and facilitate merging.
As with the Sudoc data, temporary IDs are generated for entities lacking official identifiers from IdRef. These temporary IDs support future identification and disambiguation efforts.
Merging
The merging_database.R script processes four types of tables—theses, edges, individuals, and institutions—generated from both the Sudoc and Theses.fr datasets. The script merges these tables in pairs to produce four intermediate merged tables. These intermediate data frames are subsequently cleaned and standardized in the following scripts.
Metadata
The script cleaning_metadata.R is designed to clean and harmonize theses metadata. Metadata from Sudoc and theses.fr is derived from a variety of local institutions and individuals, which often results in inconsistencies and errors. This script focuses on addressing two major challenges: language detection and duplicates identification.
- Language detection: To ensure consistency across metadata about titles and abstracts, the script employs the cld3 (Ooms 2024) and fastText (Bojanowski et al. 2016) models for robust language identification. Key tasks include:
- Verifying that titles and abstracts in French and English fields contain text in the correct languages. Discrepancies are resolved by reassigning text to appropriate fields.
- Missing French or English titles and abstracts are supplemented using auxiliary columns from the scraped data (
title_other
andabstract_other
) when relevant. - Titles and abstracts in full uppercase are converted to sentence case to enhance readability.
- Placeholder text, irrelevant symbols, and uninformative entries are removed, with such entries replaced by missing values (NA).
- Duplicates: Duplicated thesis records are a common issue, arising from cross-database redundancy (the same thesis may appear in both Sudoc and theses.fr) and intra-database redundancy (a thesis may be registered multiple times by different institutions within a single database). To address this, we developed a duplicates detection algorithm. The core of the process involves grouping titles by authors and comparing all possible title pairs within each group. We use the Optimal String Alignment (OSA) distance as the primary metric for these comparisons. OSA estimates the number of operations (insertions, deletions, substitutions, and adjacent character transpositions) needed to align two strings. This method is implemented using the
stringdist
package (van der Loo 2014). We adjust the distance measure by the number of character in the title. Each potential duplicate is manually reviewed. In alignment with the project’s overall approach, we do not remove duplicates but instead flag them in a new column,duplicates
. Table 6 provides an example of distinct theses in the sources that we flagged as duplicates.
Our script allows also for handling duplicates manually. If you spot an undetected duplicate, please let us know.
Institutions
The script cleaning_institution.R is dedicated to standardizing and improving the quality of institution data.
Institution names extracted from metadata have been stored in a separate table, thesis_institution
. This script focuses on cleaning and standardizing these names to ensure consistency and accuracy. A key goal is replacing temporary institution identifiers (id_sudoc_temp
or id_thesesfr_temp
) created in merging_database.R with the official IdRef identifiers. This process relies on matching institution names and thesis defense dates, accounting for historical changes in institutional structures (e.g., the division of the University of Paris after 1968) and carefully handling ambiguous cases.
The script employs a manually curated table that associates regular expressions (RegEx) for institution names with their corresponding IdRef identifiers. The table also includes the institutions’ dates of creation (date_of_birth
) and dissolution (date_of_death
) to set clear temporal boundaries for identifier replacement. For instance, if the institution name matches “University of Paris” and:
- The thesis defense occurred before 1968, the identifier is replaced with the identifier of the historic University of Paris, as it was the only university in Paris at the time.
- If the thesis is defended after 1968, the string “Université de Paris” is ambigous since it describes several distinct institutions. In this case, we kept the temporary identifier because we are not able to resolve the ambiguity.
Individuals
The script cleaning_individuals.R is designed to standardize and enhance the quality of individual data.
The script first enriches individual records by incorporating information from the idref_individual_table
, built from scraping_idref_person.R. When a name entity is linked to an IdRef identifier, supplementary details about the individual—such as organization affiliations, birth date, and relevant links (e.g., Wikipedia pages)—are added from the IdRef database. Additionally, raw names extracted from Sudoc or theses.fr are replaced with the standardized names provided by IdRef.
A key focus of the script is addressing inconsistencies in individual identifiers. Challenges include:
- Variations in names: The same individual may appear with slight name differences (e.g., “Jean A. Dupont” vs. “Jean Dupont”).
- Duplicate identifiers: A single individual may be associated with different identifiers across or within datasets (e.g., as an author in Sudoc in 1983 and as a jury member in theses.fr in 1999).
While the script strives to identify and group such cases, disambiguating individual identifiers is constrained by the risk of homonyms. For example, two individuals with identical names may represent distinct individuals. Due to this ambiguity, it is not possible to merge identifiers confidently.
To address potential ambiguities, the script introduces a new column, homonym_of
, which groups potential homonyms. For each individual, the homonym_of
field lists the identifiers of individuals with identical or highly similar names. This approach prevents premature merges while flagging possible relationships for users to investigate further.
Finally, the script enhances the gender column using data from the IdRef source. For individuals with missing gender values, we leverage French census data to predict gender. If a first name is associated with a single gender in more than 95% of cases, we assign that gender to the individual.
This approach has the advantage of simplicity but presents obvious limitations for handling some important cases (e.g., unisex names or cultural variations). The threshold of 95% is also arbitrary. To clarify the origin of the information, whether from IdRef or census data, we did not modify the gender
column, and we created a new column, gender_expanded.
References
Footnotes
While the focus is on France, both the database and its accompanying documentation are presented in English. This decision reflects its integration into a larger initiative, which seeks to establish a comprehensive global repository of Ph.D. dissertations in economics.↩︎
The edges data are provided in two formats: (1) a ready-to-use format with cleaned and standardized information; and (2) a more extensive format that allows for comparison between the original collected data and the results of the cleaning process.↩︎
This corresponds to the reform of French PhD and the implementation of the “new regime”.↩︎
Exceptions were made for minimal transformations, such as replacing fully uppercase titles and abstracts with standardized capitalization, or correcting errors, such as changing the language of a title or abstract when it was mistakenly assigned.↩︎
This RegEx captures terms such as “économie” and “Economique” because Sudoc’s search function is case-insensitive and disregards accents.↩︎
The structure of the
.xml
used by the ABES is explained here.↩︎