Data and Sources
Data for this study was obtained from three publicly available sources, each contributing a distinct category of indicators. The goal was to assemble a longitudinal, cross-national dataset covering as many countries and years as the sources permitted, with filtering applied downstream during the join and subset stages.
World Bank
The World Bank’s open data API was queried to retrieve the following macroeconomic indicators:
| Indicator | Description |
|---|---|
R&D expenditure (% of GDP) |
National spending on R&D as a share of GDP |
GDP growth (annual %) |
Year-on-year GDP growth rate |
Population |
Total population, used as a scaling variable |
Tertiary enrollment (% gross) |
Gross enrollment ratio in tertiary education |
These indicators capture a country’s macroeconomic trajectory alongside validation and control parameters relevant to the analysis.
OECD
The OECD’s Main Science and Technology Indicators (MSTI) dataset was accessed via the OECD Data Explorer API. The following indicators were extracted:
| Indicator | Description |
|---|---|
GERD |
Gross domestic expenditure on R&D |
BERD |
Business enterprise expenditure on R&D |
GOVERD |
Government expenditure on R&D |
Researchers |
Full-time equivalent researchers per country per year |
These indicators provide a more granular decomposition of R&D investment by funding sector, complementing the World Bank’s aggregate R&D expenditure figure.
USPTO
Two core datasets were obtained from the United States Patent and Trademark Office’s public bulk data repository, hosted on Amazon S3:
| File | Description |
|---|---|
g_patent.tsv |
Patent-level records including application year, grant year, and patent type |
g_inventor_disambiguated.tsv |
Disambiguated inventor records linked to patents and location identifiers |
A third file was downloaded as a necessity during transformation:
| File | Description |
|---|---|
g_location_disambiguated.tsv |
Mapping of location_id values to country names and country codes |
These datasets together allow patent counts and inventor counts to be attributed to countries and aggregated by year — providing the empirical measure of R&D output used throughout the analysis.
ETL Pipeline
Extraction: In order to consistently retrieve and validate a variety of datasets from the World Bank, OECD, and USPTO, this stage required building particular APIs and chunked-loading protocols. The procedure made sure that large-scale data extraction was reliable and repeatable by checking for endpoint accuracy and RAM efficiency.
Transformation: This stage involved validating data types and null coverage for World Bank and OECD sources while performing memory-efficient joins and ISO-standardization on millions of USPTO records, ultimately developing a clear strategy to transform the data in an usable format for analysis.
Load: Cleaned data was stored in .csv and .parquet formats, with Parquet prioritized for its superior columnar storage efficiency and query performance during analysis.
The following analytical subsets were produced:
Subset Contents subset_A.parquetPatent counts with GDP growth and population; no R&D expenditure requirement subset_B.parquetPatent counts with R&D expenditure, GDP growth, population, and tertiary enrollment subset_C.parquetA stricter filtered subset with fewer null values in some indicators patent_counts.parquetAggregated patent counts by country and year oecd_clean.parquetCleaned OECD MSTI data wb_clean.parquetCleaned World Bank data
Analysis
The analysis follows this funnel structure:
Raw correlations (Pearson, Spearman) first determine whether relationships exist and whether they are linear or monotonic.
Log-transformations and OLS regression quantify the functional form and control for confounders like population.
Lag analysis determines whether the relationships are time-displaced rather than concurrent.
While income-group stratification determines whether aggregate signals are hiding divergent behavior across high-income and emerging economies, efficiency measures are derived formulas that normalize patent and GDP production by investment and workforce size, allowing cross-country comparisons.
By combining binned aggregation with logarithmic curve fitting, the plateau question was answered and confirmed.