Data and Sources

Data for this study was obtained from three publicly available sources, each contributing a distinct category of indicators. The goal was to assemble a longitudinal, cross-national dataset covering as many countries and years as the sources permitted, with filtering applied downstream during the join and subset stages.


World Bank

The World Bank’s open data API was queried to retrieve the following macroeconomic indicators:

Indicator Description
R&D expenditure (% of GDP) National spending on R&D as a share of GDP
GDP growth (annual %) Year-on-year GDP growth rate
Population Total population, used as a scaling variable
Tertiary enrollment (% gross) Gross enrollment ratio in tertiary education

These indicators capture a country’s macroeconomic trajectory alongside validation and control parameters relevant to the analysis.


OECD

The OECD’s Main Science and Technology Indicators (MSTI) dataset was accessed via the OECD Data Explorer API. The following indicators were extracted:

Indicator Description
GERD Gross domestic expenditure on R&D
BERD Business enterprise expenditure on R&D
GOVERD Government expenditure on R&D
Researchers Full-time equivalent researchers per country per year

These indicators provide a more granular decomposition of R&D investment by funding sector, complementing the World Bank’s aggregate R&D expenditure figure.


USPTO

Two core datasets were obtained from the United States Patent and Trademark Office’s public bulk data repository, hosted on Amazon S3:

File Description
g_patent.tsv Patent-level records including application year, grant year, and patent type
g_inventor_disambiguated.tsv Disambiguated inventor records linked to patents and location identifiers

A third file was downloaded as a necessity during transformation:

File Description
g_location_disambiguated.tsv Mapping of location_id values to country names and country codes

These datasets together allow patent counts and inventor counts to be attributed to countries and aggregated by year — providing the empirical measure of R&D output used throughout the analysis.


ETL Pipeline

  • Extraction: In order to consistently retrieve and validate a variety of datasets from the World Bank, OECD, and USPTO, this stage required building particular APIs and chunked-loading protocols. The procedure made sure that large-scale data extraction was reliable and repeatable by checking for endpoint accuracy and RAM efficiency.

  • Transformation: This stage involved validating data types and null coverage for World Bank and OECD sources while performing memory-efficient joins and ISO-standardization on millions of USPTO records, ultimately developing a clear strategy to transform the data in an usable format for analysis.

  • Load: Cleaned data was stored in .csv and .parquet formats, with Parquet prioritized for its superior columnar storage efficiency and query performance during analysis.

  • The following analytical subsets were produced:

    Subset Contents
    subset_A.parquet Patent counts with GDP growth and population; no R&D expenditure requirement
    subset_B.parquet Patent counts with R&D expenditure, GDP growth, population, and tertiary enrollment
    subset_C.parquet A stricter filtered subset with fewer null values in some indicators
    patent_counts.parquet Aggregated patent counts by country and year
    oecd_clean.parquet Cleaned OECD MSTI data
    wb_clean.parquet Cleaned World Bank data

    The actual ETL pipeline was designed using notebooks extraction_validation.ipynb, transformation_validation.ipynb, load.ipynb. This ensured all the ideas, methods and logics were thoroughly tested to find any vulnerabilities, inconsistencies that might arise during implementation.

Analysis

The analysis follows this funnel structure:

  • Raw correlations (Pearson, Spearman) first determine whether relationships exist and whether they are linear or monotonic.

  • Log-transformations and OLS regression quantify the functional form and control for confounders like population.

  • Lag analysis determines whether the relationships are time-displaced rather than concurrent.

  • While income-group stratification determines whether aggregate signals are hiding divergent behavior across high-income and emerging economies, efficiency measures are derived formulas that normalize patent and GDP production by investment and workforce size, allowing cross-country comparisons.

  • By combining binned aggregation with logarithmic curve fitting, the plateau question was answered and confirmed.

    The analysis was also excersized in a notebook first: analysis_test.ipynb. This ensured a proper mapped out plan to get the target results out in a satisfactory limit, reducing chance of mismatch or holes in decision making while laying out the actual results and implementing the analytical methods.

  • All the referenced notebooks in ETL pipeline and Analysis section can be found in the dedicated notebooks directory.

  • ETL implemention scripts extract.py, transform.py, load.py can be found in ETL directory.

  • Implemention of the analytical models mapped by analysis_test.ipynb that compiled the visualizations of the results was done via model.py script in models directory.

NotePS.

The size of the data files made it impractical to upload the entire files(multiple stages of the data too!) to github. So, a curious reader is suggested to run the python script as well as check out and do the steps in the notebook corresponding to extraction layer, to download the data. The data are completely open source, and accessible with a little effort! After that, follow the remaining procedures and steps in the corresponding notebooks and run the scripts where necessary to reproduce the full project if wished to do so. The project was made sure to be reproducible, only lacking a pipeline script, which was intentional.