Overview
Jupyter notebooks served as a validation layer for the whole pipeline in this project,ensuring that every logical step was verified separately before being included to production scripts. In addition to assuring that the final scripts were based on previously validated logic and results, this dual-track method produced a documented audit trail of the development process.
ETL Notebooks
Extraction — extraction_validation.ipynb
This notebook covers the design and testing of the data extraction pipeline. Its workflow proceeds as follows:
World Bank: Before a dependable implementation was developed, several iterations of the API request were tried. At first, testing was restricted to two indicators for a shorter range of years. The entire indicator set and all accessible years were added to the extraction template after the pattern was verified.
OECD: The OECD Data Explorer’s URL generator was used to produce a valid API endpoint. The trial-and-error process and the final URL structure are documented and explained within the notebook.
USPTO: The patent and inventor files are substantial in size — roughly 2.1 GB combined, with the inventor dataset alone containing approximately 23.7 million rows. Both files were downloaded in chunked batches from the public S3 source. Chunk size was determined empirically: sizes were tested against local RAM utilization, and a value of 500,000 rows per chunk was selected as the safe upper bound for the used local system.
Transformation — transformation_validation.ipynb
This notebook documents all data quality checks, format corrections, and join logic applied to the raw datasets.
World Bank: Initial checks covered data types, null patterns, and country code formatting. Individual country codes were declared and validated within the notebook.
OECD: The full MSTI dataset was inspected and filtered to retain only the four indicators selected for this study. The notebook reports the complete indicator list present in the raw data.
USPTO: In accordance with the extraction stage, all processing was done in segments. The processed row counts for each file are displayed in the cell outputs: 23,753,556 rows from g_inventor_disambiguated.tsv and 9,361,444 rows from g_patent.tsv.
A data type discrepancy between the patent and inventor files that required to be fixed before merging was verified by quality tests. Another significant problem discovered was that the inventor file only had a location_id field rather than a nation or country code column. In order to resolve location identifiers to country names, a third file, g_location_disambiguated.tsv, had to be downloaded. In order to be consistent with the World Bank and OECD datasets, country codes that were originally in ISO alpha-2 format were changed to alpha-3 format after the join.
The notebook concludes with a section documenting all merging operations and the resulting analytical subsets.
Load — load.ipynb
The load stage required no complex transformation. Its primary function was converting the cleaned and merged .csv files into .parquet format for efficient downstream querying. Both .csv and .parquet versions of all datasets were saved in the local processed/ directory.
Analysis Notebook
analysis_test.ipynb
The project’s analytical center is this notebook. During the analysis stage, it documents all exploratory concepts, statistical tests, interim results, and interpretation choices. Each of the primary or secondary research questions stated in the Index page corresponds to each of the eight sections that make up the notebook.
The conclusions from this notebook, which were ultimately implemented by the model.py script, are summarized on the dedicated Analysis and Results page.
Chunked Reads
Instead of fully loading into RAM, large USPTO files are processed in predetermined 500,000-row batches.To ensure consistent RAM utilization, this consistent chunking technique is used for extraction, transformation, and analysis.The process operates dependably without manual monitoring or performance anxiety thanks to the removal of memory crashes and resource limitations.