discussion – A Humble Computational Model for Proteins

Insights

The core finding of this study is unusual in the perspective of general common sense, as no feature examined — hydrophobicity, entropy, amino acid composition, or PCA-derived axes — cleanly separates proteins by species. The mammalian proteome, viewed through the lens of physicochemical properties, behaves as a single conserved landscape rather than separate species-specific ones. This points to selection acting on biophysical viability rather than on sequence identity by itself.

Categorical labels such as “hydrophobic” or “disordered” are useful shorthand but overall; poor analytical categories. Any downstream modelling — classification, clustering, functional annotation — should treat these properties as coordinates on a manifold rather than as discrete class memberships.
Rare amino acids constrain local-sequence-context strongly, rather than their frequency would imply. This points toward active sites and interface residues as disproportionate sources of co-occurrence signal — a pattern worth isolating in future analyses that incorporate structural annotation.
There is a genuine biological constraint (larger proteins reduce hydrophobic exposure to prevent aggregation) that explains only a small fraction of observed variance, although many other factors co-determine hydrophobicity. Also compositional complexity is set by functional requirements rather than size.
95% variance is not recovered until PC18, argue for caution in any dimensionality-reduction strategy that retains only the first few components.

The findings reinforce a view of proteome evolution as operating under strong universal biophysical constraints — solubility, folding stability, resistance to aggregation — that are largely conserved across mammalian lineages while leaving substantial room for species-level divergence in properties such as size distribution and specific residue frequencies.

The Horizon to be Explored!

Possible Expansions

We could enhance this study by increasing the taxonomic breadth. Adding non-mammalian vertebrates, invertebrates, prokaryotes, eukaryotes, simple multicellular organisms, plants, fungi, and archaea would test whether the conserved physicochemical manifold observed here is a mammalian phenomenon or a universal feature of life. Extremophile proteomes — from thermophiles, halophiles, and psychrophiles — are of particular interest, as their proteins must function under conditions that should impose measurable shifts in hydrophobicity and compositional bias. We could also consider increasing the size of dataset collected as per necessity, which was constrained because of practical considerations.

Structurally informed data from AlphaFold-predicted structures, would enable the same physicochemical analyses to be performed more thoroughly. And we didn’t consider the geometric properties of the hydrocarbons for this study at all as it is outside the scope of this study, which is possible to do with computational techniques.

Applying supervised and deep learning models opens a new front with significant possibilities: transformer-based protein language models (ESM-2, ProtTrans) encode richer representations than amino acid frequencies, and graph neural networks can incorporate structural context directly.

Stricter mathematical frameworks — topological data analysis to characterise the manifold geometry, information-theoretic measures beyond pairwise mutual information — would allow more rigorous tests of the continuity and dimensionality conclusions reached here.

Regardless, this work could be thought of a quantitative foundation for further exploration on the topic.

Current State of Research

Computational biochemistry has undergone a step-change in the past decade, driven primarily by the application of deep learning to protein structure prediction. AlphaFold2 (2021) and its successors have largely solved the single-chain structure prediction problem, with accuracy sufficient for many drug discovery and functional annotation tasks. The field has since moved toward structure-based function prediction, protein–protein interaction modelling, and the inverse problem of de novo protein design, where tools such as RFdiffusion and ProteinMPNN generate novel sequences with target folds.

Protein language models — trained on hundreds of millions of sequences in a manner analogous to large language models for text — now produce dense vector representations that capture evolutionary, structural, and functional information simultaneously. These embeddings are increasingly used as feature inputs for downstream classifiers, replacing hand-crafted physicochemical descriptors of the type computed in this study.

At the systems level, proteome-scale mass spectrometry and single-cell proteomics are generating datasets of previously impossible scale and resolution, creating demand for the kind of quantitative baseline analyses performed here. Meanwhile, reinforcement learning approaches — inspired by successes in game-engine systems — are beginning to appear in protein design and drug–target interaction contexts, framing sequence optimisation as a search problem in a high-dimensional fitness landscape.

Author’s Note

Stepping outside my comfort zone into biochemistry — and the seemingly infinite world of proteins — has meaningfully boosted my confidence as a researcher. What fascinated me most were the patterns revealed by the results: counterintuitive, often unpredictable, and all the more rewarding for it. The study of proteins itself proved to be an absorbing journey — intellectually demanding and exciting; equally. Fortunately, my background in physics, with grounding in organic chemistry fundamentals, along with prior experience in data analysis and computational techniques, gave me the footing needed to navigate concepts that might otherwise have been overwhelming. Through this work, I deepened my practical knowledge of scikit-learn and Python-based machine learning, while independently exploring Julia libraries on the side and GPU-accelerated computation — both of which have broadened my perspective and sharpened my ambitions as an emerging researcher in modern computational science.

ps: I would defend this work against potential harsh critiques from the background of biochemistry and alike stating:
- This has been a practice project, in a series of an ongoing endevour to get a strong footing on data science and computational techniques. Priorities have been to explore python libraries, built-in statistical tools.
- The work was done in the shortest time constraint I would agree to!
- The explainations and presentation quality was compromised as a result of my lack of deep understanding of the subject, lack to time to shrink the gap and also a more thorough presentation goes outside the scope of this project’s goals.
- This is not to be viewed as a reference or a peer-reviewed scientific paper, only a humble effort of a curious mind who is facsinated by the possibilities!