Project Structure

├── clean_data
├── raw_data
├── results
└── src

Workflow

The complete workflow follows:

Data Extraction

Protein sequence data was retrieved from UniProt using its REST API. The scripts — download_data_test_failed.py and download_data.py — of which the prior reflect an iterative approach where the attempt exposed pagination and formatting issues that were corrected in the latter implementation. The downloaded data was stored in TSV format under the raw_data directory.

The data files: data.tsv which represents a small test pull used to confirm that the response format was correct and that pagination was advancing properly between API requests, and proteins_raw.tsv, the full dataset once quality was confirmed. The data covers three organisms: Homo sapiens (15_000 proteins), Mus musculus (7000 proteins), and Bos taurus (8000 proteins).

Data Cleaning

Raw data quality turned out to be high. The validate.py script performed upfront checks on sequence length, amino acid characters and sequence validity, with findings preserved as inline comments.

Of the 75 sequences initially flagged as invalid, most were later found to be biologically legitimate. The amino acid codes X, O, and U — which triggered the original rejection — are in fact valid for the organisms in this dataset: X represents an unknown or ambiguous residue (accepted in UniProt annotations), O denotes pyrrolysine, and U denotes selenocysteine, both of which occur in mammalian proteomes.

clean.py then handled the actual cleaning — checking for tab-separation anomalies, formatting inconsistencies, and invalid amino acid characters. The cleaned dataset was saved in clean_data as proteins_clean.tsv; sequences that failed validation were written to removed_sequences.tsv.

Feature Extraction and

feature.py computed four per-sequence descriptors and stored them in features.tsv under the results directory:

  • Molecular Weight — estimated from amino acid composition
  • Kyte–Doolittle Hydrophobicity — a residue-averaged score reflecting the overall hydrophobic character of the sequence
  • Shannon Entropy — a measure of compositional complexity based on residue frequency distributions
  • Sequence Length — recalculated here as an independent verification step

Analysis

stats.py was scripted for a lightweight quality check, grouping sequences by organism and computing feature means as a sanity check before deeper analysis.

For analysis, each computational parameter was selected to address a different structural question about the dataset. To reduce ambiguity we discuss some of those below:

  • Correlation Parameter (Pearson’s r) \(\implies\) Determines the linear relationship between two continuous parameters (eg- hydrophobicity and entropy), with a range of −1 (inverse proportionality) to +1 (proportionality). There is no linear relationship when the values are close to 0.

  • Mutual Information \(\implies\) Measures the degree to which understanding one feature lowers uncertainty about another in order to quantify non-linear statistical dependence between variables (units: bits). MI finds intricate, non-monotonic relationships, in contrast to correlation.

  • Allometric Scaling by Feature \(\implies\) Tests whether a protein feature scales as a power law with size (length or molecular weight): \(Y = a \cdot X^b\). The exponent b reveals scaling regime: b ≈ 1 (isometric), b < 1 (diminishing returns), b > 1 (accelerating).

  • PCA and PC Loadings \(\implies\) Correlated features are rotated into orthogonal axes (PCs) arranged according to variance explained using Principal Component Analysis (PCA). Each PC’s loadings are the weights of its original features; high absolute loadings show which features are responsible for that axis. The predominant physicochemical “axis of variation” is revealed by PC1 loadings.

  • Statistical Significance(ANOVA/Kruskal-Wallis + Bonferroni) \(\implies\) Tests whether feature distributions differ across organisms. ANOVA (parametric) or Kruskal-Wallis (non-parametric) computes a p-value; Bonferroni correction adjusts for multiple testing (padj = p × ntests). Significant results indicate genuine biological differences beyond sampling noise.

  • t-SNE(t-Distributed Stochastic Neighbor Embedding) \(\implies\) By modeling pairwise similarities in both high- and low-dimensional space, non-linear dimensionality reduction maintains the local neighborhood structure. Perplexity (~30) strikes a balance between local and global structure. Only relative distances and cluster patterns provide meaning; output coordinates cannot be interpreted separately.

  • Silhouette Score \(\implies\) Each sample’s clustering quality is measured using the formula \(s = \frac{(b − a)} {max(a, b)}\), where a = mean intra-cluster distance and b = mean nearest-cluster distance. The range of scores is −1 to +1; a score of >0.5 denotes an acceptable structure, while a score of <0.25 indicates the absence of natural clusters. Used to support or contradict theories of discrete grouping.

  • Outlier Proteins \(\implies\) They represent biologically extreme cases (e.g., intrinsically disordered regulators, transmembrane structural proteins). Highlighted to identify functional specializations deviating from proteome norms.

  • All python scripts can be found in src directory.

  • The results directory houses all the datasets separated and structured by applying these methods and respective plotting scripts discussed above also saves the plots there.

  • The practical application of the computational methods discussed in the Analysis section is elaborated in the Results page of this report.

  • The conventional names of the organisms and their respective scientific names are used interchangably in this study.

Table: Scientific Names
Organism Name Scientific Name
Human Homo Sapiens
Cow Bos Taurus
Mouse Mus Musculus