# Systematic Peptide Names

## Contents

# Introduction

Surprisingly, there is no universal authoritative source of names for proteins. Neither is there an agreed vocabulary that encompasses cleaved peptide fragments or post-translationally modified forms. Reactome frequently represents a protein in several forms, perhaps as the initial translated form, as fragments following processing, or following many different kinds of post-translational modification. Consequently we have developed a systematic nomenclature that can be used to name peptides in Reactome.

# Process

The majority of peptide names have been generated by a scripted process, new peptide instances are named manually and verified at the time they are first made visible as part of a Reactome quarterly update. Some peptides are exempt from the naming process to prevent name duplications or because the peptide represents a modification or state that is not currently included in the naming process. See the Exemptions section below for more details.

# Explanation Of Systematic Names

## Gene symbol core

The systematic names use HGNC gene symbols as the 'core' of the name. These are identified from UniProt via the Reactome referenceEntity.

## Peptide coordinates suffix

Reactome often represents several peptides that are derived by processing the same translated peptide, and consequently have the same UniProt external reference. To generate unique names for these multiple peptides we add the start and end coordinates of the peptide in brackets as a suffix to the gene symbol. We compare the Reactome peptide with UniProt's 'Chain' feature; in UniProt this feature is part of an annotation group called Molecular Features. This feature is used because it represents the 'default' peptide, and is consistent with our use of Uniprot IDs as our primary external peptide reference. If the start and end coordinates of the Reactome peptide agree with those of UniProt's Chain feature, no coordinates are added to the gene symbol. If either coordinate is not the same as the UniProt Chain, both Reactome coordinates are added as a bracketed suffix to the gene symbol. Unknown coordinates are represented as '?' symbols. This combination of gene symbol and coordinates is usually sufficient to generate a unique name, with the following exceptions. Peptides with unknown start or stop coordinates, or peptides that are cleaved more than once at unknown sites, can lead to duplicated names. When this occurs, the peptides are exempted from fully systematic naming but manually named to be as close to the systematic name as possible, with manual additions creating a unique name. If the UniProt record has no chain feature or more than one chain feature, start-end coordinates are added to every Reactome peptide derived from it.

e.g. Caspase-9 precursor, with peptide coordinates start:1 end:416 is named **CASP9**. The large and small subunits of caspase-9 are respectively named **CASP9(1-315)** and **CASP9(316-416)**.

An N-terminal fragment of Aggrecan, where the exact cleavage position is unknown would be named **ACAN(17-?)**.

Note that Reactome peptide coordinates always refer to the UniProt peptide, even when the literature convention is to number a cleaved fragment following the removal of a signal peptide or initiating methionine.

## Post-translational modification (PTM) prefixes

PTMs are shown as a prefix to the gene symbol. Reactome represents PTMs as modifiedResidue annotations. These use PSI-MOD terms as their primary external reference. PSI-MOD terms can be searched here. PSI-MOD terms are cross-referenced to the RESID database. The prefix(es) to use are determined by using the PSI-MOD ID to select the appropriate prefix from a lookup table (see below). Some infrequently used PTM types are not represented here.

Reactome annotation will specify the coordinate postion of PTMs when this is known, but to help shorten names most PTM prefixes do not include the coordinate. The exceptions are di- and tri- lysine methylation, lysine acetylation, ubiquitination and phosphorylation; coordinates are necessary for these PTM types to avoid name duplications. PTM prefixes for phosphorylation include the coordinate and in addition use letters to distinguish between the phosporylation subtypes of serine, threonine, tyrosine and unknown type. Phosphorylations are ordered by coordinate. If there are more than 4 occurrences of any PTM subtype the coordinates are not included, instead the number of occurrences is given in front of the subtype letter prefix.

Here are some examples of phosphorylation prefixes and their meaning:

**p-Y139-DAPP1**is DAPP1 phosphorylated on tyrosine-139**p-Y150,S343,T346-WASF2**is WASF2 phosphorylated on tyrosine-150, serine-343 and threonine-346. Note that the phosphorylations are ordered by coordinate.**p-Y55,S112,S121,Y227-SPRY2**- note that the ordering is by coordinate, phosphorylations are not grouped by subtype.**p-Y-GAB2**is GAB2 phosphorylated on a tyrosine, but the coordinate position of this tyrosine is unknown.**p-GLI3**is GLI3 phosphorylated but both the subtype and position are unknown.**p-7Y-KIT**is KIT phosphorylated on seven tyrosines. The coordinates are omitted from the name as there are more than 4 tyrosine phosphorylations.

Ubiquitination commences with a ptm modification to a a lysine in the target protein, followed by the addition of multiple ubiquitin paptides, which can cross-link in several different positions. Consequently it is necessary to indicate the site of the initial ptm and the nature of the subsequent cross-linking.

**K63polyUb-13,57-p-Y200-XYZ1** is XYZ1 with K63 cross-linked polyubiquitin attached to residues 13 and 57, and a phosphorylation on Y200.

When combinations of phosphorylation and another PTM occur, the phosphorylations are included after everything else:

**2xPalmC-MyrG-p-S1177-NOS3(2-1203)** is NOS3 fragment 2-1203 with 2 two palmitoylated cysteines, one myristoylated glycine, and a phosphorylation on serine-1177.

Note the use of x after the number when there are multiple PTMs of any type except phosphorylation. The x is included because some PTM prefixes start with a number (e.g. 4Hyp for 4-hydroxyproline).

## Exemptions

Some Reactome peptides are exempt from systematic renaming. Note that referenceEntity is a Reactome term describing a key external reference, from which our internal molecular records are derived. For most proteins this is UniProt.

Exemptions are made when:

- The peptide has the word 'mutant' in its name, indicating that the peptide has a disease-associated mutation.
- The peptide has an annotation in the Disease field, again indicating that it is an abnormal peptide associated with a disease process.
- The referenceEntity has no gene name. This exempts most non-peptides from renaming
- The referenceEntity is not ReferenceGeneProduct. This exempts mRNAs, miRNA etc from renaming.
- The referenceEntity is a referenceIsoform with variantIdentifier > 1. This avoids using potentially misleading UniProt coordinates for isoforms that are not represented as isoform 1 by UniProt, which is the isoform they reference with the 'Chain' feature.
- The peptide has a modification that is not a simple modifiedResidue instance. This avoids mutations again, and other unusual modifiedResidue types such as GroupModifiedResidues, and Internal peptide crosslinks.
- The peptide name contains the word 'active', which is used in Reactome to indicate a peptide that has an active conformation, but is otherwise identical to an inactive precursor.

Ewases that have been exempted from renaming are named in a similar style so far as possible.

Peptides can be manually exempted from systematic renaming:

- If the automated process creates duplicated names, see comments on peptides with unknown start, stop or cleavage sites above.
- If the Curator responsible believes there is a more suitable, widely-recognised and unambiguous alternative. In these cases the systematic name will be retained as an alias name.

A spreadsheet listing all current exemptions is available here

## PTM Lookup Table

MOD | Prefix | Letter | MOD preferred name |

MOD:00036 | 3D | (2S-3R)-3-hydroxyaspartic acid | |

MOD:00037 | 5Hyl | 5-hydroxy-L-lysine | |

MOD:00038 | 3Hyp | 3-hydroxy-L-proline | |

MOD:00039 | 4Hyp | 4-hydroxy-L-proline | |

MOD:00041 | CbxE | L-gamma-carboxyglutamic acid | |

MOD:00046 | p- | S | O-phospho-L-serine |

MOD:00047 | p- | T | O-phospho-L-threonine |

MOD:00048 | p- | Y | O4'-phospho-L-tyrosine |

MOD:00064 | AcK | N6-acetyl-L-lysine | |

MOD:00065 | AcC | S-acetyl-L-cysteine | |

MOD:00068 | MyrG | N-myristoylglycine | |

MOD:00083 | Me3K | N6,N6,N6-trimethyl-L-lysine | |

MOD:00084 | Me2K | N6,N6-dimethyl-L-lysine | |

MOD:00085 | MeK | N6-methyl-L-lysine | |

MOD:00087 | Myri | N6-myristoyl-L-lysine | |

MOD:00091 | ArgN | L-arginine amide | |

MOD:00111 | FarC | S-farnesyl-L-cysteine | |

MOD:00113 | GGC | S-geranylgeranyl-L-cysteine | |

MOD:00115 | PalmC | S-palmitoyl-L-cysteine | |

MOD:00125 | Hypu | Hypusine | |

MOD:00126 | Btn | N6-biotinyl-L-lysine | |

MOD:00127 | Lipo | N6-lipoyl-L-lysine | |

MOD:00128 | PXLP | N6-pyridoxal phosphate-L-lysine | |

MOD:00130 | Alys | L-allysine | |

MOD:00134 | GlyK | N6-glycyl-L-lysine | |

MOD:00159 | PpantS | O-phosphopantetheine-L-serine | |

MOD:00160 | N4GlycN | N4-glycosyl-L-asparagine | |

MOD:00162 | GlcGalHyl | O5-glucosylgalactosyl-L-hydroxylysine | |

MOD:00163 | GalNAc | O-(N-acetylamino)galactosyl-L-serine | |

MOD:00164 | GalNAc | O-(N-acetylamino)galactosyl-L-threonine | |

MOD:00166 | GlcY | O4'-glucosyl-L-tyrosine | |

MOD:00167 | GPIN | N-asparaginyl-glycosylphosphatidylinositolethanolamine | |

MOD:00168 | GPID | N-aspartyl-glycosylphosphatidylinositolethanolamine | |

MOD:00170 | GPIG | N-glycyl-glycosylphosphatidylinositolethanolamine | |

MOD:00171 | GPIS | N-seryl-glycosylphosphatidylinositolethanolamine | |

MOD:00239 | MetC | S-methyl-L-cysteine | |

MOD:00274 | CysS | L-cysteine persulfide | |

MOD:00300 | ADPRib | L-glutamyl-5-poly(ADP-ribose) | |

MOD:00314 | CHOL | glycine cholesterol ester | |

MOD:00342 | MeL | N-methyl-L-leucine | |

MOD:00369 | AcS | O-acetyl-L-serine | |

MOD:00390 | DecS | O-decanoyl-L-serine | |

MOD:00437 | Far | farnesylated residue | |

MOD:00438 | MYS | myristoylated residue | |

MOD:00465 | dHF | dihydroxyphenylalanine (Phe) | |

MOD:00599 | Me | monomethylated residue | |

MOD:00685 | dNQ | deamidated L-glutamine | |

MOD:00696 | p- | phosphorylated residue | |

MOD:00752 | RibC | adenosine diphosphoribosyl (ADP-ribosyl) modified residue | |

MOD:00798 | HC | half cystine | |

MOD:00803 | CysY | 3-(S-L-cysteinyl)-L-tyrosine | |

MOD:00804 | GlcS | O-glucosyl-L-serine | |

MOD:00812 | FucS | O-fucosyl-L-serine | |

MOD:00813 | FucT | O-fucosyl-L-threonine | |

MOD:00814 | XylS | O-xylosyl-L-serine | |

MOD:00835 | OxA | L-3-oxoalanine (Ser) | |

MOD:00971 | OxoH | 2-oxo-histidine | |

MOD:01024 | HP | monohydroxylated proline | |

MOD:01148 | Ub | ubiquitinylated lysine | |

MOD:01152 | CO | carboxylated residue | |

MOD:01228 | IY | monoiodinated tyrosine | |

MOD:01381 | PalmS | O-palmitoleyl-L-serine | |

MOD:01625 | SOG | 1-thioglycine | |

MOD:01688 | HN | 3-hydroxy-L-asparagine | |

MOD:01699 | H+ | protonated residue | |

MOD:01777 | CysO | S-(glycyl)-L-cysteine (Cys-Gly) | |

MOD:01880 | Dhp | L-deoxyhypusine | |

MOD:01914 | GalHyl | O5-galactosyl-L-hydroxylysine | |

MOD:00076 | Me2sR | symmetric dimethyl-L-arginine | |

MOD:00077 | Me2aR | asymmetric dimethyl-L-arginine | |

MOD:00078 | MeR | omega-N-methyl-L-arginine | |

MOD:00219 | Cit | L-citrulline |