Systematic Peptide Names

From ReactomeWiki

Jump to: navigation, search

Introduction

Surprisingly, there is no universal authoritative source of names for proteins. Neither is there an agreed vocabulary that encompasses cleaved peptide fragments or post-translationally modified forms. Reactome frequently represents a protein in several forms, perhaps as the initial translated form, as fragments following processing, or following many different kinds of post-translational modification. Consequently we have developed a systematic nomenclature that can be used to name peptides in Reactome.

Process

The majority of peptide names have been generated by a scripted process, new peptide instances are named manually and verified at the time they are first made visible as part of a Reactome quarterly update. Some peptides are exempt from the naming process to prevent name duplications or because the peptide represents a modification or state that is not currently included in the naming process. See the Exemptions section below for more details.

Explanation Of Systematic Names

Gene symbol core

The systematic names use HGNC gene symbols as the 'core' of the name. These are identified from UniProt via the Reactome referenceEntity.

Peptide coordinates suffix

Reactome often represents several peptides that are derived by processing the same translated peptide, and consequently have the same UniProt external reference. To generate unique names for these multiple peptides we add the start and end coordinates of the peptide in brackets as a suffix to the gene symbol. We compare the Reactome peptide with UniProt's 'Chain' feature; in UniProt this feature is part of an annotation group called Molecular Features. This feature is used because it represents the 'default' peptide, and is consistent with our use of Uniprot IDs as our primary external peptide reference. If the start and end coordinates of the Reactome peptide agree with those of UniProt's Chain feature, no coordinates are added to the gene symbol. If either coordinate is not the same as the UniProt Chain, both Reactome coordinates are added as a bracketed suffix to the gene symbol. Unknown coordinates are represented as '?' symbols. This combination of gene symbol and coordinates is usually sufficient to generate a unique name, with the following exceptions. Peptides with unknown start or stop coordinates, or peptides that are cleaved more than once at unknown sites, can lead to duplicated names. When this occurs, the peptides are exempted from fully systematic naming but manually named to be as close to the systematic name as possible, with manual additions creating a unique name. If the UniProt record has no chain feature or more than one chain feature, start-end coordinates are added to every Reactome peptide derived from it.

e.g. Caspase-9 precursor, with peptide coordinates start:1 end:416 is named CASP9. The large and small subunits of caspase-9 are respectively named CASP9(1-315) and CASP9(316-416).

An N-terminal fragment of Aggrecan, where the exact cleavage position is unknown would be named ACAN(17-?).

Note that Reactome peptide coordinates always refer to the UniProt peptide, even when the literature convention is to number a cleaved fragment following the removal of a signal peptide or initiating methionine.

Post-translational modification (PTM) prefixes

PTMs are shown as a prefix to the gene symbol. Reactome represents PTMs as modifiedResidue annotations. These use PSI-MOD terms as their primary external reference. PSI-MOD terms can be searched here. PSI-MOD terms are cross-referenced to the RESID database. The prefix(es) to use are determined by using the PSI-MOD ID to select the appropriate prefix from a lookup table (see below). Some infrequently used PTM types are not represented here.

Reactome annotation will specify the coordinate postion of PTMs when this is known, but to help shorten names most PTM prefixes do not include the coordinate. The exceptions are di- and tri- lysine methylation, lysine acetylation, ubiquitination and phosphorylation; coordinates are necessary for these PTM types to avoid name duplications. PTM prefixes for phosphorylation include the coordinate and in addition use letters to distinguish between the phosporylation subtypes of serine, threonine, tyrosine and unknown type. Phosphorylations are ordered by coordinate. If there are more than 4 occurrences of any PTM subtype the coordinates are not included, instead the number of occurrences is given in front of the subtype letter prefix.

Here are some examples of phosphorylation prefixes and their meaning:

  • p-Y139-DAPP1 is DAPP1 phosphorylated on tyrosine-139
  • p-Y150,S343,T346-WASF2 is WASF2 phosphorylated on tyrosine-150, serine-343 and threonine-346. Note that the phosphorylations are ordered by coordinate.
  • p-Y55,S112,S121,Y227-SPRY2 - note that the ordering is by coordinate, phosphorylations are not grouped by subtype.
  • p-Y-GAB2 is GAB2 phosphorylated on a tyrosine, but the coordinate position of this tyrosine is unknown.
  • p-GLI3 is GLI3 phosphorylated but both the subtype and position are unknown.
  • p-7Y-KIT is KIT phosphorylated on seven tyrosines. The coordinates are omitted from the name as there are more than 4 tyrosine phosphorylations.

Ubiquitination commences with a ptm modification to a a lysine in the target protein, followed by the addition of multiple ubiquitin paptides, which can cross-link in several different positions. Consequently it is necessary to indicate the site of the initial ptm and the nature of the subsequent cross-linking.

K63polyUb-13,57-p-Y200-XYZ1 is XYZ1 with K63 cross-linked polyubiquitin attached to residues 13 and 57, and a phosphorylation on Y200.

When combinations of phosphorylation and another PTM occur, the phosphorylations are included after everything else:

2xPalmC-MyrG-p-S1177-NOS3(2-1203) is NOS3 fragment 2-1203 with 2 two palmitoylated cysteines, one myristoylated glycine, and a phosphorylation on serine-1177.

Note the use of x after the number when there are multiple PTMs of any type except phosphorylation. The x is included because some PTM prefixes start with a number (e.g. 4Hyp for 4-hydroxyproline).

Exemptions

Some Reactome peptides are exempt from systematic renaming. Note that referenceEntity is a Reactome term describing a key external reference, from which our internal molecular records are derived. For most proteins this is UniProt.

Exemptions are made when:

  1. The peptide has the word 'mutant' in its name, indicating that the peptide has a disease-associated mutation.
  2. The peptide has an annotation in the Disease field, again indicating that it is an abnormal peptide associated with a disease process.
  3. The referenceEntity has no gene name. This exempts most non-peptides from renaming
  4. The referenceEntity is not ReferenceGeneProduct. This exempts mRNAs, miRNA etc from renaming.
  5. The referenceEntity is a referenceIsoform with variantIdentifier > 1. This avoids using potentially misleading UniProt coordinates for isoforms that are not represented as isoform 1 by UniProt, which is the isoform they reference with the 'Chain' feature.
  6. The peptide has a modification that is not a simple modifiedResidue instance. This avoids mutations again, and other unusual modifiedResidue types such as GroupModifiedResidues, and Internal peptide crosslinks.
  7. The peptide name contains the word 'active', which is used in Reactome to indicate a peptide that has an active conformation, but is otherwise identical to an inactive precursor.

Ewases that have been exempted from renaming are named in a similar style so far as possible.

Peptides can be manually exempted from systematic renaming:

  1. If the automated process creates duplicated names, see comments on peptides with unknown start, stop or cleavage sites above.
  2. If the Curator responsible believes there is a more suitable, widely-recognised and unambiguous alternative. In these cases the systematic name will be retained as an alias name.

A spreadsheet listing all current exemptions is available here

PTM Lookup Table

MOD Prefix Letter MOD preferred name
MOD:00036 3D (2S-3R)-3-hydroxyaspartic acid
MOD:00037 5Hyl 5-hydroxy-L-lysine
MOD:00038 3Hyp 3-hydroxy-L-proline
MOD:00039 4Hyp 4-hydroxy-L-proline
MOD:00041 CbxE L-gamma-carboxyglutamic acid
MOD:00046 p- S O-phospho-L-serine
MOD:00047 p- T O-phospho-L-threonine
MOD:00048 p- Y O4'-phospho-L-tyrosine
MOD:00064 AcK N6-acetyl-L-lysine
MOD:00065 AcC S-acetyl-L-cysteine
MOD:00068 MyrG N-myristoylglycine
MOD:00083 Me3K N6,N6,N6-trimethyl-L-lysine
MOD:00084 Me2K N6,N6-dimethyl-L-lysine
MOD:00085 MeK N6-methyl-L-lysine
MOD:00087 Myri N6-myristoyl-L-lysine
MOD:00091 ArgN L-arginine amide
MOD:00111 FarC S-farnesyl-L-cysteine
MOD:00113 GGC S-geranylgeranyl-L-cysteine
MOD:00115 PalmC S-palmitoyl-L-cysteine
MOD:00125 Hypu Hypusine
MOD:00126 Btn N6-biotinyl-L-lysine
MOD:00127 Lipo N6-lipoyl-L-lysine
MOD:00128 PXLP N6-pyridoxal phosphate-L-lysine
MOD:00130 Alys L-allysine
MOD:00159 PpantS O-phosphopantetheine-L-serine
MOD:00160 N4GlycN N4-glycosyl-L-asparagine
MOD:00162 GlcGalHyl O5-glucosylgalactosyl-L-hydroxylysine
MOD:00163 GalNAc O-(N-acetylamino)galactosyl-L-serine
MOD:00164 GalNAc O-(N-acetylamino)galactosyl-L-threonine
MOD:00166 GlcY O4'-glucosyl-L-tyrosine
MOD:00167 GPIN N-asparaginyl-glycosylphosphatidylinositolethanolamine
MOD:00168 GPID N-aspartyl-glycosylphosphatidylinositolethanolamine
MOD:00170 GPIG N-glycyl-glycosylphosphatidylinositolethanolamine
MOD:00171 GPIS N-seryl-glycosylphosphatidylinositolethanolamine
MOD:00239 MetC S-methyl-L-cysteine
MOD:00274 CysS L-cysteine persulfide
MOD:00300 ADPRib L-glutamyl-5-poly(ADP-ribose)
MOD:00314 CHOL glycine cholesterol ester
MOD:00369 AcS O-acetyl-L-serine
MOD:00390 DecS O-decanoyl-L-serine
MOD:00437 Far farnesylated residue
MOD:00438 MYS myristoylated residue
MOD:00465 dHF dihydroxyphenylalanine (Phe)
MOD:00599 Me monomethylated residue
MOD:00685 dNQ deamidated L-glutamine
MOD:00696 p- phosphorylated residue
MOD:00752 RibC MODified residue
MOD:00798 HC half cystine
MOD:00803 CysY 3-(S-L-cysteinyl)-L-tyrosine
MOD:00804 GlcS O-glucosyl-L-serine
MOD:00812 FucS O-fucosyl-L-serine
MOD:00813 FucT O-fucosyl-L-threonine
MOD:00814 XylS O-xylosyl-L-serine
MOD:00835 OxA L-3-oxoalanine (Ser)
MOD:00971 OxoH 2-oxo-histidine
MOD:01024 HP monohydroxylated proline
MOD:01148 Ub ubiquitinylated lysine
MOD:01152 CO carboxylated residue
MOD:01228 IY monoiodinated tyrosine
MOD:01381 PalmS O-palmitoleyl-L-serine
MOD:01625 SOG 1-thioglycine
MOD:01688 HN 3-hydroxy-L-asparagine
MOD:01699 H+ protonated residue
MOD:01777 CysO S-(glycyl)-L-cysteine (Cys-Gly)
MOD:01880 Dhp L-deoxyhypusine
MOD:01914 GalHyl O5-galactosyl-L-hydroxylysine
MOD:00076 Me2sR symmetric dimethyl-L-arginine
MOD:00077 Me2aR asymmetric dimethyl-L-arginine
MOD:00078 MeR omega-N-methyl-L-arginine
MOD:00219 Cit L-citrulline